WebSTAT - Free
Web Statistics

What Goes Wrong With PC Hard Disks

by Grey Staples

This article deals with how data files on PC hard disks can be damaged or rendered inaccessible. The discussion will be purposely kept to a "generic" level of equipment and DOS re- lease. We will not be discussing specific brands of comput- ers, controllers, disks, or utility software.

DOS Disk Organization

Figure 1 shows the steps taken by data from the time when the operator saves a file to when the computer actually stores the bytes on hard disk media. To start things off the op- erator requests the application program to write the data file to disk. Some software will hold the entire file in RAM and then proceed to write everything to the hard disk. This is generally true of older spreadsheet and word processor programs. Others only need to access a portion of a file at a time and will buffer blocks of data, reading and writing them dynamically as needed. This group of software may use the DOS buffer pool to hold the data sectors, and rely on DOS func- tions to write them to disk at a later time.

In any case, for the write operation, DOS is requested to lo- cate the data in RAM and begin transferring it to disk, typically using the DOS Write Handle function. File oriented parameters like offset within the file are used by a program, and then translated by DOS into physical parameters such as drive, cylinder, head and sector. Figure 2 shows how a clus- ter number is translated into a disk address. Next BIOS is asked, with Interrupt 13, to write a sequence of 512 byte sectors to the disk. Up to this point we have been dealing with application software and DOS. We are now venturing into the controller and disk hardware. Generally the disk control- ler is the card that plugs into the system board and is con- nected by a ribbon cable to the sealed hard disk unit.

Once the physical location of a sector has been defined and presented to BIOS, we encounter the device driver logic. The disk controller has to be addressed in its own terms. The driver constructs a command block in RAM, typically six bytes in size, and outputs it to the controller through a par- ticular set of port addresses. It is responsible for convert- ing the physical cylinder, head and sector coordinates into stepping motor commands. Finally the desired track is found (address mark is detected) and the sector is located (sector preamble). The sector is copied to the magnetic media and the Checksum or Error Correcting Code, computed by the controller, is written. We are dealing with a serial stream of bits directed to the hard disk unit itself. All this hap- pens in the span of milliseconds.

Let us step back from the "flight of the data sector" and look at how the hard disk is organized. Figure 3 shows the usual physical layout of a track. Each 512 byte sector is surrounded by gap bytes and is preceded by a Preamble or Identification Block, containing cylinder, head and sector values. Right out of the box from the manufacturer we usually have an empty disk. A "Low Level Format" process is called for. Typically the program to perform this operation is al- ready in the ROM BIOS logic on the controller. If this is the case, the manufacturer's instructions will talk about using the DEBUG utility and executing a particular address. Usually C800:5 or C800:CCC are used. After the Low Level Format the drive is still useless to DOS. All the process did was build the preamble for each sector and create data sectors with a fixed content, usually a hex "F6" character.

Every sector preamble identifies itself with cylinder, head and sector values. These provide the controller with the ability to recognize any sector. The interleaving of sector numbers is defined during this format. Interleave deals with just how many sectors are skipped between, say, sector number 1 and number 2. At issue is how long it takes for the disk to rotate to the next sector, and how long it takes the computer to decide to ask for it. Faster processors result in a smaller interleave factor. An optimal interleave will present the next desired sector just when the computer needs it. A wrong choice will waste the time it takes for a com- plete revolution of the disk (latency time). Only sequential accesses can be optimized with our present PC hardware, ran- dom requests are on their own. Generally DOS loads a program by requesting an entire track to read at a time. The control- ler finds the first sector, transfers it to RAM and then waits for the next one. Random sector requests are generally presented one at a time. In a multiprogramming system, like our mainframe cousins, the controller could be buffering mul- tiple sector requests, and using a "position sensing" mechanism to optimally select a random sector to read next.

Figure 4 depicts the relationship between the boot sectors and the DOS control areas. The DOS utility FDISK is called upon to define the DOS partition. At this point there are subtle considerations about 32 megabyte limitations, extended partitions, and ways to overcome the inherent DOS limitations for large drives. Figure 5 shows the sequence of events dur- ing the booting process, and the sectors involved. One result of the FDISK step is the creation on the Master Boot sector on track 0, sector 1. The boot sector is the first thing read from the disk by the ROM booting process. If this sector is not found, the entire drive is "invalid". It is the responsi- bility of the Master Boot logic to locate a bootable DOS par- tition, read the Partition Boot sector and transfer control to it.

The DOS FORMAT utility will prepare the partition by building the Partition Boot Sector, construct two copies (usually) of the File Allocation Table reflecting any unreadable sectors (bad clusters). FORMAT performs a "read only" scan of all sectors in the partition, building a new FAT in memory. All entries in the Root Directory are cleared by setting the first byte of each 32 byte entry to a hex 0. If the partition is to be bootable, the two hidden files of IBMBIO.COM and IBMDOS.COM, and the command interpreter, COMMAND.COM, are placed at the beginning of the area of data clusters (start- ing with cluster 2). Note that the Boot Sector also contains the BIOS Parameter Block which defines all the parameters needed by DOS to translate logical, or cluster level, re- quests into physical parameters suitable for INT 13 requests. These values were used in Figure 2 to perform the cluster to physical address transformation.

The File Allocation Table relates one cluster to the next in a file. Each entry can be 12 or 16 bits in length. The Parti- tion Table in the Master Boot Sector contains a bit flag des- ignating a 12 or 16 bit FAT format. A 10 megabyte drive using DOS 2.x versions would have a 12 bit FAT format. Here we have only eight sectors for the table, and the maximum cluster number would be 2500+. An IBM AT with 20 megabytes and using DOS 3.x would have a 41 sector FAT. Each cluster would be 2048 bytes in size and there would be 10,405 clusters. Larger drives would require larger FATs, and perhaps larger cluster sizes.

The hard disk Root Directory generally is usually sized at 32 or 64 sectors. Each sector holds 16 entries (of 32 bytes each). For our discussion the important items are file name and extension (the first 11 bytes of each directory entry), the attribute byte (next byte after the extension), the starting cluster number, and the file size. Together these describe the file and where it initially resides. Sub-Directories are essentially expandable files with a di- rectory attribute. Being treated like a file allows the sub-directory to grow to multiple clusters, unlike the Root Directory which is anchored in place and cannot expand.

What Can Go Wrong

Our focus in this discussion is on the possible damage to a user's data files. Application software modules are a separate issue. One can usually just restore any damaged software from distribution diskettes, or deal with the manufacturer on reinstallation problems caused by copy pro- tection. Data files on the other hand, can be simply catego- rized as having either fixed length or variable length records. Figure 6 shows the difference between the two for- mats. Many times the fixed length variety have a variable length header or descriptor area at the start of the file. Such popular products as dBase and SMART (Data Manager) have a header which defines each data base field, and other char- acteristics. After the header they get down to a repetition of fixed length data base records. With dBase each record is of fixed size, preceded by an active / deleted flag. SMART uses a two byte prefix on each record to define the length of the following record, with a high order bit flag showing ac- tive and deleted status. For fixed length Data Manager files the length value is the same for each record. A variable length record format is also available.

Some fixed length formats are even simpler. I have encountered CAD and data base products in which each record is a constant 128 bytes in size. Record type and control flags are generally stored at the beginning of each record.

Variable length format files typically have record type and length bytes preceding each record. In Lotus 1-2-3 for example, the first two bytes define the record type (least significant byte first) and the next two define the record length, excluding the four control bytes just described. Files of this format are quite vulnerable to random sector level corruption. A single misplaced sector in a Lotus file will cause the software to report an error and stop loading the file at that point. The rest of the file is not readable and therefore is lost.

Operator Problems

The first problem area involves the computer operator. The person at the keyboard is quite capable of creating havoc on our data files (which are usually not sufficiently preserved). With even faster processors and even larger disks, the operator can destroy more data faster than ever before. The operator problems sometimes derive from the simple omission of an important parameter from one of the DOS commands. The action of deleting all the files on a diskette and leaving out the important "A:" will result in EVERYTHING in the current directory (word processor, spreadsheet, data base, etc.) being deleted. Out of habit, the DOS question of "Are you sure(Y/N)" is automatically answered with a Yes reply by the operator. They expected to be asked that question anyway.

We now reach for our favorite Unerase utility and ask it to "save our bacon". In an erase operation all the clusters owned by the file are freed (FAT entries set to zero), and the directory entry is flagged with a hex "E5" overlaying the first byte of the file name. The file size and starting cluster number are left intact, allowing our Unerase utility to work. With luck the file will be automatically resurrected and returned to usefulness. On the down side, you will be required to inspect a myriad of clusters, all appearing in some strange Martian-like form, demanding that you know the intimate details of each file's format. One quickly learns how to identify ASCII labels in a spreadsheet, for example, to determine the appropriateness of any cluster for inclusion in the file being unerased. You next should invoke the application and ask it to validate your selections during the unerase process. Variable length format software will report an error if any alien clusters or records are encountered. Your 1000 line worksheet is reduced to 100 lines, everything else is not readable. We now get into long hours of trial and error, usually at 3 AM.

When a user creates or extends a file, DOS has to find space for it. The File Allocation Table provides a list of avail- able clusters, each a candidate to hold the new data sectors. In release 2.x of DOS starting at the lowest cluster,the first available (a zero entry in the FAT) one is claimed. DOS repeats the process of using the next available until the en- tire file is saved. This approach creates a good deal of file fragments scattered around the disk. The Read / Write head is repeated moved around the disk, slowing down the total file loading time. In DOS 3.x the process is refined a bit by starting the search at the current cluster, and proceeding upwards. The dispersal of fragments is reduced, and less head movement is needed. The degree of file fragmentation can be easily shown by using the "CHKDSK *.*" command in any subdirectory. The "*.*" will cause the utility to report on any files which are not composed of contiguous clusters as reflected in the FAT. Routine running of a disk optimizer utility will reduce or eliminate fragmentation and tip the odds in your favor for a successful unerase.

An accidental DOS FORMAT operation is essentially an unerase problem in the extreme. As outlined above, the FORMAT program first reads every sector in the partition, looking for errors or unreadable sectors. A new, and empty, copy of the FAT is being constructed with any bad clusters flagged. The root directory is then subjected to what I call a "super erase". The first byte of every directory entry is reset to a hex 0. DOS uses the zero as a flag to determine when to stop searching a directory. There is no reason to read all 64 sectors in a root directory when only a few sectors are actually being used. The rest of the root is typically untouched. Finally the new, and empty, FAT is written to the disk. The "Format Complete" message appears. If you ever inadvertently start a FORMAT on your hard disk, the easiest answer is to turn the computer off as soon as possible. At best you will terminate the read only scan of the disk looking for bad sectors. At worst, parts of the root or FATs may be overlaid. It pays to use a commercial "Unformat" utility if accidental FORMAT of the hard disk is a worry.

It is possible to destroy the first track on a hard disk (Master Boot sector) by the operator using an XT version of the IBM parking program on an AT. Older versions (1982 and 83) of the SHIPDISK program on the IBM Diagnostics diskette for the XT have a format function executed on the current track if an unexpected error is received during the parking process. The AT produces such an error, triggering the XT version to issue an "05" command (format track). Unfortu- nately the AT expects interleave to be specified by two reg- isters pointing to a table of sector numbers. The old XT pro- gram makes no attempt to set these registers. The result is that track zero is subjected to a low level format with inappropriate sector numbers. The bottom line is that there is no sector one, so POST fails for that drive. The AT is told that drive C is not valid. The recovery strategy would be to perform a low level format on track zero, run FDISK to rebuild the Master Boot Sector and redefine the partition(s). With luck, the undamaged partition will fall right into place and match the new partition table in the boot sector.

The term "corruption" will be freely used in the rest of this article. Let me offer a definition. In programming circles we frequently use the words "memory corruption" or "stack corruption". Simply put, some program has written over a sensitive section of RAM with other, or "alien" data. Typically your program, which depends on an understandable section of memory, will malfunction. Phrases like "out to lunch" or "off in the weeds" abound to describe what happens. Sometimes you have to turn the computer off to clear the damage from the computer memory. The same problems occur on disks. Sector level corruption is caused by damaged or "alien" sectors being written to the requested sector. In the following section we will explore how badly things can go wrong.

Software Problems

Software errors are a logical possibility. Since the application program, or the operating system in force, is making decisions about what data is to be written where, we have the possibility of file and disk corruption. Generally software products get a clean bill of health. But there are occasional bugs in programs which can damage your data files. Fouling your data is one thing, corrupting the FAT is quite another problem. The subsequent discussion of hardware induced errors will encompass most of the possible software caused corruptions.

One area of software induced problems needs to be mentioned. Trojan Horse programs which masquerade as a benign utility, but are quietly trashing your hard disk are a malignancy on the computer scene. It is an ingenuity battle between "us" and "them". Corrupting, or clearing the FAT is equivalent to the effects of a DOS FORMAT, with the advantage that, if the root is still intact, we have an easier job of finding each active file and then un-erasing each one. Trashing track zero is the same as running an old XT SHIPDISK on an AT. Luck might prevail and the partition would be reinstalled exactly as it was before.

Hardware Problems

Once DOS and BIOS have dispatched a sector to be written to a specific spot, or sector, on the disk, the controller takes over. Hardware malfunctions at this level can be quite insidious. Failing components, mis-seated controller boards, and "cold" or broken solder joints can cause a variety of problems. Each of these will change the intention of the requested disk write, or read, operation. For example, the controller decided NOT to write one or more sectors. What we then have residing on the disk, or diskette, surface is the good file intermixed with whatever used to be there. I have seen a 1-2-3 file with a sequence of sectors consisting of old COM and ASCII print file contents. Only one "alien" sector need be present to cause 1-2-3 to stop loading the worksheet, and report an error. In this case the 100 row worksheet was reduced to 3 rows. All we had were the column headings and none of the data.

A second possibility is that a sector was written to the wrong location. If any of the registers, used by BIOS to communicate the physical location parameters, are altered by whatever means, we then have a sector intended for one location being written elsewhere. This would constitute an "alien" sector which now is clobbering some other data or program. If a power transient causes the registers to be reset to zero, then guess what gets damaged. The DOS control areas are down at the low sector numbers. A stray sector overwriting part of the FAT will cause numerous problems. CHKDSK will report a wild assortment of cluster problems, with quite unreasonable cluster numbers being found. The program is taking some ASCII data, or program content at face value and assuming the values are cluster numbers. Look for error reports on clusters outside the reasonable range for the size of the drive.

Single sector damage to the first FAT copy can be repaired by finding the corresponding sector in the second copy of the FAT and pasting it over the damaged one. With lots of luck this will fix the problem. DOS updates both the first and second copies of the FAT one after the other. Sometimes a transient persists long enough to affect both copies. A repair strategy would be to clear, zero, the damaged sector and then see what files need that part of the FAT. CHKDSK will help in the analysis.

A more frequent problem is one in which the sector is written to the correct location, but it is corrupted during transmission. Transients are a typical cause of this problem. Remember the serial stream of bits being sent from the controller to the hard disk media. Any power disruption can result in timing being lost, and bits being dropped. Imagine what happens when a single bit is lost in a sector. All the remaining bytes are now shifted over one bit. A space character ("20" hex) now is stored as a hex "40". Additionally the checksum has been shifted so it will not match the result computed by the controller during the next read. "Error reading drive ..." is the result. See Figure 7 for an example demonstrating this problem. A sector from the root directory of a DOS 3.3 system diskette has been subjected (artificially) to two separate bit losses. The immediate observable effect is that the entries in the directory are garbled at a particular point. A recovery strategy would be to repair the checksums on the damaged sector (by simply rewriting it) and then use the DOS RECOVER utility to take the FAT and use it to reconstruct the root. Lots of manual labor is needed to redefine the sub-directories and move the reclaimed files to their proper place. RECOVER puts everything into the root directory.

Media errors are caused by some outside electrical or physical action. The classic "head crash" problem is where the Read/Write head, which usually floats over the surface of the disk platter, has bounced on the media and gouged a part of the surface. The problem can vary from a single "touch down" with an attendant damaged spot, to a full fledged crash in which the head makes enough contact with the surface to score all the way through the platter. The outside ring then falls free. Ouch!

A variant on the transient problem is where either the address mark at the start of the track or the sector preamble have been damaged. In the first case the computer cannot even find the track. The message "address mark not found" results. That track is gone. Special electronic techniques are needed to read the remaining signals on the track and reconstruct the sectors. Commercial data recovery services exist. A low level format of the track will rebuild the address mark, but of course will obliterate the track's contents. If the preamble has been damaged (it also has a checksum), we get the "Sector not found" message. That sector is unreadable by BIOS and is considered lost. A Low Level Format is the usual way to rebuild the preamble record for each sector on that track. Again heroic technical measures are needed.

Let's now consider what happens when the sequence of clusters controlled by the FAT have been altered. Let's say one cluster from the middle of the data base has been dropped. We now have a mismatch of data on one cluster to the next. Fixed length format files will show the anomaly of wrong data in a field. For example, you might see an address in the person's name field in a Name and Address data base. From this point onwards all the records have been shifted. Variable length formats will probably die at this point when the file is being read. This symptom can also arise from an incorrect selection of clusters during an unerase operation. The wrong cluster pasted into the file will be quite apparent, but selecting the correct one is a challenge.

Bad electrical contacts are a source of a variety of problems. Sometimes a controller board can become mis-seated. That is the electrical contact between to controller and the system board was been disrupted. A usual symptom is that the disk fails during POST testing (17xx errors) and refuses to respond. The easiest thing to do is remove the controller card, clean the contacts, and make sure the board is completely reseated in the system board. If this fails, take two aspirin and call the disk repair service in the morning.

Broken or marginal contacts cause a different problem. When the computer is first powered on the disk does not respond. A while later the disk can be accessed from a diskette booted system. A disk testing program shows a large number of defective clusters. After a while fewer clusters are bad. Until finally no clusters test bad. The dynamic changes are due to heating effects within the computer causing a marginal contact to expand and close, making the defect appear to go away. This is a quite simple analysis of the problem. The recovery strategy is an opportunistic one. When the machine is warm enough to extract all the needed files, back them up and have the disk unit repaired or replaced.

So far we have been looking at failures during the writing of a sector. Read time errors are even worse. Consider the effect on the program innocently requesting a sector to read, if the hardware presents the wrong one. The software may report the same symptom as caused by a corrupted sector during a prior write. The software may accept the erroneous data and pass it on. For example, BACKUP would be vulnerable to this kind of failure. Your safely backed up files now contain defects. I call this result "Swiss Cheese", lots of holes. An early indication of this problem would be when you see strange data, resembling a list of zip codes, in the middle of a record in the Name and Address data base. Either the sector in the data base has been clobbered by a stray sector from the zip code index file, or we are experiencing random read time errors which are presenting the wrong sector. One interesting controller I encountered would successfully read the Master Boot Sector, and then produce that same sector on complete tracks elsewhere on the disk. Some tracks were readable, others seemed to contain the same, but illogical, contents for all 17 sectors. There is only one Master Boot sector on a hard disk, not one in each sector of an arbitrary track. It was as if the Master Boot sector got stuck in the controller's buffer.

Warnings

Some of the more insidious disk failures are preceded by some fairly obscure warning signs. One symptom is the "I'm sure I changed that data". Here the controller may be deciding not to write some updates back to the disk. Preserve your backups and keep an open eye for future anomalies. The intermittent update problem could be a cause of "cross linked" files as reported by CHKDSK. Consider that if FAT sectors are not be- ing updated, then the next time DOS needs a free cluster it will retrieve an incorrect version of a FAT sector. The re- sult is that two files now share the same starting cluster. This is a warning.

In Conclusion

This article is admittedly highly subjective in that the analysis and repair strategies were arrived at on the spot. We had a dead or failed hard disk and needed to restore it as soon a possible. I encourage the readers to share their own experiences with disk corruption and file repair. We all can profit from each others experiences.

Grey Staples, CDP, CCP is the President of Camelback Systems, Inc., Scottsdale, Arizona. The company specializes in data recovery and repair of damaged files and DOS control areas. Specific products supported include Lotus, dBase and SMART. Mr. Staples has degrees in Physics and Electrical En- gineering, and is a 35 year veteran of the computer industry. He has achieved both Certified Data Processor and Certified Computer Programmer status, granted by the Institute for the Certification of Computer Professionals.

Published in 1988 Updated: 02-19-99