File puzzling

From Wikiversity
Jump to navigation Jump to search

If a file system is corrupted, this may result in file and folder references to get irreparably lost, among names, attributes, date and time information, original file sizes, and original on-disk locations (cluster numbers).

If only the file system itself is damaged or destroyed, but the data storage device is still functional, file contents usually still exist in the form of orphaned fragments scattered on the disk. Orphaned directories with references to other files and subdirectories may have survived as well.

In order to recreate the complete files, those file fragments will have to be located and puzzled together.

Prerequisites for this lesson are basic understanding of data storages' logical block addresses (LBAs) and conversion between ASCII and hexadecimal characters, as well as working with a HEX editor.

Causes[edit | edit source]

A flash memory controller may corrupt the frequently modified file system as a result of an unstable voltage. Such a condition, known as brown-out, may be caused by faulty USB and/or memory card reading hardware, or loose contacts.

It might also be caused by attaching too many devices to a USB hub connected to a too weak power source, such as the USB port of an older computer, or a smartphone using USB On-The-Go connection, where the power source is a lithium-ion battery with around 3.7 volts[a], boosted to 5 volts.

Another possible hazard is human error, such as choosing the wrong target device to overwrite a disk image to. The user can halt the process upon realizing the error, though the file system structure and root directory, located at the beginning, would have already been overwritten almost immediately.

Damage[edit | edit source]

Some file systems, such as ext4, store all file and folder paths and attributes in a tree structure within a static, fixed allocated space at early sectors (or blocks) of the disk space, while others, such as FAT32 and exFAT may store directories at any location, linked to from the parent directory like with usual files.

While the former method improves seeking speeds, the latter limits damage that affects early blocks of the disk, as names of files and subfolders may be stored at a farther location on the disc that is unaffected by the damage. Such orphaned directories can be discovered by data recovery software.

Additionally, the former is prone to so-called inode exhaustion, where the fixed space allocated for file and folder information may get full as a result of too many small files that do not fill the file content space.

With the latter, entries (names, attributes, and first logical block address (LBA)[b]) of files and directories) within orphaned directories are intact. However, the name of the orphaned directory itself is stored in its parent directory. Retrieving it requires that said parent directory is not affected by the damage. File fragmemt locations, known technically as cluster chains, are usually stored in a map at the beginning of a disk, though the exact implementation may vary depending on file system.

Directories created the earliest tend to have been written closer towards the beginning, therefore are more likely to have been overwritten by errors affecting early sectors. Particularly, the root directory is usually the first, as such the most vulnerable.

How files are written[edit | edit source]

Sequential writing with file appending causing file fragmentation

When files are written to the disk on a newly formatted file system, files can be written sequentially without fragmentation, except if existing files are appended to, meaning that their size is grown due to new added content, as shown in the image.

This image also shows how a file defragmentation process works, though that is unnecessary for flash storage due to almost latency-free random access, as no mechanical parts need to be moved in order to access a different part of the disc.

If a file is deleted, the blocks formerly occupied by it are deallocated, thus considered ready for new incoming data. If a new file with a larger size than the cleared blocks is being added, the following occupied blocks need to be skipped, and rest of the file needs to be written to free blocks thereafter. The file is now fragmented.

In simple terms, the puzzle pieces of the file are scattered, and the file system memorizes the on-disk location of said pieces. If the file system is corrupted, that information is lost.

More sophisticated file systems such as ext4 tend to start writing new files further apart from each other to alleviate (reduce) fragmentation forming in the first place, as the chances of new files outgrowing the remaining unreserved sequential blocks is reduced.

Recovery[edit | edit source]

Data recovery software detects file formats by their so-called header, where the first few bytes of the file mark its beginning. For example, for JPEG and BMP (bitmap) files, the hexadecimally represented header bytes, are FF D8 and 42 6D (BM as ASCII text) respectively. Many file formats also have a footer or distinct content which marks the end of a file, such as the FF D9 bytes for JPEG.

Data recovery software may neglect the file footer, expecting files to possibly be truncated (incomplete), which could cause a binary file such as a photo or video to be detected as multiple separate files. Those will have to be concatinated (stuck) together.

Plain text files, and UTF-8 text files which lack a so-called byte order mark (BOM) can only be found by scanning the entire partition for any text string known to have been written into the file. Hex editing software typically has such functionality. Some are able to save time by scanning through the full partition and list all results in one go rather than stopping at every result. The portion which contains any found text could be highlighted in the Hex editor and exported into a new text file.

Video files[edit | edit source]

Note that video files with the so-called moov atom at the end of the file might be unplayable unless complete, namely those with a program stream format, as a result of foregoing overhead information to slightly reduce file size at a similar quality.

Program streams are typically used by mobile camera software, whereas transport streams, commonly used by dedicated camcorders, remain playable even if truncated.

Orphaned directories[edit | edit source]

Similarly, data recovery software is able to detect orphaned directories, which reference other files and directories, but are not referenced to by any discoverable parent directory.

Legacy FAT[edit | edit source]

(FAT32 / FAT16 / FAT12)

For the widely used FAT32 file system (and also FAT16 and FAT12), the first two directory entries . and .. mark the beginning of a directory, which may be an orphaned one that contains references to other files and directories.[1]

The first twelve bytes of the . entry are hexadecimally 2E 20 20 20 20 20 20 20 20 20 20 10, where 2E is the actual dot, 20 are spaces that fill unused characters in the 8.3 file name entry, and 10 is attribute byte that marks the entry as directory.

exFAT[edit | edit source]

The successor file system exFAT, commonly preformatted on SD XC memory cards, mainly benefits by removing the 4 GiB (232-1 -byte) file size confinement, though it uses an overhauled layout where these two header entries do not exist, making orphaned directories more difficult to detect.[c]

Like with FAT32, FAT16, and FAT12, each entry is 32 Bytes long, and references with longer file names may reserve the space of multiple entries. An examination suggests, and documentation confirms that each reference starts with an 85 hexadecimal byte.[2]

The second entry (group of 32 bytes) contains a file or subfolder reference's metadata such as date and time, byte size, and number of the first containing cluster. It starts with C003 for unfragmented (contiguous) files, and C001 for fragmented. This is a benefit of exFAT, intended to improve efficiency by saving time when reading unfragmented files, as their file fragments do not need to be looked up and followed through the allocation table (fragment bitmap) at all. FAT32/16/12 entries do not contain the bit of information that signifies whether the referenced file is fragmented or contiguous.

The rest of the reference's entries, which contain its unicode name, start with C100, after which the name's characters follow. In distinction to FAT32/16/12, references extend downwards with increasing name length rather than upwards from the base entry that contains metadata.

If some file names or file name portions are known, and they only contain ASCII characters (which most files should do, as it includes alphanumericals), file names inside orphaned directories may be discoverable with each character separated by a NULL byte (00). For example, a file name reference that contained .mp4, hexadecimally 2E 6D 70 34, would have been written into the directory as 2E 00 6D 00 70 00 34 00, and .jpg as 2E 00 6A 00 70 00 67 00. Note that ASCII characters are case-sensitive.

For easier overview, grouping bytes into multiple can typically be set in the HEX editor software. In two-byte groups, this would make the above example appear as 2E00 6A00 7000 6700.

Hex editors' hexadecimal string search feature disregards whitespace, meaning how bytes input in search are grouped makes no difference.

Some Hex editors may be able to scan the entire storage for a desired string in one go, and list and even export the results, to save time.

ext3, ext4[edit | edit source]

These file systems are widely used among Linux-based systems. Unfortunately for recovery after accidental deletion, their deletion process is more destructive, as it nullifies information in the inode about clusters, extents (fragments), and file size, meaning that searching for file headers remains the only resort to discover files again, though fragmentation may be reduced in comparison to earlier file systems due to its more spread-out writing pattern.[3]

In comparison, the deletion process of FAT32/16/12 and exFAT is more peservative, leaving file names[d], file sizes, first clusters, and date/time information intact. Only references in the cluster allocation table are cleared immediately upon deletion. This allows for a more complete recovery imminently after a deletion accident.

Cluster sizes[edit | edit source]

Files recovered by a data recovery software could appear slightly larger than the original files, because each of their ends reserving an entire cluster without filling it out, meaning that the size of files recovered this way may be any multiple of the cluster size. Clusters are logical units used by file systems to facilitate handling files, and can be any double of 512 bytes (e.g. 512 bytes (1 block per cluster), 1024 bytes (1KB, 2 blocks), 2KB (4 blocks), 4KB (8 blocks), etc.). They are preset by the storage device vendor, though can be changed through reformatting.

Larger cluster sizes may enhance sequential reading performance and reduce fragmentation, but leave a larger unused overhead (wasted space) on smaller files, because any cluster occupied by the file can not be used by other files. For example, if the cluster size an is set to 16 KB (32 blocks per cluster), a file with any size under 16384 bytes, even a text file with just one character reserves 16384 bytes of space, and thus the space of 16384 plain text characters on disk.[e]

Salvaging the pieces[edit | edit source]

After all recoverable file fragments with a guessed file extension have been recovered by the software, those pieces need to be puzzled together.

By chance, much of it is already intact. Sometimes, multiple sequential file fragments detected as separate belong to the same original file. In such a case, the remaining file fragments may have a rarely used file extension, such as .BMP if that sector happens to start with the two letters BM.

A sign of a recovered file being a fragment of a different original file is if its first few sectors viewed using a HEX editor contain no human-readable metadata such as a time stamp, camera vendor name, and other EXIF information, though some readable information might also exist towards the end of an original file. This applies to binary files such as multimedia (pictures, video, audio), not plain text, markup, source code, spreadsheets, object notations, etc., which are largely human-readable, though less likely to be fragmented due to presumably smaller average sizes.

If fragmentation occured, the next fragment of a given file may be few recovered file fragments down, but could also be farther away.

Throughout the puzzling process, whole files and fragments of successfully pieced files can be moved into a separate directory, so that the pool of remaining fragments shrinks, making it easier to find remaining pieces.

From file lists[edit | edit source]

File lists created using the ls -alR in Linux or dir /s commands in Microsoft Windows (more details) contain file names, sizes, and last modified date/time attributes, which can migitate the loss of information and assist in the guesswork of putting the file pieces together.

Concatenating the pieces[edit | edit source]

In order to append one file to the end of another, a type file2 >>file1 command can be used in Windows, and cat file2 >>file1 in Linux. cat file1 file2 >>file3 produces a separate file out of any specified files before it.

The commands also support w:Wildcards. For example, *.txt would select all text files, and ?????? would match any file with a six-character name.

Notes[edit | edit source]

  1. Battery voltage depends on charging level and power demand while discharging, or input current while charging. However, USB-OTG reserves the charging port, and compatibility for adapters which allow simultaneous charging by simulating a docking station may vary. A full Lithium-Ion battery is at around 4.3 volts, while a depleted one is at around 3.3 volts, with quickly decreasing voltage upon further depletion.
  2. A file's first LBA is the address to the block its beginning occupies.
  3. The lack of these entries does not affect the functionability of . and .. to refer to the current and parent directory respectively in a terminal / command prompt.
  4. On FAT32/16/12, the first character of a short (8.3) file name is overwritten with an E5 byte. If no file name in the long VFAT format is stored, the first character of the file name is lost, though it can usually be guessed.
  5. Some more sophisticated file systems may support storing multiple small files within the same sector or cluster.

See also[edit | edit source]

References[edit | edit source]