I often see hard drives that have bad blocks. By this I mean that one or more blocks on the drive cannot be read. When the drive is in a Windows PC the computer may appear to be very slow because the operating system repeatedly attempts to read the bad block. Sometimes the operating system cannot start because the bad block is part of a critical file. It is easy to identify the bad block using the badblocks command, but badblocks doesn't indicate which file is corrupt. This tutorial explains the three steps necessary to find the NTFS Filename associated with the bad block.
First task with any failing drive is to clone it. I prefer to clone by partition. First I check the partition table using fdisk.
255 heads, 63 sectors/track, 0 cylinders, total 2048 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0xd192b742 Device Boot Start End Blocks Id System /dev/sdd1 2048 31459327 15728640 27 Hidden NTFS/WinRE /dev/sdd2 * 31459328 31664127 102400 7 /HPFS/NTFS/exFAT /dev/sdd3 31664128 1250260991 609298432 7 /HPFS/NTFS/exFAT
then I clone the individual partitions using ddrescue
bash# ddrescue -v -d -r 3 /dev/sdd3 sdd3.img logfile
In this case I'm just taking the image of the third partition because it contains the data (and the bad block). When ddrescue is finished I have an image of the retrievable data in the sdd3.img file and a logfile which tells me where the bad data is located. Here's the contents of logfile:
# Rescue Logfile. Created by GNU ddrescue version 1.16 # Command line: ddrescue -d /dev/sdd3 sdd3.img logfile # current_pos current_status 0x02499600 + # pos size status 0x00000000 0x02499800 + 0x02499800 0x00000200 - 0x02499A00 0x9142566600 +
Error lines end with the - character. This logfile shows one badblock, it is located at 0x02499800 relative to the start of the partition and it is 512 bytes long.
The bad block is located at 0x02499800 which is 38377472 in decimal. NTFS is organized in clusters. We need to know the cluster size. We can find it using the ntfsinfo command, like this:
bash# ntfsinfo -m -f sdd3.img WARNING: Dirty volume mount was forced by the 'force' mount option. Volume Information Name of device: sdd3.img Device state: 3 Volume Name: Gateway Volume State: 27 Volume Version: 3.1 Sector Size: 512 Cluster Size: 4096 Index Block Size: 4096 Volume Size in Clusters: 24414061 MFT Information MFT Record Size: 1024 MFT Zone Multiplier: 0 MFT Data Position: 24 MFT Zone Start: 786432 MFT Zone End: 3838189 MFT Zone Position: 786432 Current Position in First Data Zone: 3838189 Current Position in Second Data Zone: 0 LCN of Data Attribute for FILE_MFT: 786432 FILE_MFTMirr Size: 4 LCN of Data Attribute for File_MFTMirr: 2 Size of Attribute Definition Table: 2560 FILE_Bitmap Information FILE_Bitmap MFT Record Number: 6 State of FILE_Bitmap Inode: 80 Length of Attribute List: 0 Attribute List: (null) Number of Attached Extent Inodes: 0 FILE_Bitmap Data Attribute Information Decompressed Runlist: not done yet Base Inode: 6 Attribute Types: not done yet Attribute Name Length: 0 Attribute State: 3 Attribute Allocated Size: 3055616 Attribute Data Size: 3051760 Attribute Initialized Size: 3051760 Attribute Compressed Size: 0 Compression Block Size: 0 Compression Block Size Bits: 0 Compression Block Clusters: 0
So the cluster size is 4096. Then the cluster containing the bad block is 38377472 / 4096 = 9369.
Sometimes I make an image of the whole drive, not just the ntfs partition. In these cases, I need to use losetup to create a loop back device of the ntfs partition, then use ntfsinfo on that loopback device. Here are the relevant commands.
First find the offset using a calculator. For instance, if the starting block of sdd3 is 31664128 and the block size is 512 bytes, then the offset is 16212033536.
bash# losetup -o 16212033536 /dev/loop5 sdd.img bash# ntfsinfo -m -f /dev/loop5 ... ntfsinfo output ... bash# losetup -d /dev/loop5
The last step is to find which file (if any) uses cluster 9369. We can do that using the ntfscluster command, like this:
bash# ntfscluster -f -c 9369 sdd3.img 2>> /dev/null Searching for cluster 9369 Inode 89381 /Windows/System32/atidxx64.dll/$DATA
So the problem file in this case is the ATI directx driver, C:\Windows\System32\atidxx64.dll. In this case, I was able to mount the partition, delete the file, force a write to the bad block and then the drive passed its long SMART selftest.
It should also be pretty easy to adapt this procedure into a short script which will process a longer ddrescue log file and give a list of NTFS file names.
ddrescue is slow. i usually let it run overnight. rsync would be much faster. sometimes i clone using ntfsclone because its faster but not as useful for file retrieval and ntfsclone sometimes dies when it hits a read error.
the other issue i wanted to address but didn't get to last night is using LBA number from SMART to find the ntfs filename. this would be faster than ddrescue or rsync procedures (but won't give me a clone file). in this case smartctl command gives us an LBA number, such as 0x01e44ccc (31739084 in base 10), which is the LBA number offset from the beginning of the disk using LBA 0 as the first block.
if sdd3 partition starts at LBA 31664128 (see fdisk output in step 1) then the LBA block number offset from beginning of partition is 31739084 - 31664128 = 74956. With an NTFS cluster size of 4096 and an LBA size of 512, there are 8 LBAs per cluster, so cluster number = 74956/8 = 9369. Which is the same cluster number we found in step 2 using ddrescue's logfile. Then proceed with step 3.
For scripting, smartctl output would be a little harder to parse than ddrescue log file, but fifteen or twenty minutes tuning a sed filter would probably give me what I want.