HDDのS.M.A.R.Tエラーを修復する方法



●セルフテストから分かること(最終的に修正出来なかった)

 参考URL:LinuxのHDDの自己診断がエラーの場合に壊れたファイルを特定する
 参考URL:Linux LVMでのCurrent_Pending_Sectorの対処方法
 参考URL:Linuxでディスクのエラーや不良セクタのチェックと修正をする方法

 ショートテストを実施します。
# smartctl -t short /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.11.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Tue Dec 29 22:59:38 2020

Use smartctl -X to abort test.
 約2分後に結果を確認します。
# smartctl -a /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.11.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD20EZRZ-00Z5HB0
Serial Number:    WD-WCC4M*******
LU WWN Device Id: 5 0014ee 2b83a244b
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Dec 29 23:00:27 2020 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(25980) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 263) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x7035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       61
  3 Spin_Up_Time            0x0027   186   174   021    Pre-fail  Always       -       3691
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       20
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       16794
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       20
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1411117
194 Temperature_Celsius     0x0022   120   105   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       26
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       18
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       42

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     16794         12316904
# 2  Extended offline    Completed: read failure       90%     16761         12316904

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 ちょっと気になる箇所がありますね。
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       26
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       18
 WORSTに比べるとはまだまだ低い値ですが気になります。
 Completed: read failureと表示されている箇所もあります。
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     16794         12316904
# 2  Extended offline    Completed: read failure       90%     16761         12316904
 エラーが起こっている場所は LBA_of_first_error の 12316904 です。
 エラーによってディスクのどのinodeが壊れているかどうかは、以降の手順で確認します。
 更に以降の手順で、inode から壊れたファイル名を特定します。
 LBA_of_first_error が示す箇所は、どの inode なのかを、以下の作業で特定します。
 今回の例では、LBA_of_first_error は 12316904 で、これはエラーが起こったセクタ番号を示します。
 このセクタがどのパーティション上にあるかを確認します。
 fdiskを実行します。
# fdisk -lu /dev/sda
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sda: 2000.4 GB, 2000398934016 bytes, 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O サイズ (最小 / 推奨): 4096 バイト / 4096 バイト
Disk label type: gpt
Disk identifier: 5B9C0C8C-AEC6-4809-8D5E-C86F84FB09EF


#         Start          End    Size  Type            Name
 1         2048       411647    200M  EFI System      EFI System Partition
 2       411648      1435647    500M  Microsoft basic 
 3      1435648   3907028991    1.8T  Linux LVM ←(これがLinuxなら通常の方法)
 Start、End 列は各パーティションの開始セクタ、終了セクタです。
 LVMパーティションなので、さらに論理ボリュームを求めます。

 開始位置を差し引きます。
 (12316904 - 2048) = 12314856
 物理ボリュームのPEサイズを取得します。
# pvdisplay -c /dev/sdb1 | awk -F: '{print $8}'
4096
 ※この4096はファイルシステムのブロックサイズではありません。
 ここが不明なのだが、LBAブロックサイズを求める為に2を掛ける?ようである。
 4096 * 2 = 8192
  ※いろいろ調べましたが2倍する理由は不明です。でも、結果は正しいようです。
 ブロックナンバーをPEサイズで割ります。これが、PE番号となります。
 12314856 / 8192 = 1503.2783203125
 該当するPE番号を論理ボリュームから探します。
# lvdisplay --maps | egrep 'Physical|LV Name'
  LV Name                root
    Physical volume	/dev/sdb1
    Physical extents	0 to 715395
    Physical volume	/dev/sda3
    Physical extents	0 to 224737 ← (1503がこの範囲内)
  LV Name                home
    Physical volume	/dev/sda3
    Physical extents	224738 to 474737
  LV Name                swap
    Physical volume	/dev/sda3
    Physical extents	474738 to 476755
 smartctl -t short /dev/sdb 及び smartctl -a /dev/sdb の結果ではエラーが出力されなかったので、/dev/sda3と思われる。
 物理ボリュームの開始位置からのオフセット値を求めます。
# grep pe_start $(grep -l /dev/sda3 /etc/lvm/backup/*)
			pe_start = 2048
			pe_start = 2048

 以下のコマンドでも求めることが可能と記載されているが、実際に表示された値は良く判りません。
# pvs -o+pe_start /dev/sda3
  PV         VG       Fmt  Attr PSize  PFree 1st PE 
  /dev/sda3  centos00 lvm2 a--  <1.82t    0    1.00m
 以下の計算式で不良ブロック番号を求めます。
(PEのパーティション開始位置 × PEサイズ) + オフセット = (0 * 8192) + 2048 = 2048
 上記の結果を、下記のパーティション開始位置に当てはめます。
(物理パーティション開始位置 - パーティション開始位置) / (ファイルシステムブロックサイズ / 512) = (12314856 - 2048)) / (4096 / 512) = 1539357

 不良ブロックにファイルが割り当てられていないかをチェックします。
# debugfs
debugfs 1.42.9 (28-Dec-2013)
debugfs:  open /dev/mapper/centos00-root
/dev/mapper/centos00-root: Bad magic number in super-block while opening filesystem
debugfs:  icheck 163480835
icheck: Filesystem not open
debugfs:  quit
 ここで行き詰まってしまいました。

●状況2

 /var/log/messagesにHDDに関する下記のようなエラーが記録されていました。
# cat /var/log/messages|grep "blk_update_request"
Dec 29 12:25:39 serverA kernel: blk_update_request: I/O error, dev sda, sector 12316904
 確認のためsmartctlコマンドを実行します。
# smartctl -t short /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.11.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Wed Dec 30 01:35:01 2020

Use smartctl -X to abort test.
 約2分経過後、結果確認のため下記コマンドを実行します。
# smartctl -a /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.11.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD20EZRZ-00Z5HB0
Serial Number:    WD-WCC4M*******
LU WWN Device Id: 5 0014ee 2b83a244b
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Dec 30 01:35:07 2020 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(25980) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 263) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x7035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       63
  3 Spin_Up_Time            0x0027   186   174   021    Pre-fail  Always       -       3691
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       20
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       16797
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       20
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1411371
194 Temperature_Celsius     0x0022   119   105   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       25
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       18
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       42

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     16797         15757152
# 2  Short offline       Completed: read failure       90%     16794         12316904
# 3  Extended offline    Completed: read failure       90%     16761         12316904

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 Current_Pending_Sectorが発生しました(25件もあります)。
 不良セクタの位置は15757152と12316904のようです。
 不慮セクタが本当に以上であれば、ddコマンドの結果がエラーとなります。
  ※ddコマンドのbsには、smartctlコマンドの実行結果にあるSector Sizes行から読み取ることができます。
  例:Sector Sizes: 512 bytes logical, 4096 bytes physical の場合 512 となります。
# dd if=/dev/sda of=/dev/null bs=512 count=1 skip=15757152
1+0 レコード入力
1+0 レコード出力
512 バイト (512 B) コピーされました、 2.38848 秒、 0.2 kB/秒
 エラーとなりませんでしたので、上記セクタは何もしません。
 次の不良セクタ(12316904)を確認します。
# dd if=/dev/sda of=/dev/null bs=512 count=1 skip=12316904
dd: `/dev/sda' の読み込みエラー: 入力/出力エラーです
0+0 レコード入力
0+0 レコード出力
0 バイト (0 B) コピーされました、 0.924292 秒、 0.0 kB/秒
 エラーとなりましたので、hdparmを使用して書き込んでみます。
# hdparm --write-sector 12316904 --yes-i-know-what-i-am-doing /dev/sda

/dev/sda:
re-writing sector 12316904: succeeded

# dd if=/dev/sda of=/dev/null bs=512 count=1 skip=12316904
1+0 レコード入力
1+0 レコード出力
512 バイト (512 B) コピーされました、 0.00876814 秒、 58.4 kB/秒
 書き込み後、読み取れるようになりました。
 他にも不良セクタがありましたので、同様の作業を繰り返しました。
 結果を確認します。
# smartctl -t short /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.11.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Wed Dec 30 17:01:17 2020

Use smartctl -X to abort test.
# smartctl -a /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.11.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD20EZRZ-00Z5HB0
Serial Number:    WD-WCC4M*******
LU WWN Device Id: 5 0014ee 2b83a244b
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Dec 30 17:11:50 2020 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(25980) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 263) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x7035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       98
  3 Spin_Up_Time            0x0027   186   174   021    Pre-fail  Always       -       3691
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       20
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       16813
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       20
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1413054
194 Temperature_Celsius     0x0022   120   105   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       13
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       18
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       42

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     16812         -
# 2  Short offline       Completed: read failure       10%     16797         15758016
# 3  Short offline       Completed: read failure       90%     16797         15757152
# 4  Short offline       Completed: read failure       90%     16797         15757152
# 5  Short offline       Completed: read failure       90%     16797         15757152
# 6  Short offline       Completed: read failure       90%     16794         12316904
# 7  Extended offline    Completed: read failure       90%     16761         12316904

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 Current_Pending_Sectorが25から13に減少しています。

●状況

 rootに下記のようなメールが送信されてくるようになりました。
件名
    OfflineUncorrectableSector
本文
    Device: /dev/sda [SAT], 18 Offline uncorrectable sectors
 どうもHDDに不良セクタが存在するようです。
 root権限で

 ・short test
  例:smartctl -t short /dev/sda

 ・long test
  例:smartctl -t long /dev/sda

 を実行すると状態を確認できます。shortは数分、longはHDDの容量が大きいためか390分ほど要しました。
 shortの方はコマンド実行後特に何も表示されなかったためlongの方も実行しました。

# smartctl -t long /dev/sda
smartctl 6.5 2016-05-07 r4318 [i686-linux-4.4.13-200.fc22.i686+PAE] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 390 minutes for test to complete.
Test will complete after Mon Jun 17 22:52:46 2016

Use smartctl -X to abort test.


 390分経過したので結果を確認します。

# smartctl -A -l selftest /dev/sda
smartctl 6.5 2016-05-07 r4318 [i686-linux-4.4.13-200.fc22.i686+PAE] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 3
3 Spin_Up_Time 0x0027 189 162 021 Pre-fail Always - 5516
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 632
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 065 065 000 Old_age Always - 25672
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 584
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 66
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 760102
194 Temperature_Celsius 0x0022 115 106 000 Old_age Always - 35
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 76
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 18
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 199 000 Old_age Offline - 9

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 10% 25614 3823431112
# 2 short offline Completed without error 00% 25607 -



 76箇所もエラーがあります。最初に問題があるセクタは3823431112であることがわかります。

●修復

 Current Pending Sectorの修復、もしくはReallocateする必要があります。エラーとなっているセクタに何かを書き込むことで対処できるはずです。

 参照URL:Technical Memorandum: ハードディスクエラーとSMARTのPending sector, reallocated sector
 参照URL:HDDのS.M.A.R.Tエラーを解消する

 LBAを直接指定できるhdparmコマンドを使用して書き込みを行います。
# hdparm --yes-i-know-what-i-am-doing --write-sector 3823431112 /dev/sda
/dev/sda:
re-writing sector 3823431112: succeeded
 うまく書き込めたようです。

●修復されたかどうか確認

 再度、エラー箇所を確認するためS.M.A.R.T のセルフテストを実行します(shortではエラーが出ない場合もあるようので、その場合はlongを事項します)。
 あと75箇所も390分ごとに作業しないといけないのか……超非効率。

# smartctl -t long /dev/sda
smartctl 6.5 2016-05-07 r4318 [i686-linux-4.4.13-200.fc22.i686+PAE] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 390 minutes for test to complete.
Test will complete after Mon Jun 20 15:53:11 2016

Use smartctl -X to abort test.


 shortでセルフテストを実行後、結果を確認しましたが、やはりshortではエラーが検知できないようです。

# smartctl -l selftest /dev/sda
smartctl 6.5 2016-05-07 r4318 [i686-linux-4.4.13-200.fc22.i686+PAE] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 25695 -
# 2 Extended offline Completed: read failure 10% 25686 3823431113
# 3 Short offline Completed without error 00% 25680 -
# 4 Short offline Completed without error 00% 25679 -
# 5 Extended offline Aborted by host 10% 25679 -
# 6 Extended offline Completed: read failure 10% 25614 3823431112
# 7 Short offline Completed without error 00% 25607 -


 1回ずつhdparmコマンドを使用して書き込み後、390分以上待ってlong testを実行するのはのは骨が折れるので、駄目もとで残り75箇所を一気に実施しました。
 その結果、テスト結果にエラーは表示されなくなりました。

# smartctl -A -l selftest /dev/sda smartctl 6.5 2016-05-07 r4318 [i686-linux-4.4.13-200.fc22.i686+PAE] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 3

3 Spin_Up_Time 0x0027 189 162 021 Pre-fail Always - 5516
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 632
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 065 065 000 Old_age Always - 25710
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 584
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 66
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 760102
194 Temperature_Celsius 0x0022 113 106 000 Old_age Always - 37
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 75
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 18
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 199 000 Old_age Offline - 7

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 25710 -
# 2 Short offline Completed without error 00% 25703 -
# 3 Extended offline Completed: read failure 10% 25701 3823431114
# 4 Short offline Completed without error 00% 25695 -
# 5 Extended offline Completed: read failure 10% 25686 3823431113
# 6 Short offline Completed without error 00% 25680 -
# 7 Short offline Completed without error 00% 25679 -
# 8 Extended offline Aborted by host 10% 25679 -
# 9 Extended offline Completed: read failure 10% 25614 3823431112
#10 Short offline Completed without error 00% 25607 -
3 of 3 failed self-tests are outdated by newer successful extended offline self-test # 1



 ただし、Current_Pending_Sector及びUDMA_CRC_Error_Countの値に変化はありませんでした。
 そこでddコマンドを使用して該当ブロックを再配置させることにします。


●不良ブロックの再配置

 参考URL:不良セクタのリペア

 不良セクタがどのパーティションにあるかを調べます。

# fdisk -l /dev/sda
ディスク /dev/sda: 1.8 TiB, 2000398934016 バイト, 3907029168 セクタ
単位: セクタ (1 * 512 = 512 バイト)
セクタサイズ (論理 / 物理): 512 バイト / 512 バイト
I/O サイズ (最小 / 推奨): 512 バイト / 512 バイト
ディスクラベルのタイプ: dos
ディスク識別子: 0x000716fb

デバイス 起動 開始位置 最後から セクタ サイズ Id タイプ
/dev/sda1 * 2048 1026047 1024000 500M 83 Linux
/dev/sda2 1026048 3907028991 3906002944 1.8T 8e Linux LVM


 LBAが3823431112ですので/dev/sda2であることがわかりました。
 次にブロックサイズを確認します。

# tune2fs -l /dev/sda2 | grep Block
tune2fs: Bad magic number in super-block while trying to open /dev/sda2
Couldn't find valid filesystem superblock

後日、表示できた内容
# tune2fs -l /dev/mapper/fedora-home
tune2fs 1.42.12 (29-Aug-2014)
Filesystem volume name:
Last mounted on: /home
Filesystem UUID: 80425ae2-99e5-4e58-befa-38f93adbdf30
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 118562816
Block count: 474220544
Reserved block count: 23711027
Free blocks: 464782326
Free inodes: 118403878
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 910
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Sun May 18 18:11:34 2014
Last mount time: Tue Jun 21 15:54:00 2016
Last write time: Tue Jun 21 15:54:00 2016
Mount count: 3
Maximum mount count: -1
Last checked: Fri Jun 17 15:40:40 2016
Check interval: 0 ()
Lifetime writes: 328 GB Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 4c19090c-671e-463c-a1ab-a70e70ffb250
Journal backup: inode blocks


 うまく読み込めませんでした・・・。いよいよ、HDDやばいかも・・・。CentOS 7なのでブロックサイズは4096と思われるので、この値を利用することとします。
 問題のLBAがどのブロックに含まれているかを計算で求めます。計算式は下記のとおりです。
      b = (int)((L-S)*512/B)
    where:
    b = File System block number
    B = File system block size in bytes
    L = LBA of bad sector
    S = Starting sector of partition as shown by fdisk -lu
 今回の場合、b = (int)((3823431112-1026048)*512/4096 = 477800633となります。
# debugfs
debugfs 1.42.12 (29-Aug-2014)
debugfs:  open /dev/sda2
/dev/sda2: Bad magic number in super-block while opening filesystem
debugfs:  open /dev/sda
/dev/sda: Bad magic number in super-block while opening filesystem
debugfs:  ^C
debugfs:  q
本来であれば下記のように表示される。
# debugfs
debugfs 1.40.4 (31-Dec-2007)
debugfs:  open /dev/sda5
debugfs:  icheck 18117607
Block   Inode number
18117607        <block not found>
debugfs:  quit
後日、表示できた内容
# debugfs
debugfs 1.42.12 (29-Aug-2014)
debugfs:  open /dev/mapper/fedora-home
debugfs:  icheck 3823431112
Block	Inode number
3823431112	<block not found>
debugfs:  quit
 本来であれば該当inodeが使われていないことが確認できてから実施すべき作業なのですが、今回は強引に実施しちゃいます。
# dd if=/dev/zero of=/dev/sda2 bs=4096 count=1 seek=477800633
1+0 レコード入力
1+0 レコード出力
4096 バイト (4.1 kB) コピーされました、 0.000280253 秒、 14.6 MB/秒
# sync
 longテストを実施し確認します。

# smartctl -A /dev/sda
smartctl 6.5 2016-05-07 r4318 [i686-linux-4.4.13-200.fc22.i686+PAE] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 3
3 Spin_Up_Time 0x0027 189 162 021 Pre-fail Always - 5516
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 632
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 065 065 000 Old_age Always - 25729
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 584
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 66
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 760102
194 Temperature_Celsius 0x0022 114 106 000 Old_age Always - 36
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 75
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 18
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 199 000 Old_age Offline - 5


 むむっ・・・Current_Pending_Sector及びUDMA_CRC_Error_Countの値に変化はありません。
 # smartctl -t offline /dev/sdaを試してみます。
 longテストを実施し確認しましたが、Current_Pending_Sector、Offline_Uncorrectableに変化はありませんでした。