Bug 226963

Summary: mkfs.ext3 says everything is ok even if disk failed.
Product: [Fedora] Fedora Reporter: Laurent Wandrebeck <l.wandrebeck>
Component: e2fsprogsAssignee: Eric Sandeen <esandeen>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 6CC: kzak, sct
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-07-23 18:19:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Laurent Wandrebeck 2007-02-02 10:18:21 UTC
Description of problem:

mkfs.ext3 /dev/sdb1 (a 300 GB partition) runs fine, even if there are disk
failures. SMART disk status says OK. Disk is  Maxtor 6L300S0 (sata 150) on a
sata2 (sata 300, NCQ) JMicron Technologies, Inc. JMicron 20360/20363 AHCI
Controller (rev 02).
Errors appear while "writing superblocks and FS accounting information"
(approximative translation from my native language).
From logs:
(note the strange "synchronization", failures appear exactly 1 minute after ntpd
did its job)

Feb  1 17:47:07 a64 ntpd[2696]: synchronized to LOCAL(0), stratum 10           
                                                                              
Feb  1 17:47:07 a64 ntpd[2696]: kernel time sync enabled 0001                  
                                                                              
Feb  1 17:48:11 a64 ntpd[2696]: synchronized to 64.142.114.146, stratum 2      
                                                                              
Feb  1 17:49:11 a64 kernel: ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr
0x0 action 0x2 frozen                                                           
Feb  1 17:49:11 a64 kernel: ata3.00: tag 0 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)                                                                     
Feb  1 17:49:11 a64 kernel: ata3.00: tag 1 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)                                                                     
Feb  1 17:49:11 a64 kernel: ata3.00: tag 2 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)
...(up to tag 30)
Feb  1 17:49:11 a64 kernel: ata3: soft resetting port                          
                                                                              
Feb  1 17:49:18 a64 kernel: ata3: port is slow to respond, please be patient
(Status 0xc0)                                                                    
Feb  1 17:49:41 a64 kernel: ata3: port failed to respond (30 secs, Status 0xc0)
                                                                              
Feb  1 17:49:41 a64 kernel: ata3: softreset failed (device not ready)          
                                                                              
Feb  1 17:49:41 a64 kernel: ata3: softreset failed, retrying in 5 secs         
                                                                              
Feb  1 17:49:46 a64 kernel: ata3: hard resetting port                          
                                                                              
Feb  1 17:49:46 a64 kernel: ata3: COMRESET failed (device not ready)           
                                                                              
Feb  1 17:49:46 a64 kernel: ata3: hardreset failed, retrying in 5 secs         
                                                                              
Feb  1 17:49:51 a64 kernel: ata3: hard resetting port                          
                                                                              
Feb  1 17:49:52 a64 kernel: ata3: COMRESET failed (device not ready)           
                                                                              
Feb  1 17:49:52 a64 kernel: ata3: reset failed, giving up                      
                                                                              
Feb  1 17:49:52 a64 kernel: ata3.00: disabled                                  
                                                                              
Feb  1 17:49:52 a64 kernel: ata3: EH complete                                  
                                                                              
Feb  1 17:49:52 a64 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000   
                                                                              
Feb  1 17:49:52 a64 kernel: end_request: I/O error, dev sdb, sector 114032703  
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
28508160                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
28508161                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
28508162                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
28508163                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000   
                                                                              
Feb  1 17:49:52 a64 kernel: end_request: I/O error, dev sdb, sector 110362687  
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
27590656                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
27590657                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
27590658                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
27590659                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000   
                                                                              
Feb  1 17:49:52 a64 kernel: end_request: I/O error, dev sdb, sector 117178431  
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
29294592                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
29294593                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000   
                                                                              
Feb  1 17:49:52 a64 kernel: end_request: I/O error, dev sdb, sector 112984127  
                                                                              
Feb  1 17:49:52 a64 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000   
                                                                              
Feb  1 17:49:52 a64 kernel: end_request: I/O error, dev sdb, sector 116654143
(then loads of other sector errors)

Another try:
(apparently no more relation to ntpd synchronization)
Feb  2 10:14:56 a64 ntpd[2676]: synchronized to LOCAL(0), stratum 10           
                                                                              
Feb  2 10:14:56 a64 ntpd[2676]: kernel time sync enabled 0001                  
                                                                              
Feb  2 10:15:59 a64 ntpd[2676]: synchronized to 62.112.194.64, stratum 2       
                                                                              
Feb  2 10:16:03 a64 Installed: python-virtinst.noarch 0.98.0-1.fc6             
                                                                              
Feb  2 10:25:24 a64 kernel: ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr
0x0 action 0x2 frozen                                                           
Feb  2 10:25:24 a64 kernel: ata3.00: tag 0 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)                                                                     
Feb  2 10:25:24 a64 kernel: ata3.00: tag 1 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)                                                                     
Feb  2 10:25:24 a64 kernel: ata3.00: tag 2 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)
then the same as previous try.

Disk seems to have a half SMART implementation:
Feb  2 10:41:53 a64 smartd[2987]: Device: /dev/sdb, not capable of SMART
self-check                                                                     
     
Feb  2 10:41:54 a64 smartd[2987]: Sending warning via mail to root ...         
                                                                              
Feb  2 10:41:56 a64 smartd[2987]: Warning via mail to root: successful         
                                                                              
Feb  2 10:41:56 a64 smartd[2987]: Device: /dev/sdb, failed to read SMART
Attribute Data      

[root@a64 ~]# smartctl /dev/sdb --all -T permissive
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Short INQUIRY response, skip product id
SMART Health Status: OK
Read defect list: asked for grown list but didn't get it

Error Counter logging not supported
Device does not support Self Test logging

mkfs.ext3 output:

[root@a64 ~]# mkfs.ext3 /dev/sdb1
mke2fs 1.39 (29-May-2006)
Étiquette de système de fichiers=
Type de système d'exploitation : Linux
Taille de bloc=4096 (log=2)
Taille de fragment=4096 (log=2)
36634624 i-noeuds, 73262416 blocs
3663120 blocs (5.00%) réservés pour le super utilisateur
Premier bloc de données=0
Nombre maximum de blocs du système de fichiers=4294967296
2236 groupes de blocs
32768 blocs par groupe, 32768 fragments par groupe
16384 i-noeuds par groupe
Superblocs de secours stockés sur les blocs : 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616

Écriture des tables d'i-noeuds : complété                        
Création du journal (32768 blocs) : complété
Écriture des superblocs et de l'information de comptabilité du système de
fichiers : complété

Le système de fichiers sera automatiquement vérifié tous les 20 montages ou
après 180 jours, selon la première éventualité. Utiliser tune2fs -c ou -i
pour écraser la valeur.

Version-Release number of selected component (if applicable):

1.39-7.fc6

How reproducible:

Unsure. formatted the same way a similar disk (same model but sata 300) without
problem. Is it a bug with sata 150 HD on sata 300 port ? Still have to find
hardware and time for further tests.

Steps to Reproduce:
1.
2.
3.
  
Actual results:

mkfs.ext3 says everything ran fine.

Expected results:

should be aware of the failure.

Additional info:

Comment 1 Eric Sandeen 2007-05-05 18:09:11 UTC
ok, so this bug is what mkfs does in the face of IO errors... I'll have to take
a look at the error checking.  Out of curiosity, what does subsequent mount or
fsck say?

The root cause is unknown, looks like you have ATA problems...

Comment 2 Eric Sandeen 2007-07-23 18:19:38 UTC
Actually, there is not much that mkfs can do here; it is doing buffered writes
(which is fine) which will actually be flushed to disk at a later time.  If
those later attempts to flush data fail, mkfs won't know; it may well have
already exited; there is no feedback possible.  Unless mkfs were to use direct
IO it would not catch this sort of error.  But, this is true of any application
which does buffered writes.

I'm going to close this as NOTABUG; I think the only real problem here is that
you have a failing disk.