226963 – mkfs.ext3 says everything is ok even if disk failed.

Bug 226963 - mkfs.ext3 says everything is ok even if disk failed.

Summary: mkfs.ext3 says everything is ok even if disk failed.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	e2fsprogs
Sub Component:
Version:	6
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Eric Sandeen
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-02-02 10:18 UTC by Laurent Wandrebeck
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-07-23 18:19:38 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Laurent Wandrebeck 2007-02-02 10:18:21 UTC

Description of problem:

mkfs.ext3 /dev/sdb1 (a 300 GB partition) runs fine, even if there are disk
failures. SMART disk status says OK. Disk is  Maxtor 6L300S0 (sata 150) on a
sata2 (sata 300, NCQ) JMicron Technologies, Inc. JMicron 20360/20363 AHCI
Controller (rev 02).
Errors appear while "writing superblocks and FS accounting information"
(approximative translation from my native language).
From logs:
(note the strange "synchronization", failures appear exactly 1 minute after ntpd
did its job)

Feb  1 17:47:07 a64 ntpd[2696]: synchronized to LOCAL(0), stratum 10           
                                                                              
Feb  1 17:47:07 a64 ntpd[2696]: kernel time sync enabled 0001                  
                                                                              
Feb  1 17:48:11 a64 ntpd[2696]: synchronized to 64.142.114.146, stratum 2      
                                                                              
Feb  1 17:49:11 a64 kernel: ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr
0x0 action 0x2 frozen                                                           
Feb  1 17:49:11 a64 kernel: ata3.00: tag 0 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)                                                                     
Feb  1 17:49:11 a64 kernel: ata3.00: tag 1 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)                                                                     
Feb  1 17:49:11 a64 kernel: ata3.00: tag 2 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)
...(up to tag 30)
Feb  1 17:49:11 a64 kernel: ata3: soft resetting port                          
                                                                              
Feb  1 17:49:18 a64 kernel: ata3: port is slow to respond, please be patient
(Status 0xc0)                                                                    
Feb  1 17:49:41 a64 kernel: ata3: port failed to respond (30 secs, Status 0xc0)
                                                                              
Feb  1 17:49:41 a64 kernel: ata3: softreset failed (device not ready)          
                                                                              
Feb  1 17:49:41 a64 kernel: ata3: softreset failed, retrying in 5 secs         
                                                                              
Feb  1 17:49:46 a64 kernel: ata3: hard resetting port                          
                                                                              
Feb  1 17:49:46 a64 kernel: ata3: COMRESET failed (device not ready)           
                                                                              
Feb  1 17:49:46 a64 kernel: ata3: hardreset failed, retrying in 5 secs         
                                                                              
Feb  1 17:49:51 a64 kernel: ata3: hard resetting port                          
                                                                              
Feb  1 17:49:52 a64 kernel: ata3: COMRESET failed (device not ready)           
                                                                              
Feb  1 17:49:52 a64 kernel: ata3: reset failed, giving up                      
                                                                              
Feb  1 17:49:52 a64 kernel: ata3.00: disabled                                  
                                                                              
Feb  1 17:49:52 a64 kernel: ata3: EH complete                                  
                                                                              
Feb  1 17:49:52 a64 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000   
                                                                              
Feb  1 17:49:52 a64 kernel: end_request: I/O error, dev sdb, sector 114032703  
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
28508160                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
28508161                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
28508162                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
28508163                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000   
                                                                              
Feb  1 17:49:52 a64 kernel: end_request: I/O error, dev sdb, sector 110362687  
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
27590656                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
27590657                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
27590658                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
27590659                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000   
                                                                              
Feb  1 17:49:52 a64 kernel: end_request: I/O error, dev sdb, sector 117178431  
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
29294592                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: Buffer I/O error on device sdb1, logical block
29294593                                                                           
Feb  1 17:49:52 a64 kernel: lost page write due to I/O error on sdb1           
                                                                              
Feb  1 17:49:52 a64 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000   
                                                                              
Feb  1 17:49:52 a64 kernel: end_request: I/O error, dev sdb, sector 112984127  
                                                                              
Feb  1 17:49:52 a64 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000   
                                                                              
Feb  1 17:49:52 a64 kernel: end_request: I/O error, dev sdb, sector 116654143
(then loads of other sector errors)

Another try:
(apparently no more relation to ntpd synchronization)
Feb  2 10:14:56 a64 ntpd[2676]: synchronized to LOCAL(0), stratum 10           
                                                                              
Feb  2 10:14:56 a64 ntpd[2676]: kernel time sync enabled 0001                  
                                                                              
Feb  2 10:15:59 a64 ntpd[2676]: synchronized to 62.112.194.64, stratum 2       
                                                                              
Feb  2 10:16:03 a64 Installed: python-virtinst.noarch 0.98.0-1.fc6             
                                                                              
Feb  2 10:25:24 a64 kernel: ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr
0x0 action 0x2 frozen                                                           
Feb  2 10:25:24 a64 kernel: ata3.00: tag 0 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)                                                                     
Feb  2 10:25:24 a64 kernel: ata3.00: tag 1 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)                                                                     
Feb  2 10:25:24 a64 kernel: ata3.00: tag 2 cmd 0x61 Emask 0x4 stat 0x40 err 0x0
(timeout)
then the same as previous try.

Disk seems to have a half SMART implementation:
Feb  2 10:41:53 a64 smartd[2987]: Device: /dev/sdb, not capable of SMART
self-check                                                                     
     
Feb  2 10:41:54 a64 smartd[2987]: Sending warning via mail to root ...         
                                                                              
Feb  2 10:41:56 a64 smartd[2987]: Warning via mail to root: successful         
                                                                              
Feb  2 10:41:56 a64 smartd[2987]: Device: /dev/sdb, failed to read SMART
Attribute Data      

[root@a64 ~]# smartctl /dev/sdb --all -T permissive
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Short INQUIRY response, skip product id
SMART Health Status: OK
Read defect list: asked for grown list but didn't get it

Error Counter logging not supported
Device does not support Self Test logging

mkfs.ext3 output:

[root@a64 ~]# mkfs.ext3 /dev/sdb1
mke2fs 1.39 (29-May-2006)
Étiquette de système de fichiers=
Type de système d'exploitation : Linux
Taille de bloc=4096 (log=2)
Taille de fragment=4096 (log=2)
36634624 i-noeuds, 73262416 blocs
3663120 blocs (5.00%) réservés pour le super utilisateur
Premier bloc de données=0
Nombre maximum de blocs du système de fichiers=4294967296
2236 groupes de blocs
32768 blocs par groupe, 32768 fragments par groupe
16384 i-noeuds par groupe
Superblocs de secours stockés sur les blocs : 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616

Écriture des tables d'i-noeuds : complété                        
Création du journal (32768 blocs) : complété
Écriture des superblocs et de l'information de comptabilité du système de
fichiers : complété

Le système de fichiers sera automatiquement vérifié tous les 20 montages ou
après 180 jours, selon la première éventualité. Utiliser tune2fs -c ou -i
pour écraser la valeur.

Version-Release number of selected component (if applicable):

1.39-7.fc6

How reproducible:

Unsure. formatted the same way a similar disk (same model but sata 300) without
problem. Is it a bug with sata 150 HD on sata 300 port ? Still have to find
hardware and time for further tests.

Steps to Reproduce:
1.
2.
3.
  
Actual results:

mkfs.ext3 says everything ran fine.

Expected results:

should be aware of the failure.

Additional info:

Comment 1 Eric Sandeen 2007-05-05 18:09:11 UTC

ok, so this bug is what mkfs does in the face of IO errors... I'll have to take
a look at the error checking.  Out of curiosity, what does subsequent mount or
fsck say?

The root cause is unknown, looks like you have ATA problems...

Comment 2 Eric Sandeen 2007-07-23 18:19:38 UTC

Actually, there is not much that mkfs can do here; it is doing buffered writes
(which is fine) which will actually be flushed to disk at a later time.  If
those later attempts to flush data fail, mkfs won't know; it may well have
already exited; there is no feedback possible.  Unless mkfs were to use direct
IO it would not catch this sort of error.  But, this is true of any application
which does buffered writes.

I'm going to close this as NOTABUG; I think the only real problem here is that
you have a failing disk.

Note You need to log in before you can comment on or make changes to this bug.