Description of problem: Afflicted machine is both a sendmail and dovecot imap mail server. Local mail delivery is via procmail. We have now experienced three separate instances of mailbox file corruption resulting in minor to severe loss of mail data. Both procmail and dovecot are running on a local ext4 filesystem, so NFS locking issues should not be a problem. Version-Release number of selected component (if applicable): kernel-2.6.29.5-191.fc11.x86_64 How reproducible: Random occurrence, has happened three times in roughly 1 week. Steps to Reproduce: I don't have a reproducer yet, but I would assume two task both reading and writing to the same file, using some sort of file locking to mediate access, should replicate the problem. Actual results: The effected files all have blocks of NULL characters. Usually, those blocks of NULL characters have overwritten previous data and new data is appended beyond the NULL stream. Expected results: No file corruption! Additional info: I have corrupted files available for inspection of that would help
Can you add details on the specific F11 kernel version & mount options used for the file system? Thanks!
Did the machine ever experience power a loss or crash?
Oh another thing that may be interesting is to map the file - filefrag -v would suffice. This will tell us whether we have 0s on disk, or holes in the file. If there was no power loss or other odd event, my first hunch is delalloc problems; if this is at all repeatable and you're willing to experiment, mounting with -o nodelalloc would take that out of the picture. (Potential downside is that nodelalloc doesn't get a lot of testing...) Ric also asks whether you may have ever run out of space during operation...
Kernel version is in comment #1 Mount options: /dev/md/root / ext4 defaults,noatime,nodiratime 1 1 No power loss. No recent crashes since disabling SELinux about 3 weeks ago. Prior to that I had some nfs<->SELinux related crashes where this machine was NFS server and when the client tried to access files, on random occasions and files it would fail to get an SELinux context. [doug@firewall redhat-lists]$ sudo filefrag -v bugzilla Checking bugzilla Filesystem type is: ef53 Filesystem cylinder groups is approximately 18299 File is stored in extents format Blocksize of file bugzilla is 4096 File size of bugzilla is 991855 (243 blocks) First block: 22356864 Last block: 22471154 Discontinuity: Block 128 is at 22471040 (was 22356991) bugzilla: 2 extents found [doug@firewall redhat-lists]$ Never ran out of space. Current usage has over 1TB free.
Thanks Doug. Could you also: # debugfs /dev/whatever debugfs> stat relative/path/to/bugzilla Just trying to collect some info about the file to try to correlate to the location of the 0s. thanks, -Eric
stat /home/doug/Mail/redhat-lists/bugzilla Inode: 5541183 Type: regular Mode: 0600 Flags: 0x80000 Generation: 3103348147 Version: 0x00000000:00000001 User: 2218 Group: 2218 Size: 995714 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 1952 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4a684b9e:819806a0 -- Thu Jul 23 07:38:06 2009 atime: 0x4a37f0bd:00000000 -- Tue Jun 16 15:21:33 2009 mtime: 0x4a684b9e:819806a0 -- Thu Jul 23 07:38:06 2009 crtime: 0x4a3804ec:961871c8 -- Tue Jun 16 16:47:40 2009 Size of extra inode fields: 28 Extended attributes stored in inode body: selinux = "unconfined_u:object_r:user_home_t:s0\000" (37) BLOCKS: (0-127):22356864-22356991, (128-243):22471040-22471155 TOTAL: 244 debugfs:
Umm....Holy Crap. After the unclean shutdowns in the past it always came back up mostly clean, just did a journal replay and went on. But, given the current issues, I ran fsck, and this is what I got. [root@firewall openvpn]# fsck.ext4 -n /dev/md/root e2fsck 1.41.4 (27-Jan-2009) Warning! /dev/md/root is mounted. Warning: skipping journal recovery because doing a read-only filesystem check. firewall/ contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Inodes that were part of a corrupted orphan linked list found. Fix? no Inode 4760 was part of the orphaned inode list. IGNORED. Inode 5374 was part of the orphaned inode list. IGNORED. Inode 5447 was part of the orphaned inode list. IGNORED. Inode 11048 was part of the orphaned inode list. IGNORED. Inode 11200 was part of the orphaned inode list. IGNORED. Inode 11231 was part of the orphaned inode list. IGNORED. Inode 11357 was part of the orphaned inode list. IGNORED. Inode 32344 was part of the orphaned inode list. IGNORED. Inode 99504 was part of the orphaned inode list. IGNORED. Inode 101530 was part of the orphaned inode list. IGNORED. Inode 103650 was part of the orphaned inode list. IGNORED. Deleted inode 105652 has zero dtime. Fix? no Inode 155994 was part of the orphaned inode list. IGNORED. Inode 155995 was part of the orphaned inode list. IGNORED. Inode 164750 was part of the orphaned inode list. IGNORED. Inode 166059 was part of the orphaned inode list. IGNORED. Inode 166122 was part of the orphaned inode list. IGNORED. Inode 167136 was part of the orphaned inode list. IGNORED. Inode 168909 was part of the orphaned inode list. IGNORED. Inode 169301 was part of the orphaned inode list. IGNORED. Inode 169302 was part of the orphaned inode list. IGNORED. Inode 169348 was part of the orphaned inode list. IGNORED. Inode 169389 was part of the orphaned inode list. IGNORED. Inode 169396 was part of the orphaned inode list. IGNORED. Inode 169443 was part of the orphaned inode list. IGNORED. Inode 169477 was part of the orphaned inode list. IGNORED. Inode 169484 was part of the orphaned inode list. IGNORED. Inode 169497 was part of the orphaned inode list. IGNORED. Inode 169516 was part of the orphaned inode list. IGNORED. Inode 169521 was part of the orphaned inode list. IGNORED. Inode 169547 was part of the orphaned inode list. IGNORED. Inode 169682 was part of the orphaned inode list. IGNORED. Inode 170515 was part of the orphaned inode list. IGNORED. Inode 170849 was part of the orphaned inode list. IGNORED. Inode 170850 was part of the orphaned inode list. IGNORED. Inode 171815 was part of the orphaned inode list. IGNORED. Inode 172876 was part of the orphaned inode list. IGNORED. Inode 175733 was part of the orphaned inode list. IGNORED. Inode 175833 was part of the orphaned inode list. IGNORED. Inode 175835 was part of the orphaned inode list. IGNORED. Inode 175901 was part of the orphaned inode list. IGNORED. Inode 175969 was part of the orphaned inode list. IGNORED. Inode 176022 was part of the orphaned inode list. IGNORED. Inode 176302 was part of the orphaned inode list. IGNORED. Inode 176353 was part of the orphaned inode list. IGNORED. Inode 176542 was part of the orphaned inode list. IGNORED. Inode 176588 was part of the orphaned inode list. IGNORED. Inode 176781 was part of the orphaned inode list. IGNORED. Inode 176855 was part of the orphaned inode list. IGNORED. Inode 176934 was part of the orphaned inode list. IGNORED. Inode 177176 was part of the orphaned inode list. IGNORED. Inode 177213 was part of the orphaned inode list. IGNORED. Inode 177382 was part of the orphaned inode list. IGNORED. Inode 177474 was part of the orphaned inode list. IGNORED. Inode 177488 was part of the orphaned inode list. IGNORED. Inode 177586 was part of the orphaned inode list. IGNORED. Inode 177709 was part of the orphaned inode list. IGNORED. Inode 177716 was part of the orphaned inode list. IGNORED. Inode 177766 was part of the orphaned inode list. IGNORED. Inode 177775 was part of the orphaned inode list. IGNORED. Inode 177814 was part of the orphaned inode list. IGNORED. Inode 177883 was part of the orphaned inode list. IGNORED. Inode 177950 was part of the orphaned inode list. IGNORED. Inode 177977 was part of the orphaned inode list. IGNORED. Inode 178227 was part of the orphaned inode list. IGNORED. Inode 178259 was part of the orphaned inode list. IGNORED. Inode 178326 was part of the orphaned inode list. IGNORED. Inode 178396 was part of the orphaned inode list. IGNORED. Inode 178467 was part of the orphaned inode list. IGNORED. Inode 178520 was part of the orphaned inode list. IGNORED. Inode 178584 was part of the orphaned inode list. IGNORED. Inode 178804 was part of the orphaned inode list. IGNORED. Inode 178867 was part of the orphaned inode list. IGNORED. Inode 197633 was part of the orphaned inode list. IGNORED. Inode 199282 was part of the orphaned inode list. IGNORED. Inode 199300 was part of the orphaned inode list. IGNORED. Inode 199348 was part of the orphaned inode list. IGNORED. Inode 201848 was part of the orphaned inode list. IGNORED. Inode 204603 was part of the orphaned inode list. IGNORED. Inode 204684 was part of the orphaned inode list. IGNORED. Inode 204688 was part of the orphaned inode list. IGNORED. Inode 204755 was part of the orphaned inode list. IGNORED. Inode 205220 was part of the orphaned inode list. IGNORED. Inode 205308 was part of the orphaned inode list. IGNORED. Inode 205309 was part of the orphaned inode list. IGNORED. Inode 208867 was part of the orphaned inode list. IGNORED. Inode 208874 was part of the orphaned inode list. IGNORED. Inode 208918 was part of the orphaned inode list. IGNORED. Inode 208922 was part of the orphaned inode list. IGNORED. Inode 208934 was part of the orphaned inode list. IGNORED. Inode 208979 was part of the orphaned inode list. IGNORED. Inode 208987 was part of the orphaned inode list. IGNORED. Inode 209186 was part of the orphaned inode list. IGNORED. Inode 209190 was part of the orphaned inode list. IGNORED. Inode 209210 was part of the orphaned inode list. IGNORED. Inode 209213 was part of the orphaned inode list. IGNORED. Inode 209214 was part of the orphaned inode list. IGNORED. Inode 209225 was part of the orphaned inode list. IGNORED. Inode 209335 was part of the orphaned inode list. IGNORED. Inode 209392 was part of the orphaned inode list. IGNORED. Inode 209634 was part of the orphaned inode list. IGNORED. Inode 212153 was part of the orphaned inode list. IGNORED. Inode 212262 was part of the orphaned inode list. IGNORED. Inode 212268 was part of the orphaned inode list. IGNORED. Inode 212269 was part of the orphaned inode list. IGNORED. Inode 212276 was part of the orphaned inode list. IGNORED. Inode 212281 was part of the orphaned inode list. IGNORED. Inode 212316 was part of the orphaned inode list. IGNORED. Inode 212682 was part of the orphaned inode list. IGNORED. Inode 212765 was part of the orphaned inode list. IGNORED. Inode 212803 was part of the orphaned inode list. IGNORED. Inode 216493 was part of the orphaned inode list. IGNORED. Inode 217003 was part of the orphaned inode list. IGNORED. Inode 218025 was part of the orphaned inode list. IGNORED. Inode 218080 was part of the orphaned inode list. IGNORED. Inode 218197 was part of the orphaned inode list. IGNORED. Inode 218312 was part of the orphaned inode list. IGNORED. Inode 218371 was part of the orphaned inode list. IGNORED. Inode 218397 was part of the orphaned inode list. IGNORED. Inode 218421 was part of the orphaned inode list. IGNORED. Inode 218446 was part of the orphaned inode list. IGNORED. Inode 218751 was part of the orphaned inode list. IGNORED. Inode 218903 was part of the orphaned inode list. IGNORED. Inode 219089 was part of the orphaned inode list. IGNORED. Inode 219094 was part of the orphaned inode list. IGNORED. Inode 219620 was part of the orphaned inode list. IGNORED. Inode 219814 was part of the orphaned inode list. IGNORED. Inode 219817 was part of the orphaned inode list. IGNORED. Inode 219835 was part of the orphaned inode list. IGNORED. Inode 240154 was part of the orphaned inode list. IGNORED. Inode 241714 was part of the orphaned inode list. IGNORED. Inode 241718 was part of the orphaned inode list. IGNORED. Inode 247978 was part of the orphaned inode list. IGNORED. Inode 247983 was part of the orphaned inode list. IGNORED. Inode 247992 was part of the orphaned inode list. IGNORED. Inode 248010 was part of the orphaned inode list. IGNORED. Inode 248290 was part of the orphaned inode list. IGNORED. Inode 248298 was part of the orphaned inode list. IGNORED. Inode 248299 was part of the orphaned inode list. IGNORED. Inode 248300 was part of the orphaned inode list. IGNORED. Inode 248301 was part of the orphaned inode list. IGNORED. Inode 248302 was part of the orphaned inode list. IGNORED. Inode 248306 was part of the orphaned inode list. IGNORED. Inode 248315 was part of the orphaned inode list. IGNORED. Inode 266976 was part of the orphaned inode list. IGNORED. Inode 291027 was part of the orphaned inode list. IGNORED. Inode 291546 was part of the orphaned inode list. IGNORED. Inode 291570 was part of the orphaned inode list. IGNORED. Inode 291572 was part of the orphaned inode list. IGNORED. Inode 291762 was part of the orphaned inode list. IGNORED. Inode 293046 was part of the orphaned inode list. IGNORED. Inode 309020 was part of the orphaned inode list. IGNORED. Inode 309081 was part of the orphaned inode list. IGNORED. Inode 309082 was part of the orphaned inode list. IGNORED. Inode 309083 was part of the orphaned inode list. IGNORED. Inode 5377632 was part of the orphaned inode list. IGNORED. Inode 5379671 was part of the orphaned inode list. IGNORED. Inode 6029642 was part of the orphaned inode list. IGNORED. Inode 6032427 was part of the orphaned inode list. IGNORED. Inode 6032568 was part of the orphaned inode list. IGNORED. Inode 6032728 was part of the orphaned inode list. IGNORED. Inode 6032731 was part of the orphaned inode list. IGNORED. Inode 6032736 was part of the orphaned inode list. IGNORED. Inode 6032944 was part of the orphaned inode list. IGNORED. Inode 6032975 was part of the orphaned inode list. IGNORED. Inode 6033281 was part of the orphaned inode list. IGNORED. Inode 6033432 was part of the orphaned inode list. IGNORED. Inode 6033685 was part of the orphaned inode list. IGNORED. Inode 6034543 was part of the orphaned inode list. IGNORED. Inode 6035373 was part of the orphaned inode list. IGNORED. Inode 6035533 was part of the orphaned inode list. IGNORED. Inode 6036431 was part of the orphaned inode list. IGNORED. Inode 6036448 was part of the orphaned inode list. IGNORED. Inode 6036482 was part of the orphaned inode list. IGNORED. Inode 6036517 was part of the orphaned inode list. IGNORED. Inode 6036555 was part of the orphaned inode list. IGNORED. Inode 6036561 was part of the orphaned inode list. IGNORED. Inode 6036593 was part of the orphaned inode list. IGNORED. Inode 6036602 was part of the orphaned inode list. IGNORED. Inode 6036618 was part of the orphaned inode list. IGNORED. Inode 6036621 was part of the orphaned inode list. IGNORED. Inode 6036626 was part of the orphaned inode list. IGNORED. Inode 6036650 was part of the orphaned inode list. IGNORED. Inode 6036656 was part of the orphaned inode list. IGNORED. Inode 6036665 was part of the orphaned inode list. IGNORED. Inode 6036667 was part of the orphaned inode list. IGNORED. Inode 6036675 was part of the orphaned inode list. IGNORED. Inode 6036684 was part of the orphaned inode list. IGNORED. Inode 6036691 was part of the orphaned inode list. IGNORED. Inode 6036696 was part of the orphaned inode list. IGNORED. Inode 6036701 was part of the orphaned inode list. IGNORED. Inode 6036703 was part of the orphaned inode list. IGNORED. Inode 6037595 was part of the orphaned inode list. IGNORED. Inode 10592426 was part of the orphaned inode list. IGNORED. Inode 12320927 was part of the orphaned inode list. IGNORED. Inode 12320928 was part of the orphaned inode list. IGNORED. Inode 12320929 was part of the orphaned inode list. IGNORED. Pass 2: Checking directory structure Entry '.fetchmail.pid' in /home/dledford (5541444) has deleted/unused inode 5379205. Clear? no Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -(16128--16169) -(17152--17221) -(35808--35837) -(40832--40946) -(95040--95102) -(115680--115708) -(516096--516131) -(656832--656865) -(657536--657553) -(659904--659938) -(660416--660467) -(660512--660538) -(661120--661165) -(677248--677375) -(681696--681714) -(690048--690068) -(690080--690095) -(693504--693623) -(695232--695294) -(696608--696624) -(701184--701245) -(710752--710778) -(712640--712702) -(725696--725732) -(727040--727106) -(727680--727753) -(729376--729400) -(734464--734502) -(736864--736881) -(737152--737251) -(751296--751333) -(755200--755358) -(760576--760613) -(778240--778405) -(788224--788244) -(799392--799417) -(820720--820735) -(825904--825919) -(837760--837793) -(837888--837911) -(839424--839666) -(839936--839983) -(840000--840042) -(842112--842136) -(845216--845233) -(857600--857913) -(858304--858322) -(861568--861644) -(864704--864739) -(864832--864885) -(864896--864932) -(864960--864981) -(869984--870011) -(875392--875410) -(876128--876156) -(879488--879543) -(880320--880337) -(885792--885808) -(886240--886263) -(887072--887100) -(887264--887282) -(887744--887795) -(887904--887930) -(993536--993602) -(1023392--1023415) -(1023488--1023504) -(1086464--1086499) -(1091840--1091896) -(1107456--1107648) -(1110016--1110319) -(1110528--1110894) -(1115648--1115683) -(1145344--1145434) -(1155328--1155432) -(1156608--1156718) -(1166976--1167018) -(1217600--1217621) -(1237504--1237594) -(1243264--1243291) -(1262592--1262746) -(1264896--1264943) -(1269760--1269844) -(1269888--1269930) -(1353472--1353642) -(1420032--1420171) -(1507824--1507836) -(1519040--1519072) -(1531392--1531681) -(1538560--1538704) -(1547264--1547727) -(1548494--1548504) -(1609728--1611775) -(1613824--1615871) -(1617920--1618960) -(1639424--1639561) -(1749472--1749488) -(1782272--1782719) -(20803584--20804063) -(20805120--20805631) -(22806754--22806776) -(22806975--22806978) -(22839376--22839381) -(22840351--22840354) -(22840519--22840526) -(24416704--24416736) -(24416768--24416805) -(24416832--24416864) -(24417024--24417175) -(26181993--26181997) -(26182073--26182078) -(26182589--26182611) -(26182743--26182755) -(26182780--26182791) -(26182805--26182810) -(26183084--26183091) -(26183110--26183120) -(26183140--26183145) -(26183613--26183619) -(26192279--26192288) -(26192357--26192364) -(26192453--26192460) -(26192681--26192689) -(26192994--26193000) -(26193098--26193103) -(26193124--26193135) -(26193472--26193476) -(26193732--26193738) -(26193879--26193888) -(26193905--26193916) -(48662074--48662079) -(48727051--48727055) -(48732056--48732062) -(48742236--48742245) -(48742259--48742284) -(48742297--48742306) -(49333056--49333112) -(54463467--54463468) -54463531 -54464487 -(54479728--54479737) -(54479853--54479857) -54494288 -54495066 -(54502541--54502545) -54541479 -54541511 -(54541513--54541514) -54541526 -(56197270--56197271) -(58004151--58004154) -(58011927--58011929) -58532044 -58758794 -58758848 -58758905 -58758971 -(58884347--58884353) -59475269 -(60523864--60523871) -(60571899--60571912) -(60571933--60571938) -(60589445--60589456) -(60590604--60590614) -(60591017--60591019) -(60607154--60607168) -(60607914--60607926) -(60609124--60609127) Fix? no Free blocks count wrong for group #1617 (7734, counted=7735). Fix? no Free blocks count wrong for group #1661 (24072, counted=24074). Fix? no Free blocks count wrong for group #1662 (31088, counted=31093). Fix? no Free blocks count wrong for group #1663 (29854, counted=29874). Fix? no Free blocks count wrong for group #1716 (22408, counted=22409). Fix? no Free blocks count wrong for group #1766 (24047, counted=24049). Fix? no Free blocks count wrong for group #1767 (21221, counted=21227). Fix? no Free blocks count wrong for group #1776 (4224, counted=4226). Fix? no Free blocks count wrong for group #1786 (13739, counted=13738). Fix? no Free blocks count wrong for group #1858 (30121, counted=30123). Fix? no Free blocks count wrong (524151895, counted=524151947). Fix? no Inode bitmap differences: -4760 -5374 -5447 -11048 -11200 -11231 -11357 -32344 -99504 -101530 -103650 -105652 -(155994--155995) -164750 -166059 -166122 -167136 -168909 -(169301--169302) -169348 -169389 -169396 -169443 -169477 -169484 -169497 -169516 -169521 -169547 -169682 -170515 -(170849--170850) -171815 -172876 -175733 -175833 -175835 -175901 -175969 -176022 -176302 -176353 -176542 -176588 -176781 -176855 -176934 -177176 -177213 -177382 -177474 -177488 -177586 -177709 -177716 -177766 -177775 -177814 -177883 -177950 -177977 -178227 -178259 -178326 -178396 -178467 -178520 -178584 -178804 -178867 -197633 -199282 -199300 -199348 -201848 -204603 -204684 -204688 -204755 -205220 -(205308--205309) -208867 -208874 -208918 -208922 -208934 -208979 -208987 -209186 -209190 -209210 -(209213--209214) -209225 -209335 -209392 -209634 -212153 -212262 -(212268--212269) -212276 -212281 -212316 -212682 -212765 -212803 -216493 -217003 -218025 -218080 -218197 -218312 -218371 -218397 -218421 -218446 -218751 -218903 -219089 -219094 -219620 -219814 -219817 -219835 -240154 -241714 -241718 -247978 -247983 -247992 -248010 -248290 -(248298--248302) -248306 -248315 -266976 -291027 -291546 -291570 -291572 -291762 -293046 -309020 -(309081--309083) -5377632 -5379205 -5379671 -6032427 -6032568 -6032728 -6032731 -6032736 -6032944 -6032975 -6033281 -6033685 -6035533 -6036618 -10592426 -(12320927--12320929) Fix? no Free inodes count wrong for group #656 (199, counted=198). Fix? no Free inodes count wrong for group #736 (252, counted=277). Fix? no Free inodes count wrong for group #737 (7114, counted=7115). Fix? no Free inodes count wrong (150641317, counted=150641342). Fix? no firewall/: ********** WARNING: Filesystem still has errors ********** firewall/: 1647963/152289280 files (0.7% non-contiguous), 85002345/609154240 blocks [root@firewall openvpn]#
OK, the previous comment is a red herring. It was simply because I did a read only check of a mounted filesystem. A similar read only check while the fs was mounted ro came back clean. So, whatever corruption I have isn't a case of inconsistent meta data, the meta data is consistent, it's just pointing to data blocks that it shouldn't :-(
Heh, don't you wish you could delete comments ;) Anyway, your 0s were to 0x5c00, or 23552 bytes, which is not on a 4k boundary, it's 5.75 4k blocks. It is on a 512-byte boundary though. It also doesn't correlate with your block layout boundaries currently; your first extent is 244 blocks long, so it's not a sparse file leading to 0s; you really do seem to be pointing at valid blocks which just happen to have 0s in them. Any idea if the 0s always start on 512-byte boundaries? Any other pattern to them? Working backwards from strings of 0s to "what went wrong" will be tough. :( The 512 boundary is a little suspicious; if it were a delalloc problem I'd expect it to be on fs block boundaries (4k in your case).
It would be interesting to get some idea of how your mail server is using the file system - is it using O_DIRECT? A_SYNC? etc? Is it multi-threaded? (This reminds me of a mis-behaving multi-threaded app from my past that tried to be clever with its threads all writing to the same file).
Doug can you attach your dovecot config? I may try pointing http://www.imapwiki.org/ImapTest at dovecot over ext4. Are the dovecot mail files the only thing you've seen problems with? -Eric
(In reply to comment #10) > Anyway, your 0s were to 0x5c00, or 23552 bytes, which is not on a 4k boundary, > it's 5.75 4k blocks. It is on a 512-byte boundary though. You guys need glasses or something ;-) You asked for the kernel version when I gave it in comment #1, and the 0s were to 0x5c000 not 0x5c00, which is exactly 368k, which *is* a 4k boundary, it's file block 0 through 91 ;-) That kinda makes the rest of your comment moot. Although I'll answer the 0s question for you: this is on a relatively new raid5 device. As a part of the initial raid5 sync, the entire array gets 0ed out. So, any time the metadata points to blocks that have never actually been written, which is going to be almost all the time when it points to random otherwise unallocated blocks, it's going to return 0s. In the future, when this array has had lots of stuff written and deleted, that will start to show random data instead. Now, as for the fileserver/mail server. I'm using the exact same setup I've been using for several years. However, about a month ago, I upgraded from Fedora 9 to Fedora 11. During that upgrade, I put in new hard drives to expand capacity and made a new raid5 array on the new drives (the old setup had a 4 disk raid5 array, the new a 6 disk raid5 array). Obviously, the version of sendmail and dovecot have been updated during the f9-f11 update, but they are close enough that I was able to simply reuse my old config files. The only major difference between the old server and the new is the switch from ext3 to ext4. In the 6 years or so since I've been running my own mail server over ext3, I can't recall any problems like I've had recently with ext4, while this new setup has resulted in at least 5 known mail file corruptions so far. That being said, I'm more than happy to give you my dovecot config and sendmail config files since they might help you track this down, but so you know these are known working configurations on ext3.
Sigh, dropped a 0 on the size, sorry. Ok well it's still not a sparse file, and also the 0s are not aligned on any extent boundaries. ;) Yeah I'm not all that inclined to implicate dovecot or sendmail, but running an imap stress-test over ext4 might be a way to replicate the bug. Thanks for the info; sorry about the misreading :) -Eric
I'm not sure an IMAP stress test will help. I think you need a combination IMAP/procmail stress test to really get to things. The basic operation is that you have an IMAP mbox file with some number of emails. On occasion, you purge the deleted emails (I have no idea if dovecot mmap's the file and rewrites its own memory and then flushes the changes or what it does, but it does it somehow). Simultaneously, you have new mails being delivered. Obviously, there is locking between a dovecot flush and a procmail append operation, but it's likely that somewhere in this locking is the problem. Would it help if I did a raw grep of the hard disk looking for the missing email messages? If they are found, it would indicate that the blocks got redirected to somewhere they shouldn't (or the write from dovecot was supposed to go to these blocks instead of writing over the original blocks, but never made it). If it isn't found, then it's a clear indication that things got overwritten in some way. I suspect the data is there on the disk somewhere, and that instead of overwriting in place the ext4 code attempted to do a new block write, but that the new data simply never made it to the disk. Of course, that's just my guess.
(In reply to comment #17) > Would it help if I did a raw grep of the hard disk looking for the missing > email messages? Sure, it'd be a clue ... -Eric
I had a new instance of corruption show up today (I'm making regular reviews of the mail spool files and IMAP folders for all accounts now). This one, unless I can work around it, will wipe out the wife's entire Inbox :-(
Doug, while we try to reproduce, it might be an interesting test to try and use XFS for your mail spool - it has delayed allocation and other similar features to ext4. If the corruption issues show up with XFS as well, it will help point us in a different direction (app issues? non-fs corruption? )
Doug, any time you find this corruption, can you provide details on where the 0s were in the file? I'm wondering, were the 0s where there was previously good data, or is it looking like fresh data that should have been written? Oops, putting on my glasses ;) you said: "Usually, those blocks of NULL characters have overwritten previous data and new data is appended beyond the NULL stream." Is that still the case? Is there any apparent pattern or consistency to this overwriting? Did the grep through the disk find anything interesting?
Regarding switching the mail spool to xfs, I would have to shrink my / partition, free up some space on the disks, create a new raid array, put xfs on it, and mount it as the mail spool, plus do additional mounts in user's home directories for the IMAP mailboxes. Not a small task :-( As for the corruption, the NULL bytes can happen in the middle of existing emails. However, when they do, they usually wipe out far more than just the one email they start in the middle of. The latest corruption looks like this: /var/spool/mail/jessica 0001000 0000 0000 0000 0000 0000 0000 0000 0000 * 000b000 7274 6e6f 3e67 4c20 6f6f 206b 7075 7920 -- 000c000 0000 0000 0000 0000 0000 0000 0000 0000 * 0028000 7720 7469 2068 5345 544d 2050 6469 6e20 So, the initial NULL bytes is a smallish length (0xa000), but it's right at the beginning of the file. It falls within the boundary of the second email in the spool file, and the transition looks like this in vi: (Mostly that would^@^@^@ .... <a href="http The email that starts the corruption is definitely a different mail than the one where the corruption ends. The next spot of corruption starts in yet another email (not far away) and looks like this: Received: from mail1.no^@^@^@ ... (version=TLSv1/SSLv3 Even though this corruption begins in a mail header and ends in a mail header, they definitely are not the same message. Here's the filefrag output on this spool file as it stands: [root@firewall ~]# filefrag -v /var/spool/mail/jessica Checking /var/spool/mail/jessica Filesystem type is: ef53 Filesystem cylinder groups is approximately 18299 File is stored in extents format Blocksize of file /var/spool/mail/jessica is 4096 File size of /var/spool/mail/jessica is 1430389 (350 blocks) First block: 24416256 Last block: 24416605 /var/spool/mail/jessica: 1 extent found [root@firewall ~]# And here's the debug output as it stands: [root@firewall ~]# debugfs /dev/md/root debugfs 1.41.4 (27-Jan-2009) stat /var/spool/mail/jessica debugfs: stat /var/spool/mail/jessica Inode: 5384004 Type: regular Mode: 0660 Flags: 0x80000 Generation: 3103287788 Version: 0x00000000:00000001 User: 501 Group: 12 Size: 1425738 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 2800 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4a6f3000:0d4152ec -- Tue Jul 28 13:06:08 2009 atime: 0x4a692e4d:246c438c -- Thu Jul 23 23:45:17 2009 mtime: 0x4a6f3000:0d4152ec -- Tue Jul 28 13:06:08 2009 crtime: 0x4a380231:36390498 -- Tue Jun 16 16:36:01 2009 Size of extra inode fields: 28 Extended attributes stored in inode body: selinux = "system_u:object_r:mail_spool_t:s0\000" (34) BLOCKS: (IND):44892265, (0-348):24416256-24416604 TOTAL: 350 debugfs: So, the first 4k of the file is good, then a break, then another good 4k block, then another big break. If I had to guess, I would almost say this looks like on some sort of defrag reallocation, where the fs moves data blocks to avoid fragmentation in a file that has lots of deletes in the middle of the file data, the data is not getting moved from old blocks to new blocks properly. But, that's just a guess on my part.
What helped me debug this kind of issue in the past (different distro/fs/app) was getting a trace of the IO related system calls. We are only interested in the ones that could change the state of the file (open/creat/lseek/write/mmap). Before starting, verify that all of the dovecot mail files are clean, we can stop the trace once you see any zero's and then search backwards in the trace for the last touch. If you have not done the dump/restore yet, any chance you have enough time/space to store a massive trace :-) ?
One other question - how do we syncrhonize the RAID5 remirroring IO with FS IO? I wonder if delalloc and the remirror operation has opened a window here where MD might zero blocks that should have valid data?
What are you calling "remirroring I/O"?
In comment 13 above (https://bugzilla.redhat.com/show_bug.cgi?id=513221#c13), you mentioned the original setup zeros data. Probably not related, but I do wonder if delayed allocation could play tricks with the MD threads scheduling this kind of zeroing. One interesting question is did you see any files not related to the dovecot that had this corruption (again, assuming you would notice)?
Ah, ok. The answer to that is no. The raid subsystem actually does a reconstruct on drive initialization, I said it zeroed the blocks because I knew the drives were brand new and it was going to read all zeros and construct parity to match. However, any time a write comes in, whether or not we've completed the reconstruction, the write goes to the correct data blocks on the disks and when we get around to those blocks we read back from disk what was written. So, no race condition. Actually, this description is still somewhat simplified from reality in that it describes a resync when what we do on raid creation is a reconstruction, but it gets the point across. As for other files, I didn't have an easy means of checking them for corruption. I could run an rpm verify, but that would only check rpm installed files, which are mostly static and probably wouldn't have this problem. Data, such as the files in a git repo, would eventually get caught by a hash/data mismatch. Other data files, like the ones I keep in /srv/Fedora that are mirrors of upstream get overwritten and so likely wouldn't stay corrupted even if they are today. Other data files, like my music collection, I wouldn't know are corrupted until I tried to play them, and that's over 7GB of music files so it would take over 7 days of 24 hour a day listening to check them all :-/
Ok, just a whim. I see that dovecot actually does use posix_fallocate in places: int file_set_size(int fd, off_t size) { ... #ifdef HAVE_POSIX_FALLOCATE if (posix_fallocate_supported) { if (posix_fallocate(fd, st.st_size, size - st.st_size) == 0) return 0; ... in f11, posix_fallocate on ext4 will actually call sys_fallocate, which creates unwritten extents. For unwritten extents, on-disk data is not exposed because they are flagged as uninitialized, and read back as 0s. It would be interesting to see if, possibly, your 0s are due to that range of blocks being marked as unwritten. It's a longshot, but ... http://sandeen.fedorapeople.org/utilities/fiemap-test-new.c is a bit of a hacky fiemap exerciser; unwritten extents will be shown in flags: [root@inode tmp]# dd if=/dev/zero of=testfile bs=4k count=1 [root@inode tmp]# /root/fallocate -o 4k -l 4k testfile [root@inode tmp]# ./fiemap-test-new -S testfile ext: 0 logical: [ 0.. 0] phys: 1323743.. 1323743 flags: 0x000 tot: 1 ext: 1 logical: [ 1.. 1] phys: 1323739.. 1323739 flags: 0x801 tot: 1 0x800 is preallocated. Pointing the fiemap tool at a corrupted file *might* be interesting. -Eric
Oh, and -if- by chance an extent shows up as preallocated/unwritten, then using the physical block number to go and read directly from disk may be the next step. -Eric
I have also seen this corruption several times in an mbox file written only by procmail (via sendmail) and dovecot (accessed using POP3). A compare with a pre-corruption backup of the file a few days earlier shows that the first 4096 bytes are set to NUL, with no other corruption. At 4096 bytes is not a message boundary, or even a line boundary. There were some system reboots between the time of the good backup and when I noticed the corruption, but none were unclean shutdowns. The filesystem was originally created as ext3, and then converted to ext4 a while ago by adding the new options. The sendmail and dovecot configs are pretty close to stock, with no changes in procmail or dovecot file handling options. Other info in the log below. I'm sorry that I was unable to try reading from the preallocated physical block 26161055 before I fixed the file by rewriting the correct data (maybe next time). I might be willing to try a system call trace, if there is not a lot of overhead. Please let me know exactly how you'd like to see this set up. (Yes, I know that the filesystem on /dev/md1 is much smaller than the volume.) --- # uname -r 2.6.30.8-64.fc11.x86_64 # cd /var/spool/mail # ls -ls paula 3852 -rw-rw----. 1 paula paula 3936926 Oct 18 01:02 paula # fiemap-test-new -vv paula Printing values in 4096-byte-block units Mapping 18446744073709551615 bytes from offset 0 Input: start 0 length 18446744073709551615 flags 0x0 count 32 Output: start 0 length 18446744073709551615 flags 0x0 count 32 mapped 5 ext: 0 logical: [ 0.. 0] phys: 26161055..26161055 flags: 0x800 tot: 1 ext: 1 logical: [ 1.. 127] phys: 26161056..26161182 flags: 0x000 tot: 127 ext: 2 logical: [ 128.. 243] phys: 26149376..26149491 flags: 0x000 tot: 116 ext: 3 logical: [ 244.. 511] phys: 26214956..26215223 flags: 0x000 tot: 268 ext: 4 logical: [ 512.. 961] phys: 43226112..43226561 flags: 0x001 tot: 450 # debugfs /dev/md1 debugfs 1.41.4 (27-Jan-2009) debugfs: stat /var/spool/mail/paula Inode: 5322715 Type: regular Mode: 0660 Flags: 0x80000 Generation: 4234752118 Version: 0x00000000:00000001 User: 1065 Group: 1065 Size: 3936926 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 7704 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4adacb95:4bd9f51c -- Sun Oct 18 01:02:29 2009 atime: 0x4adacc6a:372b140c -- Sun Oct 18 01:06:02 2009 mtime: 0x4adacb95:4bd9f51c -- Sun Oct 18 01:02:29 2009 crtime: 0x4a960276:c16641bc -- Wed Aug 26 20:50:14 2009 Size of extra inode fields: 28 Extended attributes stored in inode body: selinux = "unconfined_u:object_r:default_t:s0\000" (35) BLOCKS: (IND):20021749, (0-127):26161055-26161182, (128-243):26149376-26149491, (244-511):26214956-26215223, (512-961):43226112-432226561 TOTAL: 963 debugfs: # # od -Ax -c paula | head 000000 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 * 001000 o n t e n t - T y p e : t e x 001010 t / h t m l ; \n \t c h a r s e t 001020 = " U S - A S C I I " \n C o n t 001030 e n t - T r a n s f e r - E n c 001040 o d i n g : q u o t e d - p r 001050 i n t a b l e \n \n < h t m l x 001060 m l n s : v = 3 D " u r n : s c 001070 h e m a s - m i c r o s o f t - # df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/md1 412990128 225527976 166494148 58% / # mdadm -D /dev/md1 /dev/md1: Version : 1.00 Creation Time : Tue Feb 3 05:41:05 2009 Raid Level : raid10 Array Size : 1744112384 (1663.32 GiB 1785.97 GB) Used Dev Size : 872056192 (831.66 GiB 892.99 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sun Oct 18 01:00:35 2009 State : active Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : offset=2 Chunk Size : 64K Name : root UUID : a6be9de2:11f8f516:463ae293:3729c4a4 Events : 66910 Number Major Minor RaidDevice State 0 8 3 0 active sync /dev/sda3 1 8 22 1 active sync /dev/sdb6 2 8 35 2 active sync /dev/sdc3 3 8 51 3 active sync /dev/sdd3 # dumpe2fs -h /dev/md1 dumpe2fs 1.41.4 (27-Jan-2009) Filesystem volume name: / Last mounted on: <not available> Filesystem UUID: 7cbef9a4-c767-4841-8389-e5e234a16b68 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 25651200 Block count: 104857600 Reserved block count: 5242001 Free blocks: 46865504 Free inodes: 24266395 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 226 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8016 Inode blocks per group: 501 RAID stride: 16 Filesystem created: Thu Nov 6 03:52:33 2008 Last mount time: Sat Oct 17 20:45:37 2009 Last write time: Sat Oct 17 20:45:37 2009 Mount count: 37 Maximum mount count: -1 Last checked: Wed Oct 7 03:43:41 2009 Check interval: 0 (<none>) Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 First orphan inode: 114603 Default directory hash: half_md4 Directory Hash Seed: b5d3063b-46f7-437e-9baf-c3158dcd4369 Journal backup: inode blocks Journal size: 128M # diff /etc/dovecot.conf{.dist,} 20c20 < #protocols = imap imaps pop3 pop3s --- > protocols = imap imaps pop3 pop3s 42c42 < #listen = *, [::] --- > listen = *, [::] 95,96c95,96 < #ssl_cert_file = /etc/pki/dovecot/certs/dovecot.pem < #ssl_key_file = /etc/pki/dovecot/private/dovecot.pem --- > ssl_cert_file = /etc/pki/dovecot/certs/dovecot.pem > ssl_key_file = /etc/pki/dovecot/private/dovecot.pem #
I have seen this corruption again, with exactly the same symptom, on the same file. This time, I have read the raw disk data underlying the extent flagged erroneously as FIEMAP_EXTENT_UNWRITTEN and determined that it does contain the proper data that previously appeared at that place in the file (that is, the file had correct data, and then somehow the flag got set on an extent). I'm now running strace on dovecot to try to catch the next corruption. The common sequence seems to be: 1. procmail appends messages to the file, extending it. 2. /usr/libexec/dovecot/pop3 reads messages from the file, and then uses ftruncate() to delete all but the first sentinel message. In this case, the file is left at 530 bytes, with one block in one extent allocated. Note that since the file never becomes empty (0 bytes), the first block has had the same block number through many cycles of append and truncate, and at least two cycles of corruption and repair. 3. Once the corruption occurs, dovecot refuses to use the file, conveniently preserving its state. procmail continues to append to it, however. A log is below, slightly edited for privacy. It shows the erroneous state, then a file repair, then the state after. --- # cd /var/spool/mail # fiemap-test-new -vv paula Printing values in 4096-byte-block units Mapping 18446744073709551615 bytes from offset 0 Input: start 0 length 18446744073709551615 flags 0x0 count 32 Output: start 0 length 18446744073709551615 flags 0x0 count 32 mapped 10 ext: 0 logical: [ 0.. 0] phys: 26161055..26161055 flags: 0x800 tot: 1 ext: 1 logical: [ 1.. 13] phys: 24102136..24102148 flags: 0x000 tot: 13 ext: 2 logical: [ 14.. 14] phys: 24102186..24102186 flags: 0x000 tot: 1 ext: 3 logical: [ 15.. 69] phys: 24084981..24085035 flags: 0x000 tot: 55 ext: 4 logical: [ 70.. 105] phys: 24085376..24085411 flags: 0x000 tot: 36 ext: 5 logical: [ 106.. 511] phys: 26870352..26870757 flags: 0x000 tot: 406 ext: 6 logical: [ 512.. 1023] phys: 26871296..26871807 flags: 0x000 tot: 512 ext: 7 logical: [ 1024.. 2047] phys: 43353088..43354111 flags: 0x000 tot: 1024 ext: 8 logical: [ 2048.. 4095] phys: 63455232..63457279 flags: 0x000 tot: 2048 ext: 9 logical: [ 4096.. 5071] phys: 63461376..63462351 flags: 0x001 tot: 976 # debugfs /dev/md1 debugfs 1.41.4 (27-Jan-2009) debugfs: stat /var/spool/mail/paula Inode: 5322715 Type: regular Mode: 0660 Flags: 0x80000 Generation: 4234752118 Version: 0x00000000:00000001 User: 1065 Group: 1065 Size: 20774824 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 40584 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4ae0ac06:6df02d80 -- Thu Oct 22 12:01:26 2009 atime: 0x4ae0aa0a:528cc5d4 -- Thu Oct 22 11:52:58 2009 mtime: 0x4ae0ac06:6df02d80 -- Thu Oct 22 12:01:26 2009 crtime: 0x4a960276:c16641bc -- Wed Aug 26 20:50:14 2009 Size of extra inode fields: 28 Extended attributes stored in inode body: selinux = "unconfined_u:object_r:default_t:s0\000" (35) BLOCKS: (IND):20021749, (0):26161055, (1-13):24102136-24102148, (14):24102186, (15-69):24084981-24085035, (70-105):24085376-24085411, (106-511):26870352-26870757, (512-1023):26871296-26871807, (1024-2047):43353088-43354111, (2048-4095):63455232-63457279, (4096-5071):63461376-63462351 TOTAL: 5073 debugfs: # # od -Ax -c paula |head 000000 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 * 001000 i d n 9 K J I D T 6 0 1 3 8 001010 2 1 \n \t f o r < _ _ _ _ _ _ _ 001020 _ _ _ @ _ _ _ _ _ _ _ _ _ _ _ _ 001030 . _ _ _ > ; T u e , 2 0 O 001040 c t 2 0 0 9 1 2 : 1 8 : 1 3 001050 - 0 7 0 0 \n E n v e l o p e - 001060 T o : < _ _ _ _ _ _ _ _ _ _ @ 001070 _ _ _ _ _ _ _ _ _ _ _ _ . _ _ _ # dd if=/dev/md1 bs=4096 skip=26161055 count=1 | od -Ax -c | tail 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.00264496 s, 1.5 MB/s 000f70 s t - m a i l . c o m > \n R e c 000f80 e i v e d : f r o m p e a k 000f90 m a i l 2 . v e c t o r v e s t 000fa0 . l o c a l ( m a i l . v e c 000fb0 t o r v e s t - m a i l . c o m 000fc0 [ 6 6 . 1 2 9 . 7 2 . 2 2 8 ] 000fd0 ) \n \t b y _ _ _ _ _ _ _ _ _ _ 000fe0 _ . _ _ _ ( 8 . 1 4 . 3 / 8 . 000ff0 1 4 . 3 ) w i t h E S M T P 001000 # dd if=/dev/md1 bs=4096 skip=26161055 count=1 | dd of=paula bs=4096 conv=notrunc 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 2.2419e-05 s, 183 MB/s 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 4.2116e-05 s, 97.3 MB/s # fiemap-test-new -vv paula Printing values in 4096-byte-block units Mapping 18446744073709551615 bytes from offset 0 Input: start 0 length 18446744073709551615 flags 0x0 count 32 Output: start 0 length 18446744073709551615 flags 0x0 count 32 mapped 10 ext: 0 logical: [ 0.. 0] phys: 26161055..26161055 flags: 0x000 tot: 1 ext: 1 logical: [ 1.. 13] phys: 24102136..24102148 flags: 0x000 tot: 13 ext: 2 logical: [ 14.. 14] phys: 24102186..24102186 flags: 0x000 tot: 1 ext: 3 logical: [ 15.. 69] phys: 24084981..24085035 flags: 0x000 tot: 55 ext: 4 logical: [ 70.. 105] phys: 24085376..24085411 flags: 0x000 tot: 36 ext: 5 logical: [ 106.. 511] phys: 26870352..26870757 flags: 0x000 tot: 406 ext: 6 logical: [ 512.. 1023] phys: 26871296..26871807 flags: 0x000 tot: 512 ext: 7 logical: [ 1024.. 2047] phys: 43353088..43354111 flags: 0x000 tot: 1024 ext: 8 logical: [ 2048.. 4095] phys: 63455232..63457279 flags: 0x000 tot: 2048 ext: 9 logical: [ 4096.. 5071] phys: 63461376..63462351 flags: 0x001 tot: 976 # debugfs /dev/md1 debugfs 1.41.4 (27-Jan-2009) debugfs: stat /var/spool/mail/paula Inode: 5322715 Type: regular Mode: 0660 Flags: 0x80000 Generation: 4234752118 Version: 0x00000000:00000001 User: 1065 Group: 1065 Size: 20774824 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 40584 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4ae0ad17:bfeca90c -- Thu Oct 22 12:05:59 2009 atime: 0x4ae0ad38:aff3e468 -- Thu Oct 22 12:06:32 2009 mtime: 0x4ae0ad17:bfeca90c -- Thu Oct 22 12:05:59 2009 crtime: 0x4a960276:c16641bc -- Wed Aug 26 20:50:14 2009 Size of extra inode fields: 28 Extended attributes stored in inode body: selinux = "unconfined_u:object_r:default_t:s0\000" (35) BLOCKS: (IND):20021749, (0):26161055, (1-13):24102136-24102148, (14):24102186, (15-69):24084981-24085035, (70-105):24085376-24085411, (106-511):26870352-26870757, (512-1023):26871296-26871807, (1024-2047):43353088-43354111, (2048-4095):63455232-63457279, (4096-5071):63461376-63462351 TOTAL: 5073 debugfs: # #
oh! Ok, this is the first time, I think, that we've determined that it really is flagged as unwritten extents... That's a huge clue, and narrows down the problem significantly, thank you! I wonder if there's a way to turn that off or short-circuit it for the time being, just to save your mail spool. I have a version of fsx that does random fallocates; perhaps i can tweak it to do it more intensively and see if fsx can ever hit a corruption that way, or maybe after taking a hard look at your description of the IO we can come up with a testcase ... Otherwise maybe tracing read/write/fallocate/truncates would give us a clue, though I hate to sacrifice any more mail spools for this ...
Any chance you have errors like "Error: Corrupted index cache file" in dovecot error logs (maybe syslog?)
On 2.6.30, running dovecot's imaptest against ext4 gives me the corrupted index cache file messages above. However, it only seems to show up on sub-page blocks; so far 4k blocks seem ok, and I expect you guys all had 4k blocks. It was also ok w/ 1k blocks and nodelalloc.... In any case, 2.6.31 seems to run fine. Perhaps the problem is related; I'll see what fixed the sub-page-block case in .31 and see if it seems relevant... Ok, that particular bug is likely to be c364b22c9580a885e0f8c0d0f9710d67dc448958 ext4: Fix mmap/truncate race when blocksize < pagesize && delayed allocation though I'm not sure that's going to fix the original case here.
Don't let yourself get too excited about the "Error: Corrupted index cache file" messages from dovecot. I get them on ext3, ext4, and xfs. They do not, however, affect operation. I have removed the supposedly corrupt index files entirely so that dovecot could remake them and had them return in less than 24 hours. This appears to be an issue with dovecot that has nothing to do with the underlying file system. However, this issue has never resulted in a corrupt mail box or data, just warning messages in syslog (and possibly slower performance although I wouldn't know that without a better benchmarking system).
I'd just like to add a "me too"... 07ffff0 6564 3768 7569 5246 3465 7934 6f74 3556 0800000 0000 0000 0000 0000 0000 0000 0000 0000 * 2478130 7246 6d6f 6c20 6e69 7875 762d 7861 622d 2478140 756f 636e 7365 6d40 6961 2e6c 6570 6772 It basically seems as if the mbox file was overwritten from 0800000 and onwards; the text at 2478130 is a new mail that was received after the problem occurred. Fortunately it was my own inbox (a little less than two years' worth of mail :-( ) and not the old missus', as the latter would probably have caused the immediate termination of myself. This is on kernel 2.6.30.8-64.fc11.i686.PAE with postfix-2.5.6-3.fc11.i586 and dovecot-1.2.6-1_101.fc11.i586 ; the fs is ext4 with default options.
Managed to forget one thing: The only possible lead I have as to what might have caused the error is that it may have happened as the dovecot rpm was upgraded from to 1.2.6-1_101 from the last version before that. The log files don't show any apparent problems; it seems as if dovecot was shut down and restarted cleanly.
Mikkel, if you still have the file around - and for anyone else who hits this - if you could point that fiemap-test application from comment #31 at the file in question, with -vv option, we can confirm that it is an unwritten extent problem. Thanks, -Eric
Sure - here you go: [mikkel@server ~]$ ./fiemap-test-new -vv /var/spool/mail/mikkel Printing values in 4096-byte-block units Mapping 18446744073709551615 bytes from offset 0 Input: start 0 length 18446744073709551615 flags 0x0 count 32 Output: start 0 length 18446744073709551615 flags 0x0 count 32 mapped 2 ext: 0 logical: [ 0.. 1305] phys: 155648.. 156953 flags: 0x800 tot: 1306 ext: 1 logical: [ 1306.. 1311] phys: 156954.. 156959 flags: 0x001 tot: 6
Yepper, 0x800 is the preallocated/unwritten flag, so that first range will just report 0s. You might try what the other reporter did in comment #35, and do a raw dd of those physical (4k blocks from 155648 to 156959) blocks - your data may well still be there, and would be a way to recover your mbox. If it does look intact that would be worth reporting back too, thanks! -Eric
As it turns out, something (dovecot, probably) had rewritten the mbox file to remove the zero'ed block, so I wasn't able to recover anything. Oh well. I guess the lesson to learn is that if you end up with a corrupt inbox, don't touch it until you've tried recovering using the hack discussed above :-P
Tim, do you have those traces to share? Are you doing full traces or just some operations? I think creat/open/close/sync/mmap/write/fallocate/truncate would probably be sufficient... Thanks, -Eric
Created attachment 367000 [details] strace excerpt of Dovecot file access before/after corruption I was doing (almost) full straces of Dovecot with the following command: strace -s128 -q -e!epoll_wait -tt -v -f A trace of all syscalls outputs a trace line every second from the network listen process; excluding the tracing of epoll_wait() omits this clutter. I had another file corruption while the trace was running, but on a different mail file than before. The trace excerpt shows the last good access to the file, when Dovecot updates data in the sentinel message in the first file block (lines 21-22), right after the ftruncate(). The next access several hours later shows that the first block reads as NULs (line 111). Note (line 100) that the atime has the timestamp of the read on line 21, so presumably procmail was the only other accessor in between, appending the file. As before, the raw disk block still contains the file's correct block 0 data. Note that the value in the sentinel message (X-IMAP: 1251348499 0000026887) matches what was written in line 22 of the trace. This data doesn't seem to help us in knowing when the corruption occurs. I see three options: 1. Dovecot triggered the setting of the unwritten flag. Either it happened after line 21, or somehow the block data was read (instead of returning NULs) despite the flag (from a cache?). 2. procmail triggered the setting of the unwritten flag. The simple method of tracing procmail would require tracing sendmail and all children, which is far too much overhead for us. I have some alternate ideas. 3. Something else happened, unrelated to dovecot or procmail. During the period in between the two trace times an rsync backup process ran which seemed to be hung for a long time reading some sparse VM image files of around 100 GB. I don't have any more info on that one yet. Below is a log of the file state, and it's repair. One peculiar thing is that the flags for the first block still show 0x800 immediately after the dd write, but then show as 0 a short time later. Is this a journal/inode sync issue? --- # cd /var/spool/mail/ralph # fiemap-test-new -vv ralph Printing values in 4096-byte-block units Mapping 18446744073709551615 bytes from offset 0 Input: start 0 length 18446744073709551615 flags 0x0 count 32 Output: start 0 length 18446744073709551615 flags 0x0 count 32 mapped 9 ext: 0 logical: [ 0.. 0] phys: 37507421..37507421 flags: 0x800 tot: 1 ext: 1 logical: [ 1.. 2148] phys: 42766313..42768460 flags: 0x000 tot: 2148 ext: 2 logical: [ 2149.. 2815] phys: 59170848..59171514 flags: 0x000 tot: 667 ext: 3 logical: [ 2816.. 2883] phys: 59171564..59171631 flags: 0x000 tot: 68 ext: 4 logical: [ 2884.. 2884] phys: 59171660..59171660 flags: 0x000 tot: 1 ext: 5 logical: [ 2885.. 2947] phys: 59171662..59171724 flags: 0x000 tot: 63 ext: 6 logical: [ 2948.. 2987] phys: 62042174..62042213 flags: 0x000 tot: 40 ext: 7 logical: [ 2988.. 3460] phys: 62032790..62033262 flags: 0x000 tot: 473 ext: 8 logical: [ 3461.. 3771] phys: 62036151..62036461 flags: 0x001 tot: 311 # dd if=/dev/md1 bs=4096 skip=37507421 count=1 | dd of=ralph bs=4096 conv=notrunc 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 2.3118e-05 s, 177 MB/s 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 5.8109e-05 s, 70.5 MB/s # fiemap-test-new -vv ralph Printing values in 4096-byte-block units Mapping 18446744073709551615 bytes from offset 0 Input: start 0 length 18446744073709551615 flags 0x0 count 32 Output: start 0 length 18446744073709551615 flags 0x0 count 32 mapped 9 ext: 0 logical: [ 0.. 0] phys: 37507421..37507421 flags: 0x800 tot: 1 ext: 1 logical: [ 1.. 2148] phys: 42766313..42768460 flags: 0x000 tot: 2148 ext: 2 logical: [ 2149.. 2815] phys: 59170848..59171514 flags: 0x000 tot: 667 ext: 3 logical: [ 2816.. 2883] phys: 59171564..59171631 flags: 0x000 tot: 68 ext: 4 logical: [ 2884.. 2884] phys: 59171660..59171660 flags: 0x000 tot: 1 ext: 5 logical: [ 2885.. 2947] phys: 59171662..59171724 flags: 0x000 tot: 63 ext: 6 logical: [ 2948.. 2987] phys: 62042174..62042213 flags: 0x000 tot: 40 ext: 7 logical: [ 2988.. 3460] phys: 62032790..62033262 flags: 0x000 tot: 473 ext: 8 logical: [ 3461.. 3771] phys: 62036151..62036461 flags: 0x001 tot: 311 # head ralph From MAILER_DAEMON Mon Sep 28 11:28:14 2009 Date: Mon, 28 Sep 2009 11:28:14 -0700 From: Mail System Internal Data <MAILER-DAEMON@_____> Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA Message-ID: <1254162494@_____> X-IMAP: 1251348499 0000026887 Status: RO This text is part of the internal format of your mail folder, and is not a real message. It is created automatically by the mail system software. # fiemap-test-new -vv ralph Printing values in 4096-byte-block units Mapping 18446744073709551615 bytes from offset 0 Input: start 0 length 18446744073709551615 flags 0x0 count 32 Output: start 0 length 18446744073709551615 flags 0x0 count 32 mapped 9 ext: 0 logical: [ 0.. 0] phys: 37507421..37507421 flags: 0x000 tot: 1 ext: 1 logical: [ 1.. 2148] phys: 42766313..42768460 flags: 0x000 tot: 2148 ext: 2 logical: [ 2149.. 2815] phys: 59170848..59171514 flags: 0x000 tot: 667 ext: 3 logical: [ 2816.. 2883] phys: 59171564..59171631 flags: 0x000 tot: 68 ext: 4 logical: [ 2884.. 2884] phys: 59171660..59171660 flags: 0x000 tot: 1 ext: 5 logical: [ 2885.. 2947] phys: 59171662..59171724 flags: 0x000 tot: 63 ext: 6 logical: [ 2948.. 2987] phys: 62042174..62042213 flags: 0x000 tot: 40 ext: 7 logical: [ 2988.. 3460] phys: 62032790..62033262 flags: 0x000 tot: 473 ext: 8 logical: [ 3461.. 3771] phys: 62036151..62036461 flags: 0x001 tot: 311 #
Interesting, thanks - were there no fallocate calls in the trace?
The only fallocate call by processes 22115 or 26139 was on the file "/home/ralph/mail/.imap/INBOX/dovecot.index.cache". I presume this is not relevant, as this file was not corrupted. I'm sorry I can't post a large trace file, as it is too difficult to sanitize. If you're looking for anything else specific, please let me know.
Now it ate the almost the entire rest of my inbox :-/ This time I restarted the server (cleanly). [mikkel@server ~]$ hexdump /var/spool/mail/mikkel 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 051a000 382e 532f 6275 696d 2974 6920 2064 396e 051a010 4751 735a 366b 3130 3637 3138 0a3b 4d09 [...] [mikkel@server ~]$ ./fiemap-test-new -vv /var/spool/mail/mikkel Printing values in 4096-byte-block units Mapping 18446744073709551615 bytes from offset 0 Input: start 0 length 18446744073709551615 flags 0x0 count 32 Output: start 0 length 18446744073709551615 flags 0x0 count 32 mapped 4 ext: 0 logical: [ 0.. 1305] phys: 155648.. 156953 flags: 0x800 tot: 1306 ext: 1 logical: [ 1306.. 1314] phys: 156954.. 156962 flags: 0x000 tot: 9 ext: 2 logical: [ 1315.. 1366] phys: 11888901..11888952 flags: 0x000 tot: 52 ext: 3 logical: [ 1367.. 1370] phys: 11909120..11909123 flags: 0x001 tot: 4
FWIW, in this case the hint from comment #44 about restoring the missing data using dd worked fine.
Ok, focusing more on this bug now, I had to sort out an upstream corruption regression in .32 ... sorry. I'll look over your logs etc and see if there's anything useful and/or more I can ask for ... -Eric
(In reply to comment #47) > This data doesn't seem to help us in knowing when the corruption occurs. I see > three options: > > 1. Dovecot triggered the setting of the unwritten flag. Either it happened > after line 21, or somehow the block data was read (instead of returning NULs) > despite the flag (from a cache?). well, we don't see any allocation calls on this file ... > 2. procmail triggered the setting of the unwritten flag. The simple method of > tracing procmail would require tracing sendmail and all children, which is far > too much overhead for us. I have some alternate ideas. I don't see any calls in procmail that would result in preallocation ... > 3. Something else happened, unrelated to dovecot or procmail. During the period > in between the two trace times an rsync backup process ran which seemed to be > hung for a long time reading some sparse VM image files of around 100 GB. I > don't have any more info on that one yet. I was wondering if we possibly reused a block which had been preallocated, and somehow got "old" prealloc state; but the file in question was truncated down to 530, so its first block never should have changed... throws out that idea I guess. I wonder if a jprobe in the kernel's fallocate path would be worth it, to log something to dmesg whenever fallocate gets called on this fs, and by which process...? > Below is a log of the file state, and it's repair. One peculiar thing is that > the flags for the first block still show 0x800 immediately after the dd write, > but then show as 0 a short time later. Is this a journal/inode sync issue? Hm, I'll have to look into when the flag gets reset when the block gets written to. -Eric
Created attachment 368953 [details] jprobe for ext4_fallocate This might be helpful... tar xvzf the tarball, and then in the ext4_falloc_probe/ directory, do: make -C /lib/modules/`uname -r`/build M=`pwd` (you'll need a matching kernel & kernel-devel rpm installed for the `uname -r` kernel) That will build a .ko file, and if you insmod it you'll see messages like this on every call into ext4_fallocate: Planted jprobe at ffffffff81225cf1, handler addr ffffffffa006d000 jprobe: ext4_fallocate by imap for inode 234972 (dovecot.index.cache), off 179200, len 5120 jprobe: ext4_fallocate by imap for inode 234972 (dovecot.index.cache), off 184320, len 1024 jprobe: ext4_fallocate by imap for inode 234972 (dovecot.index.cache), off 185344, len 7168 jprobe: ext4_fallocate by imap for inode 234972 (dovecot.index.cache), off 192512, len 1024 jprobe: ext4_fallocate by imap for inode 234972 (dovecot.index.cache), off 193536, len 1024 jprobe: ext4_fallocate by imap for inode 16 (testuser), off 12802047, len 6323 .... Not sure if this will be helpful or not, but it might be worth a shot.
I've run 20 fallocate-enabled fsx instances in parallel for 12 hours, with no corruption. Not hitting it by random ...
This may actually resolve it: commit a41f20716975910d9beb90b7efc61107901492b8 Author: Aneesh Kumar K.V <aneesh.kumar.ibm.com> Date: Wed Jun 10 14:22:55 2009 -0400 ext4: Avoid corrupting the uninitialized bit in the extent during truncate The unitialized bit was not properly getting preserved in in an extent which is partially truncated because the it was geting set to the value of the first extent to be removed or truncated as part of the truncate operation, and if there are multiple extents are getting removed or modified as part of the truncate operation, it is only the last extent which will might be partially truncated, and its uninitalized bit is not necessarily the same as the first extent to be truncated. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso> I don't think anyone has hit this in 2.6.31 and beyond, and that's where it landed. I'll build a test kernel if anyone is willing to try it ... -Eric
kernel-2.6.30.9-99.fc11 has been submitted as an update for Fedora 11. http://admin.fedoraproject.org/updates/kernel-2.6.30.9-99.fc11
For the folks who haven't given up on ext4 for mail spools yet, please try the kernel in comment #57, I think it should resolve this issue. Thanks, -Eric
Thanks. Your efforts in fixing this issue are highly appreciated.
Sure thing, apologies it took so long, that fix just slipped my mind ... :(
kernel-2.6.30.9-99.fc11 has been pushed to the Fedora 11 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update kernel'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F11/FEDORA-2009-11803
kernel-2.6.30.9-99.fc11 has been pushed to the Fedora 11 stable repository. If problems still persist, please make note of it in this bug report.