Description of problem: Corrupt xfs root filesystem with kernel kernel-2.6.24.3-12.fc8 and kernel-2.6.24.3-34.fc8. Version-Release number of selected component (if applicable): kernel-2.6.24.3-12.fc8 kernel-2.6.24.3-34.fc8 How reproducible: Install kernel 2.6.24.3-xx. Steps to Reproduce: 1. Install a fresh F8 2. Update to kernel 2.6.24.3-xx 3. Reboot 4. Update everything else. 5. Reboot, boot from install CD and enter in rescue mode 6. xfs_repair -n /dev/sdaX Actual results: xfs root filesystem is corrupt. Expected results: Filesystem should be clean. Additional info:
Created attachment 298393 [details] xfs_repair output
Ok, that's odd. I've had xfs root on my F8 workstation for quite some time now w/ no trouble as far as I know... though I'll double check it from a rescue disk :) and I will try your testcase when I get some time. I'll ping the sgi guys on this one, too. Thanks for saving the repair output! Have you actually reproduced this or was this a one-time occurance thus far? Thanks, -Eric
FWIW all those bad magic numbers in the repair output are actually superblock magic numbers: 0x58465342 is "XFSB" I see the repair output says "would have" - I guess you ran xfs_repair -n, is the fs still in this state or is it fully repaired? If it's still in this state maybe you can also capture an xfs_metadump. And just 'cause I have to ask, did the system lose power anywhere in between, and are you running any proprietary kernel modules? And to be specific about it, what version of xfsprogs did you use?
Talked this over with the sgi guys, and because repair seems to be finding superblock magic where an inode should be... is there any chance that a mkfs happened over the top of a valid filesystem, or that there is any confusion between the filesystem being on /dev/sda vs on /dev/sda1? Is this a regular dos partition table? -Eric
(In reply to comment #2) > Thanks for saving the repair output! Have you actually reproduced this or was > this a one-time occurance thus far? Friday evening I updated 2 systems with the latest kernel from updates. Monday morning one of the systems did not start in X because of errors (no kde plugins). I had a quick look in /var/log/messages and I see some errors about XFS filesystem). Made an xfs_repair, reboot, login, replace about 40 damaged rpms, reboot and made an new xfs_repair. Now I had even more errors. Did a xfs_repair again, reboot and system was damaged beyound repair (unable to login in init 3, but only in init 1.) Directories /usr/lib and /usr/lib64 are no longer on filesystem. Same with the second system but in one step (missing /usr/lib and /usr/lib64 after first xfs_repair). Yes, I had reproduced this 3 times. 2 times on an AMD x64 computer and one time on an intel x64. xfs_metadump is too large to be attached and can be downloaded from: http://www.vlasiu.net/xfs/ I ran: xfs_metadump -o -w /dev/sdb3 /mnt/usbdisk/sdb3_metadata 2>/mnt/usbdisk/sdb3_metadata.warnings.txt No, I do not use any proprietary kernel modules with the test case systems (plain fedora install). I only use vmware and nvidia modules but not on the 3 ones above. xfs version is 2.9.4 (from Fedora 8 install CD). > is there any chance that a mkfs happened over the top of a valid filesystem, > or that there is any confusion between the filesystem being on /dev/sda vs > on /dev/sda1? No, I do not think so. Everything was OK with default kernel installed from F8 CD and all kernels from F8 updates but 2.6.24.3-xx. On every install each partition was reformated. Partition table is a regular one: Disk /dev/sdb: 80.0 GB, 80026361856 bytes 255 heads, 63 sectors/track, 9729 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0xc7ad5de6 Device Boot Start End Blocks Id System /dev/sdb1 * 1 19 152586 83 Linux /dev/sdb2 20 7285 58364145 83 Linux /dev/sdb3 7286 9197 15358140 83 Linux /dev/sdb4 9198 9729 4273290 5 Extended /dev/sdb5 9198 9328 1052226 83 Linux /dev/sdb6 9329 9729 3221001 82 Linux swap / Solaris Sincerely, Gabriel
(In reply to comment #3) > And just 'cause I have to ask, did the system lose power anywhere in between, No. I will not fill a bug report if I lose a filesystem because of power failure. :-) Sincerely, Gabriel
XFS is working fine here with 2.6.24.3-12 and 2.6.24.3-34, on 32-bit x86
re: comment #5 > I had a quick look in /var/log/messages and I see some errors about > XFS filesystem). What were the errors? And any errors before the xfs errors? re: comment #6, if you had barriers enabled, a power loss, and subsequent corruption, it would still be bug-worthy. :)
(In reply to comment #8) > What were the errors? And any errors before the xfs errors? Something about XFS_WANT_CORRUPTED_GOTO. I do not have log file anymore since I reinstalled the system again several times. No errors before. This was the hint to make a xfs_repair. Same error on all corrupted installs so I assumed is an internal xfs error message after corruption of filesystem. Also, one time i had errors during install of updates (update kernel, reboot, apply others updates). rpm was unable to write some files. Then I make an cd to the directory where the problem was reported by rpm and made an ls, an I/O error was reported by ls. > re: comment #6, if you had barriers enabled, a power loss, and subsequent > corruption, it would still be bug-worthy. :) :-) Sincerely, Gabriel
Can you let memtest86 run for a while? I'd also look carefully for any scsi/ide/IO type errors in the logs. I'll see if I can glean anything from the inode numbers with problems and their locations on disk...
(In reply to comment #10) > Can you let memtest86 run for a while? I'd also look carefully for any > scsi/ide/IO type errors in the logs. > > I'll see if I can glean anything from the inode numbers with problems and their > locations on disk... memtest86 is running for more then 55 minutes and there are no errors (1 pass, second is 72% done). Anyway I will let memtest running until tomorrow morning. But since there are no error running with kernel 2.6.23 I do not think there are gonna be errors. I did made another attempt to replicate this bug this time in vmware (I did not install vmware tools or anything else inside guest except a plain F8 x86_64 install) on another computer. 1. Install F8 (text mode, de-select anything then installer ask me what packages to install). 2. reboot and poweroff and made a snapshot. 3. boot in F8 and update kernel (2.6.24.3-34). 4. Poweroff and made another snapshot. 5. Boot in F8 and install all other updates. 6. Poweroff and made another snapshot. 7. Boot in rescue mode and made an xfs_repair. There are errors in xfs filesystem. 8. Boot in rescue mode with snapshot from step 4 and made an xfs_repair. No errors in / filesystem. Sincerely, Gabriel
> I made a second test in vmware an a different system which work perfectly for years (no vmware tools installed inside guest except a plain F8 x86_64) 1. Install F8 (text mode, de-select anything then installer ask me what packages to install). 2. reboot and poweroff and made a snapshot. 3. boot in F8 and update kernel (2.6.24.3-34). 4. Poweroff and made another snapshot. 5. Boot in F8 and install all other updates. During this step i receive lots and I mean LOTS of messages like this: I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 Updating : sendmail #################### [187/365]I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 Updating : sendmail ##################### [187/365] Error unpacking rpm package sendmail - 8.14.2-1.fc8.i386 warning: /etc/mail/sendmail.cf created as /etc/mail/sendmail.cf.rpmnew warning: /etc/mail/submit.cf created as /etc/mail/submit.cf.rpmnew error: unpacking of archive failed on file /usr/share/man/man1/mailq.sendmail.1.gz;47e14c60: cpio: open Updating : NetworkManager #################### [199/365]I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 Updating : NetworkManager ##################### [199/365] Error unpacking rpm package NetworkManager - 1:0.7.0-0.6.7.svn3235.fc8.x86_64 error: unpacking of archive failed on file /usr/share/man/man1/nm-tool.1.gz;47e14c60: cpio: open Installing: PolicyKit-gnome ##################### [200/365] Installing: gail ##################### [201/365] Installing: gnome-mount #################### [202/365]I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 Installing: gnome-mount ##################### [202/365] Error unpacking rpm package gnome-mount - 0.7-1.fc8.x86_64 error: unpacking of archive failed on file /usr/share/man/man1/gnome-mount.1.gz;47e14c60: cpio: open I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 Cleanup : bluez-utils ##################### [223/365] I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 After update: [root@f8test ~]# cd /usr/share/man/man8/ [root@f8test man8]# ls I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 ls: reading directory .: Input/output error [root@f8test man8]# So, from my point of view situation is clear: something is wrong with kernel 2.6.24.3-xx. All attempts to install and use kernel 2.6.24 failed for me on F8 x86_64. Sincerely, Gabriel
Thanks for doing the memtest; I know it's a bit of a pain but good to rule out if you don't mind. :) When you see this: > I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 > ("xfs_trans_read_buf") error 5 buf count 4096 "error 5" is in fact EIO. *something* gave EIO and should have said so at the time... (this message should be *after* the fact of the EIO). If you do a dmesg when this happens, before the buffer rolls over, do you see *any* other errors before the "I/O error in filesystem" message? I'm sorry, I have not yet had any time to try to recreate this myself, but you're doing a fine job debugging so far :) Thanks, -Eric
(In reply to comment #13) > Thanks for doing the memtest; I know it's a bit of a pain but good to rule out > if you don't mind. :) That's ok. :-) > "error 5" is in fact EIO. *something* gave EIO and should have said so at the > time... (this message should be *after* the fact of the EIO). > > If you do a dmesg when this happens, before the buffer rolls over, do you see > *any* other errors before the "I/O error in filesystem" message? /var/log/messages had about 900 entries like this: Mar 19 19:33:27 f8test kernel: attempt to access beyond end of device Mar 19 19:33:27 f8test kernel: sda2: rw=0, want=31453347856, limit=30716280 Mar 19 19:33:27 f8test kernel: I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 19 19:33:27 f8test kernel: attempt to access beyond end of device Mar 19 19:33:27 f8test kernel: sda2: rw=0, want=31453347856, limit=30716280 Mar 19 19:33:27 f8test kernel: I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 I'm sorry, I forget to tell you about this. It was a long day. :-) Again, an ls in /usr/share/man/man8/ looks like this: [root@f8test ~]# cd /usr/share/man/man8/ [root@f8test man8]# ls I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 ls: reading directory .: Input/output error [root@f8test man8]# and /var/log/messages: Mar 19 21:21:09 f8test kernel: attempt to access beyond end of device Mar 19 21:21:09 f8test kernel: sda2: rw=0, want=31453347856, limit=30716280 Mar 19 21:21:09 f8test kernel: I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008 ("xfs_trans_read_buf") error 5 buf count 4096 > I'm sorry, I have not yet had any time to try to recreate this myself Take your time. There's no need to hurry. 2.6.23.15-137 is just fine for me. > but you're doing a fine job debugging so far :) :-) Thank you. Sincerely, Gabriel
Ok, basically I'm trying to work backwards to the very first error encountered... so the IO errors are probably a result of some corrupt metadata which refers to blocks beyond the end of your device (which looks to be about 30G?) -Eric
Created attachment 298643 [details] messages.gz
(In reply to comment #15) > Ok, basically I'm trying to work backwards to the very first error encountered... > > so the IO errors are probably a result of some corrupt metadata which refers to > blocks beyond the end of your device (which looks to be about 30G?) memtest86 did not report any error after 17:08 hours. Rebooted the computer and during boot I receive the same error messages about I/O error in meta-data. No previous error in /var/log/messages (see attached file messages.gz). Sincerely, Gabriel
It seems I was hit by same problem in one of machines here when running 'yum update' Mar 20 17:57:44 serwer yum: Updated: kdenetwork-devel - 7:3.5.9-2.fc8.i386 Mar 20 17:57:44 serwer yum: Updated: perl-devel - 4:5.8.8-36.fc8.i386 Mar 20 17:57:44 serwer yum: Updated: kdegraphics-devel - 7:3.5.9-1.fc8.x86_64 Mar 20 17:57:44 serwer yum: Updated: kdenetwork-devel - 7:3.5.9-2.fc8.x86_64 Mar 20 17:58:02 serwer dnsmasq[30157]: DHCPRELEASE(eth1) 192.168.10.29 00:50:ba:3e:4a:39 Mar 20 17:58:19 serwer squid[3560]: Squid Parent: child process 3563 exited with status 0 Mar 20 17:58:20 serwer squid[26417]: Squid Parent: child process 26420 started Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000 Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x61a800000 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000 Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x61a800000 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000 Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x61a800000 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000 Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x61a800000 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000 Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x61a800000 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000 Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x61a800000 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 20 18:01:28 serwer kernel: attempt to access beyond end of device Mar 20 18:01:28 serwer kernel: dm-0: rw=0, want=24786214913768312, limit=204800000 Mar 20 18:01:28 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x580eee5f42fb70 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 20 18:01:30 serwer kernel: attempt to access beyond end of device Mar 20 18:01:30 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000 Mar 20 18:01:30 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x61a800000 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 20 18:01:30 serwer kernel: attempt to access beyond end of device Mar 20 18:01:30 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000 Mar 20 18:01:30 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x61a800000 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 20 18:01:30 serwer kernel: attempt to access beyond end of device Mar 20 18:01:30 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000 Mar 20 18:01:30 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x61a800000 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 20 18:01:30 serwer kernel: attempt to access beyond end of device Mar 20 18:01:30 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000 Mar 20 18:01:30 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x61a800000 ("xfs_trans_read_buf") error 5 buf count 4096 Mar 20 18:01:30 serwer kernel: attempt to access beyond end of device Mar 20 18:01:30 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000 Mar 20 18:01:30 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x61a800000 ("xfs_trans_read_buf") error 5 buf count 4096 This machine is AMD64 server and it was running fine last few months with older kernels and it works on 2.6.24.3.XX until I run yum to upgrade some packages. I was in hurry to restore this server to normal state then I do not have xfsdump output, but it seems I lost mainly files from libs (lib and lib64) and some man files. Memtest is perfectly fine here then I think it's definitly something wrong with XFS in 2.6.24 Filesystem was created using F8 installation DVD, there is no binary packages around just pure F8 kernel image without any customizations. I can try to organize other x86_64 machine to reproduce the problem if required.
Ok, thanks for the corroboration... hrm...
do you know exactly (or approximately) what the first kernel to exhibit problems was? Maybe looking through logs to see what was updated when..?
So, under 2.6.24.3-34.fc8 on x86_64 I did this on a temporarily reformatted swap partition... 1029 mkfs.xfs -f /dev/sda3 1030 mount /dev/sda3 /mnt/test 1031 yum --installroot=/mnt/test install filesystem 1032 mkdir -p /mnt/test/var/lib/yum/ 1033 yum --installroot=/mnt/test install gnome-terminal amarok which installed about 830M onto that filesystem, with no problems... any chance you guys could do the same test on a spare or swap partition?
I upgraded from .23 to 2.6.24.3-12.fc8 when it shows up. And about proposed test I think copying data to partition is not enough as my partition was fine for a 1-2 days and it fails on yom upgrade as most damages was on libs I think can you try to copy many small files to partition and overwrite it randomly (or change it contents) in some kind of loop (script ?)
I also tried an install of the same packages from the original F8 repo, followed by an upgrade from the updates repo; so far still no problems. I'll set up a test box to try an install under the older 2.6.23 kernel, followed by upgrades under 2.6.24, see if that trips anything. I probably won't be able to try a real install/upgrade 'til I get back into the office on Monday. What do you guys have for IO hardware? (maybe which sata/ide controller?) just in case that's relevant... -Eric
In my case corruption first appears on RAID array (4x320GB SATA disks) thats runs on ARC-1210 PCI-Express RAID controller. Machine has 4GB DDR2-667 ECC memory and runs on SuperMicro Server Mainboard PDSME+ with Quad-core Intel Xeon 3220 2,40 GHz 8MB FSB1066
(In reply to comment #21) > So, under 2.6.24.3-34.fc8 on x86_64 I did this on a temporarily reformatted swap > partition... > > 1029 mkfs.xfs -f /dev/sda3 > 1030 mount /dev/sda3 /mnt/test > 1031 yum --installroot=/mnt/test install filesystem > 1032 mkdir -p /mnt/test/var/lib/yum/ > 1033 yum --installroot=/mnt/test install gnome-terminal amarok I cannot run yum or rpm anymore. Too many error. > which installed about 830M onto that filesystem, with no problems... any chance > you guys could do the same test on a spare or swap partition?
(In reply to comment #23) > What do you guys have for IO hardware? (maybe which sata/ide controller?) just > in case that's relevant... System1: nVidia CK8S Parallel ATA Controller (v2.5) System2: SATA controller: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA AHCI Controller (rev 03)
Just as a minor datapoint, I did a yum update this morning on 2.6.24.3-34.fc8, xfs root filesystem, without trouble... so it's apparently not universally broken...
(In reply to comment #27) > Just as a minor datapoint, I did a yum update this morning on 2.6.24.3-34.fc8, > xfs root filesystem, without trouble... so it's apparently not universally broken... Hmm... Do you have a /usr partition/mount point? In fact what mount points do you have? How many packages did yum install? Did you try an a fresh f8 installed system (only kernel upgraded before yum)? Sincerely, Gabriel
I only have only /, and no /usr partition. And, it was not an exceedingly large update - so not exactly the same test, I agree. I've not had a chance to do the fresh f8 upgrade yet (sorry, my support for xfs in fedora has to be done more as a hobby than a profession right now... but I will try to get to it!) -Eric
(In reply to comment #29) > I've not had a chance to do the fresh f8 upgrade yet (sorry, my support for xfs > in fedora has to be done more as a hobby than a profession right now... but I > will try to get to it!) That's OK. As long the last 2.6.23 kernel works OK, I'm fine. Gabriel
I have separated /usr, /var, /tmp and /home in my setup here. /etc/fstab: /dev/VolGroup00/LogVol01 / xfs defaults,noatime,nodiratime,logbufs=8 1 1 /dev/VolGroup00/LogVol03 /var xfs defaults,noatime,nodiratime,logbufs=8 1 2 /dev/VolGroup00/LogVol04 /home xfs defaults,noatime,nodiratime,logbufs=8 1 2 /dev/VolGroup00/LogVol02 /tmp ext3 defaults,noatime 1 2 LABEL=/boot /boot ext3 defaults,noatime 1 2 Hmm, I wonder can this be caused by lack of write barriers ? I guess XFS uses barriers by default, but when using LVM as here barriers are not available/supported.
(In reply to comment #31) > Hmm, I wonder can this be caused by lack of write barriers ? I guess XFS uses > barriers by default, but when using LVM as here barriers are not > available/supported. I don't think so. I do not have LVM and still / become corrupt. The strange part here is yum. Fresh F8 install, upgrade kernel to 2.6.24.3-XX. As long I do not use yum everything is OK. mount /dev/dvd /mnt/1 mkdir 1 cp /mnt/1/Packages/* /1 cp -r /1 /2 cp -r /1 /3 Reboot, rescue mode, xfs_repair, no errors. If I run yum update / become unusable. Sincerely, Gabriel
Yes I must say this is weird a little as my server works 1-2 days on .24 kernel without any errors in log's (We have all documents on it + ftp + samba + httpd) until I decide to upgrade system then it fails. Maybe errors was there before hard to say, but seeking logs from this two days doesn't show anything wrong until yum update execution. I wonder in my case most damages was in /lib and /lib64 directories then maybe it's not yum but ldconfig and/or prelink ?
(In reply to comment #33) > I wonder in my case most damages was in /lib and /lib64 directories then maybe > it's not yum but ldconfig and/or prelink ? Are you sure is not /usr/lib and /usr/lib64? ldconfig, maybe. Prelink - no. That's for sure. prelink did not have time to start (install, reboot, upgrade kernel, reboot, install updates, reboot, rescue mode, xfs_repair, errors). Sincerely, Gabriel
Ahh, sorry in /usr/lib*, but my / and /usr are on the same partition then propably this doesn't matter so much.
As I can see fedora uses a few patches for XFS: linux-2.6-xfs-optimize-away-realtime-tests.patch linux-2.6-xfs-setfattr-32bit-compat.patch linux-2.6-xfs-xfs_mount-refactor.patch And looking to changeslog shows all of them are quite old. I asked because looking at gentoo, debian, suse bugzilla doesn't show me any similar problem.
Testing a stock 2.6.24.3 kernel w/ the same config as fedora could be instructive... If it turns out that that fails, perhaps we can devise a fairly simple, repeatable automated test case to do a git bisect on, and narrow down when the failure occurred. Running with barriers only really matters when it comes time to do a log replay, so previous power losses w/o barriers could leave latent corruption. But as Gabriel said, he is on a regular block device, no lvm, so he should have had barriers in place...
I still can't do a full install today, but I did this test in the background, which is as close as I can get w/o doing an actual full/fresh install. installed kernel-2.6.23.1-42.fc8.x86_64 and booted it. mkfs'd a 3.8G (non-root) filesystem with F8-era xfsprogs yum installed 1.8G worth of original F8 rpms on it, about 600 packages ran xfs_repair -n, got no errors installed kernel-2.6.24.3-34.fc8.x86_64 and booted it yum updated the filesystem from above, it upgraded around 250 packages IIRC ran xfs_repair -n, got no errors Did you guys experience the first errors after some particular package installed? Perhaps some %post script is doing something interesting that triggers it... Hopefully can do a real install before the end of the week.
Made this evening some more tests. 1. Installed a fresh F8. Update kernel to 2.6.24.3-xx. Download kernel 2.6.24.3 from kernel.org and compile a new kernel with .config file generated from fedora kernel-xxx.src.rpm cat config-generic config-nodebug > temp-generic perl merge.pl config-x86_64-generic temp-generic > temp-x86_64-generic perl merge.pl /dev/null temp-x86_64-generic x86_64 > kernel-2.6.24.3-x86_64.config make menuconfig (load kernel-2.6.24.3-x86_64.config and save .config) make bzImage && make modules && etc. Boot from the new kernel, yum update and restart. Boot in rescue mode, run xfs_repair and errors. :-( 2. New fresh F8, install new kernel 2.6.24.3-xx, reboot, yum update everything but selinux*, policycoreutils and audit*. Reboot, rescue mode, xfs_repair and no errors. Reboot in F8, update audit*, reboot, rescue mode, xfs_repair and still no errors. Reboot again in F8 and update selinux*, reboot, rescue mode, xfs_repair and filesystem has errors. Made a new attempt but this time I update policycoreutils instead of selinux* and / become corrupted again. 3. I try again with kernel from 1.) and / become corrupted then selinux* or policycoreutils is installed. 4. Made a an attempt as Eric suggested on comment #21 and I cannot reproduce filesystem corruption. But selinux post-install script have returned some errors (unable to load policy). Could something in selinux* and/or policycoreutils packages to corrupt kernel memory somehow? Sincerely, Gabriel
Gabriel, thanks for that additional testing. From this it looks like it is probably selinux attribute related... will look into that (pinged the sgi guys again, too) -Eric
Hi, Okay, basing on an assumption of assumptions, :) if one gets a chance to give mkfs.xfs options then you could try with "-i attr=1" to try version#1 EAs (we have had probs with v2 in the past) and/or "-i size=512" to give better chance of EAs being inline within the inode (larger inode size on a SELinux system probably makes more sense anyway - I'm unsure what redhat set this to). --Tim
Tim, it was Fedora that set it ;) and it should be defaults except for attr=2, set by anaconda at install time. ugh, which is, of course, something I forgot to set in my tests. Gabriel can you confirm attr=2 on your root fs, with "xfs_info" on the mountpoint? -Eric (rerunning w/ attr2...)
F8-era installer did: rc = iutil.execWithRedirect("mkfs.xfs", ["-f", "-l", "internal", "-i", "attr=2", devicePath], stdout = "/dev/tty5", stderr = "/dev/tty5", searchPath = 1) but, due to the features2 flag packing/swapping issue, I'm not sure it gets properly picked up as attr2... [root@inode tmp]# mkfs.xfs -V mkfs.xfs version 2.9.4 [root@inode tmp]# mkfs.xfs -dfile,name=fsfile,size=32m -i attr=2 meta-data=fsfile isize=256 agcount=2, agsize=4096 blks = sectsz=512 attr=2 data = bsize=4096 blocks=8192, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=1200, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 [root@inode tmp]# mkdir mnt [root@inode tmp]# mount -o loop fsfile mnt/ xf[root@inode tmp]# xfs_info mnt meta-data=/dev/loop0 isize=256 agcount=2, agsize=4096 blks = sectsz=512 attr=0 data = bsize=4096 blocks=8192, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=1200, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 [root@inode tmp]# touch mnt/foo [root@inode tmp]# xfs_info mnt meta-data=/dev/loop0 isize=256 agcount=2, agsize=4096 blks = sectsz=512 attr=1 data = bsize=4096 blocks=8192, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=1200, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 root@inode tmp]# umount mnt [root@inode tmp]# xfs_db fsfile xfs_db> sb 0 xfs_db> p features2 features2 = 0
Yeah, I'm not sure attr2 gets thru too (the default is v2 but I'm not sure if an x8664 or ia64 would see it). Its confusing how xfs_db's version # command disagress with features2 because db uses the structure (read using an endian conversion function) instead of the offset & size. i.e. xfs_db> version versionnum [0xb4a4+0xa] = V4,NLINK,ALIGN,DIRV2,LOGV2,EXTFLG,MOREBITS,ATTR2,LAZYSBCOUNT ===> note the 0xa in versionnum for features2 xfs_db> p features2 features2 = 0 Eric, so were we doing 32 bit tests with attr2 in the past, when we were going thru all the attr2 woes? It's all confusing :) --Tim
(In reply to comment #42) > Gabriel can you confirm attr=2 on your root fs, with "xfs_info" on the mountpoint? Only on one system have attr=2 (3 tested). Others two have attr=1.
attr=2 on my system. I take a look at logs from the faulty upgrade too and it seems in my case there was a selinux policy upgrade. Anyway I will upload a log file in a few minutes.
Created attachment 299137 [details] xfs_info output from partition that fails
Created attachment 299138 [details] syslog from faulty yum upgrade
FWIW, I managed to reproduce this last night, even w/o the selinux-related updates. I'll try to find some time this weekend to narrow it down.
From the testcase I came up with and some git bisecting this evening, looks like this is the mod that broke it: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2bdf7cd0baa67608ada1517a281af359faf4c58c [XFS] superblock endianess annotations now to sort out why...
Ok, I see exactly what is happening; it's a bit of a perfect storm of bugs. I've sent a patch & explanation upstream tonight, we'll see what the sgi guys say. I'll put the gory details here once the solution is agreed upon.... hopefully will get this fixed up soon in F8. Hm, and probably F9 needs it as well. Thanks for your help on this, and sorry for questioning your hardware... :) -Eric
Ok, kernel-2.6.24.4-74 (from koji) seems to work just fine: Apr 3 14:27:45 localhost kernel: XFS: correcting sb_features alignment problem Nice job! :-) Anyway, I still have some more questions: 1. How about for xfs on other kernels? Especially on rhel5/centos5? Do I have to backport this patch? 2. I do not see this patch in latest kernel from F9 (on koji). Do you think you can add this patch to F9 kernel? This will be nice. Sincerely, Gabriel
So here's a decent description of the bug: http://oss.sgi.com/archives/xfs/2008-03/msg00355.html Essentially your filesystem "lost" it's attr2 flag as far as the newer kernel knew; this actually should have been ok, except for the bug I pointed out in the above message. For now, for F8, I just put a patch in that stops it from "losing" the flag, as this is easiest to inspect as correct. The patch to make attr2 filesystems safe when mounted as attr1 takes a bit more care but will probably follow. rhel5/centos5 should not have this problem. (well, rhel5 certainly doesn't! ;) centos 5 likely does not even set attr2 by default, and if it does, both kernel & userspace mis-place the flag in the same way, so it's ok, at least until you migrate between x86 and x86_64, or explicitly mount an attr2 fs as attr1. So the patch could be ported there but it's probably very unlikely to be hit with the stock kernels & userspace. I did check the patch into F9, it looks like it's just not built yet. -Eric
(In reply to comment #53) > So here's a decent description of the bug: > > http://oss.sgi.com/archives/xfs/2008-03/msg00355.html Thank you. > (well, rhel5 certainly doesn't! ;) Well... not unless you do: perl -p -i -e 's/# CONFIG_XFS_FS is not set/CONFIG_XFS_FS=m\n# CONFIG_XFS_RT is not set\nCONFIG_XFS_QUOTA=y\nCONFIG_XFS_POSIX_ACL=y\nCONFIG_XFS_SECURITY=y/' config-rhel-generic :-) > centos 5 likely does not even set attr2 by default, and if it does, both kernel > & userspace mis-place the flag in the same way, so it's ok, at least until you > migrate between x86 and x86_64, or explicitly mount an attr2 fs as attr1. So what you are saying is I must use noattr2 option when mounting an xfs filesystem if I use the latest kernel (lets say for a mobile hdd which can be mounted on several computers with different kernel/userspace versions)? > I did check the patch into F9, it looks like it's just not built yet. OK. Thank you. Sincerely, Gabriel
Hi everybody. I've read through the whole list of messages. If it may help i can add myself to the list of those experiencing problems with xfs and 64-bits kernels. My system file-system is a bit complicated but not that much if we consider only the linux part. I have an alu iMac with five partitions and a triple boot. The first partition is the standard EFI one Apple uses to boot the system. The second one contains the HFS+ journaled file system with mcosX. The third one contains the linux /boot partition i use to boot up Fedora 8. The fourth one contains an NTFS file system, while the last one is an LVM with two xfs-formatted logical volumes which contain, respectively, the "/" (root) directory and the "/home" directory mounted. Well, everything worked fine until last update. Unfortunately i've not been able to take any snapshot or log message, but the problem showed up very similar to all others. A couple of day ago i "yum-updated" the system. I repeat. i didn't take much care in writing down things or saving log file because i was confident in the "usual" normal conclusion of the update process. I can only say that the latest kernel () was in the list. After the restart to jump into the new kernel, the directory /usr/lib64/kde3, /usr/share/man/man8 and /usr/lib64/openoffice/share/ were gone, unreadable. The directories were there, but unreadable (i got "io error" at any ls command). I tried then an xfs_repair. It completed succesfuly and the file system appeared to be clean but the directories were definitely gone. Just a bunch of "node-number-named" files in the "lost+found" directory. I can only guess it can be a problem free space problem, because the gnome "free disk space" tool always showed a 100% full root ("/") directory, BEFORE the xfs_repair attempt, regardless of the amount of files i've deleted in the vain attempt to restore things. Sorry for not being able to be mor precise with log and dump files but i didsn't think of it at that moment. I hope this can help in any case.
It does sound like the same issue (though all the extra info about your setup probably isn't relevant.) The new kernel in Koji should prevent the problem from occurring again; unfortunately I don't have a great recovery scheme for fs's which have already been hit by this, other than xfs_repair, which may wind up moving lots to lost+found. Thnks, -Eric
re: comment #54: > So what you are saying is I must use noattr2 option when mounting an xfs > filesystem if I use the latest kernel (lets say for a mobile hdd which can be > mounted on several computers with different kernel/userspace versions)? If you have a filesystem which really did use attr2 (which is basically a sliding divider in the inode between attribute & extent data, vs a fixed split point with attr1) then running 2.6.24+ without this patch, or the one I referenced in the above thread, is potentially dangerous, because if a file or dir with attrs needs to add information (attr or extent) to the inode structure it'll get the wrong split-point, and potentially corrupt that file. If you only use the latest kernels (>= 2.6.24), you're fine, but if you switch between <= 2.6.23 and >= 2.6.24 without these patches, or use <= 2.6.23 on 32 and 64 bit machines both, you'll be exposed to the bug. I think I have that all straight... ;)
(In reply to comment #57) > re: comment #54: > I think I have that all straight... ;) Well, that's bad. At least for me. Thank you. Sincerely, Gabriel
I wonder anyone else tested this fix ? Personaly I prefer to be sure before try to install it on production machine here as my boss would castrate me if something goes wrong.
(In reply to comment #59) > I wonder anyone else tested this fix ? Personaly I prefer to be sure before try > to install it on production machine here as my boss would castrate me if > something goes wrong. Made 2 fresh installs, update kernel to 2.6.24.4-74, reboot, update everything else and no problems at all. Also I update kernel from another computer (2.6.23.15-137) and I have no problems at all (production machine). Sincerely, Gabriel
Patch is in CVS now, for F8 as well as F9. FWIW, it was also requested that this fix get pulled for 2.6.25: http://oss.sgi.com/archives/xfs/2008-04/msg00230.html Thanks, -Eric
Fix is in 2.6.25-rc9
kernel-2.6.24.5-85.fc8 has been submitted as an update for Fedora 8
kernel-2.6.24.5-85.fc8 has been pushed to the Fedora 8 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update kernel'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-3260
kernel-2.6.24.5-85.fc8 has been pushed to the Fedora 8 stable repository. If problems still persist, please make note of it in this bug report.