Bug 437968 - Corrupt xfs root filesystem with kernel kernel-2.6.24.3-xx
Corrupt xfs root filesystem with kernel kernel-2.6.24.3-xx
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
8
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Eric Sandeen
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-18 09:37 EDT by Gabriel VLASIU
Modified: 2013-02-01 10:13 EST (History)
5 users (show)

See Also:
Fixed In Version: 2.6.24.5-85.fc8
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-04-29 16:54:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
xfs_repair output (196.37 KB, text/plain)
2008-03-18 09:37 EDT, Gabriel VLASIU
no flags Details
messages.gz (21.02 KB, application/x-gzip)
2008-03-20 03:40 EDT, Gabriel VLASIU
no flags Details
xfs_info output from partition that fails (535 bytes, text/plain)
2008-03-26 08:22 EDT, Marcin Kurek
no flags Details
syslog from faulty yum upgrade (71.88 KB, text/plain)
2008-03-26 08:23 EDT, Marcin Kurek
no flags Details

  None (edit)
Description Gabriel VLASIU 2008-03-18 09:37:47 EDT
Description of problem:
Corrupt xfs root filesystem with kernel kernel-2.6.24.3-12.fc8 and
kernel-2.6.24.3-34.fc8.

Version-Release number of selected component (if applicable):
kernel-2.6.24.3-12.fc8
kernel-2.6.24.3-34.fc8

How reproducible:
Install kernel 2.6.24.3-xx.

Steps to Reproduce:
1. Install a fresh F8
2. Update to kernel 2.6.24.3-xx
3. Reboot
4. Update everything else.
5. Reboot, boot from install CD and enter in rescue mode
6. xfs_repair -n /dev/sdaX

Actual results:
xfs root filesystem is corrupt.

Expected results:
Filesystem should be clean.

Additional info:
Comment 1 Gabriel VLASIU 2008-03-18 09:37:47 EDT
Created attachment 298393 [details]
xfs_repair output
Comment 2 Eric Sandeen 2008-03-18 23:26:58 EDT
Ok, that's odd.  I've had xfs root on my F8 workstation for quite some time now
w/ no trouble as far as I know... though I'll double check it from a rescue disk
:) and I will try your testcase when I get some time.

I'll ping the sgi guys on this one, too.

Thanks for saving the repair output!  Have you actually reproduced this or was
this a one-time occurance thus far?

Thanks,
-Eric
Comment 3 Eric Sandeen 2008-03-18 23:32:49 EDT
FWIW all those bad magic numbers in the repair output are actually superblock
magic numbers:  0x58465342 is "XFSB"

I see the repair output says "would have" - I guess you ran xfs_repair -n, is
the fs still in this state or is it fully repaired?  If it's still in this state
maybe you can also capture an xfs_metadump.

And just 'cause I have to ask, did the system lose power anywhere in between,
and are you running any proprietary kernel modules?

And to be specific about it, what version of xfsprogs did you use?
Comment 4 Eric Sandeen 2008-03-19 00:13:51 EDT
Talked this over with the sgi guys, and because repair seems to be finding
superblock magic where an inode should be... is there any chance that a mkfs
happened over the top of a valid filesystem, or that there is any confusion
between the filesystem being on /dev/sda vs on /dev/sda1?

Is this a regular dos partition table?

-Eric
Comment 5 Gabriel VLASIU 2008-03-19 07:01:22 EDT
(In reply to comment #2)

> Thanks for saving the repair output!  Have you actually reproduced this or was
> this a one-time occurance thus far?
Friday evening I updated 2 systems with the latest kernel from updates. Monday
morning one of the systems did not start in X because of errors (no kde
plugins). I had a quick look in /var/log/messages and I see some errors about
XFS filesystem). Made an xfs_repair, reboot, login, replace about 40 damaged
rpms, reboot and made an new xfs_repair. Now I had even more errors. Did a
xfs_repair again, reboot and system was damaged beyound repair (unable to login
in init 3, but only in init 1.) Directories /usr/lib and /usr/lib64 are no
longer on filesystem.

Same with the second system but in one step (missing /usr/lib and /usr/lib64
after first xfs_repair).

Yes, I had reproduced this 3 times. 2 times on an AMD x64 computer and one time
on an intel x64.

xfs_metadump is too large to be attached and can be downloaded from:
http://www.vlasiu.net/xfs/

I ran:
xfs_metadump -o -w /dev/sdb3 /mnt/usbdisk/sdb3_metadata
2>/mnt/usbdisk/sdb3_metadata.warnings.txt

No, I do not use any proprietary kernel modules with the test case systems
(plain fedora install). I only use vmware and nvidia modules but not on the 3
ones above.

xfs version is 2.9.4 (from Fedora 8 install CD).

> is there any chance that a mkfs happened over the top of a valid filesystem, 
> or that there is any confusion between the filesystem being on /dev/sda vs 
> on /dev/sda1?

No, I do not think so. Everything was OK with default kernel installed from F8
CD and all kernels from F8 updates but 2.6.24.3-xx. On every install each
partition was reformated.

Partition table is a regular one:

Disk /dev/sdb: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0xc7ad5de6

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          19      152586   83  Linux
/dev/sdb2              20        7285    58364145   83  Linux
/dev/sdb3            7286        9197    15358140   83  Linux
/dev/sdb4            9198        9729     4273290    5  Extended
/dev/sdb5            9198        9328     1052226   83  Linux
/dev/sdb6            9329        9729     3221001   82  Linux swap / Solaris

Sincerely,
Gabriel
Comment 6 Gabriel VLASIU 2008-03-19 07:07:09 EDT
(In reply to comment #3)

> And just 'cause I have to ask, did the system lose power anywhere in between,
No. I will not fill a bug report if I lose a filesystem because of power
failure. :-)


Sincerely,
Gabriel
Comment 7 Chuck Ebbert 2008-03-19 08:54:43 EDT
XFS is working fine here with 2.6.24.3-12 and 2.6.24.3-34, on 32-bit x86
Comment 8 Eric Sandeen 2008-03-19 09:05:39 EDT
re: comment #5

> I had a quick look in /var/log/messages and I see some errors about
> XFS filesystem).

What were the errors?  And any errors before the xfs errors?

re: comment #6, if you had barriers enabled, a power loss, and subsequent
corruption, it would still be bug-worthy. :)
Comment 9 Gabriel VLASIU 2008-03-19 09:27:31 EDT
(In reply to comment #8)
> What were the errors?  And any errors before the xfs errors?
Something about XFS_WANT_CORRUPTED_GOTO. I do not have log file anymore since I
reinstalled the system again several times. No errors before. This was the hint
to make a xfs_repair.

Same error on all corrupted installs so I assumed is an internal xfs error
message after corruption of filesystem.

Also, one time i had errors during install of updates (update kernel, reboot,
apply others updates). rpm was unable to write some files. Then I make an cd to
the directory where the problem was reported by rpm and made an ls, an I/O error
was reported by ls.

> re: comment #6, if you had barriers enabled, a power loss, and subsequent
> corruption, it would still be bug-worthy. :)
:-)

Sincerely,
Gabriel
Comment 10 Eric Sandeen 2008-03-19 09:49:00 EDT
Can you let memtest86 run for a while?  I'd also look carefully for any
scsi/ide/IO type errors in the logs.

I'll see if I can glean anything from the inode numbers with problems and their
locations on disk...
Comment 11 Gabriel VLASIU 2008-03-19 10:53:01 EDT
(In reply to comment #10)
> Can you let memtest86 run for a while?  I'd also look carefully for any
> scsi/ide/IO type errors in the logs.
> 
> I'll see if I can glean anything from the inode numbers with problems and their
> locations on disk...

memtest86 is running for more then 55 minutes and there are no errors (1 pass,
second is 72% done). Anyway I will let memtest running until tomorrow morning.
But since there are no error running with kernel 2.6.23 I do not think there are
gonna be errors.

I did made another attempt to replicate this bug this time in vmware (I did not
install vmware tools or anything else inside guest except a plain F8 x86_64
install) on another computer. 
 
1. Install F8 (text mode, de-select anything then installer ask me what packages
to install). 
2. reboot and poweroff and made a snapshot.
3. boot in F8 and update kernel (2.6.24.3-34).
4. Poweroff and made another snapshot.
5. Boot in F8 and install all other updates.
6. Poweroff and made another snapshot.
7. Boot in rescue mode and made an xfs_repair. There are errors in xfs filesystem.
8. Boot in rescue mode with snapshot from step 4 and made an xfs_repair. No
errors in / filesystem.


Sincerely,
Gabriel
Comment 12 Gabriel VLASIU 2008-03-19 14:41:31 EDT
> I made a second test in vmware an a different system which work perfectly for
years (no vmware tools installed inside guest except a plain F8 x86_64) 

1. Install F8 (text mode, de-select anything then installer ask me what packages
to install). 
2. reboot and poweroff and made a snapshot.
3. boot in F8 and update kernel (2.6.24.3-34).
4. Poweroff and made another snapshot.
5. Boot in F8 and install all other updates.
During this step i receive lots and I mean LOTS of messages like this:
I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096

  Updating  : sendmail                     ####################  [187/365]I/O
error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096       
I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096
I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096
  Updating  : sendmail                     ##################### [187/365]
Error unpacking rpm package sendmail - 8.14.2-1.fc8.i386
warning: /etc/mail/sendmail.cf created as /etc/mail/sendmail.cf.rpmnew
warning: /etc/mail/submit.cf created as /etc/mail/submit.cf.rpmnew             
                                                                              
error: unpacking of archive failed on file
/usr/share/man/man1/mailq.sendmail.1.gz;47e14c60: cpio: open

  Updating  : NetworkManager               ####################  [199/365]I/O
error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096
I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096
  Updating  : NetworkManager               ##################### [199/365]
Error unpacking rpm package NetworkManager - 1:0.7.0-0.6.7.svn3235.fc8.x86_64
error: unpacking of archive failed on file
/usr/share/man/man1/nm-tool.1.gz;47e14c60: cpio: open                          
                                      Installing: PolicyKit-gnome             
##################### [200/365]
  Installing: gail                         ##################### [201/365]  
  Installing: gnome-mount                  ####################  [202/365]I/O
error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096
I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096
  Installing: gnome-mount                  ##################### [202/365]
Error unpacking rpm package gnome-mount - 0.7-1.fc8.x86_64
error: unpacking of archive failed on file
/usr/share/man/man1/gnome-mount.1.gz;47e14c60: cpio: open

  
I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096
I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096
  Cleanup   : bluez-utils                  ##################### [223/365] 
I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096

After update:

[root@f8test ~]# cd /usr/share/man/man8/
[root@f8test man8]# ls
I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096
ls: reading directory .: Input/output error
[root@f8test man8]#

So, from my point of view situation is clear: something is wrong with kernel
2.6.24.3-xx. All attempts to install and use kernel 2.6.24 failed for me on F8
x86_64.


Sincerely,
Gabriel
Comment 13 Eric Sandeen 2008-03-19 15:16:57 EDT
Thanks for doing the memtest; I know it's a bit of a pain but good to rule out
if you don't mind.  :)

When you see this:

> I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
> ("xfs_trans_read_buf") error 5 buf count 4096

"error 5" is in fact EIO.  *something* gave EIO and should have said so at the
time... (this message should be *after* the fact of the EIO).

If you do a dmesg when this happens, before the buffer rolls over, do you see
*any* other errors before the "I/O error in filesystem" message?

I'm sorry,  I have not yet had any time to try to recreate this myself, but
you're doing a fine job debugging so far :)

Thanks,
-Eric
Comment 14 Gabriel VLASIU 2008-03-19 15:50:58 EDT
(In reply to comment #13)
> Thanks for doing the memtest; I know it's a bit of a pain but good to rule out
> if you don't mind.  :)
That's ok. :-)


> "error 5" is in fact EIO.  *something* gave EIO and should have said so at the
> time... (this message should be *after* the fact of the EIO).
> 
> If you do a dmesg when this happens, before the buffer rolls over, do you see
> *any* other errors before the "I/O error in filesystem" message?
/var/log/messages had about 900 entries like this:

Mar 19 19:33:27 f8test kernel: attempt to access beyond end of device
Mar 19 19:33:27 f8test kernel: sda2: rw=0, want=31453347856, limit=30716280
Mar 19 19:33:27 f8test kernel: I/O error in filesystem ("sda2") meta-data dev
sda2 block 0x752c40008       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 19 19:33:27 f8test kernel: attempt to access beyond end of device
Mar 19 19:33:27 f8test kernel: sda2: rw=0, want=31453347856, limit=30716280
Mar 19 19:33:27 f8test kernel: I/O error in filesystem ("sda2") meta-data dev
sda2 block 0x752c40008       ("xfs_trans_read_buf") error 5 buf count 4096

I'm sorry, I forget to tell you about this. It was a long day. :-)

Again, an ls in /usr/share/man/man8/ looks like this:

[root@f8test ~]# cd /usr/share/man/man8/
[root@f8test man8]# ls
I/O error in filesystem ("sda2") meta-data dev sda2 block 0x752c40008      
("xfs_trans_read_buf") error 5 buf count 4096
ls: reading directory .: Input/output error
[root@f8test man8]#

and /var/log/messages:

Mar 19 21:21:09 f8test kernel: attempt to access beyond end of device
Mar 19 21:21:09 f8test kernel: sda2: rw=0, want=31453347856, limit=30716280
Mar 19 21:21:09 f8test kernel: I/O error in filesystem ("sda2") meta-data dev
sda2 block 0x752c40008       ("xfs_trans_read_buf") error 5 buf count 4096


> I'm sorry,  I have not yet had any time to try to recreate this myself
Take your time. There's no need to hurry. 2.6.23.15-137 is just fine for me.

> but you're doing a fine job debugging so far :)
:-)

Thank you.


Sincerely,
Gabriel

Comment 15 Eric Sandeen 2008-03-19 16:10:48 EDT
Ok, basically I'm trying to work backwards to the very first error encountered...

so the IO errors are probably a result of some corrupt metadata which refers to
blocks beyond the end of your device (which looks to be about 30G?)

-Eric
Comment 16 Gabriel VLASIU 2008-03-20 03:40:02 EDT
Created attachment 298643 [details]
messages.gz
Comment 17 Gabriel VLASIU 2008-03-20 03:43:23 EDT
(In reply to comment #15)
> Ok, basically I'm trying to work backwards to the very first error encountered...
> 
> so the IO errors are probably a result of some corrupt metadata which refers to
> blocks beyond the end of your device (which looks to be about 30G?)
memtest86 did not report any error after 17:08 hours. Rebooted the computer and
during boot I receive the same error messages about I/O error in meta-data. No
previous error in /var/log/messages (see attached file messages.gz).


Sincerely,
Gabriel
Comment 18 Marcin Kurek 2008-03-21 04:31:31 EDT
It seems I was hit by same problem in one of machines here when running 'yum
update' 

Mar 20 17:57:44 serwer yum: Updated: kdenetwork-devel - 7:3.5.9-2.fc8.i386
Mar 20 17:57:44 serwer yum: Updated: perl-devel - 4:5.8.8-36.fc8.i386
Mar 20 17:57:44 serwer yum: Updated: kdegraphics-devel - 7:3.5.9-1.fc8.x86_64
Mar 20 17:57:44 serwer yum: Updated: kdenetwork-devel - 7:3.5.9-2.fc8.x86_64
Mar 20 17:58:02 serwer dnsmasq[30157]: DHCPRELEASE(eth1) 192.168.10.29
00:50:ba:3e:4a:39
Mar 20 17:58:19 serwer squid[3560]: Squid Parent: child process 3563 exited with
status 0
Mar 20 17:58:20 serwer squid[26417]: Squid Parent: child process 26420 started
Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000
Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x61a800000       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000
Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x61a800000       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000
Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x61a800000       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000
Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x61a800000       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000
Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x61a800000       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 20 18:01:26 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:26 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000
Mar 20 18:01:26 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x61a800000       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 20 18:01:28 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:28 serwer kernel: dm-0: rw=0, want=24786214913768312, limit=204800000
Mar 20 18:01:28 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x580eee5f42fb70       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 20 18:01:30 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:30 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000
Mar 20 18:01:30 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x61a800000       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 20 18:01:30 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:30 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000
Mar 20 18:01:30 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x61a800000       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 20 18:01:30 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:30 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000
Mar 20 18:01:30 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x61a800000       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 20 18:01:30 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:30 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000
Mar 20 18:01:30 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x61a800000       ("xfs_trans_read_buf") error 5 buf count 4096
Mar 20 18:01:30 serwer kernel: attempt to access beyond end of device
Mar 20 18:01:30 serwer kernel: dm-0: rw=0, want=26214400008, limit=204800000
Mar 20 18:01:30 serwer kernel: I/O error in filesystem ("dm-0") meta-data dev
dm-0 block 0x61a800000       ("xfs_trans_read_buf") error 5 buf count 4096

This machine is AMD64 server and it was running fine last few months with older
kernels and it works on 2.6.24.3.XX until I run yum to upgrade some packages. I
was in hurry to restore this server to normal state then I do not have xfsdump
output, but it seems I lost mainly files from libs (lib and lib64) and some man
files.

Memtest is perfectly fine here then I think it's definitly something wrong with
XFS in 2.6.24

Filesystem was created using F8 installation DVD, there is no binary packages
around just pure F8 kernel image without any customizations. I can try to
organize other x86_64 machine to reproduce the problem if required.
Comment 19 Eric Sandeen 2008-03-21 09:30:20 EDT
Ok, thanks for the corroboration... hrm...
Comment 20 Eric Sandeen 2008-03-21 09:36:04 EDT
do you know exactly (or approximately) what the first kernel to exhibit problems
was?  Maybe looking through logs to see what was updated when..?
Comment 21 Eric Sandeen 2008-03-21 10:03:01 EDT
So, under 2.6.24.3-34.fc8 on x86_64 I did this on a temporarily reformatted swap
partition...

 1029  mkfs.xfs -f /dev/sda3
 1030  mount /dev/sda3 /mnt/test
 1031  yum --installroot=/mnt/test install filesystem
 1032  mkdir -p /mnt/test/var/lib/yum/
 1033  yum --installroot=/mnt/test install gnome-terminal amarok

which installed about 830M onto that filesystem, with no problems... any chance
you guys could do the same test on a spare or swap partition?
Comment 22 Marcin Kurek 2008-03-21 11:08:27 EDT
I upgraded from .23 to 2.6.24.3-12.fc8 when it shows up. And about proposed test
I think copying data to partition is not enough as my partition was fine for a
1-2 days and it fails on yom upgrade as most damages was on libs I think can you
try to copy many small files to partition and overwrite it randomly (or change
it contents) in some kind of loop (script ?) 
Comment 23 Eric Sandeen 2008-03-21 11:31:54 EDT
I also tried an install of the same packages from the original F8 repo, followed
by an upgrade from the updates repo; so far still no problems.

I'll set up a test box to try an install under the older 2.6.23 kernel, followed
by upgrades under 2.6.24, see if that trips anything.  I probably won't be able
to try a real install/upgrade 'til I get back into the office on Monday.

What do you guys have for IO hardware?  (maybe which sata/ide controller?) just
in case that's relevant...

-Eric
Comment 24 Marcin Kurek 2008-03-21 17:24:34 EDT
In my case corruption first appears on RAID array (4x320GB SATA disks) thats
runs on ARC-1210 PCI-Express RAID controller. Machine has 4GB DDR2-667 ECC
memory and runs on SuperMicro Server Mainboard PDSME+ with Quad-core Intel Xeon
3220 2,40 GHz 8MB FSB1066
Comment 25 Gabriel VLASIU 2008-03-24 03:49:24 EDT
(In reply to comment #21)
> So, under 2.6.24.3-34.fc8 on x86_64 I did this on a temporarily reformatted swap
> partition...
> 
>  1029  mkfs.xfs -f /dev/sda3
>  1030  mount /dev/sda3 /mnt/test
>  1031  yum --installroot=/mnt/test install filesystem
>  1032  mkdir -p /mnt/test/var/lib/yum/
>  1033  yum --installroot=/mnt/test install gnome-terminal amarok
I cannot run yum or rpm anymore. Too many error.

> which installed about 830M onto that filesystem, with no problems... any chance
> you guys could do the same test on a spare or swap partition?

Comment 26 Gabriel VLASIU 2008-03-24 03:54:21 EDT
(In reply to comment #23)
> What do you guys have for IO hardware?  (maybe which sata/ide controller?) just
> in case that's relevant...
System1: nVidia CK8S Parallel ATA Controller (v2.5)
System2:  SATA controller: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA
AHCI Controller (rev 03)
 
Comment 27 Eric Sandeen 2008-03-24 11:03:53 EDT
Just as a minor datapoint, I did a yum update this morning on 2.6.24.3-34.fc8,
xfs root filesystem, without trouble... so it's apparently not universally broken...
Comment 28 Gabriel VLASIU 2008-03-24 13:16:12 EDT
(In reply to comment #27)
> Just as a minor datapoint, I did a yum update this morning on 2.6.24.3-34.fc8,
> xfs root filesystem, without trouble... so it's apparently not universally
broken...

Hmm... Do you have a /usr partition/mount point? In fact what mount points do
you have? How many packages did yum install? Did you try an a fresh f8 installed
system (only kernel upgraded before yum)?

Sincerely,
Gabriel
Comment 29 Eric Sandeen 2008-03-24 13:25:45 EDT
I only have only /, and no /usr partition.  And, it was not an exceedingly large
update - so not exactly the same test, I agree.

I've not had a chance to do the fresh f8 upgrade yet (sorry, my support for xfs
in fedora has to be done more as a hobby than a profession right now... but I
will try to get to it!)

-Eric
Comment 30 Gabriel VLASIU 2008-03-24 13:39:56 EDT
(In reply to comment #29)
> I've not had a chance to do the fresh f8 upgrade yet (sorry, my support for xfs
> in fedora has to be done more as a hobby than a profession right now... but I
> will try to get to it!)
That's OK. As long the last 2.6.23 kernel works OK, I'm fine.

Gabriel


Comment 31 Marcin Kurek 2008-03-25 06:48:46 EDT
I have separated /usr, /var, /tmp and /home in my setup here. 

/etc/fstab:

/dev/VolGroup00/LogVol01 /                       xfs    
defaults,noatime,nodiratime,logbufs=8   1 1
/dev/VolGroup00/LogVol03 /var                    xfs    
defaults,noatime,nodiratime,logbufs=8   1 2
/dev/VolGroup00/LogVol04 /home                   xfs    
defaults,noatime,nodiratime,logbufs=8   1 2
/dev/VolGroup00/LogVol02 /tmp                    ext3    defaults,noatime      
                 1 2
LABEL=/boot              /boot                   ext3    defaults,noatime      
                 1 2

Hmm, I wonder can this be caused by lack of write barriers ? I guess XFS uses
barriers by default, but when using LVM as here barriers are not
available/supported.
Comment 32 Gabriel VLASIU 2008-03-25 07:05:54 EDT
(In reply to comment #31)
> Hmm, I wonder can this be caused by lack of write barriers ? I guess XFS uses
> barriers by default, but when using LVM as here barriers are not
> available/supported.
I don't think so. I do not have LVM and still / become corrupt.
The strange part here is yum. 
Fresh F8 install, upgrade kernel to 2.6.24.3-XX. As long I do not use yum
everything is OK.
mount /dev/dvd /mnt/1
mkdir 1
cp /mnt/1/Packages/* /1
cp -r /1 /2
cp -r /1 /3
Reboot, rescue mode, xfs_repair, no errors.

If I run yum update / become unusable.

Sincerely,
Gabriel
 
Comment 33 Marcin Kurek 2008-03-25 07:14:56 EDT
Yes I must say this is weird a little as my server works 1-2 days on .24 kernel
without any errors in log's (We have all documents on it + ftp + samba + httpd)
until I decide to upgrade system then it fails. Maybe errors was there before
hard to say, but seeking logs from this two days doesn't show anything wrong
until yum update execution.

I wonder in my case most damages was in /lib and /lib64 directories then maybe
it's not yum but ldconfig and/or prelink ? 
Comment 34 Gabriel VLASIU 2008-03-25 07:29:51 EDT
(In reply to comment #33)
> I wonder in my case most damages was in /lib and /lib64 directories then maybe
> it's not yum but ldconfig and/or prelink ? 
Are you sure is not /usr/lib and /usr/lib64?

ldconfig, maybe. Prelink - no. That's for sure. prelink did not have time to
start  (install, reboot, upgrade kernel, reboot, install updates, reboot, rescue
mode, xfs_repair, errors).

Sincerely,
Gabriel

Comment 35 Marcin Kurek 2008-03-25 07:52:26 EDT
Ahh, sorry in /usr/lib*, but my / and /usr are on the same partition then
propably this doesn't matter so much.
Comment 36 Marcin Kurek 2008-03-25 09:39:39 EDT
As I can see fedora uses a few patches for XFS:

linux-2.6-xfs-optimize-away-realtime-tests.patch
linux-2.6-xfs-setfattr-32bit-compat.patch
linux-2.6-xfs-xfs_mount-refactor.patch

And looking to changeslog shows all of them are quite old. I asked because
looking at gentoo, debian, suse bugzilla doesn't show me any similar problem.
Comment 37 Eric Sandeen 2008-03-25 10:04:01 EDT
Testing a stock 2.6.24.3 kernel w/ the same config as fedora could be instructive...

If it turns out that that fails, perhaps we can devise a fairly simple,
repeatable automated test case to do a git bisect on, and narrow down when the
failure occurred. 

Running with barriers only really matters when it comes time to do a log replay,
so previous power losses w/o barriers could leave latent corruption.  But as
Gabriel said, he is on a regular block device, no lvm, so he should have had
barriers in place...

Comment 38 Eric Sandeen 2008-03-25 12:46:32 EDT
I still can't do a full install today, but I did this test in the background,
which is as close as I can get w/o doing an actual full/fresh install.

installed kernel-2.6.23.1-42.fc8.x86_64 and booted it.
mkfs'd a 3.8G (non-root) filesystem with F8-era xfsprogs
yum installed 1.8G worth of original F8 rpms on it, about 600 packages
ran xfs_repair -n, got no errors
installed kernel-2.6.24.3-34.fc8.x86_64 and booted it
yum updated the filesystem from above, it upgraded around 250 packages IIRC
ran xfs_repair -n, got no errors

Did you guys experience the first errors after some particular package
installed?  Perhaps some %post script is doing something interesting that
triggers it...

Hopefully can do a real install before the end of the week.
Comment 39 Gabriel VLASIU 2008-03-25 16:38:38 EDT
Made this evening some more tests.

1. Installed a fresh F8. Update kernel to 2.6.24.3-xx. Download kernel 2.6.24.3
from kernel.org and compile a new kernel with .config file generated from fedora
kernel-xxx.src.rpm 
cat config-generic config-nodebug > temp-generic
perl merge.pl config-x86_64-generic temp-generic  > temp-x86_64-generic
perl merge.pl /dev/null temp-x86_64-generic x86_64 > kernel-2.6.24.3-x86_64.config
make menuconfig (load kernel-2.6.24.3-x86_64.config and save .config)
make bzImage && make modules && etc.
Boot from the new kernel, yum update and restart. Boot in rescue mode, run
xfs_repair and errors. :-(

2. New fresh F8, install new kernel 2.6.24.3-xx, reboot, yum update everything
but selinux*, policycoreutils and audit*. Reboot, rescue mode, xfs_repair and no
errors. Reboot in F8, update audit*, reboot, rescue mode, xfs_repair and still
no errors. Reboot again in F8 and update selinux*, reboot, rescue mode,
xfs_repair and filesystem has errors. 
Made a new attempt but this time I update policycoreutils instead of selinux*
and / become corrupted again.

3. I try again with kernel from 1.) and / become corrupted then selinux* or
policycoreutils is installed.

4. Made a an attempt as Eric suggested on comment #21 and I cannot reproduce
filesystem corruption. But selinux post-install script have returned some errors
(unable to load policy).

Could something in selinux* and/or policycoreutils packages to corrupt kernel
memory somehow?

Sincerely,
Gabriel


Comment 40 Eric Sandeen 2008-03-25 19:10:42 EDT
Gabriel, thanks for that additional testing.  From this it looks like it is
probably selinux attribute related...  will look into that (pinged the sgi guys
again, too)

-Eric
Comment 41 Timothy Shimmin 2008-03-25 20:57:01 EDT
Hi,

Okay, basing on an assumption of assumptions, :)
if one gets a chance to give mkfs.xfs options then you could
try with "-i attr=1" to try version#1 EAs (we have had probs with v2 in the past)
and/or "-i size=512" to give better chance of EAs being inline within the inode
(larger inode size on a SELinux system probably makes more sense anyway -
I'm unsure what redhat set this to).

--Tim
Comment 42 Eric Sandeen 2008-03-25 21:32:44 EDT
Tim, it was Fedora that set it ;) and it should be defaults except for attr=2,
set by anaconda at install time.

ugh, which is, of course, something I forgot to set in my tests.

Gabriel can you confirm attr=2 on your root fs, with "xfs_info" on the mountpoint?

-Eric (rerunning w/ attr2...)
Comment 43 Eric Sandeen 2008-03-25 23:32:56 EDT
F8-era installer did:

        rc = iutil.execWithRedirect("mkfs.xfs",
                                    ["-f", "-l", "internal",
                                     "-i", "attr=2", devicePath],
                                    stdout = "/dev/tty5",
                                    stderr = "/dev/tty5", searchPath = 1)

but, due to the features2 flag packing/swapping issue, I'm not sure it gets
properly picked up as attr2...

[root@inode tmp]# mkfs.xfs -V
mkfs.xfs version 2.9.4
[root@inode tmp]# mkfs.xfs -dfile,name=fsfile,size=32m -i attr=2
meta-data=fsfile                 isize=256    agcount=2, agsize=4096 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=8192, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096  
log      =internal log           bsize=4096   blocks=1200, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@inode tmp]# mkdir mnt
[root@inode tmp]# mount -o loop fsfile mnt/
xf[root@inode tmp]# xfs_info mnt
meta-data=/dev/loop0             isize=256    agcount=2, agsize=4096 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=8192, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096  
log      =internal               bsize=4096   blocks=1200, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@inode tmp]# touch mnt/foo
[root@inode tmp]# xfs_info mnt
meta-data=/dev/loop0             isize=256    agcount=2, agsize=4096 blks
         =                       sectsz=512   attr=1
data     =                       bsize=4096   blocks=8192, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096  
log      =internal               bsize=4096   blocks=1200, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

root@inode tmp]# umount mnt
[root@inode tmp]# xfs_db fsfile 
xfs_db> sb 0
xfs_db> p features2
features2 = 0
Comment 44 Timothy Shimmin 2008-03-26 00:51:50 EDT
Yeah, I'm not sure attr2 gets thru too (the default is v2 but I'm
not sure if an x8664 or ia64 would see it).

Its confusing how xfs_db's version # command disagress with
features2 because db uses the structure (read using an
endian conversion function) instead of the offset & size.
i.e.
xfs_db> version
versionnum [0xb4a4+0xa] = 
V4,NLINK,ALIGN,DIRV2,LOGV2,EXTFLG,MOREBITS,ATTR2,LAZYSBCOUNT
===> note the 0xa in versionnum for features2
xfs_db> p features2
features2 = 0

Eric, so were we doing 32 bit tests with attr2 in the past,
when we were going thru all the attr2 woes?
It's all confusing :)

--Tim
Comment 45 Gabriel VLASIU 2008-03-26 07:45:42 EDT
(In reply to comment #42)
> Gabriel can you confirm attr=2 on your root fs, with "xfs_info" on the mountpoint?
Only on one system have attr=2 (3 tested). Others two have attr=1.
Comment 46 Marcin Kurek 2008-03-26 08:15:57 EDT
attr=2 on my system. I take a look at logs from the faulty upgrade too and it
seems in my case there was a selinux policy upgrade. Anyway I will upload a log
file in a few minutes.
Comment 47 Marcin Kurek 2008-03-26 08:22:20 EDT
Created attachment 299137 [details]
xfs_info output from partition that fails
Comment 48 Marcin Kurek 2008-03-26 08:23:49 EDT
Created attachment 299138 [details]
syslog from faulty yum upgrade
Comment 49 Eric Sandeen 2008-03-27 11:26:48 EDT
FWIW, I managed to reproduce this last night, even w/o the selinux-related updates.

I'll try to find some time this weekend to narrow it down.
Comment 50 Eric Sandeen 2008-03-27 23:49:28 EDT
From the testcase I came up with and some git bisecting this evening, looks like
this is the mod that broke it:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2bdf7cd0baa67608ada1517a281af359faf4c58c

[XFS] superblock endianess annotations

now to sort out why...
Comment 51 Eric Sandeen 2008-03-29 01:17:06 EDT
Ok, I see exactly what is happening; it's a bit of a perfect storm of bugs.

I've sent a patch & explanation upstream tonight, we'll see what the sgi guys say.

I'll put the gory details here once the solution is agreed upon.... hopefully
will get this fixed up soon in F8.  Hm, and probably F9 needs it as well.

Thanks for your help on this, and sorry for questioning your hardware... :)

-Eric
Comment 52 Gabriel VLASIU 2008-04-03 09:19:42 EDT
Ok, kernel-2.6.24.4-74 (from koji) seems to work just fine:
Apr  3 14:27:45 localhost kernel: XFS: correcting sb_features alignment problem

Nice job! :-)

Anyway, I still have some more questions:
1. How about for xfs on other kernels? Especially on rhel5/centos5? Do I have to
backport this patch?

2. I do not see this patch in latest kernel from F9 (on koji). Do you think you
can add this patch to F9 kernel? This will be nice.

Sincerely,
Gabriel
Comment 53 Eric Sandeen 2008-04-03 10:33:51 EDT
So here's a decent description of the bug:

http://oss.sgi.com/archives/xfs/2008-03/msg00355.html

Essentially your filesystem "lost" it's attr2 flag as far as the newer kernel
knew; this actually should have been ok, except for the bug I pointed out in the
above message.  For now, for F8, I just put a patch in that stops it from
"losing" the flag, as this is easiest to inspect as correct.  The patch to make
attr2 filesystems safe when mounted as attr1 takes a bit more care but will
probably follow.

rhel5/centos5 should not have this problem.  (well, rhel5 certainly doesn't! ;)

centos 5 likely does not even set attr2 by default, and if it does, both kernel
& userspace mis-place the flag in the same way, so it's ok, at least until you
migrate between x86 and x86_64, or explicitly mount an attr2 fs as attr1.  So
the patch could be ported there but it's probably very unlikely to be hit with
the stock kernels & userspace.

I did check the patch into F9, it looks like it's just not built yet.

-Eric
Comment 54 Gabriel VLASIU 2008-04-03 12:40:11 EDT
(In reply to comment #53)
> So here's a decent description of the bug:
> 
> http://oss.sgi.com/archives/xfs/2008-03/msg00355.html
Thank you.

> (well, rhel5 certainly doesn't! ;)
Well... not unless you do: 
perl -p -i -e 's/# CONFIG_XFS_FS is not set/CONFIG_XFS_FS=m\n# CONFIG_XFS_RT is
not set\nCONFIG_XFS_QUOTA=y\nCONFIG_XFS_POSIX_ACL=y\nCONFIG_XFS_SECURITY=y/'
config-rhel-generic
:-)

> centos 5 likely does not even set attr2 by default, and if it does, both kernel
> & userspace mis-place the flag in the same way, so it's ok, at least until you
> migrate between x86 and x86_64, or explicitly mount an attr2 fs as attr1.
So what you are saying is I must use noattr2 option when mounting an xfs
filesystem if I use the latest kernel (lets say for a mobile hdd which can be
mounted on several computers with different kernel/userspace versions)?

> I did check the patch into F9, it looks like it's just not built yet.
OK. Thank you.

Sincerely,
Gabriel
Comment 55 Raffaele Candeliere 2008-04-03 13:55:34 EDT
Hi everybody.
I've read through the whole list of messages. If it may help i can add myself to
the list of those experiencing problems with xfs and 64-bits kernels.
My system file-system is a bit complicated but not that much if we consider only
the linux part.
I have an alu iMac with five partitions and a triple boot. The first partition
is the standard EFI one Apple uses to boot the system. The second one contains
the HFS+ journaled file system with mcosX. The third one contains the linux
/boot partition i use to boot up Fedora 8. The fourth one contains an NTFS file
system, while the last one is an LVM with two xfs-formatted logical volumes
which contain, respectively, the "/" (root) directory and the "/home" directory
mounted.
Well, everything worked fine until last update. Unfortunately i've not been able
to take any snapshot or log message, but the problem showed up very similar to
all others.
A couple of day ago i "yum-updated" the system. I repeat. i didn't take much
care in writing down things or saving log file because i was confident in the
"usual" normal conclusion of the update process. I can only say that the latest
kernel () was in the list. After the restart to jump into the new kernel, the
directory /usr/lib64/kde3, /usr/share/man/man8 and /usr/lib64/openoffice/share/
were gone, unreadable. The directories were there, but unreadable (i got "io
error" at any ls command). I tried then an xfs_repair. It completed succesfuly
and the file system appeared to be clean but the directories were definitely
gone. Just a bunch of "node-number-named" files in the "lost+found" directory.
I can only guess it can be a problem free space problem, because the gnome "free
disk space" tool always showed a 100% full root ("/") directory, BEFORE the
xfs_repair attempt, regardless of the amount of files i've deleted in the vain
attempt to restore things.
Sorry for not being able to be mor precise with log and dump files but i didsn't
think of it at that moment.
I hope this can help in any case.
Comment 56 Eric Sandeen 2008-04-03 14:16:15 EDT
It does sound like the same issue (though all the extra info about your setup
probably isn't relevant.)

The new kernel in Koji should prevent the problem from occurring again;
unfortunately I don't have a great recovery scheme for fs's which have already
been hit by this, other than xfs_repair, which may wind up moving lots to
lost+found.

Thnks,
-Eric
Comment 57 Eric Sandeen 2008-04-03 14:25:30 EDT
re: comment #54:

> So what you are saying is I must use noattr2 option when mounting an xfs
> filesystem if I use the latest kernel (lets say for a mobile hdd which can be
> mounted on several computers with different kernel/userspace versions)?

If you have a filesystem which really did use attr2 (which is basically a
sliding divider in the inode between attribute & extent data, vs a fixed split
point with attr1) then running 2.6.24+ without this patch, or the one I
referenced in the above thread, is potentially dangerous, because if a file or
dir with attrs needs to add information (attr or extent) to the inode structure
it'll get the wrong split-point, and potentially corrupt that file.

If you only use the latest kernels (>= 2.6.24), you're fine, but if you switch
between <= 2.6.23 and >= 2.6.24 without these patches, or use <= 2.6.23 on 32
and 64 bit machines both, you'll be exposed to the bug.  I think I have that all
straight... ;)
Comment 58 Gabriel VLASIU 2008-04-03 14:50:39 EDT
(In reply to comment #57)
> re: comment #54:
> I think I have that all straight... ;)
Well, that's bad. At least for me.
Thank you.


Sincerely,
Gabriel

Comment 59 Marcin Kurek 2008-04-04 02:23:40 EDT
I wonder anyone else tested this fix ? Personaly I prefer to be sure before try
to install it on production machine here as my boss would castrate me if
something goes wrong.
Comment 60 Gabriel VLASIU 2008-04-04 02:51:51 EDT
(In reply to comment #59)
> I wonder anyone else tested this fix ? Personaly I prefer to be sure before try
> to install it on production machine here as my boss would castrate me if
> something goes wrong.

Made 2 fresh installs, update kernel to 2.6.24.4-74, reboot, update everything
else and no problems at all. Also I update kernel from another computer
(2.6.23.15-137) and I have no problems at all (production machine).


Sincerely,
Gabriel
 
Comment 61 Eric Sandeen 2008-04-10 10:19:00 EDT
Patch is in CVS now, for F8 as well as F9.

FWIW, it was also requested that this fix get pulled for 2.6.25:

http://oss.sgi.com/archives/xfs/2008-04/msg00230.html

Thanks,
-Eric
Comment 62 Chuck Ebbert 2008-04-11 17:36:39 EDT
Fix is in 2.6.25-rc9
Comment 63 Fedora Update System 2008-04-21 20:04:37 EDT
kernel-2.6.24.5-85.fc8 has been submitted as an update for Fedora 8
Comment 64 Fedora Update System 2008-04-22 18:43:06 EDT
kernel-2.6.24.5-85.fc8 has been pushed to the Fedora 8 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-3260
Comment 65 Fedora Update System 2008-04-29 16:54:27 EDT
kernel-2.6.24.5-85.fc8 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.