Bug 60886

Summary: Serious fs problems on IBM xseries 250
Product: [Retired] Red Hat Linux Reporter: Vidar Langseid <vl>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED WORKSFORME QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.1CC: sct
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2002-03-13 11:18:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Problemdescription for firmware patch none

Description Vidar Langseid 2002-03-08 08:48:22 UTC
Description of Problem:

I was met with these messages when I got to work today:

Mar  8 05:46:38 valen kernel: EXT2-fs error (device sd(8,7)): ext2_check_page:
bad entry in directory #9733406: directory entry across blocks - offset=0,
inode=1967530032, rec_len=29804, name_len=111


During the 3 last mounts, this has happend twice. Looks like something serious
is wrong here since this happends now and then.

The server is running apache, MySQL and the nfs server.

The server is running hardware RAID 5
# cat /proc/scsi/ips/2
IBM ServeRAID General Information:
 
        Controller Type                   : ServeRAID 4L
        Memory region                     : 0xf6ffe000 (8192 bytes)
        Shared memory address             : 0xf884b000
        IRQ number                        : 21
        BIOS Version                      : 4.70.17
        Firmware Version                  : 4.70.17p
        Boot Block Version                : 4.30.04
        Driver Version                    : 4.80.26
        Max Physical Devices              : 15
        Max Active Commands               : 64
        Current Queued Commands           : 0
        Current Active Commands           : 3
        Current Queued PT Commands        : 0
        Current Active PT Commands        : 0


Version-Release number of selected component (if applicable):
kernel-2.4.9-12smp

How Reproducible:
I am not able to reproduce this on demand. It just happens now and then....

Steps to Reproduce:
1. 
2. 
3. 

Actual Results:


Expected Results:


Additional Information:
Before the server was set in production, we had to upgrade the firmware. We
experienced bad performance problems with the nfs server. That problem may be
related to BUG 42355 ? Anyway, the firmware update solved that problem. Problem
description from IBM will following in a own attachment.

Comment 1 Vidar Langseid 2002-03-08 08:51:17 UTC
Created attachment 47880 [details]
Problemdescription for firmware patch

Comment 2 Arjan van de Ven 2002-03-08 10:22:33 UTC
Can you run a full fsck to see if the filesystem is ok ?

Comment 3 Stephen Tweedie 2002-03-08 10:43:41 UTC
Are there any other kernel filesystem or IO errors reported in the logs?

Comment 4 Vidar Langseid 2002-03-08 11:51:18 UTC
> Can you run a full fsck to see if the filesystem is ok ?
Yes, ofcourse. But I have to wait until later today. You see, the server is in
production. (and that worries me)


> Are there any other kernel filesystem or IO errors reported in the logs?
No, but I did a "dmesg" and found a couple of instances with:
(ips0) Resetting controller.

Are there possible to find out which directory #9733406 is while the volume is
mounted ?


Comment 5 Stephen Tweedie 2002-03-08 12:47:43 UTC
Yes --- "find" and "debugfs" will both work.

  find /path/to/mount/point -xdev -type d -inum 9733406

should do the trick.  For /path/to/mount/point you want to use whatever mount
point /dev/sda7 is on.

I'd also like to see what "debugfs" shows for that inode:

  debugfs 1.23, 15-Aug-2001 for EXT2 FS 0.5b, 95/08/09
  debugfs:  stat <9733406> 

to see if we're just looking at a transient read IO failure from the device
driver, or a true on-disk corruption.

--Stephen

Comment 6 Vidar Langseid 2002-03-08 13:44:30 UTC
# debugfs /dev/sda7
debugfs 1.25 (20-Sep-2001)
debugfs:  stat <9733406>
Inode: 9733406   Type: directory    Mode:  0775   Flags: 0x0   Generation:
541074356
User:  1520   Group:  1524   Size: 4096
File ACL: 0    Directory ACL: 0
Links: 3   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x3c5aadeb -- Fri Feb  1 16:02:03 2002
atime: 0x3c88422e -- Fri Mar  8 05:46:38 2002
mtime: 0x3c5aadeb -- Fri Feb  1 16:02:03 2002
BLOCKS:
(0):19468861
TOTAL: 1


As you see, I am running 1.25 of e2fsprogs.

Comment 7 Stephen Tweedie 2002-03-08 14:40:43 UTC
Could you dump the contents of the directory to a file and attach that to this
report?

  # debugfs /dev/sda7
  debugfs 1.25 (20-Sep-2001)
  debugfs:  dump <9733406> /tmp/dir.img

should do the trick.  I'd like to know if the directory is still corrupt, and if
so, what the corruption looks like.  

When you have seen this in the past, what has happened afterwards?  Did the
kernel carry on, or did you reboot?  Was there a forced fsck, and if so, did it
find any problems?

The issue here is to find out if we're getting corruption on disk, or if it is
just the read of the disk data into memory which is getting messed up.  I've
seen hardware problems cause the latter type of corruption on several platforms.


Comment 8 Stephen Tweedie 2002-03-08 14:44:43 UTC
Just for pattern-hunting in the corruption report, what motherboard/cpu/chipset
are you using?

Comment 9 Vidar Langseid 2002-03-08 15:15:10 UTC
debugfs:  ncheck 9733406
Inode   Pathname
9733406
.../mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod

I didn't figure out how to send that dump in bugzilla, without making it public
available, so I'l send it directly to you (sct) (It's not that
confidencial, but anyway)


I have the opportunity to do a chkdsk soon. Do you need more information from
the filesystem before I go ahead with that?





Comment 10 Vidar Langseid 2002-03-08 15:26:52 UTC
>When you have seen this in the past, what has happened afterwards?  Did the
> kernel carry on, or did you reboot?  Was there a forced fsck, and if so, did
>it find any problems?
I unmounted the volume and ran e2fsck. It found several errors. I had to delete
a numerous of inodes. After that I rebooted the system
I don't recall if the fs was marked as clean or not.



>Just for pattern-hunting in the corruption report, what motherboard/cpu/chipset
> are you using?
Dual Xeon, 700MHz, 1Gb RAM
It's a IBM xseries 250, Don't know anything more about motherboard/chipset. Can
I find this information in /proc ?

[root@valen /root]# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 10
model name      : Pentium III (Cascades)
stepping        : 1
cpu MHz         : 699.199
cache size      : 2048 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr sse
bogomips        : 1395.91
 
processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 10
model name      : Pentium III (Cascades)
stepping        : 1
cpu MHz         : 699.199
cache size      : 2048 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr sse
bogomips        : 1395.91






Comment 11 Stephen Tweedie 2002-03-08 17:05:16 UTC
OK, that file contains a bunch of XML, not just mildly corrupt directory data. 
This is looking like serious drive corruption.  It would be useful to know if
the corruption is just in memory or if it is present on disk, too --- if a
reboot/fsck shows the same problem, that points to the latter.

Certainly, this sort of corruption is most commonly caused by driver or hardware
problems/interactions.  It's not a minor data shift or bit flip indicative of a
DMA problem or memory problem.

Comment 12 Vidar Langseid 2002-03-08 17:57:01 UTC
I ran e2fsck. It found some unattached inodes and some wrong ref counts on some
inodes.

I'l contact IBM and let them have a look at it.


Comment 13 Stephen Tweedie 2002-03-08 18:03:08 UTC
The unattached inodes and the wrong refcounts would be natural results of the
directory being corrupt --- the directory references that the inode was
recording were not there.

Were there not errors concerning the directory itself, though?

Comment 14 Vidar Langseid 2002-03-08 18:21:24 UTC
As I reported ealier I did a :
debugfs:  ncheck 9733406
 Inode   Pathname
 9733406
.../mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod

These directories was moved to lost+found. Also I got reports claiming that
"." and ".." was corrupted for that directory.




Comment 15 Vidar Langseid 2002-03-13 09:09:46 UTC
Looks like IBM doesn't have experienced any simular problems on these servers.
They want to replace the RAID controller and see if that resolves the problem.