Bug 60886
Summary: | Serious fs problems on IBM xseries 250 | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Vidar Langseid <vl> | ||||
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> | ||||
Status: | CLOSED WORKSFORME | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 7.1 | CC: | sct | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i386 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2002-03-13 11:18:14 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Vidar Langseid
2002-03-08 08:48:22 UTC
Created attachment 47880 [details]
Problemdescription for firmware patch
Can you run a full fsck to see if the filesystem is ok ? Are there any other kernel filesystem or IO errors reported in the logs? > Can you run a full fsck to see if the filesystem is ok ? Yes, ofcourse. But I have to wait until later today. You see, the server is in production. (and that worries me) > Are there any other kernel filesystem or IO errors reported in the logs? No, but I did a "dmesg" and found a couple of instances with: (ips0) Resetting controller. Are there possible to find out which directory #9733406 is while the volume is mounted ? Yes --- "find" and "debugfs" will both work. find /path/to/mount/point -xdev -type d -inum 9733406 should do the trick. For /path/to/mount/point you want to use whatever mount point /dev/sda7 is on. I'd also like to see what "debugfs" shows for that inode: debugfs 1.23, 15-Aug-2001 for EXT2 FS 0.5b, 95/08/09 debugfs: stat <9733406> to see if we're just looking at a transient read IO failure from the device driver, or a true on-disk corruption. --Stephen # debugfs /dev/sda7 debugfs 1.25 (20-Sep-2001) debugfs: stat <9733406> Inode: 9733406 Type: directory Mode: 0775 Flags: 0x0 Generation: 541074356 User: 1520 Group: 1524 Size: 4096 File ACL: 0 Directory ACL: 0 Links: 3 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x3c5aadeb -- Fri Feb 1 16:02:03 2002 atime: 0x3c88422e -- Fri Mar 8 05:46:38 2002 mtime: 0x3c5aadeb -- Fri Feb 1 16:02:03 2002 BLOCKS: (0):19468861 TOTAL: 1 As you see, I am running 1.25 of e2fsprogs. Could you dump the contents of the directory to a file and attach that to this report? # debugfs /dev/sda7 debugfs 1.25 (20-Sep-2001) debugfs: dump <9733406> /tmp/dir.img should do the trick. I'd like to know if the directory is still corrupt, and if so, what the corruption looks like. When you have seen this in the past, what has happened afterwards? Did the kernel carry on, or did you reboot? Was there a forced fsck, and if so, did it find any problems? The issue here is to find out if we're getting corruption on disk, or if it is just the read of the disk data into memory which is getting messed up. I've seen hardware problems cause the latter type of corruption on several platforms. Just for pattern-hunting in the corruption report, what motherboard/cpu/chipset are you using? debugfs: ncheck 9733406 Inode Pathname 9733406 .../mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod I didn't figure out how to send that dump in bugzilla, without making it public available, so I'l send it directly to you (sct) (It's not that confidencial, but anyway) I have the opportunity to do a chkdsk soon. Do you need more information from the filesystem before I go ahead with that? >When you have seen this in the past, what has happened afterwards? Did the > kernel carry on, or did you reboot? Was there a forced fsck, and if so, did >it find any problems? I unmounted the volume and ran e2fsck. It found several errors. I had to delete a numerous of inodes. After that I rebooted the system I don't recall if the fs was marked as clean or not. >Just for pattern-hunting in the corruption report, what motherboard/cpu/chipset > are you using? Dual Xeon, 700MHz, 1Gb RAM It's a IBM xseries 250, Don't know anything more about motherboard/chipset. Can I find this information in /proc ? [root@valen /root]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 10 model name : Pentium III (Cascades) stepping : 1 cpu MHz : 699.199 cache size : 2048 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1395.91 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 10 model name : Pentium III (Cascades) stepping : 1 cpu MHz : 699.199 cache size : 2048 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1395.91 OK, that file contains a bunch of XML, not just mildly corrupt directory data. This is looking like serious drive corruption. It would be useful to know if the corruption is just in memory or if it is present on disk, too --- if a reboot/fsck shows the same problem, that points to the latter. Certainly, this sort of corruption is most commonly caused by driver or hardware problems/interactions. It's not a minor data shift or bit flip indicative of a DMA problem or memory problem. I ran e2fsck. It found some unattached inodes and some wrong ref counts on some inodes. I'l contact IBM and let them have a look at it. The unattached inodes and the wrong refcounts would be natural results of the directory being corrupt --- the directory references that the inode was recording were not there. Were there not errors concerning the directory itself, though? As I reported ealier I did a : debugfs: ncheck 9733406 Inode Pathname 9733406 .../mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod/manual/mod These directories was moved to lost+found. Also I got reports claiming that "." and ".." was corrupted for that directory. Looks like IBM doesn't have experienced any simular problems on these servers. They want to replace the RAID controller and see if that resolves the problem. |