Bug 201461

Summary: ext3 filesystem problems
Product: Red Hat Enterprise Linux 3 Reporter: Karl O. Pinc <kop>
Component: kernelAssignee: Eric Sandeen <esandeen>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: petrides
Target Milestone: ---   
Target Release: ---   
Hardware: athlon   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-09-20 20:28:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
find /proc/ide -type f -exec bash -c 'echo {} ; cat {} ; echo' \; none

Description Karl O. Pinc 2006-08-05 18:22:36 UTC
Description of problem:

Did an ls -lah on a directory, it showed only '.', not '..'
(and nothing else, there should have been files.)

The system was slow logging in (ssh), but vmstat showed
no cpu usage.  iostat showed a continuous writing of about
50 blocks/sec (in total) to the /, /var, and /usr filesystems.
I could not get lsof to shows any files open for writing
on /usr. (I filtered the lsof output looking for write access,
on files after scanning manually.  There _might_ have been
directory access??)  Normally there's less than 10 
blocks/sec, most of
it on /var logging spam & the occasional email.  

Went into single user mode, did an fsck -f on all filesystems.
Got no errors.  After remounting the directory once again
contained all the files it should.

We've also been getting errors from rsync like:

Aug  5 04:44:05 jcp rsyncd[28183]: rsync to jcp-backup from
root.edu (128.135.44.143)
Aug  5 04:44:38 jcp rsyncd[28183]: write failed on etc/ld.so.cache : Success
Aug  5 04:44:38 jcp rsyncd[28183]: rsync error: error in file IO (code 11) at
receiver.c(272)
Aug  5 04:44:38 jcp rsyncd[28183]: rsync: connection unexpectedly closed
(4255086 bytes read so far)
Aug  5 04:44:38 jcp rsyncd[28183]: rsync error: error in rsync protocol data
stream (code 12) at io.c(165)




Version-Release number of selected component (if applicable):
# uname -a
Linux jcp.uchicago.edu 2.4.21-40.EL #1 Thu Feb 2 22:22:40 EST 2006 i686 athlon
i386 GNU/Linux


How reproducible:

Haven't really tried too much.  Only found the "disappearing directory
contents" by accident.

Steps to Reproduce:
1.
2.
3.
  
Actual results:

The rsync problems started about the time I upgraded to 2.4.21-47.EL
(I think)
so I rebooted in 2.4.21-40.EL and things seem normal.

Expected results:


Additional info:

We've got an academic support contract, but I can't for the life of me
figure out how to submit a bug using it.  Maybe it only gets us
updates?

No sata drives although the mobo supports sata.
All problems are on ide drives:

# lspci
00:00.0 Host bridge: VIA Technologies, Inc. VT8385 [K8T800 AGP] Host Bridge (rev 01)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI bridge [K8T800/K8T890 South]
00:08.0 Class 4401: C-Media Electronics Inc CM8738 (rev 10)
00:0b.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit
Ethernet (rev 10)
00:0d.0 RAID bus controller: Promise Technology, Inc. PDC20378 (FastTrak
378/SATA 378) (rev 02)
00:0e.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller
(rev 80)
00:0f.0 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.4 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge
[KT600/K8T800/K8T890 South]
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:00.0 VGA compatible controller: nVidia Corporation NV34 [GeForce FX 5200]
(rev a1)

Comment 1 Karl O. Pinc 2006-08-05 18:22:36 UTC
Created attachment 133696 [details]
find /proc/ide -type f -exec bash -c 'echo {} ; cat {} ; echo' \;

Comment 2 Karl O. Pinc 2006-08-07 20:56:09 UTC
The slow logging-in was due to a bad nameserver elsewhere in the network.

Last night this message showed up on the target system when copying from the
problem system:
Aug  7 04:31:20 jcp-backup rsyncd[23849]: readlink
jcp/u/ems/C6.06.241/Editor/Refltr/rltr.rr1.-1.C6.06.241.20Jul06.txt:
Input/output error
The /jcp/u/ems/C6.06.241/ directory is the one mentioned at the very beginning
of this bug report, the directory which contained only a "." and no "..".
As of today I see no problems while poking about that part of the filesystem.
(There's nothing in the problem system's logs.  Clocks are sychronized.)

Today I'm able to reproduce some sort of rsync problem when rsync-ing _to_ the
box with the problem, using 2.4.21-40.EL kernels on both sides, without having
to run out of inodes.  (The server side rsync process seems to disappear.)  Log
messages on the server side (the machine with the reported problem) are:

Aug  7 13:05:14 jcp rsyncd[32555]: rsync allowed access on module jcp-backup
from jcp-backup.uchicago.edu (128.135.44.143)
Aug  7 13:05:14 jcp rsyncd[32555]: rsync to jcp-backup from
root.edu (128.135.44.143)
Aug  7 13:06:03 jcp rsyncd[32555]: write failed on etc/ld.so.cache : Success
Aug  7 13:06:03 jcp rsyncd[32555]: rsync error: error in file IO (code 11) at
receiver.c(272)
Aug  7 13:06:03 jcp rsyncd[32555]: rsync: connection unexpectedly closed
(4255086 bytes read so far)
Aug  7 13:06:03 jcp rsyncd[32555]: rsync error: error in rsync protocol data
stream (code 12) at io.c(165)

Note that this is a completely different filesystem, on the problem machine.

Maybe the rsync issue is unrelated as well?  ?  Or maybe there's something else
going on.  I've scheduled downtime for a memory test.

Comment 3 Karl O. Pinc 2006-08-08 17:13:25 UTC
Ran memtest86+ v 1.65 on both source and destination machine and everything passed.

The rsync errors from the last post are from filling up the filesystem (blocks,
not inodes).  Sorry.

I still have no explaination for the original problem, the directory that showed
up with just a "." as contents.  There may yet be a problem with ext3 when you
run out of inodes.

My plan now is to switch back to the 2.4.21-47.EL kernel and let you guys worry
about possible ext3 problems.

Comment 4 Eric Sandeen 2006-08-08 17:39:55 UTC
So, to summarize: the one actual problem is the missing entries from "ls -lah",
and this problem disappears after a remount?  The slowness was due to a bad
nameserver and the rsync errors due to a full filesystem, right?

Do these missing entries happen only after the filesystem runs out of space?
Does it always take a remount for the missing entries to re-appear?
Is it always the same directory which has this problem?

Thanks,
-Eric

Comment 5 Karl O. Pinc 2006-08-08 18:28:50 UTC
Sorry about the confused bug report, wanted to get in any info that might be
relevant.

Yes, the one problem is that I did a "ls -lah" and got only ".".

The missing entries reappear after reboot and fsck -f (which gave me no
errors).  I did not just try remounting, sorry.  I can't say more as I've
not seen the problem re-occur.

The only other thing to say is that when the problem occurred there were
2 other unusual occurences.  (Niether of which should matter. :)  The system
that had the "." problem had a full filesystem (out of blocks) on another
partitition, not the partition with the "." problem.  An rsync was periodically
running and failing while trying to put more data on the full partition.  (I
suppose it's remotely possible that the directory I where I discovered the
problem was on the full partition.  You know how these things get to be a blur
after a while.)  Meanwhile, another rsync was periodically running, copying the
partition that had the "." problem, to a remote machine -- and the remote
machine's partition ran out of inodes.  (See
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=381486  Three machines are
involved here, the one with the problem is a RH AS3, as is the one copying to
the problem machine.  The third, recieving data from the problem machine, is a
Debian box.)

So, I don't know what to tell you here.  Unlikely as it seems, maybe if the
system's busy dealing with a full partition it will somehow, sometimes, show
something wierd happening in a directory on another partition?  The strange
thing is that fsck -f reported no problems.  The most likely answer is that I've
made some sort of mistake in my reporting.  But I really did see a directory
with only a "." in it.  Given what's being backed-up where, it is possible that
he directory with the "." problem was on the full partition.  The filesystem
path would differ by only 1 component between what I thought I was looking at
and the full partition.  Still, why no errors reported by fsck?

(Next time I'll remember to use script when recovering a system.)

I'm willing to answer more questions, but would not blame you if you wanted to
close the bug.  I'm afraid the system's in production use so I can't run trials
filling up the disk.

One more thought.  I was running rsync with the option that preserves hard
links.  Maybe it can leave a directory in a strange state when the filesystem
fills?

Comment 6 Eric Sandeen 2006-09-20 20:28:33 UTC
I think I'm going to have to close this one - if you see this again, and
can come up with a bit clearer path to reproduction, please do reopen.

Thanks,d

-Eric