Description of problem: Did an ls -lah on a directory, it showed only '.', not '..' (and nothing else, there should have been files.) The system was slow logging in (ssh), but vmstat showed no cpu usage. iostat showed a continuous writing of about 50 blocks/sec (in total) to the /, /var, and /usr filesystems. I could not get lsof to shows any files open for writing on /usr. (I filtered the lsof output looking for write access, on files after scanning manually. There _might_ have been directory access??) Normally there's less than 10 blocks/sec, most of it on /var logging spam & the occasional email. Went into single user mode, did an fsck -f on all filesystems. Got no errors. After remounting the directory once again contained all the files it should. We've also been getting errors from rsync like: Aug 5 04:44:05 jcp rsyncd[28183]: rsync to jcp-backup from root.edu (128.135.44.143) Aug 5 04:44:38 jcp rsyncd[28183]: write failed on etc/ld.so.cache : Success Aug 5 04:44:38 jcp rsyncd[28183]: rsync error: error in file IO (code 11) at receiver.c(272) Aug 5 04:44:38 jcp rsyncd[28183]: rsync: connection unexpectedly closed (4255086 bytes read so far) Aug 5 04:44:38 jcp rsyncd[28183]: rsync error: error in rsync protocol data stream (code 12) at io.c(165) Version-Release number of selected component (if applicable): # uname -a Linux jcp.uchicago.edu 2.4.21-40.EL #1 Thu Feb 2 22:22:40 EST 2006 i686 athlon i386 GNU/Linux How reproducible: Haven't really tried too much. Only found the "disappearing directory contents" by accident. Steps to Reproduce: 1. 2. 3. Actual results: The rsync problems started about the time I upgraded to 2.4.21-47.EL (I think) so I rebooted in 2.4.21-40.EL and things seem normal. Expected results: Additional info: We've got an academic support contract, but I can't for the life of me figure out how to submit a bug using it. Maybe it only gets us updates? No sata drives although the mobo supports sata. All problems are on ide drives: # lspci 00:00.0 Host bridge: VIA Technologies, Inc. VT8385 [K8T800 AGP] Host Bridge (rev 01) 00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI bridge [K8T800/K8T890 South] 00:08.0 Class 4401: C-Media Electronics Inc CM8738 (rev 10) 00:0b.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) 00:0d.0 RAID bus controller: Promise Technology, Inc. PDC20378 (FastTrak 378/SATA 378) (rev 02) 00:0e.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller (rev 80) 00:0f.0 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06) 00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81) 00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81) 00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81) 00:10.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81) 00:10.4 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86) 00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge [KT600/K8T800/K8T890 South] 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:00.0 VGA compatible controller: nVidia Corporation NV34 [GeForce FX 5200] (rev a1)
Created attachment 133696 [details] find /proc/ide -type f -exec bash -c 'echo {} ; cat {} ; echo' \;
The slow logging-in was due to a bad nameserver elsewhere in the network. Last night this message showed up on the target system when copying from the problem system: Aug 7 04:31:20 jcp-backup rsyncd[23849]: readlink jcp/u/ems/C6.06.241/Editor/Refltr/rltr.rr1.-1.C6.06.241.20Jul06.txt: Input/output error The /jcp/u/ems/C6.06.241/ directory is the one mentioned at the very beginning of this bug report, the directory which contained only a "." and no "..". As of today I see no problems while poking about that part of the filesystem. (There's nothing in the problem system's logs. Clocks are sychronized.) Today I'm able to reproduce some sort of rsync problem when rsync-ing _to_ the box with the problem, using 2.4.21-40.EL kernels on both sides, without having to run out of inodes. (The server side rsync process seems to disappear.) Log messages on the server side (the machine with the reported problem) are: Aug 7 13:05:14 jcp rsyncd[32555]: rsync allowed access on module jcp-backup from jcp-backup.uchicago.edu (128.135.44.143) Aug 7 13:05:14 jcp rsyncd[32555]: rsync to jcp-backup from root.edu (128.135.44.143) Aug 7 13:06:03 jcp rsyncd[32555]: write failed on etc/ld.so.cache : Success Aug 7 13:06:03 jcp rsyncd[32555]: rsync error: error in file IO (code 11) at receiver.c(272) Aug 7 13:06:03 jcp rsyncd[32555]: rsync: connection unexpectedly closed (4255086 bytes read so far) Aug 7 13:06:03 jcp rsyncd[32555]: rsync error: error in rsync protocol data stream (code 12) at io.c(165) Note that this is a completely different filesystem, on the problem machine. Maybe the rsync issue is unrelated as well? ? Or maybe there's something else going on. I've scheduled downtime for a memory test.
Ran memtest86+ v 1.65 on both source and destination machine and everything passed. The rsync errors from the last post are from filling up the filesystem (blocks, not inodes). Sorry. I still have no explaination for the original problem, the directory that showed up with just a "." as contents. There may yet be a problem with ext3 when you run out of inodes. My plan now is to switch back to the 2.4.21-47.EL kernel and let you guys worry about possible ext3 problems.
So, to summarize: the one actual problem is the missing entries from "ls -lah", and this problem disappears after a remount? The slowness was due to a bad nameserver and the rsync errors due to a full filesystem, right? Do these missing entries happen only after the filesystem runs out of space? Does it always take a remount for the missing entries to re-appear? Is it always the same directory which has this problem? Thanks, -Eric
Sorry about the confused bug report, wanted to get in any info that might be relevant. Yes, the one problem is that I did a "ls -lah" and got only ".". The missing entries reappear after reboot and fsck -f (which gave me no errors). I did not just try remounting, sorry. I can't say more as I've not seen the problem re-occur. The only other thing to say is that when the problem occurred there were 2 other unusual occurences. (Niether of which should matter. :) The system that had the "." problem had a full filesystem (out of blocks) on another partitition, not the partition with the "." problem. An rsync was periodically running and failing while trying to put more data on the full partition. (I suppose it's remotely possible that the directory I where I discovered the problem was on the full partition. You know how these things get to be a blur after a while.) Meanwhile, another rsync was periodically running, copying the partition that had the "." problem, to a remote machine -- and the remote machine's partition ran out of inodes. (See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=381486 Three machines are involved here, the one with the problem is a RH AS3, as is the one copying to the problem machine. The third, recieving data from the problem machine, is a Debian box.) So, I don't know what to tell you here. Unlikely as it seems, maybe if the system's busy dealing with a full partition it will somehow, sometimes, show something wierd happening in a directory on another partition? The strange thing is that fsck -f reported no problems. The most likely answer is that I've made some sort of mistake in my reporting. But I really did see a directory with only a "." in it. Given what's being backed-up where, it is possible that he directory with the "." problem was on the full partition. The filesystem path would differ by only 1 component between what I thought I was looking at and the full partition. Still, why no errors reported by fsck? (Next time I'll remember to use script when recovering a system.) I'm willing to answer more questions, but would not blame you if you wanted to close the bug. I'm afraid the system's in production use so I can't run trials filling up the disk. One more thought. I was running rsync with the option that preserves hard links. Maybe it can leave a directory in a strange state when the filesystem fills?
I think I'm going to have to close this one - if you see this again, and can come up with a bit clearer path to reproduction, please do reopen. Thanks,d -Eric