Recently on an Fedora 20 machine with Rawhide kernels (from the fedora-rawhide-kernel-nodebug repo), I've had some errors cropping up with ext4 (once) and tmpfs (twice). The symptoms are such that df reports the partition is full, but du shows there is plenty of space remaining. First, the ext4 problem: last week, I ran out of space on /home. I proceeded to weed out some things (old build trees, various media files, etc.). du was showing the space as being reclaimed, but df didn't show anything until the machine was rebooted. The problem hasn't shown up again, but I'm also not near the limit anymore. For the tmpfs problems, I've ran out of space (and is currently showing an overflow as well) after using the machine for a day or two: % df -h Filesystem Size Used Avail Use% Mounted on /dev/sdb3 49G 19G 28G 40% / devtmpfs 3.9G 0 3.9G 0% /dev tmpfs 3.9G 58M 3.9G 2% /dev/shm tmpfs 3.9G 13M 3.9G 1% /run tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup tmpfs 3.9G -64Z -208K 100% /tmp /dev/sdb1 190M 119M 58M 68% /boot /dev/sda1 212G 130G 72G 65% /home # du -bhs /tmp 311 /tmp As root, after some cleaning up attempts (e.g., shutting down services which had directories there), du is reporting only 311 bytes being used (a tmux socket, some X stuff, a few directories, and an ssh-agent socket). Since this has occurred on both ext4 and tmpfs, it makes me think something above the FS layer has gone wrong, but I don't know how to track it down. The other alternative is a service holding open an ever-growing file, but I've seen nothing in /proc indicating that and I think it'd be more consistent than this (plus /home and /tmp being completely different but having the same symptoms). System logs show nothing of note (no oops). I do have the vbox module installed; I've blacklisted it for future boots and see if things still occur in the future. % uname -a Linux erythro 3.13.0-0.rc8.git2.2.fc21.x86_64 #1 SMP Wed Jan 15 16:18:47 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux I'll leave the system with the "full" tmpfs up tonight and I'll try any data gathering in the morning when I get back into work; after that I'll have to reboot to be able to compile things again and data gathering would have to wait until it happens again.
OK, so it's also a gradual thing. Any way to find out what it is before I'm forced to reboot again? # du -bhsx /tmp 1.2K /tmp # df -h /tmp Filesystem Size Used Avail Use% Mounted on tmpfs 3.9G 2.2G 1.8G 55% /tmp
and neither lsof nor /proc/*/fd/* points at anything in /tmp? du stats files one at a time &* adds them up; df does a statfs and queries the fs-wide stats in the superblock. It seems really unlikely that there is a common bug between ext4 & tmpfs, tbh. On ext4, did i.e. "sync" change the df numbers?
(In reply to Eric Sandeen from comment #2) > and neither lsof nor /proc/*/fd/* points at anything in /tmp? # ls -l /proc/*/fd/* | grep tmp ls: cannot access /proc/5082/fd/255: No such file or directory ls: cannot access /proc/5082/fd/3: No such file or directory ls: cannot access /proc/self/fd/255: No such file or directory ls: cannot access /proc/self/fd/3: No such file or directory lr-x------. 1 boeckb boeckb 64 Jan 28 14:44 /proc/1530/fd/38 -> /tmp/.esd-1000 # lsof /tmp COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME pulseaudi 1530 boeckb 38r DIR 0,33 60 24897 /tmp/.esd-1000 > It seems really unlikely that there is a common bug between ext4 & tmpfs, > tbh. That's fine; I'm sorta shooting into the dark there :) . > On ext4, did i.e. "sync" change the df numbers? I don't remember if I tried "sync" for the ext4 issue (I think I did, but I'm not sure), but it doesn't work for tmpfs at least.
And is /tmp/.esd-1000 big and/or present? Or if it's deleted, does "du /proc/1530/fd/38" work (I'm not sure if it does, tbh) Not sure how to debug the tmpfs thing, there's no way to get a "fileystem image" - bug in accounting somewhere it seems. If you can work out a testcase to provoke it...
.esd-1000 is a directory holding a single socket. Running `du` on any /proc/*/fd/* returns 0.
So it looks like it is holding steady today, so I'm guessing it happens in chunks? # du -bhsx /tmp 1.3K /tmp # df -h /tmp Filesystem Size Used Avail Use% Mounted on tmpfs 3.9G 2.2G 1.8G 56% /tmp
It just happened again. It happened while compiling large projects (VTK and ParaView). This would probably explain the "happening in chunks" behavior, but not why it doesn't happen sooner. Possibly relevant package information: kernel-3.13.0-0.rc8.git2.2.fc21.x86_64 gcc-4.8.2-7.fc20.x86_64 ccache-3.1.9-4.fc20.x86_64
I've confirmed that it is compilation that is causing this (unmounted /tmp, remounted, started compiling and watched /tmp usage shoot up in df).
Created attachment 857208 [details] strace log of GCC which permanently increases tmpfs usage Here's an strace log of a compile which takes up tmpfs space. Using -pipe fixes the issue for now.
This does not have anything to do with the file system. The /tmp is clearly filled with gcc temporary files and it goes away when using --pipe which forces GCC to use pipes instead of files, which confirm it.
OK. That's fine. How do I remove them? They don't have names so "rm" is useless. *Something* is wrong here since du and df are disagreeing even when all GCC processes have exited.
The tmpfs statfs overflow certainly looks like a problem, still... > tmpfs 3.9G -64Z -208K 100% /tmp Also, Lukas, I might have missed something, but: if ls -a shows no files in /tmp, and they aren't open+unlinked (as not-found in /proc), how is this attributable to a simple filling of the filesystem? Ben, perhaps some more targeted tests which involve filling & removing files in a tmpfs mount yourself might be interesting, to see if you can provoke a problem in a simpler, more controlled test. Tight testcases almost always lead to bugfixes. :)
OK, found an example that I had lying around which reproduces it. It seems to not occur if the binary is small enough, as removing one line causes it to not occur (commented in the source). strace logs with and without pipe are also attached. Before: tmpfs 4083668 2753500 1330168 68% /tmp After: tmpfs 4083668 2753624 1330044 68% /tmp Occasionally, it will drop by 4 bytes after compiling a file, but it's sporadic.
Created attachment 858836 [details] strace log which does not cause the issue (without -pipe) Caught an instance where it doesn't occur without -pipe.
Created attachment 858837 [details] strace log which does not cause the issue (with -pipe)
Created attachment 858838 [details] strace log when g++ showed the issue
Created attachment 858839 [details] source code for strace logs Just a plain "g++ circular.cxx" and boost-devel is required. I can try and find an example without boost if that would help as well.
Eric, it would be better to know what gcc and ccache are actually doing there, because I have no idea and it might explain few things. Not sure who can I cc for the gcc :) Any idea ? -Lukas
Lukas, I don't really know. It'd be nice if someone ;) had time to look at the strace & concoct a testcase that doesn't require boost-devel etc. Too bad there seem to be no tracepoints in mm/shmem.c -Eric
Seems to have stopped occurring with 3.14.0-0.rc1.git5.2.fc21.1.x86_64, but I'll keep an eye on it for a few weeks yet.
So I haven't seen this in months. Presumably it has been fixed or is no longer triggered by my workflows.
Ok, thanks. Closing, then.