1058512 – ENOSPC when there is plenty available (tmpfs and ext4)

Bug 1058512 - ENOSPC when there is plenty available (tmpfs and ext4)

Summary: ENOSPC when there is plenty available (tmpfs and ext4)

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	fs-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-01-27 23:40 UTC by Ben Boeckel
Modified:	2014-12-11 06:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-12-11 06:25:19 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
strace log of GCC which permanently increases tmpfs usage (2.26 MB, text/x-log) 2014-01-29 18:46 UTC, Ben Boeckel	no flags	Details
strace log which does not cause the issue (without -pipe) (1.93 MB, text/x-log) 2014-02-03 19:37 UTC, Ben Boeckel	no flags	Details
strace log which does not cause the issue (with -pipe) (1.91 MB, text/x-log) 2014-02-03 19:38 UTC, Ben Boeckel	no flags	Details
strace log when g++ showed the issue (1.90 MB, text/x-log) 2014-02-03 19:38 UTC, Ben Boeckel	no flags	Details
source code for strace logs (323 bytes, text/x-c++src) 2014-02-03 19:39 UTC, Ben Boeckel	no flags	Details
Show Obsolete (1) View All

Description Ben Boeckel 2014-01-27 23:40:58 UTC

Recently on an Fedora 20 machine with Rawhide kernels (from the fedora-rawhide-kernel-nodebug repo), I've had some errors cropping up with ext4 (once) and tmpfs (twice). The symptoms are such that df reports the partition is full, but du shows there is plenty of space remaining.

First, the ext4 problem: last week, I ran out of space on /home. I proceeded to weed out some things (old build trees, various media files, etc.). du was showing the space as being reclaimed, but df didn't show anything until the machine was rebooted. The problem hasn't shown up again, but I'm also not near the limit anymore.

For the tmpfs problems, I've ran out of space (and is currently showing an overflow as well) after using the machine for a day or two:

% df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdb3 49G 19G 28G 40% /
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 58M 3.9G 2% /dev/shm
tmpfs 3.9G 13M 3.9G 1% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
tmpfs 3.9G -64Z -208K 100% /tmp
/dev/sdb1 190M 119M 58M 68% /boot
/dev/sda1 212G 130G 72G 65% /home

# du -bhs /tmp
311 /tmp

As root, after some cleaning up attempts (e.g., shutting down services which had directories there), du is reporting only 311 bytes being used (a tmux socket, some X stuff, a few directories, and an ssh-agent socket).

Since this has occurred on both ext4 and tmpfs, it makes me think something above the FS layer has gone wrong, but I don't know how to track it down. The other alternative is a service holding open an ever-growing file, but I've seen nothing in /proc indicating that and I think it'd be more consistent than this (plus /home and /tmp being completely different but having the same symptoms).

System logs show nothing of note (no oops). I do have the vbox module installed; I've blacklisted it for future boots and see if things still occur in the future.

% uname -a
Linux erythro 3.13.0-0.rc8.git2.2.fc21.x86_64 #1 SMP Wed Jan 15 16:18:47 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

I'll leave the system with the "full" tmpfs up tonight and I'll try any data gathering in the morning when I get back into work; after that I'll have to reboot to be able to compile things again and data gathering would have to wait until it happens again.

Comment 1 Ben Boeckel 2014-01-28 17:31:35 UTC

OK, so it's also a gradual thing. Any way to find out what it is before I'm forced to reboot again?

# du -bhsx /tmp
1.2K    /tmp
# df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.9G  2.2G  1.8G  55% /tmp

Comment 2 Eric Sandeen 2014-01-28 18:56:39 UTC

and neither lsof nor /proc/*/fd/* points at anything in /tmp?

du stats files one at a time &* adds them up; df does a statfs and queries the fs-wide stats in the superblock.

It seems really unlikely that there is a common bug between ext4 & tmpfs, tbh.

On ext4, did i.e. "sync" change the df numbers?

Comment 3 Ben Boeckel 2014-01-28 19:48:19 UTC

(In reply to Eric Sandeen from comment #2)
> and neither lsof nor /proc/*/fd/* points at anything in /tmp?

# ls -l /proc/*/fd/* | grep tmp
ls: cannot access /proc/5082/fd/255: No such file or directory
ls: cannot access /proc/5082/fd/3: No such file or directory
ls: cannot access /proc/self/fd/255: No such file or directory
ls: cannot access /proc/self/fd/3: No such file or directory
lr-x------. 1 boeckb boeckb 64 Jan 28 14:44 /proc/1530/fd/38 -> /tmp/.esd-1000
# lsof /tmp
COMMAND    PID   USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
pulseaudi 1530 boeckb   38r   DIR   0,33       60 24897 /tmp/.esd-1000

> It seems really unlikely that there is a common bug between ext4 & tmpfs,
> tbh.

That's fine; I'm sorta shooting into the dark there :) .

> On ext4, did i.e. "sync" change the df numbers?

I don't remember if I tried "sync" for the ext4 issue (I think I did, but I'm not sure), but it doesn't work for tmpfs at least.

Comment 4 Eric Sandeen 2014-01-28 19:52:24 UTC

And is /tmp/.esd-1000 big and/or present?  Or if it's deleted, does "du /proc/1530/fd/38" work (I'm not sure if it does, tbh)

Not sure how to debug the tmpfs thing, there's no way to get a "fileystem image" - bug in accounting somewhere it seems.

If you can work out a testcase to provoke it...

Comment 5 Ben Boeckel 2014-01-28 20:11:46 UTC

.esd-1000 is a directory holding a single socket. Running `du` on any /proc/*/fd/* returns 0.

Comment 6 Ben Boeckel 2014-01-28 22:38:45 UTC

So it looks like it is holding steady today, so I'm guessing it happens in chunks?

# du -bhsx /tmp
1.3K    /tmp
# df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.9G  2.2G  1.8G  56% /tmp

Comment 7 Ben Boeckel 2014-01-29 18:18:47 UTC

It just happened again. It happened while compiling large projects (VTK and ParaView). This would probably explain the "happening in chunks" behavior, but not why it doesn't happen sooner. Possibly relevant package information:

kernel-3.13.0-0.rc8.git2.2.fc21.x86_64
gcc-4.8.2-7.fc20.x86_64
ccache-3.1.9-4.fc20.x86_64

Comment 8 Ben Boeckel 2014-01-29 18:35:06 UTC

I've confirmed that it is compilation that is causing this (unmounted /tmp, remounted, started compiling and watched /tmp usage shoot up in df).

Comment 9 Ben Boeckel 2014-01-29 18:46:45 UTC

Created attachment 857208 [details]
strace log of GCC which permanently increases tmpfs usage

Here's an strace log of a compile which takes up tmpfs space.

Using -pipe fixes the issue for now.

Comment 10 Lukáš Czerner 2014-02-03 17:17:29 UTC

This does not have anything to do with the file system. The /tmp is clearly filled with gcc temporary files and it goes away when using --pipe which forces GCC to use pipes instead of files, which confirm it.

Comment 11 Ben Boeckel 2014-02-03 18:09:23 UTC

OK. That's fine. How do I remove them? They don't have names so "rm" is useless. *Something* is wrong here since du and df are disagreeing even when all GCC processes have exited.

Comment 12 Eric Sandeen 2014-02-03 18:12:31 UTC

The tmpfs statfs overflow certainly looks like a problem, still...

>     tmpfs           3.9G  -64Z -208K 100% /tmp

Also, Lukas, I might have missed something, but: if ls -a shows no files in /tmp, and they aren't open+unlinked (as not-found in /proc), how is this attributable to a simple filling of the filesystem?

Ben, perhaps some more targeted tests which involve filling & removing files in a tmpfs mount yourself might be interesting, to see if you can provoke a problem in a simpler, more controlled test.   Tight testcases almost always lead to bugfixes.  :)

Comment 13 Ben Boeckel 2014-02-03 19:36:14 UTC

OK, found an example that I had lying around which reproduces it. It seems to not occur if the binary is small enough, as removing one line causes it to not occur (commented in the source). strace logs with and without pipe are also attached.

Before:
    tmpfs            4083668   2753500   1330168  68% /tmp
After:
    tmpfs            4083668   2753624   1330044  68% /tmp

Occasionally, it will drop by 4 bytes after compiling a file, but it's sporadic.

Comment 14 Ben Boeckel 2014-02-03 19:37:33 UTC

Created attachment 858836 [details]
strace log which does not cause the issue (without -pipe)

Caught an instance where it doesn't occur without -pipe.

Comment 15 Ben Boeckel 2014-02-03 19:38:06 UTC

Created attachment 858837 [details]
strace log which does not cause the issue (with -pipe)

Comment 16 Ben Boeckel 2014-02-03 19:38:36 UTC

Created attachment 858838 [details]
strace log when g++ showed the issue

Comment 17 Ben Boeckel 2014-02-03 19:39:40 UTC

Created attachment 858839 [details]
source code for strace logs

Just a plain "g++ circular.cxx" and boost-devel is required. I can try and find an example without boost if that would help as well.

Comment 18 Lukáš Czerner 2014-02-04 08:23:21 UTC

Eric, it would be better to know what gcc and ccache are actually doing there, because I have no idea and it might explain few things. Not sure who can I cc for the gcc :) Any idea ?

-Lukas

Comment 19 Eric Sandeen 2014-02-04 17:44:16 UTC

Lukas, I don't really know.

It'd be nice if someone ;) had time to look at the strace & concoct a testcase that doesn't require boost-devel etc.

Too bad there seem to be no tracepoints in mm/shmem.c

-Eric

Comment 20 Ben Boeckel 2014-02-11 20:11:48 UTC

Seems to have stopped occurring with 3.14.0-0.rc1.git5.2.fc21.1.x86_64, but I'll keep an eye on it for a few weeks yet.

Comment 21 Ben Boeckel 2014-12-11 05:28:11 UTC

So I haven't seen this in months. Presumably it has been fixed or is no longer triggered by my workflows.

Comment 22 Eric Sandeen 2014-12-11 06:25:19 UTC

Ok, thanks.  Closing, then.

Note You need to log in before you can comment on or make changes to this bug.