Bug 750344

Summary: fsck on reboot fails on very large filesystems with small ram machines unless swap is enabled first
Product: Red Hat Enterprise Linux 5 Reporter: Mike Hardy <redhat>
Component: initscriptsAssignee: Lukáš Nykrýn <lnykryn>
Status: CLOSED WONTFIX QA Contact: qe-baseos-daemons
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.7CC: esandeen, initscripts-maint-list, notting
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: i386   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-11 14:14:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Start swap before executing fsck instead of after none

Description Mike Hardy 2011-10-31 19:06:49 UTC
Created attachment 531021 [details]
Start swap before executing fsck instead of after

Description of problem:

I have a machine with 1GB of RAM, but a very large software RAID filesystem (4TB). I reached the maximum number of mounts for the filesystem in between filesystem checks, so on reboot the system attempted to fsck the filesystem - this is all as expected and works well

However, for a very large filesystem on a machine with limited RAM (where 1GB RAM vs 4TB ext3 fs is apparently limited RAM) the fsck fails with this error message:

Error allocating block bitmap (4): Memory allocation failed

If you enable swap prior to performing the fsck, things work fine


Version-Release number of selected component (if applicable):
RHEL 5.7

How reproducible:
every time

Steps to Reproduce:
1. Create a very large filesystem, and restrict the amount of physical RAM the kernel may use
2. Somehow trigger boot-time fsck for the filesystem (set /forcefsck or use tune2fs to trigger after each mount)
3. Reboot
  
Actual results:
System reports a boot error and drops to single-user repair shell

Expected results:
System performs a fsck and completes bootup

Additional info:
I fixed the problem for myself by doing this:
mount -o rw,remount /
vi /etc/fstab
(edit the fstab to comment out the very large filesystem)
shutdown -r now

At that point I edited /etc/rc.sysinit and moved the swap stanza so that it was above the stanza where the fsck command was actually called. I'm attaching a patch that expresses that - this appeared to be a minimal change to me, and works for me - but I recognize it may have unintended side effects as swap may have other dependencies I'm not using so someone well-versed in rc.sysinit should examine it

Hope this helps

Comment 1 Bill Nottingham 2011-10-31 20:22:51 UTC
I'm a bit leery of changing the order at this point. Eric, is this expected behavior for large ext3 filesystems, or is it a bug that it's taking this much memory?

Comment 2 Mike Hardy 2011-10-31 20:46:03 UTC
I was leery too as I know swap can be encrypted or network and I don't have those dependency graphs in my head but for my boring non-RAID, non-encrypted, local swap setup it was good enough.

I'm sure you guys have virtual playgrounds where you can test e2fsprogs changes like this, but if this results in an e2fsprogs patch to reduce memory consumption, I'm happy to beta test a sample RPM or binary with the changes. I have a non-prod machine to do it with and am very comfortable with manual rpm installation etc as well as collecting stats if needed from a fsck while it's running (info from /proc or /sys or similar (if it's useful). Likewise if you just need more info about my current setup, I'm happy to provide details if there's interest but you have reproduction difficulties.

I don't have anything else to offer though so I'll butt out otherwise. Thanks!

Comment 3 Eric Sandeen 2011-10-31 20:52:59 UTC
I don't think it's likely a bug, though I guess it could be.  I'd probably need to see an image of the fs to offer better guidance.

Fack memory usage depends on the details of the filesystem; I could imagine running out in this situation, yes.

It's a crummy end-result that a forced-fsck of an as-far-as-we-know healthy filesystem caused a boot failure... (!)

I guess I too like the idea of swapon before running fsck - fsck is kind of a wildcard, and it could certainly be very memory-intensive...

Maybe we could also look into changing the exit code for -ENOMEM if no problems have been found - there should be no need for manual intervention if no problems were indicated, no problems were found, and fsck simply keeled over on its own...

-Eric

Comment 4 Eric Sandeen 2011-10-31 20:55:31 UTC
Mike, if you could provide a zipped-up "e2image -r" output I could look into the memory usage.  No file data, but by default filenames are in the clear.  There is an option to obfuscate filenames as well.  If you don't want to attach it you could shoot it to me in an email.

Offhand it doesn't sound like a bug but it might be worth looking into.

-Eric

Comment 5 Mike Hardy 2011-10-31 23:53:56 UTC
Hi Eric, happy to. From the man pages it appeared obfuscating it might have the side effect of making analysis unreliable. I don't believe I have anything in there that's more than simply boring w.r.t. filenames so I just used '-r', but the size of the file is likely to be large. I'll put it in a spot where you can get it (once it's done generating itself - it's been running 2.5hrs already...) then email you an URL.

I was thinking along the same lines as your comment #3 though - if you can't alter rc.sysinit to provide the same virtual memory capacity at boot time that you'd have at run time for fsck, then from a programming perspective you may know the size limits of filesystems and thus the max ram that any of the fscks would use and thus make them use memory resources more stingily, but you never know how little memory a system has and can't control that. There were reports of this happening in the embedded space with e.g. DD-WRT and an external 1TB USB drive (as a highlight of an extreme version of the problem). This sort of mismatch seems likely to appear more frequently as it seems data grows faster than RAM capacity as a trend. At least failing nicely in this situation with the ENOMEM error handling change you mention maintains principle of least surprise and OCD folks like myself will know we still have problems with e.g. Nagios check_next_fsck

My specific filesystem is hosting the rsync point-in-time symbolic-link-forest equivalent of time machine backups for a few hosts, so there is a stupendous amount of FS entries vs disk usage. That usage pattern is possibly putting me on the bleeding edge here.

Comment 6 Eric Sandeen 2011-11-01 16:14:49 UTC
if you use tar to compress it, i.e. something like:

e2image -r /dev/hda1 hda1.e2i && tar Sjcvf e2i.tar.bz2 hda1.e2i

then I can re-sparsify it relatively easily on my end.

You are probably right that a massive number of inodes is at the root of your memory usage requirements...

-Eric

Comment 7 Mike Hardy 2011-11-01 16:32:46 UTC
Hmm - the incantation I used was cribbed from the manpage (tailored for my device name and an output spot with enough space):

e2image -r /dev/md4 - | bzip2 > /raid/md4.e2i.bz2

...and it's still running - it had a zero-length bz2 file for 2.5 hours (indicating to me that e2image took that long to do it's thing - roughly the time it takes to fsck incidentally) then bzip2 started using as much CPU as it could and it's been doing that for the last 12 hours or so while steadily growing the file. The file's currently 87MB and is still growing though I'm obviously unable to determine progress/completion

is this incantation one that will allow you to maintain the sparseness? If not I could see how that's an important enough property on a 4TB filesystem to warrant re-doing it even with the long turnaround times here.

Also may be that I'm inadvertently asking bzip2 to process a seeming 4TB of data here which could take longer than re-doing it with tar in the pipeline. I'm not familiar enough with sparse files in practice to know if I'm doing something silly along those lines.

Which I guess boils down to - given my incantation above vs the one you post in Comment #6, would you recommend I restart with yours, or wait mine out?

Thanks Eric
-Mike

Comment 8 Eric Sandeen 2011-11-01 16:55:02 UTC
I can live with your bzip, it'll just take me longer to unpack on my end.

Comment 9 Eric Sandeen 2011-11-01 18:37:05 UTC
I did send a patch upstream to exit with more status even when we abort due to ENOMEM, but I'm not sure that's terribly useful here.

Upstream we have removed the forced fsck intervals altogether, and initscripts would have to parse out the information for it to be useful.

Still, if the patch gets merged, it will return:

FSCK_REBOOT, FSCK_UNCORRECTED, and/or FSCK_NONDESTRUCT in addition to FSCK_ERROR based on what had happened to the filesystem up to that point.

Comment 10 Mike Hardy 2011-11-02 19:39:28 UTC
interesting that forced fsck intervals are removed as policy now - good to know. definitely preserves least surprise on reboots but also makes me curious about capability for online checks vs monitoring check status and scheduling offline ones. I'll read up on that.

The e2image/bzip2 pipeline I fired up was still chugging along today (day 2...) and I suspected this could take much longer than expected, so I read up on sparseness support in utilities and did a quick test on a 10GB filesystem and doing the e2image -r combined with a tar -Sz(etc) step separately took 1min 50sec for the tar/compress while using bzip2 took 10min 50sec. That's a pretty significant difference and means at worst I'm sacrificing 5-6 hours by restarting things and doing it the definitely-sparse way and I may be saving days of time.

This doesn't necessarily qualify as a bug report against the e2image man page, but I cribbed my command pipeline directly from it, and it appears that it may be wrong at worst w.r.t. sparseness support but is at minimum a much less slower-than-optimal way to go about things than the method in Comment #6

So I aborted the current attempt and just restarted the e2image outside of a pipeline (~3hrs) and will run it through tar -Szc(etc) afterwards (ETA unknown).

Hopefully a result soon - there definitely appears to be something algorithmically difficult going on with regards to e2fsprogs, and I'll be curious to know if it's something that can't be optimized away or whether it can be and/or is worth it...

Comment 11 Eric Sandeen 2011-11-02 19:43:06 UTC
Yes, it'll be up to you to do periodic checks now.  But a good fs should detect metadata corruption runtime anyway, and not propagate it ... I really don't see the value in boot-time fsck on a journaling filesytem, ever.

A patch has actually been sent for the e2image manpage already, I think.  :)

I'm thinking that the end result of all this is "yes, you have a big, metadata-complex filesystem and yes, you need more memory than you've got to check it."

Comment 12 Mike Hardy 2011-11-06 03:15:57 UTC
Had a second to check in on the e2image -r step run by itself, and it has "lseek: Invalid argument" errors. I agree with your Comment 11 that my original report is likely not a practically fixable problem but in the hope I am perhaps providing value fuzz-busting the associated toolchain, I'm persisting ;-)

This time I think it's that e2image -r doesn't appear to be able to generate large files - the resulting file topped out with this byte count: 

[root@maximegalon raid]# ls -lat md4.e2i
-rw------- 1 root root 2196875759616 Nov  2 18:00 md4.e2i
[root@maximegalon raid]# du -h md4.e2i
3.3G	md4.e2i

That looks suspiciously close to a 2TB powers-of-two limitation on total file size though it's only 3.3GB of actual disk usage. Unsure if it is possible to exceed that on ext3 with sparse files by changing e2image, or if e2image could produce a better error message, but it means running e2image -r as a solo step on a filesystem >= 2TB will likely never work.

Given that, I wasn't able to quickly come up with a way to send the e2image command through a pipe while still being assured that it would maintain it's sparseness.

I checked out the discussion regarding the e2image man page and the example command there so I'm guessing you guys may have thought of this. Anything you'd recommend?

Comment 13 Eric Sandeen 2011-11-08 16:51:22 UTC
Oh... yeah, for a very large fs you may be hitting the max file offset limit on ext3.  Sigh.  ext4 or xfs could host it.

It may be time to let this one go ;)

Comment 14 Mike Hardy 2011-11-08 17:01:43 UTC
Agreed - I'm happy closing this one knowing it's been thought about and shouldn't bit people at least with future revs. And I've learned a fair bit about how to handle sparseness. Thanks for the support Eric

Comment 15 Eric Sandeen 2011-11-08 17:07:37 UTC
Mike, no problem.

Bill, I'll leave it up to you whether you want this one open to consider changing when swap gets activated...

-Eric

Comment 16 Lukáš Nykrýn 2013-03-11 14:14:19 UTC
Based on previous comments -> closed.