Bug 467677 - Old fsck (e2fsprogs 1.39) are unable to check 5 TB volume
Old fsck (e2fsprogs 1.39) are unable to check 5 TB volume
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: e2fsprogs (Show other bugs)
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Eric Sandeen
Depends On:
  Show dependency treegraph
Reported: 2008-10-20 03:21 EDT by Oliver Falk
Modified: 2009-07-31 14:49 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-07-31 13:38:02 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Oliver Falk 2008-10-20 03:21:31 EDT
Description of problem:
We have some host here, with a some volumes mapped from a IBM DS4800. About a week ago the host crashed a few times and we had to reset it (each time).

The host came up again with no problem. The journal was recovered and everything seemed to be fine.

However. A few days (without crashing) later the host encountered an ext3_abort (see attached file).

I restarted the machine and it started with fsck. After some hours (two or so), fsck repaired some errors and then started using 100% CPU (regarding to top), but did no more IO, regarding to iostat and the fibre channel switch.

I tried it again (about 3 times), but every time; Same behavior: Infinite loop. Well, I waited for about 13 - 14 hours and no IO happened on this volume, but fsck took 100% CPU all the time.

Since there are a few "infinite loop" bugs between e2fsprogs 1.39 and 1.41.3, I upgraded e2fsprogs with a package from Fedora (e2fsprogs-1.41.3-2). I restarted fsck on Friday and today I discovered that it finished successfully.

From my graphs I can see that, while fsck was running, it always did IO! From about Friday 9:30 until Saturday 8:30.

Version-Release number of selected component (if applicable): 1.39, 1.41.3.

How reproducible:
I hope it's not reproducible :-/

Steps to Reproduce:
1. Create a 5 TB LUN on a IBM DS4800
2. Map it to a host
3. Create a Volume
4. Let the machine crash (?)
5. Restart it
6. Wait a few days until it encounters an ext3_abort (unlikely to *really* reproduce)
7. Start fsck with e2fsprogs 1.39
8. Watch it going to an infinite loop
9. Stop fsck
10. Rebuild e2fsprogs 1.41.3 (pkg from Fedora)
11. Restart fsck from e2fsprogs 1.41.3 and wait 23 hours
12. See that it works :-)
Actual results:
fsck from RHEL5(.2) infinite loop if checking 5 TB volume.

Expected results:
e2fsprogs from RHEL5(.2) should JustWork (tm).

Additional info:
I'm afraid I don't have the output of fsck from 1.39, nor 1.41.3. Since this is a data pre-processing machine in state *production* I had to make it work ASAP and therefore forgot about saving all relevant logs.

I think the easiest solution would be to upgrade e2fsprogs in RHEL5, but I'm not sure if this is possible for you.

Please also note, I have a few more machines with RHEL5 and lots of TB diskspace. So if it happens again - for me - I already know the solution. I don't know how many other RHEL users actually do have quite large volumes (with ext3) as I do.

Also, if you want to suggest a different filesystem, please don't hesitate to do so.

A few command line outputs:
[root@aprocp01 log]# df -i /data01
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
                     671088640 124733964 546354676   19% /data01
[root@aprocp01 log]# df -h /data01
Filesystem            Size  Used Avail Use% Mounted on
                      5.0T  1.4T  2.9T  33% /data01

If you need additional information, let me know. For me/us large volumes are 'mission critical'!
Comment 1 Eric Sandeen 2008-11-06 15:01:21 EST
Thanks for the report; I'll have to go looking for these "infinite loop" fixes, I haven't seen them reported against RHEL5 before.  Simply updating e2fsprogs to the bleeding-edge upstream version isn't really a viable option for RHEL, though, I'm afraid.

Comment 2 Oliver Falk 2008-11-07 16:59:20 EST
Eric, I know it's not a viable option for RHEL. And I know we don't have a problem with Fedora, since there we already have an updated version.
And Eric, if you want me to bug RH via Support Services, I can do that as well :-)
Comment 3 Eric Sandeen 2009-03-09 15:47:08 EDT
Oliver, do you have any idea at least which phase of fsck encountered this loop?
Comment 4 Eric Sandeen 2009-07-30 16:32:36 EDT
Oliver, I know this  bug has languished a while, but: if you see this again, and can provide an e2image (compressed) of the problematic filesystem, it would greatly help to track this down.

Comment 5 Oliver Falk 2009-07-31 13:13:18 EDT
Eric, it actually didn't happen again and I hope it will stay that stable.
I don't know in which routine/phase it looped...
So, how shall we proceed? Close with postponed and reopen if it happens again?
For now, I think it might be worth some entry in the knowledge base?! So the support knows the problem...
Comment 6 Eric Sandeen 2009-07-31 13:38:02 EDT
Ok, I'll close INSUFFICIENT_DATA for now; as for support knowing about it, I'd suggest that you contact them directly I guess... Being an engineer I don't know the ins & outs of support knowledge bases. :)   I've not seen any other reports of this, though, myself.
Comment 7 Oliver Falk 2009-07-31 14:49:03 EDT
I guess support will query bz if they encounter such a problem... I thought it's easy for you, to connect....
Thanks, however!

Note You need to log in before you can comment on or make changes to this bug.