Red Hat Bugzilla – Bug 467677
Old fsck (e2fsprogs 1.39) are unable to check 5 TB volume
Last modified: 2009-07-31 14:49:03 EDT
Description of problem:
We have some host here, with a some volumes mapped from a IBM DS4800. About a week ago the host crashed a few times and we had to reset it (each time).
The host came up again with no problem. The journal was recovered and everything seemed to be fine.
However. A few days (without crashing) later the host encountered an ext3_abort (see attached file).
I restarted the machine and it started with fsck. After some hours (two or so), fsck repaired some errors and then started using 100% CPU (regarding to top), but did no more IO, regarding to iostat and the fibre channel switch.
I tried it again (about 3 times), but every time; Same behavior: Infinite loop. Well, I waited for about 13 - 14 hours and no IO happened on this volume, but fsck took 100% CPU all the time.
Since there are a few "infinite loop" bugs between e2fsprogs 1.39 and 1.41.3, I upgraded e2fsprogs with a package from Fedora (e2fsprogs-1.41.3-2). I restarted fsck on Friday and today I discovered that it finished successfully.
From my graphs I can see that, while fsck was running, it always did IO! From about Friday 9:30 until Saturday 8:30.
Version-Release number of selected component (if applicable): 1.39, 1.41.3.
I hope it's not reproducible :-/
Steps to Reproduce:
1. Create a 5 TB LUN on a IBM DS4800
2. Map it to a host
3. Create a Volume
4. Let the machine crash (?)
5. Restart it
6. Wait a few days until it encounters an ext3_abort (unlikely to *really* reproduce)
7. Start fsck with e2fsprogs 1.39
8. Watch it going to an infinite loop
9. Stop fsck
10. Rebuild e2fsprogs 1.41.3 (pkg from Fedora)
11. Restart fsck from e2fsprogs 1.41.3 and wait 23 hours
12. See that it works :-)
fsck from RHEL5(.2) infinite loop if checking 5 TB volume.
e2fsprogs from RHEL5(.2) should JustWork (tm).
I'm afraid I don't have the output of fsck from 1.39, nor 1.41.3. Since this is a data pre-processing machine in state *production* I had to make it work ASAP and therefore forgot about saving all relevant logs.
I think the easiest solution would be to upgrade e2fsprogs in RHEL5, but I'm not sure if this is possible for you.
Please also note, I have a few more machines with RHEL5 and lots of TB diskspace. So if it happens again - for me - I already know the solution. I don't know how many other RHEL users actually do have quite large volumes (with ext3) as I do.
Also, if you want to suggest a different filesystem, please don't hesitate to do so.
A few command line outputs:
[root@aprocp01 log]# df -i /data01
Filesystem Inodes IUsed IFree IUse% Mounted on
671088640 124733964 546354676 19% /data01
[root@aprocp01 log]# df -h /data01
Filesystem Size Used Avail Use% Mounted on
5.0T 1.4T 2.9T 33% /data01
If you need additional information, let me know. For me/us large volumes are 'mission critical'!
Thanks for the report; I'll have to go looking for these "infinite loop" fixes, I haven't seen them reported against RHEL5 before. Simply updating e2fsprogs to the bleeding-edge upstream version isn't really a viable option for RHEL, though, I'm afraid.
Eric, I know it's not a viable option for RHEL. And I know we don't have a problem with Fedora, since there we already have an updated version.
And Eric, if you want me to bug RH via Support Services, I can do that as well :-)
Oliver, do you have any idea at least which phase of fsck encountered this loop?
Oliver, I know this bug has languished a while, but: if you see this again, and can provide an e2image (compressed) of the problematic filesystem, it would greatly help to track this down.
Eric, it actually didn't happen again and I hope it will stay that stable.
I don't know in which routine/phase it looped...
So, how shall we proceed? Close with postponed and reopen if it happens again?
For now, I think it might be worth some entry in the knowledge base?! So the support knows the problem...
Ok, I'll close INSUFFICIENT_DATA for now; as for support knowing about it, I'd suggest that you contact them directly I guess... Being an engineer I don't know the ins & outs of support knowledge bases. :) I've not seen any other reports of this, though, myself.
I guess support will query bz if they encounter such a problem... I thought it's easy for you, to connect....