Description of problem: We have some host here, with a some volumes mapped from a IBM DS4800. About a week ago the host crashed a few times and we had to reset it (each time). The host came up again with no problem. The journal was recovered and everything seemed to be fine. However. A few days (without crashing) later the host encountered an ext3_abort (see attached file). I restarted the machine and it started with fsck. After some hours (two or so), fsck repaired some errors and then started using 100% CPU (regarding to top), but did no more IO, regarding to iostat and the fibre channel switch. I tried it again (about 3 times), but every time; Same behavior: Infinite loop. Well, I waited for about 13 - 14 hours and no IO happened on this volume, but fsck took 100% CPU all the time. Since there are a few "infinite loop" bugs between e2fsprogs 1.39 and 1.41.3, I upgraded e2fsprogs with a package from Fedora (e2fsprogs-1.41.3-2). I restarted fsck on Friday and today I discovered that it finished successfully. From my graphs I can see that, while fsck was running, it always did IO! From about Friday 9:30 until Saturday 8:30. Version-Release number of selected component (if applicable): 1.39, 1.41.3. How reproducible: I hope it's not reproducible :-/ Steps to Reproduce: 1. Create a 5 TB LUN on a IBM DS4800 2. Map it to a host 3. Create a Volume 4. Let the machine crash (?) 5. Restart it 6. Wait a few days until it encounters an ext3_abort (unlikely to *really* reproduce) 7. Start fsck with e2fsprogs 1.39 8. Watch it going to an infinite loop 9. Stop fsck 10. Rebuild e2fsprogs 1.41.3 (pkg from Fedora) 11. Restart fsck from e2fsprogs 1.41.3 and wait 23 hours 12. See that it works :-) Actual results: fsck from RHEL5(.2) infinite loop if checking 5 TB volume. Expected results: e2fsprogs from RHEL5(.2) should JustWork (tm). Additional info: I'm afraid I don't have the output of fsck from 1.39, nor 1.41.3. Since this is a data pre-processing machine in state *production* I had to make it work ASAP and therefore forgot about saving all relevant logs. I think the easiest solution would be to upgrade e2fsprogs in RHEL5, but I'm not sure if this is possible for you. Please also note, I have a few more machines with RHEL5 and lots of TB diskspace. So if it happens again - for me - I already know the solution. I don't know how many other RHEL users actually do have quite large volumes (with ext3) as I do. Also, if you want to suggest a different filesystem, please don't hesitate to do so. A few command line outputs: [root@aprocp01 log]# df -i /data01 Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/VolGroup01-data01 671088640 124733964 546354676 19% /data01 [root@aprocp01 log]# df -h /data01 Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup01-data01 5.0T 1.4T 2.9T 33% /data01 If you need additional information, let me know. For me/us large volumes are 'mission critical'!
Thanks for the report; I'll have to go looking for these "infinite loop" fixes, I haven't seen them reported against RHEL5 before. Simply updating e2fsprogs to the bleeding-edge upstream version isn't really a viable option for RHEL, though, I'm afraid. -Eric
Eric, I know it's not a viable option for RHEL. And I know we don't have a problem with Fedora, since there we already have an updated version. And Eric, if you want me to bug RH via Support Services, I can do that as well :-)
Oliver, do you have any idea at least which phase of fsck encountered this loop?
Oliver, I know this bug has languished a while, but: if you see this again, and can provide an e2image (compressed) of the problematic filesystem, it would greatly help to track this down. Thanks, -Eric
Eric, it actually didn't happen again and I hope it will stay that stable. I don't know in which routine/phase it looped... So, how shall we proceed? Close with postponed and reopen if it happens again? For now, I think it might be worth some entry in the knowledge base?! So the support knows the problem...
Ok, I'll close INSUFFICIENT_DATA for now; as for support knowing about it, I'd suggest that you contact them directly I guess... Being an engineer I don't know the ins & outs of support knowledge bases. :) I've not seen any other reports of this, though, myself.
I guess support will query bz if they encounter such a problem... I thought it's easy for you, to connect.... Thanks, however!