Bug 467677 - Old fsck (e2fsprogs 1.39) are unable to check 5 TB volume
Summary: Old fsck (e2fsprogs 1.39) are unable to check 5 TB volume
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: e2fsprogs
Version: 5.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Eric Sandeen
QA Contact: BaseOS QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-10-20 07:21 UTC by Oliver Falk
Modified: 2009-07-31 18:49 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-07-31 17:38:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Oliver Falk 2008-10-20 07:21:31 UTC
Description of problem:
We have some host here, with a some volumes mapped from a IBM DS4800. About a week ago the host crashed a few times and we had to reset it (each time).

The host came up again with no problem. The journal was recovered and everything seemed to be fine.

However. A few days (without crashing) later the host encountered an ext3_abort (see attached file).

I restarted the machine and it started with fsck. After some hours (two or so), fsck repaired some errors and then started using 100% CPU (regarding to top), but did no more IO, regarding to iostat and the fibre channel switch.

I tried it again (about 3 times), but every time; Same behavior: Infinite loop. Well, I waited for about 13 - 14 hours and no IO happened on this volume, but fsck took 100% CPU all the time.

Since there are a few "infinite loop" bugs between e2fsprogs 1.39 and 1.41.3, I upgraded e2fsprogs with a package from Fedora (e2fsprogs-1.41.3-2). I restarted fsck on Friday and today I discovered that it finished successfully.

From my graphs I can see that, while fsck was running, it always did IO! From about Friday 9:30 until Saturday 8:30.

Version-Release number of selected component (if applicable): 1.39, 1.41.3.

How reproducible:
I hope it's not reproducible :-/

Steps to Reproduce:
1. Create a 5 TB LUN on a IBM DS4800
2. Map it to a host
3. Create a Volume
4. Let the machine crash (?)
5. Restart it
6. Wait a few days until it encounters an ext3_abort (unlikely to *really* reproduce)
7. Start fsck with e2fsprogs 1.39
8. Watch it going to an infinite loop
9. Stop fsck
10. Rebuild e2fsprogs 1.41.3 (pkg from Fedora)
11. Restart fsck from e2fsprogs 1.41.3 and wait 23 hours
12. See that it works :-)
  
Actual results:
fsck from RHEL5(.2) infinite loop if checking 5 TB volume.

Expected results:
e2fsprogs from RHEL5(.2) should JustWork (tm).

Additional info:
I'm afraid I don't have the output of fsck from 1.39, nor 1.41.3. Since this is a data pre-processing machine in state *production* I had to make it work ASAP and therefore forgot about saving all relevant logs.

I think the easiest solution would be to upgrade e2fsprogs in RHEL5, but I'm not sure if this is possible for you.

Please also note, I have a few more machines with RHEL5 and lots of TB diskspace. So if it happens again - for me - I already know the solution. I don't know how many other RHEL users actually do have quite large volumes (with ext3) as I do.

Also, if you want to suggest a different filesystem, please don't hesitate to do so.



A few command line outputs:
[root@aprocp01 log]# df -i /data01
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/VolGroup01-data01
                     671088640 124733964 546354676   19% /data01
[root@aprocp01 log]# df -h /data01
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup01-data01
                      5.0T  1.4T  2.9T  33% /data01

If you need additional information, let me know. For me/us large volumes are 'mission critical'!

Comment 1 Eric Sandeen 2008-11-06 20:01:21 UTC
Thanks for the report; I'll have to go looking for these "infinite loop" fixes, I haven't seen them reported against RHEL5 before.  Simply updating e2fsprogs to the bleeding-edge upstream version isn't really a viable option for RHEL, though, I'm afraid.

-Eric

Comment 2 Oliver Falk 2008-11-07 21:59:20 UTC
Eric, I know it's not a viable option for RHEL. And I know we don't have a problem with Fedora, since there we already have an updated version.
And Eric, if you want me to bug RH via Support Services, I can do that as well :-)

Comment 3 Eric Sandeen 2009-03-09 19:47:08 UTC
Oliver, do you have any idea at least which phase of fsck encountered this loop?

Comment 4 Eric Sandeen 2009-07-30 20:32:36 UTC
Oliver, I know this  bug has languished a while, but: if you see this again, and can provide an e2image (compressed) of the problematic filesystem, it would greatly help to track this down.

Thanks,
-Eric

Comment 5 Oliver Falk 2009-07-31 17:13:18 UTC
Eric, it actually didn't happen again and I hope it will stay that stable.
I don't know in which routine/phase it looped...
So, how shall we proceed? Close with postponed and reopen if it happens again?
For now, I think it might be worth some entry in the knowledge base?! So the support knows the problem...

Comment 6 Eric Sandeen 2009-07-31 17:38:02 UTC
Ok, I'll close INSUFFICIENT_DATA for now; as for support knowing about it, I'd suggest that you contact them directly I guess... Being an engineer I don't know the ins & outs of support knowledge bases. :)   I've not seen any other reports of this, though, myself.

Comment 7 Oliver Falk 2009-07-31 18:49:03 UTC
I guess support will query bz if they encounter such a problem... I thought it's easy for you, to connect....
Thanks, however!


Note You need to log in before you can comment on or make changes to this bug.