Description of problem: Online resize of an ext3 fs fails. The system keeps working normally for a few minutes, then freezes. It is possible to login on the console or via ssh, but one never gets to the prompt (ctrl-c or ctrl-z doesn't help either). Nothing turns up in the logs. It seems to work sometimes if one increases the size by a small increment at a time. E.g. after extending the SAN LUN, and rescanning the devices, reloading the dm-multipath map etc., then do lvextend -L +10G ..., resize2fs in a loop until the all the new space is in use. Yesterday we had a crash when trying to increase the size of a volume by 50G, previously crashes have occurred trying to resize 500G and 100G, respectively (size of filesystem in question is 2.7T at the moment). Resizing 10-20G at a time has worked so far. Version-Release number of selected component (if applicable): kernel-PAE-2.6.18-128.1.14.el5.i686 e2fsprogs-1.39-20.el5.i386 How reproducible: Not sure, so far it seems to happen with higher probability if one increases the size by a lot at a time. Steps to Reproduce: 1. Increase size of underlying LUN 2. echo 1 > /sys/block/sdX/device/rescan for every device in multipath map 3. Check with sfdisk -s sdX that new size is seen 4. Enter multipath prompt (multipathd -k), del path sdX, add path sdX for every path, then "resize map mapname" 5. Check with multipath -ll that new size is seen 6. pvresize /dev/dm-X 2. lvextend 3. resize2fs => crash Actual results: System hangs. Expected results: Size of filesystem increased without having to unmount it or otherwise disrupt service. Additional info: Hardware is HP BL460, qlogic dual port FC card (qla2xxx driver), HP EVA 6100 storage, dm-multipath for multipathing.
so is it a crash or a hang? If the system is hung, sysrq-t would show us backtraces of all kernel processes and might offer a clue. If it's a hang (panic, bug, etc) please include that output.
So I upgraded the NFS server to RHEL 5.4, and tried again. This time the same happened when trying to expand by only +5G; so much for my theory that this happens only for large expansions. However, before it hung I managed to attach strace on the resize2fs process, and it produced no output at all but it didn't seem to be hanging in a syscall (as one can sometimes see with strace) either. Neither did it consume any cpu, so I'm not sure what it was actually doing. Also, from our statistic graphing (munin) it seems that load average shot up from the usual ~2 to 65 before crashing; however this graphing is at 1 minute intervals so it might have gone much higher still. I suppose this is what you call a crash and not a hang, there is no kernel panic or such. The system just becomes unresponsive and I have to reset it from the remote console. Unfortunately I did this before receiving your message, so using sysrq-t didn't occur to me. I'll keep that in mind for the next time I need to expand some fs on this server. This is a production machine, so I'm not particularly eager to try to make it hang and piss off our users. According to some googling it should even be possible to trigger sysrq via the remote console with some tweaking; the machine doesn't have a physical console at all. I realize it's pretty difficult if not impossible to fix this with no more info; in any case thanks for the quick reply and I'll try to get the sysrq output next time.
Issue Tracker 366564 is on RHEL 4.7, so I don't know if we need to clone this for RHEL 4 or...
I've been able to reproduce this bug. I setup a 1G logical volume, mkfs'd with ext3, mounted it and exported it via NFS. I ran fsstress on the NFS client and filled up the filesystem. I then used lvextend to grow the volume out to 2G and that completed successfully. Then I used resize2fs to grow the filesystem and almost immediately it got stuck. I then tried the same test again without using NFS and running fsstress on the ext3 filesystem and it took a few tries but eventually got stuck too. Here's the output from sysrq-t for the resize2fs process: Nov 18 16:59:31 localhost kernel: resize2fs D ffff81000103e980 0 7145 3014 6274 (NOTLB) Nov 18 16:59:31 localhost kernel: ffff810009e1fc28 0000000000000086 ffff810009e1fc38 0000004400000046 Nov 18 16:59:31 localhost kernel: ffff81000000dc00 0000000000000008 ffff81000e9147e0 ffff8100011e1100 Nov 18 16:59:31 localhost kernel: 00000337b24ba86d 0000000000034e42 ffff81000e9149c8 0000000100000000 Nov 18 16:59:31 localhost kernel: Call Trace: Nov 18 16:59:31 localhost kernel: [<ffffffff88035fc7>] :jbd:journal_cancel_revoke+0x8e/0xb5 Nov 18 16:59:31 localhost kernel: [<ffffffff88032002>] :jbd:start_this_handle+0x2e5/0x36c Nov 18 16:59:31 localhost kernel: [<ffffffff8009db21>] autoremove_wake_function+0x0/0x2e Nov 18 16:59:31 localhost kernel: [<ffffffff880312e8>] :jbd:__journal_file_buffer+0x13e/0x243 Nov 18 16:59:31 localhost kernel: [<ffffffff88058491>] :ext3:extend_or_restart_transaction+0x31/0x6b Nov 18 16:59:31 localhost kernel: [<ffffffff8805942a>] :ext3:ext3_group_add+0x8ac/0x11c3 Nov 18 16:59:31 localhost kernel: [<ffffffff8009db4f>] wake_bit_function+0x0/0x23 Nov 18 16:59:31 localhost kernel: [<ffffffff8805138c>] :ext3:ext3_ioctl+0x4f8/0x588 Nov 18 16:59:31 localhost kernel: [<ffffffff80063097>] thread_return+0x62/0xfe Nov 18 16:59:31 localhost kernel: [<ffffffff80041b62>] do_ioctl+0x55/0x6b Nov 18 16:59:31 localhost kernel: [<ffffffff8002fd1e>] vfs_ioctl+0x248/0x261 Nov 18 16:59:31 localhost kernel: [<ffffffff8004c0a3>] sys_ioctl+0x59/0x78 Nov 18 16:59:31 localhost kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Lachlan, which kernel did you reproduce this on? Want to see just where :jbd:journal_cancel_revoke+0x8e/0xb5 is. Thanks, -Eric
Eric, that was 2.6.18-128.el5 on x86_64.
Hi Devs, I'd need a status update for my customer. Do we have an ETA for a fix? best regards, Hari
I'll raise the priority on this, I have backtraces from Lachlan but I've been pulled into other things lately; I'm getting back to this now. Sorry for the delay. -Eric
(In reply to comment #6) > Issue Tracker 366564 is on RHEL 4.7, so I don't know if we need to clone this > for RHEL 4 or... Same code in both, likely the same root cause of the bug for both ... (In reply to comment #7) > I've been able to reproduce this bug. ... > Nov 18 16:59:31 localhost kernel: resize2fs D ffff81000103e980 0 7145 > 3014 6274 (NOTLB) > Nov 18 16:59:31 localhost kernel: ffff810009e1fc28 0000000000000086 > ffff810009e1fc38 0000004400000046 > Nov 18 16:59:31 localhost kernel: ffff81000000dc00 0000000000000008 > ffff81000e9147e0 ffff8100011e1100 > Nov 18 16:59:31 localhost kernel: 00000337b24ba86d 0000000000034e42 > ffff81000e9149c8 0000000100000000 > Nov 18 16:59:31 localhost kernel: Call Trace: > Nov 18 16:59:31 localhost kernel: [<ffffffff88035fc7>] .... Lachlan, which kernel did you test that on? Thanks, -Eric
argh Lachlan, sorry, you already told me which kernel, missed that reply! Thanks, -Eric
Eric, I have been able to reliably reproduce this issue on 2.6.9-78.0.1 as well. It's the exact same stack trace for the ext2online and resize2fs processes but I doubt that's where the root cause is. There must be a transaction commit somewhere that is not progressing and holding up a lot of other threads stuck in start_this_handle(). I tried this fix on a whim (since it has been included in a later RHEL5 version): http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9bb91784de6618c955994b2d5be332fb68c87ef1 but it did not help. I'm going to try the latest RHEL4 kernel to see if the problem persists. Lachlan
Just tried 2.6.9-89.17 and it hangs too.
Lachlan, I can reproduce it too though apparently not as reliably as you can. Josef eyeballed this and pretty quickly saw that it's a deadlock around lock_super that gets taken in the orphan_add path. I'm trying a fix now... Thanks, -Eric
Ok, actually this fix should be pretty trivial. Ext4 moved to having separate mutexes for the orphan list and for resizing; doing the same for ext3 will get them out of each others' way. I'll test that... should have something testable shortly if it works (and I expect that it will...) Thanks, -Eric
Created attachment 377586 [details] Proposed patch This fixes it in my testing, by giving resize its own lock rather than overloading the superblock lock. It's a cross-port to ext3 of upstream commit: 32ed5058ce90024efcd811254b4b1de0468099df ext4: Replace lock/unlock_super() with an explicit lock for resizing There were other associated commits (another explicit mutex for the orphan list) which we may want to pull over to ext3 as well, but that's not strictly necessary to fix this. I'll submit these changes upstream soon and see if we can still get the change into RHEL5.5. Thanks and sorry for the delay, -Eric
To test this, I created a script called setup-resize.sh: #!/bin/bash mkdir -p /tmp/mnt umount /tmp/mnt rm -f /tmp/bigfile touch /tmp/bigfile dd if=/dev/zero of=/tmp/bigfile bs=1 seek=17179869184 count=1 mkfs.ext3 -F -b 4096 /tmp/bigfile 262144 mount -o loop /tmp/bigfile /tmp/mnt/ # end of script That makes a 4G sparse file and mkfs's it to only 1G. and then in a shell, did a loop with resizes out to 4G while under fsstress load: # while true; do ./setup-resize.sh ; fsstress -d /tmp/mnt/ -n 10000 -p 16 & sleep 60; echo -n "=========================== >Resizing starting at: "; date; resize2fs /dev/loop0; killall fsstress-ltp; sleep 10; done Eventually I'd get a hanging resize.
Great work Eric, thanks a lot.
I ported this fix to RHEL4 and it's fixed that issue too. I'll clone the BZ and attach the patch soon.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-182.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
can reproduce on 2.6.18-128.el5 use https://bugzilla.redhat.com/show_bug.cgi?id=525100#c21 resize2fs process stuck and the output shown as below: ... Jan 26 00:43:41 dell-pe1650-1 kernel: ======================= Jan 26 00:43:41 dell-pe1650-1 kernel: fsstress D 00000592 2464 9093 9076 9092 (NOTLB) Jan 26 00:43:41 dell-pe1650-1 kernel: c7ce6eb0 00000086 13c6c42b 00000592 00000005 00050000 f07983e4 00000007 Jan 26 00:43:41 dell-pe1650-1 kernel: f7fd9aa0 c21ce000 13c97669 00000592 0002b23e 00000000 f7fd9bac c200cc60 Jan 26 00:43:41 dell-pe1650-1 kernel: 00000004 00000000 00000080 c7ce6ec8 f7676a38 00000002 c0434767 f7676a38 Jan 26 00:43:41 dell-pe1650-1 kernel: Call Trace: Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0434767>] prepare_to_wait+0x24/0x42 Jan 26 00:43:41 dell-pe1650-1 kernel: [<f884cd35>] start_this_handle+0x21b/0x30d [jbd] Jan 26 00:43:41 dell-pe1650-1 kernel: [<c043461f>] autoremove_wake_function+0x0/0x2d Jan 26 00:43:41 dell-pe1650-1 kernel: [<f884ced3>] journal_start+0xac/0xdb [jbd] Jan 26 00:43:41 dell-pe1650-1 kernel: [<f8888ba2>] ext3_symlink+0x6f/0x190 [ext3] Jan 26 00:43:41 dell-pe1650-1 kernel: [<c047e4da>] vfs_symlink+0xb8/0x11c Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0480974>] sys_symlinkat+0x74/0xab Jan 26 00:43:41 dell-pe1650-1 kernel: [<c044a26f>] audit_syscall_entry+0x14b/0x17d Jan 26 00:43:41 dell-pe1650-1 kernel: [<c04809ba>] sys_symlink+0xf/0x13 Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0404f17>] syscall_call+0x7/0xb Jan 26 00:43:41 dell-pe1650-1 kernel: ======================= Jan 26 00:43:41 dell-pe1650-1 kernel: resize2fs D 00000591 2836 9097 1 9076 (NOTLB) Jan 26 00:43:41 dell-pe1650-1 kernel: eead9e08 00000086 ef777fe4 00000591 c0474475 00048202 00000000 00000005 Jan 26 00:43:41 dell-pe1650-1 kernel: f76e8550 c06793c0 ef7c43d0 00000591 0004c3ec 00000000 f76e865c c200cc60 Jan 26 00:43:41 dell-pe1650-1 kernel: d81539c4 00000000 00000000 eead9e20 ffffffff 00000000 00000000 f7676a38 Jan 26 00:43:41 dell-pe1650-1 kernel: Call Trace: Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0474475>] __find_get_block+0x15c/0x166 Jan 26 00:43:41 dell-pe1650-1 kernel: [<f884cd35>] start_this_handle+0x21b/0x30d [jbd] Jan 26 00:43:41 dell-pe1650-1 kernel: [<c043461f>] autoremove_wake_function+0x0/0x2d Jan 26 00:43:41 dell-pe1650-1 kernel: [<f888ce90>] extend_or_restart_transaction+0x2d/0x5d [ext3] Jan 26 00:43:41 dell-pe1650-1 kernel: [<f888dc8e>] ext3_group_add+0x76c/0xf71 [ext3] Jan 26 00:43:41 dell-pe1650-1 kernel: [<f888689a>] ext3_ioctl+0x4a2/0x51c [ext3] Jan 26 00:43:41 dell-pe1650-1 kernel: [<c04821a7>] do_ioctl+0x47/0x5d Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0482407>] vfs_ioctl+0x24a/0x25c Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0482461>] sys_ioctl+0x48/0x5f Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0404f17>] syscall_call+0x7/0xb ...
That is to be expected since the bug isn't fixed until kernel-2.6.18-182.el5.
(In reply to comment #33) > That is to be expected since the bug isn't fixed until kernel-2.6.18-182.el5. yes, you're right, sorry for my uncomplete comments, correct as below: can reproduce on 2.6.18-128.el5 use #c21: resize2fs process stuck and the output shown as below: ... Jan 26 00:43:41 dell-pe1650-1 kernel: ======================= Jan 26 00:43:41 dell-pe1650-1 kernel: fsstress D 00000592 2464 9093 9076 9092 (NOTLB) Jan 26 00:43:41 dell-pe1650-1 kernel: c7ce6eb0 00000086 13c6c42b 00000592 00000005 00050000 f07983e4 00000007 Jan 26 00:43:41 dell-pe1650-1 kernel: f7fd9aa0 c21ce000 13c97669 00000592 0002b23e 00000000 f7fd9bac c200cc60 Jan 26 00:43:41 dell-pe1650-1 kernel: 00000004 00000000 00000080 c7ce6ec8 f7676a38 00000002 c0434767 f7676a38 Jan 26 00:43:41 dell-pe1650-1 kernel: Call Trace: Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0434767>] prepare_to_wait+0x24/0x42 Jan 26 00:43:41 dell-pe1650-1 kernel: [<f884cd35>] start_this_handle+0x21b/0x30d [jbd] Jan 26 00:43:41 dell-pe1650-1 kernel: [<c043461f>] autoremove_wake_function+0x0/0x2d Jan 26 00:43:41 dell-pe1650-1 kernel: [<f884ced3>] journal_start+0xac/0xdb [jbd] Jan 26 00:43:41 dell-pe1650-1 kernel: [<f8888ba2>] ext3_symlink+0x6f/0x190 [ext3] Jan 26 00:43:41 dell-pe1650-1 kernel: [<c047e4da>] vfs_symlink+0xb8/0x11c Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0480974>] sys_symlinkat+0x74/0xab Jan 26 00:43:41 dell-pe1650-1 kernel: [<c044a26f>] audit_syscall_entry+0x14b/0x17d Jan 26 00:43:41 dell-pe1650-1 kernel: [<c04809ba>] sys_symlink+0xf/0x13 Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0404f17>] syscall_call+0x7/0xb Jan 26 00:43:41 dell-pe1650-1 kernel: ======================= Jan 26 00:43:41 dell-pe1650-1 kernel: resize2fs D 00000591 2836 9097 1 9076 (NOTLB) Jan 26 00:43:41 dell-pe1650-1 kernel: eead9e08 00000086 ef777fe4 00000591 c0474475 00048202 00000000 00000005 Jan 26 00:43:41 dell-pe1650-1 kernel: f76e8550 c06793c0 ef7c43d0 00000591 0004c3ec 00000000 f76e865c c200cc60 Jan 26 00:43:41 dell-pe1650-1 kernel: d81539c4 00000000 00000000 eead9e20 ffffffff 00000000 00000000 f7676a38 Jan 26 00:43:41 dell-pe1650-1 kernel: Call Trace: Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0474475>] __find_get_block+0x15c/0x166 Jan 26 00:43:41 dell-pe1650-1 kernel: [<f884cd35>] start_this_handle+0x21b/0x30d [jbd] Jan 26 00:43:41 dell-pe1650-1 kernel: [<c043461f>] autoremove_wake_function+0x0/0x2d Jan 26 00:43:41 dell-pe1650-1 kernel: [<f888ce90>] extend_or_restart_transaction+0x2d/0x5d [ext3] Jan 26 00:43:41 dell-pe1650-1 kernel: [<f888dc8e>] ext3_group_add+0x76c/0xf71 [ext3] Jan 26 00:43:41 dell-pe1650-1 kernel: [<f888689a>] ext3_ioctl+0x4a2/0x51c [ext3] Jan 26 00:43:41 dell-pe1650-1 kernel: [<c04821a7>] do_ioctl+0x47/0x5d Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0482407>] vfs_ioctl+0x24a/0x25c Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0482461>] sys_ioctl+0x48/0x5f Jan 26 00:43:41 dell-pe1650-1 kernel: [<c0404f17>] syscall_call+0x7/0xb ... verified on 2.6.18-185.el5: # uname -a Linux amd-dinar-02.lab.bos.redhat.com 2.6.18-185.el5 #1 SMP Thu Jan 14 16:43:14 EST 2010 i686 athlon i386 GNU/Linux resize2fs process is normal,without hung occur on.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html