Bug 290821
Summary: | cmirror write path appears deadlocked after recovery is successful | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Corey Marthaler <cmarthal> | ||||||||
Component: | kernel | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 4.0 | CC: | jbaron, lwang, rkenna | ||||||||
Target Milestone: | --- | Keywords: | ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | RHBA-2007-0991 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2007-11-21 21:15:23 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 358881, 381081, 422551, 430698 | ||||||||||
Attachments: |
|
Description
Corey Marthaler
2007-09-14 14:24:50 UTC
Created attachment 196081 [details]
stack traces from link-07
Created attachment 196101 [details]
stack traces from link-08
Created attachment 196121 [details]
stack traces from link-02
I appear to have reproduced this bug and this may be related to bz 238629. The test hung in the same spot, during a sync, right after a leg failure: ================================================================================ Iteration 0.7 started at Mon Sep 24 17:45:04 CDT 2007 ================================================================================ Senario: Kill secondary leg of non synced 2 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: fail_secondary_non_synced_2_legs_1_mirrors * sync: 0 * mirrors: 1 * disklog: 1 * failpv: /dev/sde1 * legs: 2 * pvs: /dev/sda1 /dev/sde1 /dev/sdf1 ************************************************ Creating mirror(s) on link-02... qarsh root@link-02 lvcreate -m 1 -n fail_secondary_non_synced_2_legs_1_mirrors_1 -L 800M helter_skelter /dev/sda1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-150 mirror(s) currently at 5.50% synced right now Creating gfs on top of mirror(s) on link-02... Mounting mirrored gfs filesystems on link-02... Mounting mirrored gfs filesystems on link-07... Mounting mirrored gfs filesystems on link-08... Writing verification files (checkit) to mirror(s) on... link-02: checkit starting with: CREATE Num files: 100 Random Seed: 12193 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_2_legs_1_mirrors_1 Working dir: /mnt/fail_secondary_non_synced_2_legs_1_mirrors_1/checkit link-07: checkit starting with: CREATE Num files: 100 Random Seed: 11603 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_2_legs_1_mirrors_1 Working dir: /mnt/fail_secondary_non_synced_2_legs_1_mirrors_1/checkit link-08: checkit starting with: CREATE Num files: 100 Random Seed: 12456 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_2_legs_1_mirrors_1 Working dir: /mnt/fail_secondary_non_synced_2_legs_1_mirrors_1/checkit Starting the io load (collie/xdoio) on mirror(s) Sleeping 15 seconds to get some I/O locks outstanding before the failure Disabling device sde on link-02 Disabling device sde on link-07 Disabling device sde on link-08 Attempting I/O to cause mirror down conversion(s) on link-02 10+0 records in 10+0 records out [Here's where the sync hangs] The reason I think that this is bz 238629 is due to the flood of "conflict" messages. [...] Sep 24 17:39:03 link-02 kernel: dm-cmirror: Notifying server(1) of sync change: 62/Q0c5dDtC Sep 24 17:39:03 link-02 last message repeated 127 times Sep 24 17:39:03 link-02 kernel: dm-cmirror: Notifying server(1) of sync change: 63/Q0c5dDtC Sep 24 17:39:03 link-02 last message repeated 46 times Sep 24 17:39:03 link-02 kernel: dm-cmirror: Notifying server(1) of sync change: 748/Q0c5dDtC Sep 24 17:39:03 link-02 last message repeated 2 times Sep 24 17:39:03 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC Sep 24 17:39:03 link-02 last message repeated 2 times Sep 24 17:39:03 link-02 kernel: scsi0 (0:5): rejecting I/O to offline device Sep 24 17:39:04 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC Sep 24 17:39:04 link-02 last message repeated 2 times Sep 24 17:39:04 link-02 kernel: dm-cmirror: Notifying server(1) of sync change: 749/Q0c5dDtC Sep 24 17:39:04 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC Sep 24 17:39:05 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC Sep 24 17:39:06 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC [...] More debugging from the only node not completely hung... [root@link-02 tmp]# ./lvm_backtraces.pl Backtrace for lvs-a-o+devices (489): #1 0x000000000044fde3 in pool_labeller_create () #2 0x00000000004500e3 in pool_labeller_create () #3 0x000000000043cea6 in check_lvm1_vg_inactive () #4 0x000000000043d0b9 in lock_vol () #5 0x000000000041f2ef in process_each_lv () #6 0x000000000041ceaa in pvscan () #7 0x0000000000414c42 in lvm_run_command () #8 0x00000000004155f2 in lvm2_main () #9 0x00000031b871c3fb in __libc_start_main () from /lib64/tls/libc.so.6 #10 0x000000000040c84a in _start () #11 0x0000007fbffffa38 in ?? () #12 0x000000000000001c in ?? () #13 0x0000000000000004 in ?? () #14 0x0000007fbffffc26 in ?? () #15 0x0000007fbffffc2a in ?? () #16 0x0000007fbffffc2d in ?? () #17 0x0000007fbffffc30 in ?? () Sep 25 08:38:36 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC Sep 25 08:38:36 link-02 last message repeated 3 times Sep 25 08:38:36 link-02 qarshd[12229]: Nothing to do Sep 25 08:38:36 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC Sep 25 08:38:39 link-02 last message repeated 55 times Sep 25 08:38:39 link-02 qarshd[12199]: Nothing to do Sep 25 08:38:39 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC Sep 25 08:38:39 link-02 last message repeated 3 times Sep 25 08:38:39 link-02 qarshd[12229]: Nothing to do Sep 25 08:38:39 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC Sep 25 08:38:41 link-02 last message repeated 39 times Sep 25 08:38:41 link-02 kernel: dm-cmirror: LOG INFO: Sep 25 08:38:41 link-02 kernel: dm-cmirror: uuid: LVM-vEL8KkKGwA4X8JFaykzpGjepOwn908GGm3jPYhF61PD5tuohdgtzNmEIQ0c5dDtC Sep 25 08:38:41 link-02 kernel: dm-cmirror: uuid_ref : 1 Sep 25 08:38:41 link-02 kernel: dm-cmirror: log type : disk Sep 25 08:38:41 link-02 kernel: dm-cmirror: ?region_count: 1600 Sep 25 08:38:41 link-02 kernel: dm-cmirror: ?sync_count : 1079 Sep 25 08:38:41 link-02 kernel: dm-cmirror: ?sync_search : 0 Sep 25 08:38:41 link-02 kernel: dm-cmirror: in_sync : NO Sep 25 08:38:41 link-02 kernel: dm-cmirror: suspended : NO Sep 25 08:38:41 link-02 kernel: dm-cmirror: server_id : [[link-02]] Sep 25 08:38:41 link-02 kernel: dm-cmirror: server_valid: YES Sep 25 08:38:41 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC Sep 25 08:38:42 link-02 last message repeated 15 times Sep 25 08:38:42 link-02 qarshd[12199]: Nothing to do Sep 25 08:38:42 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC Sep 25 08:38:42 link-02 last message repeated 3 times Sep 25 08:38:42 link-02 qarshd[12229]: Nothing to do Sep 25 08:38:42 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC Sep 25 08:38:45 link-02 last message repeated 55 times Sep 25 08:38:45 link-02 qarshd[12199]: Nothing to do Sep 25 08:38:45 link-02 kernel: dm-cmirror: Remote recovery conflict: 0/Q0c5dDtC DLM lockspace 'clvmd' Resource 000001002fbcfe88 (parent 0000000000000000). Name (len=64) "vEL8KkKGwA4X8JFaykzpGjepOwn908GG19cPvkWWQOJsav7yEvp20ydUxz2F8SZP" Local Copy, Master is node [[link-08]] Granted Queue 000103b6 CR 8657 Master: 00010201 Conversion Queue Waiting Queue Resource 0000010032a4d6e8 (parent 0000000000000000). Name (len=16) "V_helter_skelter" Local Copy, Master is node [[link-07]] Granted Queue 000103d0 PW 12221 Master: 00010007 Conversion Queue Waiting Queue Resource 000001002e7db9d8 (parent 0000000000000000). Name (len=12) "V_VolGroup00" Master Copy Granted Queue 000100a9 PR 490 Conversion Queue Waiting Queue This was with the latest code: 2.6.9-59.ELsmp cmirror-kernel-smp-2.6.9-36.2 It's a good bet this was a eariler screw up. We can't allow writes to regions that have not been recovered yet. We see if they have bee recovered by looking at the 'sync_bits'. In some device failure cases, regions must be marked 'out-of-sync' - this was causing a following write to block because it thought the region had not yet been recovered - when in fact, it had just been put out-of-sync due to failing device. new -> post This bug appears to still very much exist. The sync hung once again right after mirror leg failure to three mirrors all having three legs. I appear to see this more often with the multiple mirror scenarios. There is a constant stream of the following messages: dm-cmirror: Priority recovery region: 18446744073709551615/Pn10cLPV The strace for the sync is exact same: Oct 12 11:42:31 link-02 kernel: sync R running task 0 7539 7538 (NOTLB) Oct 12 11:42:31 link-02 kernel: kmirrord S 000001003e8baf48 0 7540 6 7558 7355 (L-TLB) Oct 12 11:42:31 link-02 kernel: 0000010022de7e68 0000000000000046 000001002e327260 0000010000000073 Oct 12 11:42:31 link-02 kernel: 0000010022de7e68 000000003e8bae18 000001000101ba80 000000013e8bae00 Oct 12 11:42:31 link-02 kernel: 0000010027b007f0 0000000000000840 Oct 12 11:42:31 link-02 kernel: Call Trace:<ffffffffa00cd6a2>{:dm_mirror:do_work+0} <ffffffff80148119>{worker_thread+226} Oct 12 11:42:31 link-02 kernel: <ffffffff8013461d>{default_wake_function+0} <ffffffff8013466e>{__wake_up_common+67} Oct 12 11:42:31 link-02 kernel: <ffffffff8013461d>{default_wake_function+0} <ffffffff8014bf28>{keventd_create_kthread+0} Oct 12 11:42:31 link-02 kernel: <ffffffff80148037>{worker_thread+0} <ffffffff8014bf28>{keventd_create_kthread+0} Oct 12 11:42:31 link-02 kernel: <ffffffff8014beff>{kthread+200} <ffffffff80110f47>{child_rip+8} Oct 12 11:42:31 link-02 kernel: <ffffffff8014bf28>{keventd_create_kthread+0} <ffffffff8014be37>{kthread+0} Oct 12 11:42:31 link-02 kernel: <ffffffff80110f3f>{child_rip+0} This doesn't cause future I/O to block, and clvmd is not deadlocked here either. The leg failure recovery properly takes place. [root@link-08 syncd_primary_3legs_3]# lvs -a -o +devices /dev/sda1: open failed: No such device or address LV VG Attr LSize Origin Snap% Move Log Copy% Devices LogVol00 VolGroup00 -wi-ao 72.38G /dev/hda2(0) LogVol01 VolGroup00 -wi-ao 1.94G /dev/hda2(2316) syncd_primary_3legs_1 helter_skelter mwi-ao 800.00M syncd_primary_3legs_1_mlog 100.00 syncd_primary_3legs_1_mimage_2(0),syncd_primary_3legs_1_mimage_1(0) [syncd_primary_3legs_1_mimage_1] helter_skelter iwi-ao 800.00M /dev/sdf1(0) [syncd_primary_3legs_1_mimage_2] helter_skelter iwi-ao 800.00M /dev/sdc1(0) [syncd_primary_3legs_1_mlog] helter_skelter lwi-ao 4.00M /dev/sdg1(0) syncd_primary_3legs_2 helter_skelter mwi-ao 800.00M syncd_primary_3legs_2_mlog 100.00 syncd_primary_3legs_2_mimage_2(0),syncd_primary_3legs_2_mimage_1(0) [syncd_primary_3legs_2_mimage_1] helter_skelter iwi-ao 800.00M /dev/sdf1(200) [syncd_primary_3legs_2_mimage_2] helter_skelter iwi-ao 800.00M /dev/sdc1(200) [syncd_primary_3legs_2_mlog] helter_skelter lwi-ao 4.00M /dev/sdg1(1) syncd_primary_3legs_3 helter_skelter mwi-ao 800.00M syncd_primary_3legs_3_mlog 100.00 syncd_primary_3legs_3_mimage_2(0),syncd_primary_3legs_3_mimage_1(0) [syncd_primary_3legs_3_mimage_1] helter_skelter iwi-ao 800.00M /dev/sdf1(400) [syncd_primary_3legs_3_mimage_2] helter_skelter iwi-ao 800.00M /dev/sdc1(400) [syncd_primary_3legs_3_mlog] helter_skelter lwi-ao 4.00M /dev/sdg1(2) This is very strange. I can't seem to make things "hang", but things do become abismally slow for certain processes. Perhaps your type of work load backs up enough I/O that when the sync happens, it appears to hang (because it is so slow). I'm not sure where the slowdown is coming from. cmirror network calls take virtually no time. However, I have noticed considerable delay between (core) mirror 'do_writes' calls. From system tap: Entering: do_writes(time_between=879,consult_calls=1,consult_time=0) Entering: do_writes(time_between=201,consult_calls=5,consult_time=1) Entering: do_writes(time_between=0,consult_calls=3,consult_time=0) Entering: do_writes(time_between=31,consult_calls=3,consult_time=0) Entering: do_writes(time_between=200,consult_calls=1,consult_time=0) Entering: do_writes(time_between=0,consult_calls=3,consult_time=0) Entering: do_writes(time_between=28,consult_calls=3,consult_time=0) Entering: do_writes(time_between=4998,consult_calls=1,consult_time=0) We can see that the time between calls (in ms) can be as high as 5 seconds... Also during my tests, I've seen processes slow down on mirrors which have not suffered a failure (after another mirror has recovered from a failure). I'll have to test if the slow down occurs without a failure. The problem could be occuring in the core mirroring code. From drivers/md/dm-raid1.c: static void do_work(void *data) { while (do_mirror(data)) { set_current_state(TASK_INTERRUPTIBLE); schedule_timeout(HZ/5); } } 'wake(ms)' is the usual call that triggers 'do_work'. However, do_mirror will return 1 if there is still work to be done - causing a 200ms delay before it is run again. Normally, this happens very infrequently - like when a failure occurs. In those cases, we may _want_ to delay and wait for a suspend. However, it also happens in response to the normal write queue having data on it. This can be filled by nominal I/O or by I/O placed back on the queue due to remote recovery conflicts (from cluster mirror). In the event of remote recovery conflicts, we may want the delay. However, we never want the delay when the write queue is filled by normal means... and that seems to be exactly what is happening. This is probably happening to single machine mirroring too, but to a lesser extend because more delay is introduced with the network calls cmirror makes - allowing more time to add something to the nominal queue before 'do_mirror' completes. Will test theory. that's it. Summary: The small sleep in drivers/md/dm-raid1.c:do_work() was designed to: 1) allow the userspace process to more easily proceed to reconfigure a mirror in the event of a failure. 2) allow remote machines (cluster mirror) to finish recovery of a region before resubmitting the I/O. However, the sleep is having adverse affects on nominal I/O - sometimes slowing processes down by greater than a factor of 10. This can make applications appear hung. This problem occurs most frequently in cluster mirrors under heavy I/O (or during failure scenarios). It is possible in single machine mirroring, but hasn't been seen yet. This bug does not corrupt data, but it certainly effects "AVAILABILITY" - which is especially bad considering mirrors are used when you want to increase availability. Just a quick note that the only issue I'm seeing this this fix is bz 359341, and so far that only happens when there are simultaneous failures to 3 three legged mirrors. Otherwise, I haven't been able to reproduce this sync "hang" anymore. Just a note that I ran the single node lvm tests with this patch and didn't see any issues. That said, I wasn't doing the extensive failure scenario testing that is done on cluster mirrors. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0991.html |