Description of problem: While cleaning up from this mirror failure scenario, lvchange hung. ================================================================================ Iteration 2.10 started at Tue Nov 13 01:21:22 CST 2007 ================================================================================ Senario: Kill disk log of synced 2 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: syncd_log_2legs * sync: 1 * mirrors: 3 * disklog: 1 * failpv: /dev/sdd1 * legs: 2 * pvs: /dev/sda1 /dev/sdc1 /dev/sdd1 ************************************************ Version-Release number of selected component (if applicable): cmirror-1.0.1-1 cmirror-kernel-2.6.9-38.5 kernel-hugemem-2.6.9-67.EL device-mapper-1.02.21-1.el4 How reproducible: Unknown Actual results: The problem started at 1:28 am on tank-02. Nov 13 01:28:43 tank-02 qarshd[16144]: Running cmdline: lvchange -an /dev/helter_skelter/syncd_log_2legs_3 Nov 13 01:28:43 tank-02 kernel: cdrom: open failed. Nov 13 01:28:44 tank-02 lvm[23704]: No longer monitoring mirror device helter_skelter-syncd_log_2legs_3 for events Nov 13 01:28:46 tank-02 kernel: SM: 0100013b process_stop_request: uevent already set Nov 13 01:28:52 tank-02 kernel: irror: starter : 2 Nov 13 01:28:52 tank-02 last message repeated 4 times Nov 13 01:28:52 tank-02 kernel: <7irror: starter : 2 Nov 13 01:28:52 tank-02 kernel: <7irror: starter : 2 Nov 13 01:28:53 tank-02 kernel: <irror: starter : 2 Nov 13 01:28:53 tank-02 kernel: irror: starter : 2 tank-03 started producing the same messages: Nov 13 01:28:43 tank-03 lvm[23694]: No longer monitoring mirror device helter_skelter-syncd_log_2legs_1 for events Nov 13 01:28:44 tank-03 lvm[23694]: No longer monitoring mirror device helter_skelter-syncd_log_2legs_2 for events Nov 13 01:28:46 tank-03 lvm[23694]: No longer monitoring mirror device helter_skelter-syncd_log_2legs_3 for events Nov 13 01:28:54 tank-03 kernel: irror: starter : 2 Nov 13 01:28:54 tank-03 kernel: irror: starter : 2 Nov 13 01:28:54 tank-03 kernel: <irror: starter : 2 tank-01 and tank-04 stopped monitoring the mirror device about the same time: Nov 13 01:28:41 tank-01 lvm[14164]: No longer monitoring mirror device helter_skelter-syncd_log_2legs_1 for events Nov 13 01:28:42 tank-01 lvm[14164]: No longer monitoring mirror device helter_skelter-syncd_log_2legs_2 for events Nov 13 01:28:43 tank-01 lvm[14164]: No longer monitoring mirror device helter_skelter-syncd_log_2legs_3 for events Nov 13 01:28:39 tank-04 lvm[24693]: No longer monitoring mirror device helter_skelter-syncd_log_2legs_1 for events Nov 13 01:28:40 tank-04 lvm[24693]: No longer monitoring mirror device helter_skelter-syncd_log_2legs_2 for events Nov 13 01:28:41 tank-04 lvm[24693]: No longer monitoring mirror device helter_skelter-syncd_log_2legs_3 for events Output from cman_tool services: [root@tank-02 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 4 M tank-01 2 1 4 M tank-03 3 1 4 M tank-04 4 1 4 M tank-02 tank-01: [root@tank-01 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 2 run - [1 4 2 3] DLM Lock Space: "clvmd" 107 107 run - [1 4 2 3] tank-02: [root@tank-02 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 2 run - [1 4 2 3] DLM Lock Space: "clvmd" 107 107 run - [1 4 2 3] DLM Lock Space: "clustered_log" 315 308 run S-10,280,2 [2 4] tank-03: [root@tank-03 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 2 run - [1 2 4 3] DLM Lock Space: "clvmd" 107 107 run - [1 2 4 3] DLM Lock Space: "clustered_log" 315 308 run S-13,210,1 [2 4] tank-04: [root@tank-04 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 2 run - [1 2 3 4] DLM Lock Space: "clvmd" 107 107 run - [1 2 3 4] Output of lvm_backtraces.pl on tank-02: Backtrace for lvchange-an/dev/helter_skelter/syncd_log_2legs_3 (16145): #1 0x00471673 in __read_nocancel () from /lib/tls/libpthread.so.0 #2 0x080a1000 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags" is not available.) at locking/cluster_locking.c:115 #3 0x080a1466 in _lock_resource (cmd=0x8ff4a88, resource=Variable "resource" is not available.) at locking/cluster_locking.c:415 #4 0x0808bb9a in _lock_vol (cmd=0x8ff4a88, resource=0xfef3f430 "helter_skelter", flags=6) at locking/locking.c:321 ####################################### # _lock_vol flags = LCK_VG | LCK_UNLOCK ####################################### #5 0x0808bdb8 in lock_vol (cmd=0x8ff4a88, vol=0x900ab20 "helter_skelter", flags=6) at locking/locking.c:354 #6 0x0806827e in process_each_lv (cmd=0x8ff4a88, argc=1, argv=0xfef410bc, lock_type=36, handle=0x0, process_single=0x80521bd <lvchange_single>) at toollib.c:407 #7 0x08053bbb in lvchange (cmd=0x8ff4a88, argc=1, argv=0xfffffe00) at lvchange.c:728 #8 0x0805b571 in lvm_run_command (cmd=0x8ff4a88, argc=1, argv=0xfef410bc) at lvmcmdline.c:927 #9 0x0805c1d9 in lvm2_main (argc=3, argv=0xfef410b4, is_static=0) at lvmcmdline.c:1205 Backtrace for lvs (16831): #1 0x00471673 in __read_nocancel () from /lib/tls/libpthread.so.0 #2 0x080a1000 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags" is not available.) at locking/cluster_locking.c:115 #3 0x080a1466 in _lock_resource (cmd=0x8fd5a40, resource=Variable "resource" is not available.) at locking/cluster_locking.c:415 #4 0x0808bb9a in _lock_vol (cmd=0x8fd5a40, resource=0xfee8b6a0 "VolGroup00", flags=33) at locking/locking.c:321 ####################################### # _lock_vol flags = LCK_VG | LCK_HOLD | LCK_READ ####################################### #5 0x0808bdb8 in lock_vol (cmd=0x8fd5a40, vol=0x8fecba8 "VolGroup00", flags=33) at locking/locking.c:354 #6 0x08067dc6 in process_each_lv (cmd=0x8fd5a40, argc=0, argv=0xfee8d368, lock_type=33, handle=0x8ff25b8, process_single=0x8064ece <_lvs_single>) at toollib.c:346 #7 0x08065864 in _report (cmd=0x8fd5a40, argc=0, argv=0xfee8d368, report_type=LVS) at reporter.c:325 #8 0x0805b571 in lvm_run_command (cmd=0x8fd5a40, argc=0, argv=0xfee8d368) at lvmcmdline.c:927 #9 0x0805c1d9 in lvm2_main (argc=1, argv=0xfee8d364, is_static=0) at lvmcmdline.c:1205 Command failed Nov 13 09:29:31 tank-02 kernel: <562 Nov 13 09:29:31 tank-02 kernel: <7566 Nov 13 09:29:31 tank-02 kernel: 584 Nov 13 09:29:31 tank-02 kernel: <594 Nov 13 09:29:31 tank-02 kernel: 608 Nov 13 09:29:31 tank-02 kernel: 830 Nov 13 09:29:31 tank-02 kernel: 300 Nov 13 09:29:32 tank-02 kernel: <650 Nov 13 09:29:32 tank-02 kernel: <7652 Nov 13 09:29:32 tank-02 kernel: 870 DLM lockspace 'clvmd' Resource 44b52ecc (parent 00000000). Name (len=64) "bgBQiuke1HPFXUNCblZcYdkujzWktG2SfdTn2Vf3HNJ8P8y3dzagNQUQhyvUhGd3" Master Copy Granted Queue 00010222 CR 23420 000101cb CR 26067 Remote: [[tank-03]] 000203d2 Conversion Queue Waiting Queue Resource 48a4d9ac (parent 00000000). Name (len=12) "V_VolGroup00" Master Copy Granted Queue 0002021a PR 16832 Conversion Queue Waiting Queue Output of lvm_backtraces.pl on tank-03: [root@tank-03 ~]# /tmp/lvm_backtraces.pl Command failed Nov 13 09:29:34 tank-03 kernel: 011 Nov 13 09:29:34 tank-03 kernel: <7019 Nov 13 09:29:34 tank-03 kernel: <7115 Nov 13 09:29:34 tank-03 kernel: <7845 Nov 13 09:29:34 tank-03 kernel: 847 Nov 13 09:29:34 tank-03 kernel: <700687 Nov 13 09:29:34 tank-03 kernel: <n: 00713 Nov 13 09:29:34 tank-03 kernel: 469 Nov 13 09:29:35 tank-03 kernel: <669 Nov 13 09:29:35 tank-03 kernel: <673 DLM lockspace 'clvmd' Resource 44fd6af8 (parent 00000000). Name (len=64) "bgBQiuke1HPFXUNCblZcYdkujzWktG2SfdTn2Vf3HNJ8P8y3dzagNQUQhyvUhGd3" Local Copy, Master is node [[tank-02]] Granted Queue 000203d2 CR 26067 Master: 000101cb Conversion Queue Waiting Queue Expected results: Additional info:
I did end up finding a bz where clustered_log gets stuck (bz 254225), though I'm not sure that it's related to this one.
I wonder if this is 435491/435341/217895? The fix for the first 2 (which is what I think this bug is most related to) is in the latest LVM[-cluster] code. The 3rd bug I listed is what you will hit after being able to get by the first two... I committed the fix for 217895 on 2/20/08. Look for cmirror package newer than that. Marking as MODIFIED, as I believe this has been fixed by clearing the above 3 bugs.
I hit something like this today while running on 4.7 bits. The test case I was running was: Senario: Kill primary leg of synced 2 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: syncd_primary_2legs * sync: 1 * mirrors: 1 * disklog: 1 * failpv: /dev/sdf1 * legs: 2 * pvs: /dev/sdf1 /dev/sdd1 /dev/sde1 ************************************************ After disabling /dev/sdf1 and trying to force the down conversion with some light I/O and a sync, the I/O and sync hung. Further LVM commands hang on the system. In the logs I see a lot of warnings about using dev_open() on all sorts of devices. May 21 22:32:06 morph-02 qarshd[9315]: Running cmdline: echo offline > /sys/block/sdf/device/state May 21 22:32:07 morph-02 kernel: scsi0 (0:6): rejecting I/O to offline device May 21 22:32:07 morph-02 lvm[24528]: Mirror device, 253:3, has failed. May 21 22:32:07 morph-02 lvm[24528]: Device failure in helter_skelter-syncd_primary_2legs_1 May 21 22:32:07 morph-02 lvm[24528]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended May 21 22:32:07 morph-02 kernel: scsi0 (0:6): rejecting I/O to offline device May 21 22:32:07 morph-02 lvm[24528]: WARNING: dev_open(/dev/ramdisk) called while suspended May 21 22:32:07 morph-02 last message repeated 2 times
The Red Hat Cluster Suite product is past end-of-life; closing.