Bug 380181 - lvchange -an hung after a series of mirror removes
Summary: lvchange -an hung after a series of mirror removes
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cmirror-kernel
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-11-13 15:32 UTC by Nate Straz
Modified: 2013-09-23 15:32 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-23 15:32:39 UTC
Embargoed:


Attachments (Terms of Use)

Description Nate Straz 2007-11-13 15:32:08 UTC
Description of problem:

While cleaning up from this mirror failure scenario, lvchange hung.

================================================================================
Iteration 2.10 started at Tue Nov 13 01:21:22 CST 2007
================================================================================
Senario: Kill disk log of synced 2 leg mirror(s)

****** Mirror hash info for this scenario ******
* name:      syncd_log_2legs
* sync:      1
* mirrors:   3
* disklog:   1
* failpv:    /dev/sdd1
* legs:      2
* pvs:       /dev/sda1 /dev/sdc1 /dev/sdd1
************************************************



Version-Release number of selected component (if applicable):
cmirror-1.0.1-1
cmirror-kernel-2.6.9-38.5
kernel-hugemem-2.6.9-67.EL
device-mapper-1.02.21-1.el4

How reproducible:
Unknown

Actual results:

The problem started at 1:28 am on tank-02.  

Nov 13 01:28:43 tank-02 qarshd[16144]: Running cmdline: lvchange -an
/dev/helter_skelter/syncd_log_2legs_3 
Nov 13 01:28:43 tank-02 kernel: cdrom: open failed.
Nov 13 01:28:44 tank-02 lvm[23704]: No longer monitoring mirror device
helter_skelter-syncd_log_2legs_3 for events 
Nov 13 01:28:46 tank-02 kernel: SM: 0100013b process_stop_request: uevent
already set
Nov 13 01:28:52 tank-02 kernel: irror:   starter     : 2
Nov 13 01:28:52 tank-02 last message repeated 4 times
Nov 13 01:28:52 tank-02 kernel: <7irror:   starter     : 2
Nov 13 01:28:52 tank-02 kernel: <7irror:   starter     : 2
Nov 13 01:28:53 tank-02 kernel: <irror:   starter     : 2
Nov 13 01:28:53 tank-02 kernel: irror:   starter     : 2

tank-03 started producing the same messages:

Nov 13 01:28:43 tank-03 lvm[23694]: No longer monitoring mirror device
helter_skelter-syncd_log_2legs_1 for events 
Nov 13 01:28:44 tank-03 lvm[23694]: No longer monitoring mirror device
helter_skelter-syncd_log_2legs_2 for events 
Nov 13 01:28:46 tank-03 lvm[23694]: No longer monitoring mirror device
helter_skelter-syncd_log_2legs_3 for events 
Nov 13 01:28:54 tank-03 kernel: irror:   starter     : 2
Nov 13 01:28:54 tank-03 kernel: irror:   starter     : 2
Nov 13 01:28:54 tank-03 kernel: <irror:   starter     : 2

tank-01 and tank-04 stopped monitoring the mirror device about the same time:

Nov 13 01:28:41 tank-01 lvm[14164]: No longer monitoring mirror device
helter_skelter-syncd_log_2legs_1 for events 
Nov 13 01:28:42 tank-01 lvm[14164]: No longer monitoring mirror device
helter_skelter-syncd_log_2legs_2 for events 
Nov 13 01:28:43 tank-01 lvm[14164]: No longer monitoring mirror device
helter_skelter-syncd_log_2legs_3 for events 

Nov 13 01:28:39 tank-04 lvm[24693]: No longer monitoring mirror device
helter_skelter-syncd_log_2legs_1 for events 
Nov 13 01:28:40 tank-04 lvm[24693]: No longer monitoring mirror device
helter_skelter-syncd_log_2legs_2 for events 
Nov 13 01:28:41 tank-04 lvm[24693]: No longer monitoring mirror device
helter_skelter-syncd_log_2legs_3 for events 

Output from cman_tool services:
[root@tank-02 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    4   M   tank-01
   2    1    4   M   tank-03
   3    1    4   M   tank-04
   4    1    4   M   tank-02

tank-01:
[root@tank-01 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 4 2 3]

DLM Lock Space:  "clvmd"                           107 107 run       -
[1 4 2 3]

tank-02:
[root@tank-02 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 4 2 3]

DLM Lock Space:  "clvmd"                           107 107 run       -
[1 4 2 3]

DLM Lock Space:  "clustered_log"                   315 308 run       S-10,280,2
[2 4]

tank-03:
[root@tank-03 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 2 4 3]

DLM Lock Space:  "clvmd"                           107 107 run       -
[1 2 4 3]

DLM Lock Space:  "clustered_log"                   315 308 run       S-13,210,1
[2 4]

tank-04:
[root@tank-04 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 2 3 4]

DLM Lock Space:  "clvmd"                           107 107 run       -
[1 2 3 4]

Output of lvm_backtraces.pl on tank-02:

Backtrace for lvchange-an/dev/helter_skelter/syncd_log_2legs_3 (16145):
#1  0x00471673 in __read_nocancel () from /lib/tls/libpthread.so.0
#2  0x080a1000 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags" is not
available.)    at locking/cluster_locking.c:115
#3  0x080a1466 in _lock_resource (cmd=0x8ff4a88, resource=Variable "resource" is
not available.)    at locking/cluster_locking.c:415
#4  0x0808bb9a in _lock_vol (cmd=0x8ff4a88,     resource=0xfef3f430
"helter_skelter", flags=6) at locking/locking.c:321
        #######################################
        # _lock_vol flags = LCK_VG | LCK_UNLOCK
        #######################################
#5  0x0808bdb8 in lock_vol (cmd=0x8ff4a88, vol=0x900ab20 "helter_skelter",    
flags=6) at locking/locking.c:354
#6  0x0806827e in process_each_lv (cmd=0x8ff4a88, argc=1, argv=0xfef410bc,    
lock_type=36, handle=0x0, process_single=0x80521bd <lvchange_single>)    at
toollib.c:407
#7  0x08053bbb in lvchange (cmd=0x8ff4a88, argc=1, argv=0xfffffe00)    at
lvchange.c:728
#8  0x0805b571 in lvm_run_command (cmd=0x8ff4a88, argc=1, argv=0xfef410bc)    at
lvmcmdline.c:927
#9  0x0805c1d9 in lvm2_main (argc=3, argv=0xfef410b4, is_static=0)    at
lvmcmdline.c:1205

Backtrace for lvs (16831):
#1  0x00471673 in __read_nocancel () from /lib/tls/libpthread.so.0
#2  0x080a1000 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags" is not
available.)    at locking/cluster_locking.c:115
#3  0x080a1466 in _lock_resource (cmd=0x8fd5a40, resource=Variable "resource" is
not available.)    at locking/cluster_locking.c:415
#4  0x0808bb9a in _lock_vol (cmd=0x8fd5a40, resource=0xfee8b6a0 "VolGroup00",  
  flags=33) at locking/locking.c:321
        #######################################
        # _lock_vol flags = LCK_VG | LCK_HOLD | LCK_READ
        #######################################
#5  0x0808bdb8 in lock_vol (cmd=0x8fd5a40, vol=0x8fecba8 "VolGroup00",    
flags=33) at locking/locking.c:354
#6  0x08067dc6 in process_each_lv (cmd=0x8fd5a40, argc=0, argv=0xfee8d368,    
lock_type=33, handle=0x8ff25b8, process_single=0x8064ece <_lvs_single>)    at
toollib.c:346
#7  0x08065864 in _report (cmd=0x8fd5a40, argc=0, argv=0xfee8d368,    
report_type=LVS) at reporter.c:325
#8  0x0805b571 in lvm_run_command (cmd=0x8fd5a40, argc=0, argv=0xfee8d368)    at
lvmcmdline.c:927
#9  0x0805c1d9 in lvm2_main (argc=1, argv=0xfee8d364, is_static=0)    at
lvmcmdline.c:1205

Command failed
Nov 13 09:29:31 tank-02 kernel: <562
Nov 13 09:29:31 tank-02 kernel: <7566
Nov 13 09:29:31 tank-02 kernel: 584
Nov 13 09:29:31 tank-02 kernel: <594
Nov 13 09:29:31 tank-02 kernel: 608
Nov 13 09:29:31 tank-02 kernel: 830
Nov 13 09:29:31 tank-02 kernel: 300
Nov 13 09:29:32 tank-02 kernel: <650
Nov 13 09:29:32 tank-02 kernel: <7652
Nov 13 09:29:32 tank-02 kernel: 870
DLM lockspace 'clvmd'

Resource 44b52ecc (parent 00000000). Name (len=64)
"bgBQiuke1HPFXUNCblZcYdkujzWktG2SfdTn2Vf3HNJ8P8y3dzagNQUQhyvUhGd3"  
Master Copy
Granted Queue
00010222 CR 23420
000101cb CR 26067 Remote:   [[tank-03]] 000203d2
Conversion Queue
Waiting Queue

Resource 48a4d9ac (parent 00000000). Name (len=12) "V_VolGroup00"  
Master Copy
Granted Queue
0002021a PR 16832
Conversion Queue
Waiting Queue

Output of lvm_backtraces.pl on tank-03:
[root@tank-03 ~]# /tmp/lvm_backtraces.pl 
Command failed
Nov 13 09:29:34 tank-03 kernel: 011
Nov 13 09:29:34 tank-03 kernel: <7019
Nov 13 09:29:34 tank-03 kernel: <7115
Nov 13 09:29:34 tank-03 kernel: <7845
Nov 13 09:29:34 tank-03 kernel: 847
Nov 13 09:29:34 tank-03 kernel: <700687
Nov 13 09:29:34 tank-03 kernel: <n: 00713
Nov 13 09:29:34 tank-03 kernel: 469
Nov 13 09:29:35 tank-03 kernel: <669
Nov 13 09:29:35 tank-03 kernel: <673
DLM lockspace 'clvmd'

Resource 44fd6af8 (parent 00000000). Name (len=64)
"bgBQiuke1HPFXUNCblZcYdkujzWktG2SfdTn2Vf3HNJ8P8y3dzagNQUQhyvUhGd3"  
Local Copy, Master is node [[tank-02]]
Granted Queue
000203d2 CR 26067 Master:     000101cb
Conversion Queue
Waiting Queue

Expected results:


Additional info:

Comment 1 Corey Marthaler 2007-11-13 15:44:20 UTC
I did end up finding a bz where clustered_log gets stuck (bz 254225), though I'm
not sure that it's related to this one.

Comment 2 Jonathan Earl Brassow 2008-03-20 16:20:48 UTC
I wonder if this is 435491/435341/217895?  The fix for the first 2 (which is
what I think this bug is most related to) is in the latest LVM[-cluster] code.

The 3rd bug I listed is what you will hit after being able to get by the first
two...  I committed the fix for 217895 on 2/20/08.  Look for cmirror package
newer than that.

Marking as MODIFIED, as I believe this has been fixed by clearing the above 3 bugs.

Comment 3 Nate Straz 2008-05-22 16:07:58 UTC
I hit something like this today while running on 4.7 bits.  The test case I was
running was:

Senario: Kill primary leg of synced 2 leg mirror(s)

****** Mirror hash info for this scenario ******
* name:      syncd_primary_2legs
* sync:      1
* mirrors:   1
* disklog:   1
* failpv:    /dev/sdf1
* legs:      2
* pvs:       /dev/sdf1 /dev/sdd1 /dev/sde1
************************************************

After disabling /dev/sdf1 and trying to force the down conversion with some
light I/O and a sync, the I/O and sync hung.  Further LVM commands hang on the
system.

In the logs I see a lot of warnings about using dev_open() on all sorts of devices.

May 21 22:32:06 morph-02 qarshd[9315]: Running cmdline: echo offline >
/sys/block/sdf/device/state 
May 21 22:32:07 morph-02 kernel: scsi0 (0:6): rejecting I/O to offline device
May 21 22:32:07 morph-02 lvm[24528]: Mirror device, 253:3, has failed. 
May 21 22:32:07 morph-02 lvm[24528]: Device failure in
helter_skelter-syncd_primary_2legs_1 
May 21 22:32:07 morph-02 lvm[24528]: WARNING: dev_open(/etc/lvm/lvm.conf) called
while suspended
May 21 22:32:07 morph-02 kernel: scsi0 (0:6): rejecting I/O to offline device
May 21 22:32:07 morph-02 lvm[24528]: WARNING: dev_open(/dev/ramdisk) called
while suspended
May 21 22:32:07 morph-02 last message repeated 2 times


Comment 5 Lon Hohberger 2013-09-23 15:32:39 UTC
The Red Hat Cluster Suite product is past end-of-life; closing.


Note You need to log in before you can comment on or make changes to this bug.