Bug 625609

Summary: simultaneous cmirror operations fail due to locking issues
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED WONTFIX QA Contact: Corey Marthaler <cmarthal>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: agk, ddumas, dwysocha, heinzm, jbrassow, joe.thornber, mbroz, prockai
Target Milestone: rcKeywords: Regression, TestBlocker
Target Release: ---Flags: rlerch: needinfo+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Attempting to run multiple LVM commands in quick succession might cause a backlog of these commands. Consequently, some of the operations requested might time-out, and subsequently, fail.
Story Points: ---
Clone Of:
: 653628 682649 (view as bug list) Environment:
Last Closed: 2010-11-22 23:14:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 653628, 682649    

Description Corey Marthaler 2010-08-19 22:30:47 UTC
Description of problem:
I've reproduced this problem on two clusters.

[cmarthal@silver bin]$ ./cmirror_lock_stress -l /home/msp/cmarthal/work/rhel6/sts-root -r /usr/tests/sts-rhel6.0 -R ../../var/share/resource_files/grant.xml

[...]

creating lvm devices...
Create 7 PV(s) for lock_stress on grant-01
Create VG lock_stress on grant-01
Creating herd file /tmp/cmirror_lock_stress.201008191629/lock_stress.h2 containing lock_ops cmds for collie
Starting lock operations on all cluster nodes
lock operations either failed or timed out, check in /tmp/cmirror_lock_stress.201008191629


Individual cmd output:
Creating a 4 redundant legged cmirror named grant-03.5104
  Logical volume "grant-03.5104" created

Down converting cmirror from 4 legs to 1 on grant-03
  Error locking on node grant-03: Command timed out
  Problem reactivating grant-03.5104
couldn't down convert cmirror on grant-03

Aug 19 16:33:30 grant-03 qarshd[2632]: Running cmdline: lvconvert -m 1 lock_stress/grant-03.5104
Aug 19 16:33:33 grant-03 lvm[2145]: No longer monitoring mirror device lock_stress-grant--02.5101 for events.
Aug 19 16:33:45 grant-03 lvm[2145]: Monitoring mirror device lock_stress-grant--02.5101 for events.
Aug 19 16:33:55 grant-03 lvm[2145]: No longer monitoring mirror device lock_stress-grant--02.5101 for events.
Aug 19 16:34:12 grant-03 lvm[2145]: Monitoring mirror device lock_stress-grant--02.5101 for events.
Aug 19 16:34:24 grant-03 lvm[2145]: No longer monitoring mirror device lock_stress-grant--01.5105 for events.
Aug 19 16:34:39 grant-03 lvm[2145]: Monitoring mirror device lock_stress-grant--01.5105 for events.
Aug 19 16:35:01 grant-03 lvm[2145]: No longer monitoring mirror device lock_stress-grant--01.5105 for events.
Aug 19 16:35:08 grant-03 lvm[2145]: Monitoring mirror device lock_stress-grant--01.5105 for events.
Aug 19 16:35:18 grant-03 lvm[2145]: No longer monitoring mirror device lock_stress-grant--03.5104 for events.
Aug 19 16:36:43 grant-03 lvm[2145]: Monitoring mirror device lock_stress-grant--03.5104 for events.
Aug 19 16:36:43 grant-03 xinetd[1664]: EXIT: qarsh status=0 pid=2632 duration=193(sec)
Aug 19 16:36:48 grant-03 lvm[2145]: No longer monitoring mirror device lock_stress-grant--02.5101 for events.
Aug 19 16:37:20 grant-03 lvm[2145]: No longer monitoring mirror device lock_stress-grant--01.5105 for events.
Aug 19 16:45:21 grant-03 lvm[2145]: lock_stress-grant--03.5104 is now in-sync.


[root@grant-03 ~]# lvs -a -o +devices
  LV                       VG          Attr   LSize   Log                Copy%  Devices
  grant-01.5105            lock_stress -wi-a- 500.00m                           /dev/sdc4(0)
  grant-02.5101            lock_stress -wi-a- 500.00m                           /dev/sdc4(125)
  grant-03.5104            lock_stress mwi-a- 500.00m grant-03.5104_mlog 100.00 grant-03.5104_mimage_0(0),grant-03.5104_mimage_1(0)
  [grant-03.5104_mimage_0] lock_stress iwi-ao 500.00m                           /dev/sdc4(250)
  [grant-03.5104_mimage_1] lock_stress iwi-ao 500.00m                           /dev/sdc3(250)
  grant-03.5104_mimage_2   lock_stress -wi-a- 500.00m                           /dev/sdc2(250)
  grant-03.5104_mimage_3   lock_stress -wi-a- 500.00m                           /dev/sdc1(125)
  grant-03.5104_mimage_4   lock_stress -wi-a- 500.00m                           /dev/sdb4(125)
  [grant-03.5104_mlog]     lock_stress lwi-ao   4.00m                           /dev/sdb2(2)


Version-Release number of selected component (if applicable):
2.6.32-59.1.el6.x86_64

lvm2-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
lvm2-libs-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
lvm2-cluster-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
udev-147-2.22.el6    BUILT: Fri Jul 23 07:21:33 CDT 2010
device-mapper-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
device-mapper-libs-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
device-mapper-event-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
device-mapper-event-libs-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
cmirror-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010


How reproducible:
Everytime

Comment 1 Corey Marthaler 2010-08-19 22:32:35 UTC
Marking this a regression since this test (cmirror_lock_stress) used to pass in RHEL5.5.

Comment 2 Corey Marthaler 2010-08-24 21:23:01 UTC
There should be a 6.0 release note for this issue.

Comment 3 Corey Marthaler 2010-08-26 21:46:14 UTC
FWIW, I'm able to hit this bug with mirror_sanity as well.

SCENARIO - [verify_sync_completions]
Create 8 mirrors and verify that their copy percents complete
hayes-03: lvcreate -m 1 -n sync_check_1 -L 500M mirror_sanity
hayes-01: lvcreate -m 1 -n sync_check_2 -L 500M mirror_sanity
hayes-02: lvcreate -m 1 -n sync_check_3 -L 500M mirror_sanity
hayes-01: lvcreate -m 1 -n sync_check_4 -L 500M mirror_sanity
hayes-01: lvcreate -m 1 -n sync_check_5 -L 500M mirror_sanity
hayes-02: lvcreate -m 1 -n sync_check_6 -L 500M mirror_sanity
  Error locking on node hayes-03: Command timed out
  Aborting. Failed to activate new LV to wipe the start of it.
couldn't create mirror:
        hayes-02 lvcreate -m 1 -n sync_check_6 -L 500M mirror_sanity

Comment 5 Denise Dumas 2010-09-15 19:05:35 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Do not attempt to flood a production system with LVM commands, as the backlog of commands to be processed may increase to such a level that some operations fail due to timeouts.

Comment 7 Ryan Lerch 2010-10-13 03:54:44 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Do not attempt to flood a production system with LVM commands, as the backlog of commands to be processed may increase to such a level that some operations fail due to timeouts.+Attempting to run multiple LVM commands in quick succession might cause a backlog of these commands. Consequently, some of the operations requested might time-out, and subsequently, fail.

Comment 8 RHEL Program Management 2010-11-22 23:14:56 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.