Bug 612221 - cmirror create deadlock - 'clog_resume: Failed to create cluster CPG'
cmirror create deadlock - 'clog_resume: Failed to create cluster CPG'
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.0
All Linux
high Severity high
: rc
: ---
Assigned To: Jonathan Earl Brassow
Corey Marthaler
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-07 11:26 EDT by Corey Marthaler
Modified: 2010-11-10 16:08 EST (History)
10 users (show)

See Also:
Fixed In Version: lvm2-2.02.72-4.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-11-10 16:08:19 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
log from grant-01 (274.93 KB, text/plain)
2010-07-07 11:55 EDT, Corey Marthaler
no flags Details
log from grant-02 (235.12 KB, text/plain)
2010-07-07 11:56 EDT, Corey Marthaler
no flags Details
log from grant-03 (235.87 KB, text/plain)
2010-07-07 11:57 EDT, Corey Marthaler
no flags Details

  None (edit)
Description Corey Marthaler 2010-07-07 11:26:16 EDT
Description of problem:
While running standard cmirror tests, I noticed a basic cmirror create attempt stall. This could be related to the underlying cluster components, but according to cman, all is fine.


---------------------------------------------
 base legs 4; convert to 1 legs;
 base log mirrored; convert to core log;
 active 1; sync=1; pvs=sufficient
---------------------------------------------
creating a base mirror on grant-02
Waiting until all mirrors become fully syncd...
   0/1 mirror(s) are fully synced: ( 0.00% )
   0/1 mirror(s) are fully synced: ( 0.00% )
   0/1 mirror(s) are fully synced: ( 0.00% )
sync percent for centi_base hasn't changed in the past minute
HACKING AROUND BZ 464550 by dd'ing to this stale mirror in order to restart the sync
   0/1 mirror(s) are fully synced: ( 0.00% )
   0/1 mirror(s) are fully synced: ( 0.00% )
   0/1 mirror(s) are fully synced: ( 0.00% )
sync percent for centi_base hasn't changed in the past 5 minutes
ADDITIONAL HACK AROUND BZ 474174 by suspending/resuming this mirror's
dm devices in order to restart the sync process


Jul  7 09:33:15 grant-02 qarshd[14388]: Running cmdline: lvcreate -n centi_base --mirrorlog mirrored -m 4 -L 700M centipede
Jul  7 09:33:15 grant-02 lvm[2205]: Monitoring mirror device centipede-centi_base_mlog for events.
Jul  7 09:33:15 grant-02 lvm[2205]: centipede-centi_base_mlog is now in-sync.
Jul  7 09:33:19 grant-02 lvm[2205]: No longer monitoring mirror device centipede-centi_base_mlog for events.
Jul  7 09:33:20 grant-02 cmirrord[2137]: Log entry already exists: LVM-c1zIi8aMH3XKZJicu9es8kVboDzjDbFU2EDAEPruHxPlnV7dnwqUet3IEfh3M2R1
Jul  7 09:33:20 grant-02 cmirrord[2137]: clog_resume:  Failed to create cluster CPG
Jul  7 09:33:20 grant-02 lvm[2205]: Monitoring mirror device centipede-centi_base_mlog for events.
Jul  7 09:33:20 grant-02 lvm[2205]: Monitoring mirror device centipede-centi_base for events.
Jul  7 09:33:20 grant-02 xinetd[1554]: EXIT: qarsh status=0 pid=14388 duration=5(sec)
Jul  7 10:03:09 grant-02 kernel: device-mapper: dm-log-userspace: [Efh3M2R1] Request timed out: [12/51626] - retrying
Jul  7 10:03:24 grant-02 kernel: device-mapper: dm-log-userspace: [Efh3M2R1] Request timed out: [12/51627] - retrying
Jul  7 10:03:39 grant-02 kernel: device-mapper: dm-log-userspace: [Efh3M2R1] Request timed out: [12/51628] - retrying
Jul  7 10:03:54 grant-02 kernel: device-mapper: dm-log-userspace: [Efh3M2R1] Request timed out: [12/51629] - retrying


[root@grant-02 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M     80   2010-07-02 11:58:34  grant-01
   2   M     80   2010-07-02 11:58:34  grant-02
   3   M     80   2010-07-02 11:58:34  grant-03

I'll attach the logs/dumps from the each node in the cluster.


Version-Release number of selected component (if applicable):
2.6.32-24.el6.x86_64

lvm2-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-libs-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-cluster-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
cmirror-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010


How reproducible:
Only once so far
Comment 1 Corey Marthaler 2010-07-07 11:55:36 EDT
Created attachment 430098 [details]
log from grant-01
Comment 2 Corey Marthaler 2010-07-07 11:56:30 EDT
Created attachment 430099 [details]
log from grant-02
Comment 3 Corey Marthaler 2010-07-07 11:57:33 EDT
Created attachment 430100 [details]
log from grant-03
Comment 4 RHEL Product and Program Management 2010-07-15 10:15:00 EDT
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release. It has
been denied for the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Comment 5 Corey Marthaler 2010-07-27 17:36:02 EDT
Reproduced this issue with the latest builds:
2.6.32-44.el6.x86_64

lvm2-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-libs-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-cluster-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
cmirror-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010


---------------------------------------------
 base legs 0; convert to 1 legs;
 base log core; convert to mirrored log;
 active 0; sync=0; pvs=sufficient
---------------------------------------------
creating a base linear on grant-03
Unable to convert the log of an inactive cluster mirror
leaving mirror active for this operation!
Converting from 0 leg(s) core log; to 1 leg(s) mirrored log on grant-01

legs_up_needed: 1
legs_down_needed: 0
logs_up_needed: 2
logs_down_needed: 0

current free pvs: /dev/sdb2 /dev/sdb3 /dev/sdb5 /dev/sdb6 /dev/sdc1 /dev/sdc2 /dev/sdc3 /dev/sdc5
legs_up=/dev/sdc1
logs_up=/dev/sdb3 /dev/sdc3
legs_down=
logs_down=
total=/dev/sdc1 /dev/sdb3 /dev/sdc3

lvconvert --mirrorlog mirrored -m 1 centipede/centi_base /dev/sdc1 /dev/sdb3 /dev/sdc3
[HANG]


Jul 27 16:10:15 grant-01 qarshd[23177]: Running cmdline: lvconvert --mirrorlog mirrored -m 1 centipede/centi_base /dev/sdc1 /dev/sdb3 /dev/sdc3
Jul 27 16:10:16 grant-01 lvm[27664]: Monitoring mirror device centipede-centi_base_mlog for events.
Jul 27 16:10:16 grant-01 lvm[27664]: centipede-centi_base_mlog is now in-sync.
Jul 27 16:10:20 grant-01 lvm[27664]: No longer monitoring mirror device centipede-centi_base_mlog for events.
Jul 27 16:10:20 grant-01 cmirrord[17880]: Log entry already exists: LVM-N2o1fHSLSETgzUXUIi02ke7ZvEdDoMJfiC9uRh8ZBoe6JjJ0Q2tGAueoVWA7yC39
Jul 27 16:10:20 grant-01 cmirrord[17880]: clog_resume:  Failed to create cluster CPG
Jul 27 16:10:21 grant-01 lvm[27664]: Monitoring mirror device centipede-centi_base_mlog for events.
Jul 27 16:10:21 grant-01 lvm[27664]: Monitoring mirror device centipede-centi_base for events.
Comment 6 Corey Marthaler 2010-07-27 17:38:08 EDT
I still think this should block RHEL6.0
Comment 8 Jonathan Earl Brassow 2010-07-28 17:09:13 EDT
One command:
'lvcreate -m1 -l1 -n lv vg --mirrorlog mirrored'

Result (everytime?):
Jul 28 16:02:36 bp-01 cmirrord[1993]: Log entry already exists: LVM-sf0xvb3Q5GPl2F9LnM5M6S0SoVOTKsb2BCh1fdwl96XKKb3IMwCTDeoBk21hO2ji
Jul 28 16:02:36 bp-01 cmirrord[1993]: clog_resume:  Failed to create cluster CPG



I see that you are using 'mirrored' log in comment #5 - this seems to be the problem.  Could be that LVM is passing down the wrong UUID for the mirrored log identifier.
Comment 9 Jonathan Earl Brassow 2010-07-28 17:22:40 EDT
code is using UUID of the mirror if log type is 'core' and UUID of the log_lv if log type is 'disk|mirrored'.  Therefore, the main mirror is using the UUID of the log as the unique identifier, but so is the underlying mirrored log - this leads to conflict.
Comment 10 Jonathan Earl Brassow 2010-07-29 19:19:29 EDT
UUID conflict resolved, but now top-level mirror log refuses to get past "resume" - unclear why yet.
Comment 11 Jonathan Earl Brassow 2010-07-30 10:42:10 EDT
Mirrored log type for cluster mirrors will have to be deferred to a later release.  I will soon commit the patch that disallows this.  Reason from my patch:

Disallow mirrored logs in cluster mirrors.

The cluster log daemon (cmirrord) is not multi-threaded and
can handle only one request at a time.  When a log is stacked
on top of a mirror (which itself contains a 'core' log), it
creates a situation that cannot be solved without threading.

When the top level mirror issues a "resume", the log daemon
attempts to read from the log device to retrieve the log
state.  However, the log is a mirror which, before issuing
the read, attempts to determine the 'sync' status of the
region of the mirror which is to be read.  This sync status
request cannot be completed by the daemon because it is
blocked on a read I/O to the very mirror requesting the
sync status.
Comment 12 Jonathan Earl Brassow 2010-07-30 10:46:16 EDT
This bug should be resolved by virtue of the fact that the configuration that causes the issue is disallowed.  Another bug will have to be opened to request the feature to be enabled.  To complete this feature, threading will need to be added to the cluster log daemon - at least at a remedial enough level to process requests from different logs simultaneously.

It will still be necessary to verify that this bug does not show up in other configurations (cluster mirrors with 'core' or 'disk' log types).  Will change state from 'ASSIGNED' when patch is checked in.
Comment 13 Jonathan Earl Brassow 2010-08-02 17:09:35 EDT
checked in upstream - destined for 2.02.73
Comment 15 Corey Marthaler 2010-08-13 18:22:55 EDT
[root@hayes-02 ~]# lvcreate -m 1 --mirrorlog mirrored -L 100M -n mirror centipede
  Log type, "mirrored", is unavailable to cluster mirrors

Verified that mirrored logs are no longer allowed in the latest lvm rpms. Like mentioned in comment #12, if we are to ever support mirrored logs with cluster mirrors, this bug will need to actually be fixed.
Comment 16 releng-rhel@redhat.com 2010-11-10 16:08:19 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.