Bug 432109 - RHEL5 cmirror tracker: cpg_mcast_joined error while mirror is suspended
Summary: RHEL5 cmirror tracker: cpg_mcast_joined error while mirror is suspended
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cmirror
Version: 5.2
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 431714
TreeView+ depends on / blocked
 
Reported: 2008-02-08 21:22 UTC by Corey Marthaler
Modified: 2010-01-12 02:07 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 21:25:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log from taft-01 (300.38 KB, text/plain)
2008-02-12 15:26 UTC, Corey Marthaler
no flags Details
log from taft-02 (14.10 KB, text/plain)
2008-02-12 15:27 UTC, Corey Marthaler
no flags Details
log from taft-03 (10.04 KB, text/plain)
2008-02-12 15:28 UTC, Corey Marthaler
no flags Details
log from taft-04 (6.53 KB, text/plain)
2008-02-12 15:29 UTC, Corey Marthaler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:0158 0 normal SHIPPED_LIVE new package: cmirror 2009-01-20 16:05:16 UTC

Description Corey Marthaler 2008-02-08 21:22:41 UTC
Description of problem:
Saw these CPG error messages during log device failure testing. In this
case, the result was a deadlocked (not just really slow) mirror.

cmirror-1.1.13-1.el5/kmod-cmirror-0.1.6-1.el5.


Feb  8 11:40:57 taft-04 kernel: sd 1:0:0:7: rejecting I/O to offline device
Feb  8 11:40:57 taft-04 lvm[7071]: No longer monitoring mirror device
helter_skelter-syncd_log_3legs_1 for events
Feb  8 11:41:12 taft-04 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_RESYNC_WORK:1068855 - retrying
Feb  8 11:41:12 taft-04 clogd[6767]: rw_log:  write failure: Input/output error
Feb  8 11:41:12 taft-04 clogd[6767]: Error writing to disk log
Feb  8 11:41:12 taft-04 kernel: sd 1:0:0:7: rejecting I/O to offline device
Feb  8 11:41:12 taft-04 kernel: device-mapper: dm-log-clustered: Server error
while processing request [DM_CLOG_FLUSH]: -5
Feb  8 11:41:12 taft-04 lvm[7071]: Monitoring mirror device
helter_skelter-syncd_log_3legs_1 for events
Feb  8 11:41:13 taft-04 clogd[6767]: cpg_mcast_joined error: 9
Feb  8 11:41:13 taft-04 clogd[6767]: cluster_send failed at: local.c:212
(do_local_work)
Feb  8 11:41:13 taft-04 clogd[6767]: [] Unable to send (null) to cluster:
Invalid exchange
Feb  8 11:41:13 taft-04 clogd[6767]: Bad callback on local/4
Feb  8 11:41:13 taft-04 kernel: device-mapper: dm-log-clustered: Stray request
returned: <NULL>, 0
Feb  8 11:41:14 taft-04 lvm[7071]: No longer monitoring mirror device
helter_skelter-syncd_log_3legs_2 for events
Feb  8 11:41:28 taft-04 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_IN_SYNC:1077480 - retrying
Feb  8 11:41:28 taft-04 kernel: device-mapper: dm-log-clustered: Server error
while processing request [DM_CLOG_FLUSH]: -5
Feb  8 11:41:28 taft-04 kernel: device-mapper: dm-log-clustered: Server error
while processing request [DM_CLOG_FLUSH]: -5
Feb  8 11:41:28 taft-04 clogd[6767]: [QbkqfJyQ] No match for cluster response:
DM_CLOG_IS_REMOTE_RECOVERING:1083771
Feb  8 11:41:28 taft-04 clogd[6767]: Current list:
Feb  8 11:41:28 taft-04 clogd[6767]:    [none]
Feb  8 11:41:28 taft-04 clogd[6767]: [QbkqfJyQ] Error while processing CPG
message, DM_CLOG_IS_REMOTE_RECOVERING: Invalid argument
Feb  8 11:41:28 taft-04 clogd[6767]: [QbkqfJyQ]    Response  : YES
Feb  8 11:41:28 taft-04 clogd[6767]: [QbkqfJyQ]    Originator: 4
Feb  8 11:41:28 taft-04 clogd[6767]: [QbkqfJyQ]    Responder : 1
Feb  8 11:41:28 taft-04 clogd[6767]: HISTORY::
Feb  8 11:41:28 taft-04 clogd[6767]: 0:10) SEQ#=1083769, UUID=2RP77UEX,
TYPE=DM_CLOG_FLUSH, ORIG=4, RESP=YES, RSPR=1
Feb  8 11:41:28 taft-04 clogd[6767]: 1:11) SEQ#=777788, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 2:12) SEQ#=1083770, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=4, RESP=YES, RSPR=1
Feb  8 11:41:28 taft-04 clogd[6767]: 3:13) SEQ#=758142, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 4:14) SEQ#=777789, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 5:15) SEQ#=758143, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 6:16) SEQ#=777790, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 7:17) SEQ#=758144, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 8:18) SEQ#=777791, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 9:19) SEQ#=758145, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 10:0) SEQ#=777792, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 11:1) SEQ#=758146, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 12:2) SEQ#=777793, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 13:3) SEQ#=758147, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 14:4) SEQ#=777794, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 15:5) SEQ#=758148, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 16:6) SEQ#=777795, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 17:7) SEQ#=758149, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 18:8) SEQ#=1083771, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=4, RESP=NO
Feb  8 11:41:28 taft-04 clogd[6767]: 19:9) SEQ#=777796, UUID=QbkqfJyQ,
TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO
Feb  8 11:41:43 taft-04 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_IS_REMOTE_RECOVERING:1083771 - retrying
Feb  8 11:41:44 taft-04 lvm[7071]: Monitoring mirror device
helter_skelter-syncd_log_3legs_2 for events
Feb  8 11:42:24 taft-04 clogd[6767]: cpg_mcast_joined error: 9
Feb  8 11:42:24 taft-04 clogd[6767]: cluster_send failed at: local.c:212
(do_local_work)
Feb  8 11:42:24 taft-04 clogd[6767]: [] Unable to send (null) to cluster:
Invalid exchange
Feb  8 11:42:24 taft-04 clogd[6767]: Bad callback on local/4
Feb  8 11:42:24 taft-04 kernel: device-mapper: dm-log-clustered: Stray request
returned: <NULL>, 0
Feb  8 11:42:39 taft-04 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_IS_REMOTE_RECOVERING:1102107 - retrying
[...]

Comment 1 RHEL Program Management 2008-02-08 21:27:58 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Corey Marthaler 2008-02-12 15:01:00 UTC
I was able to reproduce these cpg messages during log failure testing again, and
there was even apparent corruption too.

Senario: Kill disk log of synced 2 leg mirror(s)

****** Mirror hash info for this scenario ******
* name:      syncd_log_2legs
* sync:      1
* mirrors:   2
* disklog:   1
* failpv:    /dev/sde1
* legs:      2
* pvs:       /dev/sdh1 /dev/sdg1 /dev/sde1
************************************************

Creating mirror(s) on taft-02...
taft-02: lvcreate -m 1 -n syncd_log_2legs_1 -L 800M helter_skelter
/dev/sdh1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-150
taft-02: lvcreate -m 1 -n syncd_log_2legs_2 -L 800M helter_skelter
/dev/sdh1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-150

Waiting until all mirrors become fully syncd...
        0/2 mirror(s) are fully synced: ( 1=6.00% 2=0.50% )
        0/2 mirror(s) are fully synced: ( 1=24.00% 2=18.50% )
        0/2 mirror(s) are fully synced: ( 1=41.00% 2=35.50% )
        0/2 mirror(s) are fully synced: ( 1=56.50% 2=51.00% )
        0/2 mirror(s) are fully synced: ( 1=72.00% 2=66.50% )
        0/2 mirror(s) are fully synced: ( 1=88.00% 2=82.00% )
        2/2 mirror(s) are fully synced: ( 1=100.00% 2=100.00% )

Creating gfs on top of mirror(s) on taft-02...
Mounting mirrored gfs filesystems on taft-01...
Mounting mirrored gfs filesystems on taft-02...
Mounting mirrored gfs filesystems on taft-03...
Mounting mirrored gfs filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...

Disabling device sde on taft-01
Disabling device sde on taft-02
Disabling device sde on taft-03
Disabling device sde on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-02
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.087572 seconds, 479 MB/s
Verifying the down conversion of the failed mirror(s)
  /dev/sde1: open failed: No such device or address
Verifying LEG device /dev/sdh1 *IS* in the volume(s)
  /dev/sde1: open failed: No such device or address
Verifying LEG device /dev/sdg1 *IS* in the volume(s)
  /dev/sde1: open failed: No such device or address
Verifying FAILED device /dev/sde1 is *NOT* in the volume(s)
  /dev/sde1: open failed: No such device or address

Verifying files (checkit) on mirror(s) on...
<fail name="taft-01_2" pid="1563" time="Mon Feb 11 21:23:17 2008" type="cmd"
duration="1202786597" ec="1" />
ALL STOP!
Can not stat jtdkqxtobtydyhhgxaujlprikeijwdsia: Input/output error
checkit write verify failed
run_iteration() method failed at
/home/msp/cmarthal/work/rhel5/sts-root/lib/FI_engine.pm line 21.


I'll post the syslogs...


Comment 4 Corey Marthaler 2008-02-12 15:26:30 UTC
Created attachment 294664 [details]
log from taft-01

Comment 5 Corey Marthaler 2008-02-12 15:27:10 UTC
Created attachment 294665 [details]
log from taft-02

Comment 6 Corey Marthaler 2008-02-12 15:28:40 UTC
Created attachment 294666 [details]
log from taft-03

Comment 7 Corey Marthaler 2008-02-12 15:29:15 UTC
Created attachment 294667 [details]
log from taft-04

Comment 8 Corey Marthaler 2008-02-14 16:13:26 UTC
Although I have not seen an I/O deadlock yet, I can reproduce these messages w/o
any disk failure. All I was doing was running I/O to 4 cmirrors w/ GFS on top.

Feb 14 10:06:57 taft-01 clogd[6683]: cpg_mcast_joined error: 9
Feb 14 10:06:57 taft-01 clogd[6683]: cluster_send failed at: local.c:212
(do_local_work)
Feb 14 10:06:57 taft-01 clogd[6683]: [] Unable to send (null) to cluster:
Invalid exchange
Feb 14 10:06:57 taft-01 clogd[6683]: Bad callback on local/4
Feb 14 10:06:57 taft-01 kernel: device-mapper: dm-log-clustered: Stray request
returned: <NULL>, 0
device-mapper: dm-log-clustered: Request timed out on DM_CLOG_IN_SYNC:508917 -
retrying
device-mapper: dm-log-clustered: Stray request returned: <NULL>, 0
Feb 14 10:07:12 taft-01 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_IN_SYNC:508917 - reg


Comment 9 Corey Marthaler 2008-02-18 22:43:29 UTC
Updating summary to reflect that a log failure isn't required to cause this. I
just saw this again w/ a primary leg failure and the result was a mirror stuck
at 0.00%.

Comment 10 Corey Marthaler 2008-02-20 20:22:18 UTC
I'm seeing this issue while doing resyncing and simple deletion testing for bz
431714. 

This issue is blocking the testing of quite few bugs as most cmirror tesing
eventually results in this. Marking this bug and QA enemy number 1.

Comment 11 Corey Marthaler 2008-02-29 14:58:05 UTC
I reproduced what appears to be this issue last night with the latest rpms. The
cmirror failure test used to hit this bug within 30 mins, but it took over 2
hours last night so it appears that progress is being made.

[Test output after a failure appeared to properly down convert]
Enabling device sdc on taft-01
Enabling device sdc on taft-02
Enabling device sdc on taft-03
Enabling device sdc on taft-04

Recreating PV /dev/sdc1
  WARNING: Volume group helter_skelter is not consistent
  WARNING: Volume Group helter_skelter is not consistent
  WARNING: Volume group helter_skelter is not consistent
Extending the recreated PV back into VG helter_skelter
Since we can't yet up convert existing mirrors, down converting to linear(s)
on taft-02 before re-converting back to original mirror(s)
  Error locking on node taft-02: LV helter_skelter/nonsyncd_log_3legs_1_mimage_1
in use: not deactivating
couldn't down convert mirror to linear


Feb 28 20:47:22 taft-01 lvm[6759]: Locking LV
4OXemQTNnv5zREyz3iwzZVq0BQxVlIz4uI32KyQKcoP215L422r7DcfaV0ptManO PW C (0x9c)
Feb 28 20:47:22 taft-01 clogd[6676]: cpg_mcast_joined error: 11
Feb 28 20:47:22 taft-01 clogd[6676]: cluster_send failed: Invalid exchange
Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna] Error while processing CPG
message, DM_CLOG_GET_RESYNC_WORK: Invalid exchange
Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna]    Response  : NO
Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna]    Originator: 4
Feb 28 20:47:22 taft-01 clogd[6676]: HISTORY::
Feb 28 20:47:22 taft-01 clogd[6676]: 0:70) SEQ#=4747360, UUID=OnrNtAna,
TYPE=(null), ORIG=1, RESP=NO
Feb 28 20:47:22 taft-01 clogd[6676]: 1:71) SEQ#=4747360, UUID=OnrNtAna,
TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1
Feb 28 20:47:22 taft-01 clogd[6676]: 2:72) SEQ#=4747363, UUID=OnrNtAna,
TYPE=(null), ORIG=1, RESP=NO
Feb 28 20:47:22 taft-01 clogd[6676]: 3:73) SEQ#=4747363, UUID=OnrNtAna,
TYPE=DM_CLOG_CLEAR_REGION, ORIG=1, RESP=YES, RSPR=1

[...]

Feb 28 20:47:22 taft-01 clogd[6676]: 97:67) SEQ#=4747435, UUID=OnrNtAna,
TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1
Feb 28 20:47:22 taft-01 clogd[6676]: 98:68) SEQ#=4747437, UUID=OnrNtAna,
TYPE=(null), ORIG=1, RESP=NO
Feb 28 20:47:22 taft-01 clogd[6676]: 99:69) SEQ#=4747437, UUID=OnrNtAna,
TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1
Feb 28 20:47:22 taft-01 clogd[6676]: cpg_mcast_joined error: 11
Feb 28 20:47:22 taft-01 clogd[6676]: cluster_send failed: Invalid exchange
Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna] Error while processing CPG
message, DM_CLOG_GET_RESYNC_WORK: Invalid exchange
Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna]    Response  : NO
Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna]    Originator: 2
Feb 28 20:47:22 taft-01 clogd[6676]: HISTORY::
Feb 28 20:47:22 taft-01 clogd[6676]: 0:70) SEQ#=4747360, UUID=OnrNtAna,
TYPE=(null), ORIG=1, RESP=NO
Feb 28 20:47:22 taft-01 clogd[6676]: 1:71) SEQ#=4747360, UUID=OnrNtAna,
TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1
Feb 28 20:47:22 taft-01 clogd[6676]: 2:72) SEQ#=4747363, UUID=OnrNtAna,
TYPE=(null), ORIG=1, RESP=NO
Feb 28 20:47:22 taft-01 clogd[6676]: 3:73) SEQ#=4747363, UUID=OnrNtAna,
TYPE=DM_CLOG_CLEAR_REGION, ORIG=1, RESP=YES, RSPR=1


This was with:
2.6.18-71.el5
cmirror-1.1.15-1.el5
kmod-cmirror-0.1.8-1.el5
lvm2-cluster-2.02.32-1.el5

Comment 12 Jonathan Earl Brassow 2008-04-29 14:49:40 UTC
The above 'Invalid exchange' messages correspond to the time when the mirror was
down converted?

Can you still reproduce this?

Just to recap, you had been able to reproduce this just by create/delete loop;
but now you can only reproduce with a device failure and then down-convert?

Comment 13 Corey Marthaler 2008-05-05 16:15:45 UTC
FYI - Originally, I was only able to reproduce this issue during log device
failure, but I also saw it once during heavy GFS I/O on top of cmirrors (no
failure). 

I'm currently attempting to reproduce this bug...

Comment 14 Corey Marthaler 2008-05-05 16:39:16 UTC
Reproduced this while adding the log back into the mirror, after having killed
it on 2 synced 3-way mirrors.

================================================================================
Iteration 1.2 started at Mon May  5 11:18:48 CDT 2008
================================================================================
Senario: Kill disk log of synced 3 leg mirror(s)

****** Mirror hash info for this scenario ******
* name:      syncd_log_3legs
* sync:      1
* mirrors:   2
* disklog:   1
* failpv:    /dev/sdg1
* legs:      3
* pvs:       /dev/sdf1 /dev/sdh1 /dev/sde1 /dev/sdg1
************************************************

Creating mirror(s) on taft-01...
taft-01: lvcreate -m 2 -n syncd_log_3legs_1 -L 800M helter_skelter
/dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-150
taft-01: lvcreate -m 2 -n syncd_log_3legs_2 -L 800M helter_skelter
/dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-150

Waiting until all mirrors become fully syncd...
        0/2 mirror(s) are fully synced: ( 1=2.50% 2=0.00% )
        0/2 mirror(s) are fully synced: ( 1=22.50% 2=21.00% )
        0/2 mirror(s) are fully synced: ( 1=42.50% 2=41.50% )
        0/2 mirror(s) are fully synced: ( 1=62.50% 2=63.50% )
        0/2 mirror(s) are fully synced: ( 1=82.50% 2=84.00% )
        2/2 mirror(s) are fully synced: ( 1=100.00% 2=100.00% )

Creating gfs on top of mirror(s) on taft-01...
Mounting mirrored gfs filesystems on taft-01...
Mounting mirrored gfs filesystems on taft-02...
Mounting mirrored gfs filesystems on taft-03...
Mounting mirrored gfs filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Sleeping 12 seconds to get some outsanding I/O locks before the failure

Disabling device sdg on taft-01
Disabling device sdg on taft-02
Disabling device sdg on taft-03
Disabling device sdg on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.085393 seconds, 491 MB/s
Verifying the down conversion of the failed mirror(s)
  /dev/sdg1: open failed: No such device or address
Verifying LEG device /dev/sdf1 *IS* in the volume(s)
  /dev/sdg1: open failed: No such device or address
Verifying LEG device /dev/sdh1 *IS* in the volume(s)
  /dev/sdg1: open failed: No such device or address
Verifying LEG device /dev/sde1 *IS* in the volume(s)
  /dev/sdg1: open failed: No such device or address
Verifying FAILED device /dev/sdg1 is *NOT* in the volume(s)
  /dev/sdg1: open failed: No such device or address

Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Enabling device sdg on taft-01
Enabling device sdg on taft-02
Enabling device sdg on taft-03
Enabling device sdg on taft-04

Recreating PV /dev/sdg1
  WARNING: Volume group helter_skelter is not consistent
  WARNING: Volume Group helter_skelter is not consistent
  WARNING: Volume group helter_skelter is not consistent
Extending the recreated PV back into VG helter_skelter
Since we can't yet up convert existing mirrors, down converting to linear(s)
on taft-01 before re-converting back to original mirror(s)
Up converting linear(s) back to mirror(s) on taft-01...
taft-01: lvconvert -m 2 -b helter_skelter/syncd_log_3legs_1 /dev/sdf1:0-1000
/dev/sdh1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-150
taft-01: lvconvert -m 2 -b helter_skelter/syncd_log_3legs_2 /dev/sdf1:0-1000
/dev/sdh1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-150
Verifying the up conversion from linear(s) to mirror(s)
Verifying device /dev/sdf1 *IS* in the mirror(s)
Verifying device /dev/sdh1 *IS* in the mirror(s)
Verifying device /dev/sde1 *IS* in the mirror(s)
Verifying device /dev/sdg1 *IS* in the mirror(s)

Waiting until all mirrors become fully syncd...
        0/2 mirror(s) are fully synced: ( 1=6.00% 2=4.00% )
        0/2 mirror(s) are fully synced: ( 1=15.50% 2=10.00% )
        0/2 mirror(s) are fully synced: ( 1=15.50% 2=21.50% )
        0/2 mirror(s) are fully synced: ( 1=17.00% 2=41.50% )
        0/2 mirror(s) are fully synced: ( 1=25.00% 2=55.00% )
        0/2 mirror(s) are fully synced: ( 1=45.00% 2=55.00% )
        0/2 mirror(s) are fully synced: ( 1=45.00% 2=73.00% )
        0/2 mirror(s) are fully synced: ( 1=58.00% 2=84.50% )
        0/2 mirror(s) are fully synced: ( 1=81.00% 2=84.50% )
        0/2 mirror(s) are fully synced: ( 1=96.00% 2=84.50% )
        0/2 mirror(s) are fully synced: ( 1=96.00% 2=84.50% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% )

That second mirror got stuck at 84.50%.


taft-01:
May  5 11:21:00 taft-01 qarshd[26671]: Running cmdline: echo running >
/sys/block/sdg/device/state
May  5 11:21:00 taft-01 xinetd[6203]: EXIT: qarsh status=0 pid=26671 duration=0(sec)
May  5 11:21:00 taft-01 xinetd[6203]: START: qarsh pid=26673 from=10.15.80.47
May  5 11:21:00 taft-01 qarshd[26673]: Talking to peer 10.15.80.47:52970
May  5 11:21:00 taft-01 qarshd[26673]: Running cmdline: pvcreate /dev/sdg1
May  5 11:21:11 taft-01 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:741994 - retrying
May  5 11:21:11 taft-01 clogd[22460]: Invalid request_type
May  5 11:21:26 taft-01 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_IN_SYNC:742092 - retrying
May  5 11:21:26 taft-01 clogd[22460]: cpg_mcast_joined error: 9
May  5 11:21:26 taft-01 clogd[22460]: cluster_send failed at: local.c:221
(do_local_work)
May  5 11:21:26 taft-01 clogd[22460]: [] Unable to send DM_CLOG_FLUSH to
cluster: Invalid exchange
May  5 11:21:26 taft-01 clogd[22460]: Bad callback on local/4
May  5 11:21:26 taft-01 kernel: device-mapper: dm-log-clustered: Server error
while processing request [DM_CLOG_FLUSH]: -52


taft-02:
May  5 11:20:24 taft-02 qarshd[29957]: Running cmdline: echo running >
/sys/block/sdg/device/state
May  5 11:20:24 taft-02 xinetd[5886]: EXIT: qarsh status=0 pid=29957 duration=0(sec)
May  5 11:20:37 taft-02 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_RESYNC_WORK:1488366 - retrying
May  5 11:20:47 taft-02 clogd[25202]: Invalid request_type
May  5 11:20:51 taft-02 lvm[25433]: No longer monitoring mirror device
helter_skelter-syncd_log_3legs_1 for events
May  5 11:20:53 taft-02 lvm[25433]: No longer monitoring mirror device
helter_skelter-syncd_log_3legs_2 for events
May  5 11:21:02 taft-02 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_IN_SYNC:1488767 - retrying
May  5 11:21:02 taft-02 clogd[25202]: cpg_mcast_joined error: 11
May  5 11:21:02 taft-02 clogd[25202]: cluster_send failed at: local.c:221
(do_local_work)
May  5 11:21:02 taft-02 clogd[25202]: [W6Subc39] Unable to send
DM_CLOG_CLEAR_REGION to cluster: Invalid exchange
May  5 11:21:02 taft-02 clogd[25202]: Bad callback on local/4
May  5 11:21:02 taft-02 kernel: device-mapper: dm-log-clustered: Server error
while processing request [DM_CLOG_CLEAR_REGION]: -
52
May  5 11:21:06 taft-02 [25433]: Monitoring mirror device
helter_skelter-syncd_log_3legs_1 for events
May  5 11:21:08 taft-02 lvm[25433]: Monitoring mirror device
helter_skelter-syncd_log_3legs_2 for events
May  5 11:21:38 taft-02 clogd[25202]: Invalid request_type
May  5 11:21:53 taft-02 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:1551230 - retrying
May  5 11:21:56 taft-02 clogd[25202]: Invalid request_type
May  5 11:22:11 taft-02 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:1613764 - retrying
May  5 11:22:48 taft-02 clogd[25202]: Invalid request_type


taft-03:
May  5 11:20:58 taft-03 qarshd[26345]: Running cmdline: echo running >
/sys/block/sdg/device/state                              May  5 11:20:58 taft-03
xinetd[6220]: EXIT: qarsh status=0 pid=26345 duration=0(sec)                   
                        May  5 11:21:25 taft-03 lvm[22539]: No longer monitoring
mirror device helter_skelter-syncd_log_3legs_1 for events              May  5
11:21:27 taft-03 lvm[22539]: No longer monitoring mirror device
helter_skelter-syncd_log_3legs_2 for events              May  5 11:21:40 taft-03
[22539]: Monitoring mirror device helter_skelter-syncd_log_3legs_1 for events  
                        May  5 11:21:41 taft-03 lvm[22539]: Monitoring mirror
device helter_skelter-syncd_log_3legs_2 for events                        May  5
11:21:49 taft-03 clogd[22389]: Invalid request_type                            
                                         May  5 11:22:04 taft-03 kernel:
device-mapper: dm-log-clustered: Request timed out on
DM_CLOG_GET_SYNC_COUNT:1272854 - retrying May  5 11:22:10 taft-03 clogd[22389]:
Invalid request_type                                                           
          May  5 11:22:25 taft-03 kernel: device-mapper: dm-log-clustered:
Request timed out on DM_CLOG_GET_SYNC_COUNT:1322379 - retrying May  5 11:22:26
taft-03 clogd[22389]: cpg_mcast_joined error: 9                                
                                May  5 11:22:26 taft-03 clogd[22389]:
cluster_send failed at: local.c:221 (do_local_work)                            
          May  5 11:22:26 taft-03 clogd[22389]: [] Unable to send
DM_CLOG_GET_RESYNC_WORK to cluster: Invalid exchange                    May  5
11:22:26 taft-03 clogd[22389]: Bad callback on local/4                         
                                         May  5 11:22:26 taft-03 kernel:
device-mapper: dm-log-clustered: Server error while processing request
[DM_CLOG_GET_RESYNC_WORK]: -52
May  5 11:22:28 taft-03 clogd[22389]: Invalid request_type


taft-04:
May  5 11:20:32 taft-04 qarshd[26535]: Running cmdline: echo running >
/sys/block/sdg/device/state
May  5 11:20:32 taft-04 xinetd[6228]: EXIT: qarsh status=0 pid=26535 duration=0(sec)
May  5 11:20:59 taft-04 lvm[22495]: No longer monitoring mirror device
helter_skelter-syncd_log_3legs_1 for events
May  5 11:21:00 taft-04 lvm[22495]: No longer monitoring mirror device
helter_skelter-syncd_log_3legs_2 for events
May  5 11:21:13 taft-04 [22495]: Monitoring mirror device
helter_skelter-syncd_log_3legs_1 for events
May  5 11:21:15 taft-04 lvm[22495]: Monitoring mirror device
helter_skelter-syncd_log_3legs_2 for events
May  5 11:21:24 taft-04 clogd[22347]: Invalid request_type
May  5 11:21:39 taft-04 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_RESYNC_WORK:1973299 - retrying
May  5 11:21:39 taft-04 clogd[22347]: Invalid request_type
May  5 11:21:54 taft-04 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_RESYNC_WORK:2005341 - retrying
May  5 11:21:54 taft-04 clogd[22347]: Invalid request_type
May  5 11:22:09 taft-04 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_SYNC_COUNT:2168500 - retrying
May  5 11:22:09 taft-04 clogd[22347]: Invalid request_type
May  5 11:22:24 taft-04 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_GET_RESYNC_WORK:2274721 - retrying
May  5 11:22:25 taft-04 clogd[22347]: Invalid request_type


Comment 15 Corey Marthaler 2008-05-05 19:19:25 UTC
I was able to reproduce this again, this time with the data corruption mentioned
in comment #3.

May  5 13:01:26 taft-01 kernel: GFS: fsid=TAFT:gfs2.0: fatal: invalid metadata
block                                        May  5 13:01:26 taft-01 kernel:
GFS: fsid=TAFT:gfs2.0:   bh = 161804 (type: exp=6, found=0)                    
            May  5 13:01:26 taft-01 kernel: GFS: fsid=TAFT:gfs2.0:   function =
get_leaf                                                May  5 13:01:26 taft-01
kernel: GFS: fsid=TAFT:gfs2.0:   file =
/builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/dir.c, line = 435   
                                                                               
                        May  5 13:01:26 taft-01 kernel: GFS: fsid=TAFT:gfs2.0: 
 time = 1210010486                                                  May  5
13:01:26 taft-01 kernel: GFS: fsid=TAFT:gfs2.0: about to withdraw from the
cluster                                   May  5 13:01:26 taft-01 kernel: GFS:
fsid=TAFT:gfs2.0: telling LM to withdraw                                       
       May  5 13:01:26 taft-01 xinetd[6225]: EXIT: qarsh status=0 pid=12039
duration=182(sec)                                      May  5 13:01:26 taft-01
qarshd[12040]: Sending child 12042 signal 2                                    
                    May  5 13:01:26 taft-01 clogd[6703]: cpg_mcast_joined error:
11                                                             May  5 13:01:26
taft-01 clogd[6703]: cluster_send failed: Invalid exchange                     
                            May  5 13:01:26 taft-01 clogd[6703]: [vnMJdfKt]
Error while processing CPG message, DM_CLOG_CLEAR_REGION: Invalid exchange  May
 5 13:01:26 taft-01 clogd[6703]: [vnMJdfKt]    Response  : NO                  
                                        May  5 13:01:26 taft-01 clogd[6703]:
[vnMJdfKt]    Originator: 1                                                    
       May  5 13:01:26 taft-01 clogd[6703]: HISTORY::                          
                                                   May  5 13:01:26 taft-01
clogd[6703]: 0:46) SEQ#=917241, UUID=N2m5GWYN, TYPE=(null), ORIG=1, RESP=NO    
                    May  5 13:01:26 taft-01 clogd[6703]: 1:47) SEQ#=917241,
UUID=N2m5GWYN, TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1          
                                                                               
                                May  5 13:01:26 taft-01 clogd[6703]: 2:48)
SEQ#=917243, UUID=N2m5GWYN, TYPE=(null), ORIG=1, RESP=NO                       
 May  5 13:01:26 taft-01 clogd[6703]: 3:49) SEQ#=917243, UUID=N2m5GWYN,
TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1                         
                                                                               
                 May  5 13:01:26 taft-01 clogd[6703]: 4:50) SEQ#=917245,
UUID=N2m5GWYN, TYPE=(null), ORIG=1, RESP=NO                         

Comment 16 Jonathan Earl Brassow 2008-05-09 04:21:45 UTC
OpenAIS is giving us messages after a cpg_leave, and when we try to process them
(do a cluster_send), it tell us that we have disconnected.  The solution is to
ignore messages received after cpg_leave...  I don't think AIS should do this in
the first place...



Comment 19 Corey Marthaler 2008-08-21 15:45:21 UTC
This isn't seen anymore, marking verified.

Comment 21 errata-xmlrpc 2009-01-20 21:25:16 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0158.html


Note You need to log in before you can comment on or make changes to this bug.