Description of problem: Saw these CPG error messages during log device failure testing. In this case, the result was a deadlocked (not just really slow) mirror. cmirror-1.1.13-1.el5/kmod-cmirror-0.1.6-1.el5. Feb 8 11:40:57 taft-04 kernel: sd 1:0:0:7: rejecting I/O to offline device Feb 8 11:40:57 taft-04 lvm[7071]: No longer monitoring mirror device helter_skelter-syncd_log_3legs_1 for events Feb 8 11:41:12 taft-04 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:1068855 - retrying Feb 8 11:41:12 taft-04 clogd[6767]: rw_log: write failure: Input/output error Feb 8 11:41:12 taft-04 clogd[6767]: Error writing to disk log Feb 8 11:41:12 taft-04 kernel: sd 1:0:0:7: rejecting I/O to offline device Feb 8 11:41:12 taft-04 kernel: device-mapper: dm-log-clustered: Server error while processing request [DM_CLOG_FLUSH]: -5 Feb 8 11:41:12 taft-04 lvm[7071]: Monitoring mirror device helter_skelter-syncd_log_3legs_1 for events Feb 8 11:41:13 taft-04 clogd[6767]: cpg_mcast_joined error: 9 Feb 8 11:41:13 taft-04 clogd[6767]: cluster_send failed at: local.c:212 (do_local_work) Feb 8 11:41:13 taft-04 clogd[6767]: [] Unable to send (null) to cluster: Invalid exchange Feb 8 11:41:13 taft-04 clogd[6767]: Bad callback on local/4 Feb 8 11:41:13 taft-04 kernel: device-mapper: dm-log-clustered: Stray request returned: <NULL>, 0 Feb 8 11:41:14 taft-04 lvm[7071]: No longer monitoring mirror device helter_skelter-syncd_log_3legs_2 for events Feb 8 11:41:28 taft-04 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_IN_SYNC:1077480 - retrying Feb 8 11:41:28 taft-04 kernel: device-mapper: dm-log-clustered: Server error while processing request [DM_CLOG_FLUSH]: -5 Feb 8 11:41:28 taft-04 kernel: device-mapper: dm-log-clustered: Server error while processing request [DM_CLOG_FLUSH]: -5 Feb 8 11:41:28 taft-04 clogd[6767]: [QbkqfJyQ] No match for cluster response: DM_CLOG_IS_REMOTE_RECOVERING:1083771 Feb 8 11:41:28 taft-04 clogd[6767]: Current list: Feb 8 11:41:28 taft-04 clogd[6767]: [none] Feb 8 11:41:28 taft-04 clogd[6767]: [QbkqfJyQ] Error while processing CPG message, DM_CLOG_IS_REMOTE_RECOVERING: Invalid argument Feb 8 11:41:28 taft-04 clogd[6767]: [QbkqfJyQ] Response : YES Feb 8 11:41:28 taft-04 clogd[6767]: [QbkqfJyQ] Originator: 4 Feb 8 11:41:28 taft-04 clogd[6767]: [QbkqfJyQ] Responder : 1 Feb 8 11:41:28 taft-04 clogd[6767]: HISTORY:: Feb 8 11:41:28 taft-04 clogd[6767]: 0:10) SEQ#=1083769, UUID=2RP77UEX, TYPE=DM_CLOG_FLUSH, ORIG=4, RESP=YES, RSPR=1 Feb 8 11:41:28 taft-04 clogd[6767]: 1:11) SEQ#=777788, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 2:12) SEQ#=1083770, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=4, RESP=YES, RSPR=1 Feb 8 11:41:28 taft-04 clogd[6767]: 3:13) SEQ#=758142, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 4:14) SEQ#=777789, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 5:15) SEQ#=758143, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 6:16) SEQ#=777790, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 7:17) SEQ#=758144, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 8:18) SEQ#=777791, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 9:19) SEQ#=758145, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 10:0) SEQ#=777792, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 11:1) SEQ#=758146, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 12:2) SEQ#=777793, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 13:3) SEQ#=758147, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 14:4) SEQ#=777794, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 15:5) SEQ#=758148, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 16:6) SEQ#=777795, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 17:7) SEQ#=758149, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=2, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 18:8) SEQ#=1083771, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=4, RESP=NO Feb 8 11:41:28 taft-04 clogd[6767]: 19:9) SEQ#=777796, UUID=QbkqfJyQ, TYPE=DM_CLOG_IS_REMOTE_RECOVERING, ORIG=1, RESP=NO Feb 8 11:41:43 taft-04 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_IS_REMOTE_RECOVERING:1083771 - retrying Feb 8 11:41:44 taft-04 lvm[7071]: Monitoring mirror device helter_skelter-syncd_log_3legs_2 for events Feb 8 11:42:24 taft-04 clogd[6767]: cpg_mcast_joined error: 9 Feb 8 11:42:24 taft-04 clogd[6767]: cluster_send failed at: local.c:212 (do_local_work) Feb 8 11:42:24 taft-04 clogd[6767]: [] Unable to send (null) to cluster: Invalid exchange Feb 8 11:42:24 taft-04 clogd[6767]: Bad callback on local/4 Feb 8 11:42:24 taft-04 kernel: device-mapper: dm-log-clustered: Stray request returned: <NULL>, 0 Feb 8 11:42:39 taft-04 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_IS_REMOTE_RECOVERING:1102107 - retrying [...]
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
I was able to reproduce these cpg messages during log failure testing again, and there was even apparent corruption too. Senario: Kill disk log of synced 2 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: syncd_log_2legs * sync: 1 * mirrors: 2 * disklog: 1 * failpv: /dev/sde1 * legs: 2 * pvs: /dev/sdh1 /dev/sdg1 /dev/sde1 ************************************************ Creating mirror(s) on taft-02... taft-02: lvcreate -m 1 -n syncd_log_2legs_1 -L 800M helter_skelter /dev/sdh1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-150 taft-02: lvcreate -m 1 -n syncd_log_2legs_2 -L 800M helter_skelter /dev/sdh1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-150 Waiting until all mirrors become fully syncd... 0/2 mirror(s) are fully synced: ( 1=6.00% 2=0.50% ) 0/2 mirror(s) are fully synced: ( 1=24.00% 2=18.50% ) 0/2 mirror(s) are fully synced: ( 1=41.00% 2=35.50% ) 0/2 mirror(s) are fully synced: ( 1=56.50% 2=51.00% ) 0/2 mirror(s) are fully synced: ( 1=72.00% 2=66.50% ) 0/2 mirror(s) are fully synced: ( 1=88.00% 2=82.00% ) 2/2 mirror(s) are fully synced: ( 1=100.00% 2=100.00% ) Creating gfs on top of mirror(s) on taft-02... Mounting mirrored gfs filesystems on taft-01... Mounting mirrored gfs filesystems on taft-02... Mounting mirrored gfs filesystems on taft-03... Mounting mirrored gfs filesystems on taft-04... Writing verification files (checkit) to mirror(s) on... Disabling device sde on taft-01 Disabling device sde on taft-02 Disabling device sde on taft-03 Disabling device sde on taft-04 Attempting I/O to cause mirror down conversion(s) on taft-02 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.087572 seconds, 479 MB/s Verifying the down conversion of the failed mirror(s) /dev/sde1: open failed: No such device or address Verifying LEG device /dev/sdh1 *IS* in the volume(s) /dev/sde1: open failed: No such device or address Verifying LEG device /dev/sdg1 *IS* in the volume(s) /dev/sde1: open failed: No such device or address Verifying FAILED device /dev/sde1 is *NOT* in the volume(s) /dev/sde1: open failed: No such device or address Verifying files (checkit) on mirror(s) on... <fail name="taft-01_2" pid="1563" time="Mon Feb 11 21:23:17 2008" type="cmd" duration="1202786597" ec="1" /> ALL STOP! Can not stat jtdkqxtobtydyhhgxaujlprikeijwdsia: Input/output error checkit write verify failed run_iteration() method failed at /home/msp/cmarthal/work/rhel5/sts-root/lib/FI_engine.pm line 21. I'll post the syslogs...
Created attachment 294664 [details] log from taft-01
Created attachment 294665 [details] log from taft-02
Created attachment 294666 [details] log from taft-03
Created attachment 294667 [details] log from taft-04
Although I have not seen an I/O deadlock yet, I can reproduce these messages w/o any disk failure. All I was doing was running I/O to 4 cmirrors w/ GFS on top. Feb 14 10:06:57 taft-01 clogd[6683]: cpg_mcast_joined error: 9 Feb 14 10:06:57 taft-01 clogd[6683]: cluster_send failed at: local.c:212 (do_local_work) Feb 14 10:06:57 taft-01 clogd[6683]: [] Unable to send (null) to cluster: Invalid exchange Feb 14 10:06:57 taft-01 clogd[6683]: Bad callback on local/4 Feb 14 10:06:57 taft-01 kernel: device-mapper: dm-log-clustered: Stray request returned: <NULL>, 0 device-mapper: dm-log-clustered: Request timed out on DM_CLOG_IN_SYNC:508917 - retrying device-mapper: dm-log-clustered: Stray request returned: <NULL>, 0 Feb 14 10:07:12 taft-01 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_IN_SYNC:508917 - reg
Updating summary to reflect that a log failure isn't required to cause this. I just saw this again w/ a primary leg failure and the result was a mirror stuck at 0.00%.
I'm seeing this issue while doing resyncing and simple deletion testing for bz 431714. This issue is blocking the testing of quite few bugs as most cmirror tesing eventually results in this. Marking this bug and QA enemy number 1.
I reproduced what appears to be this issue last night with the latest rpms. The cmirror failure test used to hit this bug within 30 mins, but it took over 2 hours last night so it appears that progress is being made. [Test output after a failure appeared to properly down convert] Enabling device sdc on taft-01 Enabling device sdc on taft-02 Enabling device sdc on taft-03 Enabling device sdc on taft-04 Recreating PV /dev/sdc1 WARNING: Volume group helter_skelter is not consistent WARNING: Volume Group helter_skelter is not consistent WARNING: Volume group helter_skelter is not consistent Extending the recreated PV back into VG helter_skelter Since we can't yet up convert existing mirrors, down converting to linear(s) on taft-02 before re-converting back to original mirror(s) Error locking on node taft-02: LV helter_skelter/nonsyncd_log_3legs_1_mimage_1 in use: not deactivating couldn't down convert mirror to linear Feb 28 20:47:22 taft-01 lvm[6759]: Locking LV 4OXemQTNnv5zREyz3iwzZVq0BQxVlIz4uI32KyQKcoP215L422r7DcfaV0ptManO PW C (0x9c) Feb 28 20:47:22 taft-01 clogd[6676]: cpg_mcast_joined error: 11 Feb 28 20:47:22 taft-01 clogd[6676]: cluster_send failed: Invalid exchange Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna] Error while processing CPG message, DM_CLOG_GET_RESYNC_WORK: Invalid exchange Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna] Response : NO Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna] Originator: 4 Feb 28 20:47:22 taft-01 clogd[6676]: HISTORY:: Feb 28 20:47:22 taft-01 clogd[6676]: 0:70) SEQ#=4747360, UUID=OnrNtAna, TYPE=(null), ORIG=1, RESP=NO Feb 28 20:47:22 taft-01 clogd[6676]: 1:71) SEQ#=4747360, UUID=OnrNtAna, TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1 Feb 28 20:47:22 taft-01 clogd[6676]: 2:72) SEQ#=4747363, UUID=OnrNtAna, TYPE=(null), ORIG=1, RESP=NO Feb 28 20:47:22 taft-01 clogd[6676]: 3:73) SEQ#=4747363, UUID=OnrNtAna, TYPE=DM_CLOG_CLEAR_REGION, ORIG=1, RESP=YES, RSPR=1 [...] Feb 28 20:47:22 taft-01 clogd[6676]: 97:67) SEQ#=4747435, UUID=OnrNtAna, TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1 Feb 28 20:47:22 taft-01 clogd[6676]: 98:68) SEQ#=4747437, UUID=OnrNtAna, TYPE=(null), ORIG=1, RESP=NO Feb 28 20:47:22 taft-01 clogd[6676]: 99:69) SEQ#=4747437, UUID=OnrNtAna, TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1 Feb 28 20:47:22 taft-01 clogd[6676]: cpg_mcast_joined error: 11 Feb 28 20:47:22 taft-01 clogd[6676]: cluster_send failed: Invalid exchange Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna] Error while processing CPG message, DM_CLOG_GET_RESYNC_WORK: Invalid exchange Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna] Response : NO Feb 28 20:47:22 taft-01 clogd[6676]: [OnrNtAna] Originator: 2 Feb 28 20:47:22 taft-01 clogd[6676]: HISTORY:: Feb 28 20:47:22 taft-01 clogd[6676]: 0:70) SEQ#=4747360, UUID=OnrNtAna, TYPE=(null), ORIG=1, RESP=NO Feb 28 20:47:22 taft-01 clogd[6676]: 1:71) SEQ#=4747360, UUID=OnrNtAna, TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1 Feb 28 20:47:22 taft-01 clogd[6676]: 2:72) SEQ#=4747363, UUID=OnrNtAna, TYPE=(null), ORIG=1, RESP=NO Feb 28 20:47:22 taft-01 clogd[6676]: 3:73) SEQ#=4747363, UUID=OnrNtAna, TYPE=DM_CLOG_CLEAR_REGION, ORIG=1, RESP=YES, RSPR=1 This was with: 2.6.18-71.el5 cmirror-1.1.15-1.el5 kmod-cmirror-0.1.8-1.el5 lvm2-cluster-2.02.32-1.el5
The above 'Invalid exchange' messages correspond to the time when the mirror was down converted? Can you still reproduce this? Just to recap, you had been able to reproduce this just by create/delete loop; but now you can only reproduce with a device failure and then down-convert?
FYI - Originally, I was only able to reproduce this issue during log device failure, but I also saw it once during heavy GFS I/O on top of cmirrors (no failure). I'm currently attempting to reproduce this bug...
Reproduced this while adding the log back into the mirror, after having killed it on 2 synced 3-way mirrors. ================================================================================ Iteration 1.2 started at Mon May 5 11:18:48 CDT 2008 ================================================================================ Senario: Kill disk log of synced 3 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: syncd_log_3legs * sync: 1 * mirrors: 2 * disklog: 1 * failpv: /dev/sdg1 * legs: 3 * pvs: /dev/sdf1 /dev/sdh1 /dev/sde1 /dev/sdg1 ************************************************ Creating mirror(s) on taft-01... taft-01: lvcreate -m 2 -n syncd_log_3legs_1 -L 800M helter_skelter /dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-150 taft-01: lvcreate -m 2 -n syncd_log_3legs_2 -L 800M helter_skelter /dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-150 Waiting until all mirrors become fully syncd... 0/2 mirror(s) are fully synced: ( 1=2.50% 2=0.00% ) 0/2 mirror(s) are fully synced: ( 1=22.50% 2=21.00% ) 0/2 mirror(s) are fully synced: ( 1=42.50% 2=41.50% ) 0/2 mirror(s) are fully synced: ( 1=62.50% 2=63.50% ) 0/2 mirror(s) are fully synced: ( 1=82.50% 2=84.00% ) 2/2 mirror(s) are fully synced: ( 1=100.00% 2=100.00% ) Creating gfs on top of mirror(s) on taft-01... Mounting mirrored gfs filesystems on taft-01... Mounting mirrored gfs filesystems on taft-02... Mounting mirrored gfs filesystems on taft-03... Mounting mirrored gfs filesystems on taft-04... Writing verification files (checkit) to mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Sleeping 12 seconds to get some outsanding I/O locks before the failure Disabling device sdg on taft-01 Disabling device sdg on taft-02 Disabling device sdg on taft-03 Disabling device sdg on taft-04 Attempting I/O to cause mirror down conversion(s) on taft-01 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.085393 seconds, 491 MB/s Verifying the down conversion of the failed mirror(s) /dev/sdg1: open failed: No such device or address Verifying LEG device /dev/sdf1 *IS* in the volume(s) /dev/sdg1: open failed: No such device or address Verifying LEG device /dev/sdh1 *IS* in the volume(s) /dev/sdg1: open failed: No such device or address Verifying LEG device /dev/sde1 *IS* in the volume(s) /dev/sdg1: open failed: No such device or address Verifying FAILED device /dev/sdg1 is *NOT* in the volume(s) /dev/sdg1: open failed: No such device or address Verifying files (checkit) on mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Enabling device sdg on taft-01 Enabling device sdg on taft-02 Enabling device sdg on taft-03 Enabling device sdg on taft-04 Recreating PV /dev/sdg1 WARNING: Volume group helter_skelter is not consistent WARNING: Volume Group helter_skelter is not consistent WARNING: Volume group helter_skelter is not consistent Extending the recreated PV back into VG helter_skelter Since we can't yet up convert existing mirrors, down converting to linear(s) on taft-01 before re-converting back to original mirror(s) Up converting linear(s) back to mirror(s) on taft-01... taft-01: lvconvert -m 2 -b helter_skelter/syncd_log_3legs_1 /dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-150 taft-01: lvconvert -m 2 -b helter_skelter/syncd_log_3legs_2 /dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-150 Verifying the up conversion from linear(s) to mirror(s) Verifying device /dev/sdf1 *IS* in the mirror(s) Verifying device /dev/sdh1 *IS* in the mirror(s) Verifying device /dev/sde1 *IS* in the mirror(s) Verifying device /dev/sdg1 *IS* in the mirror(s) Waiting until all mirrors become fully syncd... 0/2 mirror(s) are fully synced: ( 1=6.00% 2=4.00% ) 0/2 mirror(s) are fully synced: ( 1=15.50% 2=10.00% ) 0/2 mirror(s) are fully synced: ( 1=15.50% 2=21.50% ) 0/2 mirror(s) are fully synced: ( 1=17.00% 2=41.50% ) 0/2 mirror(s) are fully synced: ( 1=25.00% 2=55.00% ) 0/2 mirror(s) are fully synced: ( 1=45.00% 2=55.00% ) 0/2 mirror(s) are fully synced: ( 1=45.00% 2=73.00% ) 0/2 mirror(s) are fully synced: ( 1=58.00% 2=84.50% ) 0/2 mirror(s) are fully synced: ( 1=81.00% 2=84.50% ) 0/2 mirror(s) are fully synced: ( 1=96.00% 2=84.50% ) 0/2 mirror(s) are fully synced: ( 1=96.00% 2=84.50% ) 1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% ) 1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% ) 1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% ) 1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% ) 1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% ) 1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% ) 1/2 mirror(s) are fully synced: ( 1=100.00% 2=84.50% ) That second mirror got stuck at 84.50%. taft-01: May 5 11:21:00 taft-01 qarshd[26671]: Running cmdline: echo running > /sys/block/sdg/device/state May 5 11:21:00 taft-01 xinetd[6203]: EXIT: qarsh status=0 pid=26671 duration=0(sec) May 5 11:21:00 taft-01 xinetd[6203]: START: qarsh pid=26673 from=10.15.80.47 May 5 11:21:00 taft-01 qarshd[26673]: Talking to peer 10.15.80.47:52970 May 5 11:21:00 taft-01 qarshd[26673]: Running cmdline: pvcreate /dev/sdg1 May 5 11:21:11 taft-01 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:741994 - retrying May 5 11:21:11 taft-01 clogd[22460]: Invalid request_type May 5 11:21:26 taft-01 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_IN_SYNC:742092 - retrying May 5 11:21:26 taft-01 clogd[22460]: cpg_mcast_joined error: 9 May 5 11:21:26 taft-01 clogd[22460]: cluster_send failed at: local.c:221 (do_local_work) May 5 11:21:26 taft-01 clogd[22460]: [] Unable to send DM_CLOG_FLUSH to cluster: Invalid exchange May 5 11:21:26 taft-01 clogd[22460]: Bad callback on local/4 May 5 11:21:26 taft-01 kernel: device-mapper: dm-log-clustered: Server error while processing request [DM_CLOG_FLUSH]: -52 taft-02: May 5 11:20:24 taft-02 qarshd[29957]: Running cmdline: echo running > /sys/block/sdg/device/state May 5 11:20:24 taft-02 xinetd[5886]: EXIT: qarsh status=0 pid=29957 duration=0(sec) May 5 11:20:37 taft-02 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:1488366 - retrying May 5 11:20:47 taft-02 clogd[25202]: Invalid request_type May 5 11:20:51 taft-02 lvm[25433]: No longer monitoring mirror device helter_skelter-syncd_log_3legs_1 for events May 5 11:20:53 taft-02 lvm[25433]: No longer monitoring mirror device helter_skelter-syncd_log_3legs_2 for events May 5 11:21:02 taft-02 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_IN_SYNC:1488767 - retrying May 5 11:21:02 taft-02 clogd[25202]: cpg_mcast_joined error: 11 May 5 11:21:02 taft-02 clogd[25202]: cluster_send failed at: local.c:221 (do_local_work) May 5 11:21:02 taft-02 clogd[25202]: [W6Subc39] Unable to send DM_CLOG_CLEAR_REGION to cluster: Invalid exchange May 5 11:21:02 taft-02 clogd[25202]: Bad callback on local/4 May 5 11:21:02 taft-02 kernel: device-mapper: dm-log-clustered: Server error while processing request [DM_CLOG_CLEAR_REGION]: - 52 May 5 11:21:06 taft-02 [25433]: Monitoring mirror device helter_skelter-syncd_log_3legs_1 for events May 5 11:21:08 taft-02 lvm[25433]: Monitoring mirror device helter_skelter-syncd_log_3legs_2 for events May 5 11:21:38 taft-02 clogd[25202]: Invalid request_type May 5 11:21:53 taft-02 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:1551230 - retrying May 5 11:21:56 taft-02 clogd[25202]: Invalid request_type May 5 11:22:11 taft-02 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:1613764 - retrying May 5 11:22:48 taft-02 clogd[25202]: Invalid request_type taft-03: May 5 11:20:58 taft-03 qarshd[26345]: Running cmdline: echo running > /sys/block/sdg/device/state May 5 11:20:58 taft-03 xinetd[6220]: EXIT: qarsh status=0 pid=26345 duration=0(sec) May 5 11:21:25 taft-03 lvm[22539]: No longer monitoring mirror device helter_skelter-syncd_log_3legs_1 for events May 5 11:21:27 taft-03 lvm[22539]: No longer monitoring mirror device helter_skelter-syncd_log_3legs_2 for events May 5 11:21:40 taft-03 [22539]: Monitoring mirror device helter_skelter-syncd_log_3legs_1 for events May 5 11:21:41 taft-03 lvm[22539]: Monitoring mirror device helter_skelter-syncd_log_3legs_2 for events May 5 11:21:49 taft-03 clogd[22389]: Invalid request_type May 5 11:22:04 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:1272854 - retrying May 5 11:22:10 taft-03 clogd[22389]: Invalid request_type May 5 11:22:25 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:1322379 - retrying May 5 11:22:26 taft-03 clogd[22389]: cpg_mcast_joined error: 9 May 5 11:22:26 taft-03 clogd[22389]: cluster_send failed at: local.c:221 (do_local_work) May 5 11:22:26 taft-03 clogd[22389]: [] Unable to send DM_CLOG_GET_RESYNC_WORK to cluster: Invalid exchange May 5 11:22:26 taft-03 clogd[22389]: Bad callback on local/4 May 5 11:22:26 taft-03 kernel: device-mapper: dm-log-clustered: Server error while processing request [DM_CLOG_GET_RESYNC_WORK]: -52 May 5 11:22:28 taft-03 clogd[22389]: Invalid request_type taft-04: May 5 11:20:32 taft-04 qarshd[26535]: Running cmdline: echo running > /sys/block/sdg/device/state May 5 11:20:32 taft-04 xinetd[6228]: EXIT: qarsh status=0 pid=26535 duration=0(sec) May 5 11:20:59 taft-04 lvm[22495]: No longer monitoring mirror device helter_skelter-syncd_log_3legs_1 for events May 5 11:21:00 taft-04 lvm[22495]: No longer monitoring mirror device helter_skelter-syncd_log_3legs_2 for events May 5 11:21:13 taft-04 [22495]: Monitoring mirror device helter_skelter-syncd_log_3legs_1 for events May 5 11:21:15 taft-04 lvm[22495]: Monitoring mirror device helter_skelter-syncd_log_3legs_2 for events May 5 11:21:24 taft-04 clogd[22347]: Invalid request_type May 5 11:21:39 taft-04 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:1973299 - retrying May 5 11:21:39 taft-04 clogd[22347]: Invalid request_type May 5 11:21:54 taft-04 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:2005341 - retrying May 5 11:21:54 taft-04 clogd[22347]: Invalid request_type May 5 11:22:09 taft-04 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:2168500 - retrying May 5 11:22:09 taft-04 clogd[22347]: Invalid request_type May 5 11:22:24 taft-04 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:2274721 - retrying May 5 11:22:25 taft-04 clogd[22347]: Invalid request_type
I was able to reproduce this again, this time with the data corruption mentioned in comment #3. May 5 13:01:26 taft-01 kernel: GFS: fsid=TAFT:gfs2.0: fatal: invalid metadata block May 5 13:01:26 taft-01 kernel: GFS: fsid=TAFT:gfs2.0: bh = 161804 (type: exp=6, found=0) May 5 13:01:26 taft-01 kernel: GFS: fsid=TAFT:gfs2.0: function = get_leaf May 5 13:01:26 taft-01 kernel: GFS: fsid=TAFT:gfs2.0: file = /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/dir.c, line = 435 May 5 13:01:26 taft-01 kernel: GFS: fsid=TAFT:gfs2.0: time = 1210010486 May 5 13:01:26 taft-01 kernel: GFS: fsid=TAFT:gfs2.0: about to withdraw from the cluster May 5 13:01:26 taft-01 kernel: GFS: fsid=TAFT:gfs2.0: telling LM to withdraw May 5 13:01:26 taft-01 xinetd[6225]: EXIT: qarsh status=0 pid=12039 duration=182(sec) May 5 13:01:26 taft-01 qarshd[12040]: Sending child 12042 signal 2 May 5 13:01:26 taft-01 clogd[6703]: cpg_mcast_joined error: 11 May 5 13:01:26 taft-01 clogd[6703]: cluster_send failed: Invalid exchange May 5 13:01:26 taft-01 clogd[6703]: [vnMJdfKt] Error while processing CPG message, DM_CLOG_CLEAR_REGION: Invalid exchange May 5 13:01:26 taft-01 clogd[6703]: [vnMJdfKt] Response : NO May 5 13:01:26 taft-01 clogd[6703]: [vnMJdfKt] Originator: 1 May 5 13:01:26 taft-01 clogd[6703]: HISTORY:: May 5 13:01:26 taft-01 clogd[6703]: 0:46) SEQ#=917241, UUID=N2m5GWYN, TYPE=(null), ORIG=1, RESP=NO May 5 13:01:26 taft-01 clogd[6703]: 1:47) SEQ#=917241, UUID=N2m5GWYN, TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1 May 5 13:01:26 taft-01 clogd[6703]: 2:48) SEQ#=917243, UUID=N2m5GWYN, TYPE=(null), ORIG=1, RESP=NO May 5 13:01:26 taft-01 clogd[6703]: 3:49) SEQ#=917243, UUID=N2m5GWYN, TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=1, RESP=YES, RSPR=1 May 5 13:01:26 taft-01 clogd[6703]: 4:50) SEQ#=917245, UUID=N2m5GWYN, TYPE=(null), ORIG=1, RESP=NO
OpenAIS is giving us messages after a cpg_leave, and when we try to process them (do a cluster_send), it tell us that we have disconnected. The solution is to ignore messages received after cpg_leave... I don't think AIS should do this in the first place...
This isn't seen anymore, marking verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0158.html