Description of problem: I have a test case to check the resyncing of mirrors (now that bz 385761 is fixed) which is failing once in awhile (like one time in 20 or 30 attempts). It basically checks the sync percent 10 seconds of syncing, resyncs the mirror and then checkes the sync percent after 30 seconds. The assumption is that after 30 seconds, the percent should be larger (about 3X) then before and if it's not at least larger, the test case fails. This doesn't really seem to be a big deal because the mirror does eventually become fully synced, but the messages in the log may point to something more serious. # Grant-01: Feb 6 10:18:59 grant-01 kernel: device-mapper: dm-log-clustered: Timed out waiting for cluster log server [DM_CLOG_SET_REGION_SYNC] Feb 6 10:18:59 grant-01 kernel: device-mapper: dm-log-clustered: Retrying request [DM_CLOG_SET_REGION_SYNC] Feb 6 10:18:59 grant-01 kernel: device-mapper: dm-log-clustered: [331003][0][0][0][0][0] [32] Feb 6 10:19:00 grant-01 clogd[3249]: [] Unable to send (null) to cluster: Invalid exchange Feb 6 10:19:00 grant-01 clogd[3249]: Bad callback on local/4 Feb 6 10:19:05 grant-01 kernel: device-mapper: dm-log-clustered: Timed out waiting for cluster log server [DM_CLOG_SET_REGION_SYNC] Feb 6 10:19:05 grant-01 kernel: device-mapper: dm-log-clustered: Retrying request [DM_CLOG_SET_REGION_SYNC] Feb 6 10:19:05 grant-01 kernel: device-mapper: dm-log-clustered: [332809][0][0][0][0][0] [33] Feb 6 10:19:05 grant-01 clogd[3249]: [] Unable to send (null) to cluster: Invalid exchange Feb 6 10:19:05 grant-01 clogd[3249]: Bad callback on local/4 Feb 6 10:19:09 grant-01 lvm[3425]: mirror_sanity-resync_nosync is now in-sync Feb 6 10:19:10 grant-01 kernel: device-mapper: dm-log-clustered: Timed out waiting for cluster log server [DM_CLOG_SET_REGION_SYNC] Feb 6 10:19:10 grant-01 kernel: device-mapper: dm-log-clustered: Retrying request [DM_CLOG_SET_REGION_SYNC] Feb 6 10:19:10 grant-01 kernel: device-mapper: dm-log-clustered: [334112][0][0][0][0][0] [34] Feb 6 10:20:17 grant-01 lvm[3425]: mirror_sanity-resync_regular is now in-sync # Grant-02: Feb 6 10:17:22 grant-02 [3357]: Monitoring mirror device mirror_sanity-resync_nosync for events Feb 6 10:17:22 grant-02 lvm[3357]: mirror_sanity-resync_nosync is now in-sync Feb 6 10:17:24 grant-02 clogd[3244]: [bCTpsP10] Cluster log created Feb 6 10:17:24 grant-02 clogd[3244]: [bCTpsP10] Non-master resume: bits pre-loaded Feb 6 10:17:24 grant-02 lvm[3357]: Monitoring mirror device mirror_sanity-resync_regular for events Feb 6 10:17:24 grant-02 clogd[3244]: Region already in-sync: 0 Feb 6 10:17:35 grant-02 lvm[3357]: No longer monitoring mirror device mirror_sanity-resync_regular for events Feb 6 10:17:35 grant-02 clogd[3244]: WARNING: Log has changed during a flush operation Feb 6 10:17:35 grant-02 clogd[3244]: [bCTpsP10] clog_postsuspend: leaving CPG Feb 6 10:17:35 grant-02 clogd[3244]: [bCTpsP10] clog_postsuspend: finalizing Feb 6 10:17:35 grant-02 clogd[3244]: [bCTpsP10] Cluster log removed Feb 6 10:17:35 grant-02 lvm[3357]: No longer monitoring mirror device mirror_sanity-resync_nosync for events Feb 6 10:17:35 grant-02 clogd[3244]: [yyGHMGry] clog_postsuspend: leaving CPG Feb 6 10:17:35 grant-02 clogd[3244]: [yyGHMGry] clog_postsuspend: finalizing Feb 6 10:17:35 grant-02 clogd[3244]: [yyGHMGry] Cluster log removed Feb 6 10:17:38 grant-02 clogd[3244]: [bCTpsP10] Cluster log created Feb 6 10:17:38 grant-02 clogd[3244]: [bCTpsP10] Non-master resume: bits pre-loaded Feb 6 10:17:38 grant-02 [3357]: Monitoring mirror device mirror_sanity-resync_regular for events Feb 6 10:17:38 grant-02 clogd[3244]: [yyGHMGry] Cluster log created Feb 6 10:17:38 grant-02 clogd[3244]: [yyGHMGry] Master resume: reading disk log Feb 6 10:17:38 grant-02 clogd[3244]: Header not valid Feb 6 10:17:38 grant-02 clogd[3244]: magic : 0 (expected: 4d695272) Feb 6 10:17:38 grant-02 clogd[3244]: version : 0 Feb 6 10:17:38 grant-02 clogd[3244]: nr_regions: 0 Feb 6 10:17:38 grant-02 clogd[3244]: *** Invalid argument *** Feb 6 10:17:38 grant-02 lvm[3357]: Monitoring mirror device mirror_sanity-resync_nosync for events Feb 6 10:18:14 grant-02 clogd[3244]: Unknown request Version-Release number of selected component (if applicable): cmirror-1.1.11-1.el5 kmod-cmirror-0.1.5-2.el5 lvm2-2.02.32-1.el5 lvm2-cluster-2.02.32-1.el5 How reproducible: Not Often
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Actually this appears to be happening more often. I've also seen a case where after the resync, it gets stuck: [root@grant-03 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 72.44G LogVol01 VolGroup00 -wi-ao 1.94G resync_nosync mirror_sanity mwi-a- 2.00G resync_nosync_mlog 100.00 resync_regular mirror_sanity mwi-a- 2.00G resync_regular_mlog 10.55 [root@grant-03 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 72.44G LogVol01 VolGroup00 -wi-ao 1.94G resync_nosync mirror_sanity mwi-a- 2.00G resync_nosync_mlog 100.00 resync_regular mirror_sanity mwi-a- 2.00G resync_regular_mlog 10.55 [root@grant-03 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 72.44G LogVol01 VolGroup00 -wi-ao 1.94G resync_nosync mirror_sanity mwi-a- 2.00G resync_nosync_mlog 100.00 resync_regular mirror_sanity mwi-a- 2.00G resync_regular_mlog 10.55
I'll attempt this with single node mirroring to see if this is a base lvm issue.
I haven't hit this with single node mirrors in over 80 iterations in a row of this test case. That said, the "after 30 sec" sync percent is rarely 3X as big as the "after 10 sec" sync percent, and sometimes just barely larger.
Having reproduced the case where the resync gets stuck (in comment #2) I'm going to open a new bug for that and leave this one for the fact that resync can be really slow and cause funky "invalid header" messages.
My last checkin should improve the throughput of the resync I/O and nominal I/O. While it should be a big improvement, I would still expect nominal I/O to be pretty slow during the process.
This appears to be fixed as well, however like the other issues marked modified, I eventually hit the 'cpg_mcast_joined error: 9' (bz 432109) message which causes testing to fail, so I can't be positive that issues like this can be marked verified.
Hit 432109 again while attempting to verify this issue.
Untouched for almost 3 years: Are there still issues, or can this be closed?
Closing due to inactivity. Will reopen if seen again.