Bug 431714 - RHEL5 cmirror tracker: issues with resync can cause it to be really slow
Summary: RHEL5 cmirror tracker: issues with resync can cause it to be really slow
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cmirror
Version: 5.2
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 432109
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-02-06 16:55 UTC by Corey Marthaler
Modified: 2011-04-07 22:05 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-04-07 22:05:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Corey Marthaler 2008-02-06 16:55:23 UTC
Description of problem:
I have a test case to check the resyncing of mirrors (now that bz 385761 is
fixed) which is failing once in awhile (like one time in 20 or 30 attempts). It
basically checks the sync percent 10 seconds of syncing, resyncs the mirror and
then checkes the sync percent after 30 seconds. The assumption is that after 30
seconds, the percent should be larger (about 3X) then before and if it's not at
least larger, the test case fails. This doesn't really seem to be a big deal
because the mirror does eventually become fully synced, but the messages in the
log may point to something more serious.


# Grant-01:
Feb  6 10:18:59 grant-01 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_SET_REGION_SYNC]
Feb  6 10:18:59 grant-01 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_SET_REGION_SYNC]
Feb  6 10:18:59 grant-01 kernel: device-mapper: dm-log-clustered:
[331003][0][0][0][0][0] [32]
Feb  6 10:19:00 grant-01 clogd[3249]: [] Unable to send (null) to cluster:
Invalid exchange
Feb  6 10:19:00 grant-01 clogd[3249]: Bad callback on local/4
Feb  6 10:19:05 grant-01 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_SET_REGION_SYNC]
Feb  6 10:19:05 grant-01 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_SET_REGION_SYNC]
Feb  6 10:19:05 grant-01 kernel: device-mapper: dm-log-clustered:
[332809][0][0][0][0][0] [33]
Feb  6 10:19:05 grant-01 clogd[3249]: [] Unable to send (null) to cluster:
Invalid exchange
Feb  6 10:19:05 grant-01 clogd[3249]: Bad callback on local/4
Feb  6 10:19:09 grant-01 lvm[3425]: mirror_sanity-resync_nosync is now in-sync
Feb  6 10:19:10 grant-01 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_SET_REGION_SYNC]
Feb  6 10:19:10 grant-01 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_SET_REGION_SYNC]
Feb  6 10:19:10 grant-01 kernel: device-mapper: dm-log-clustered:
[334112][0][0][0][0][0] [34]
Feb  6 10:20:17 grant-01 lvm[3425]: mirror_sanity-resync_regular is now in-sync


# Grant-02:
Feb  6 10:17:22 grant-02 [3357]: Monitoring mirror device
mirror_sanity-resync_nosync for events
Feb  6 10:17:22 grant-02 lvm[3357]: mirror_sanity-resync_nosync is now in-sync
Feb  6 10:17:24 grant-02 clogd[3244]: [bCTpsP10] Cluster log created
Feb  6 10:17:24 grant-02 clogd[3244]: [bCTpsP10] Non-master resume: bits pre-loaded
Feb  6 10:17:24 grant-02 lvm[3357]: Monitoring mirror device
mirror_sanity-resync_regular for events
Feb  6 10:17:24 grant-02 clogd[3244]:   Region already in-sync: 0
Feb  6 10:17:35 grant-02 lvm[3357]: No longer monitoring mirror device
mirror_sanity-resync_regular for events
Feb  6 10:17:35 grant-02 clogd[3244]: WARNING:  Log has changed during a flush
operation
Feb  6 10:17:35 grant-02 clogd[3244]: [bCTpsP10] clog_postsuspend: leaving CPG
Feb  6 10:17:35 grant-02 clogd[3244]: [bCTpsP10] clog_postsuspend: finalizing
Feb  6 10:17:35 grant-02 clogd[3244]: [bCTpsP10] Cluster log removed
Feb  6 10:17:35 grant-02 lvm[3357]: No longer monitoring mirror device
mirror_sanity-resync_nosync for events
Feb  6 10:17:35 grant-02 clogd[3244]: [yyGHMGry] clog_postsuspend: leaving CPG
Feb  6 10:17:35 grant-02 clogd[3244]: [yyGHMGry] clog_postsuspend: finalizing
Feb  6 10:17:35 grant-02 clogd[3244]: [yyGHMGry] Cluster log removed
Feb  6 10:17:38 grant-02 clogd[3244]: [bCTpsP10] Cluster log created
Feb  6 10:17:38 grant-02 clogd[3244]: [bCTpsP10] Non-master resume: bits pre-loaded
Feb  6 10:17:38 grant-02 [3357]: Monitoring mirror device
mirror_sanity-resync_regular for events
Feb  6 10:17:38 grant-02 clogd[3244]: [yyGHMGry] Cluster log created
Feb  6 10:17:38 grant-02 clogd[3244]: [yyGHMGry] Master resume: reading disk log
Feb  6 10:17:38 grant-02 clogd[3244]: Header not valid
Feb  6 10:17:38 grant-02 clogd[3244]:   magic     : 0  (expected: 4d695272)
Feb  6 10:17:38 grant-02 clogd[3244]:   version   : 0
Feb  6 10:17:38 grant-02 clogd[3244]:   nr_regions: 0
Feb  6 10:17:38 grant-02 clogd[3244]: *** Invalid argument ***
Feb  6 10:17:38 grant-02 lvm[3357]: Monitoring mirror device
mirror_sanity-resync_nosync for events
Feb  6 10:18:14 grant-02 clogd[3244]: Unknown request


Version-Release number of selected component (if applicable):
cmirror-1.1.11-1.el5
kmod-cmirror-0.1.5-2.el5
lvm2-2.02.32-1.el5
lvm2-cluster-2.02.32-1.el5

How reproducible:
Not Often

Comment 1 RHEL Program Management 2008-02-06 17:17:22 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Corey Marthaler 2008-02-06 19:57:07 UTC
Actually this appears to be happening more often. 

I've also seen a case where after the resync, it gets stuck:
[root@grant-03 ~]# lvs
  LV             VG            Attr   LSize  Origin Snap%  Move Log            
    Copy%  Convert
  LogVol00       VolGroup00    -wi-ao 72.44G
  LogVol01       VolGroup00    -wi-ao  1.94G
  resync_nosync  mirror_sanity mwi-a-  2.00G                   
resync_nosync_mlog  100.00
  resync_regular mirror_sanity mwi-a-  2.00G                   
resync_regular_mlog  10.55


[root@grant-03 ~]# lvs
  LV             VG            Attr   LSize  Origin Snap%  Move Log            
    Copy%  Convert
  LogVol00       VolGroup00    -wi-ao 72.44G
  LogVol01       VolGroup00    -wi-ao  1.94G
  resync_nosync  mirror_sanity mwi-a-  2.00G                   
resync_nosync_mlog  100.00
  resync_regular mirror_sanity mwi-a-  2.00G                   
resync_regular_mlog  10.55


[root@grant-03 ~]# lvs
  LV             VG            Attr   LSize  Origin Snap%  Move Log            
    Copy%  Convert
  LogVol00       VolGroup00    -wi-ao 72.44G
  LogVol01       VolGroup00    -wi-ao  1.94G
  resync_nosync  mirror_sanity mwi-a-  2.00G                   
resync_nosync_mlog  100.00
  resync_regular mirror_sanity mwi-a-  2.00G                   
resync_regular_mlog  10.55


Comment 3 Corey Marthaler 2008-02-06 19:57:50 UTC
I'll attempt this with single node mirroring to see if this is a base lvm issue.

Comment 4 Corey Marthaler 2008-02-06 21:29:43 UTC
I haven't hit this with single node mirrors in over 80 iterations in a row of
this test case. That said, the "after 30 sec" sync percent is rarely 3X as big
as the "after 10 sec" sync percent, and sometimes just barely larger.

Comment 5 Corey Marthaler 2008-02-06 22:27:40 UTC
Having reproduced the case where the resync gets stuck (in comment #2) I'm going
to open a new bug for that and leave this one for the fact that resync can be
really slow and cause funky "invalid header" messages.

Comment 7 Jonathan Earl Brassow 2008-02-15 16:25:43 UTC
My last checkin should improve the throughput of the resync I/O and nominal I/O.
 While it should be a big improvement, I would still expect nominal I/O to be
pretty slow during the process.

Comment 8 Corey Marthaler 2008-02-20 20:20:20 UTC
This appears to be fixed as well, however like the other issues marked modified,
I  eventually hit the 'cpg_mcast_joined error: 9' (bz  432109) message which
causes testing to fail, so I can't be positive that issues like this can be
marked verified.

Comment 9 Corey Marthaler 2008-03-04 21:38:16 UTC
Hit 432109 again while attempting to verify this issue.

Comment 11 Alasdair Kergon 2011-02-09 22:02:12 UTC
Untouched for almost 3 years: Are there still issues, or can this be closed?

Comment 12 Corey Marthaler 2011-04-07 22:05:35 UTC
Closing due to inactivity. Will reopen if seen again.


Note You need to log in before you can comment on or make changes to this bug.