429599 – RHEL5 cmirror tracker: dm-log-clustered time out while waiting for log server

Bug 429599 - RHEL5 cmirror tracker: dm-log-clustered time out while waiting for log server

Summary: RHEL5 cmirror tracker: dm-log-clustered time out while waiting for log server

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cmirror
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	430797
TreeView+	depends on / blocked

Reported:	2008-01-21 21:15 UTC by Corey Marthaler
Modified:	2010-01-12 02:06 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 21:25:33 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
log from haye-01 (200.28 KB, text/plain) 2008-06-26 20:18 UTC, Corey Marthaler	no flags	Details
log from hayes-02 (201.07 KB, text/plain) 2008-06-26 20:19 UTC, Corey Marthaler	no flags	Details
log from hayes-03 (199.48 KB, text/plain) 2008-06-26 20:19 UTC, Corey Marthaler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2009:0158	0	normal	SHIPPED_LIVE	new package: cmirror	2009-01-20 16:05:16 UTC

Description Corey Marthaler 2008-01-21 21:15:45 UTC

Description of problem:
The userspace processes seem to be slow enough to continually cause timeouts
messages during most cmirror operations.


Jan 21 15:05:35 hayes-03 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_
GET_SYNC_COUNT]
Jan 21 15:05:35 hayes-03 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_GET_SYNC_COUNT]
Jan 21 15:05:35 hayes-03 kernel: device-mapper: dm-log-clustered:
[77938][0][0][0][0][0] [10]


Version-Release number of selected component (if applicable):
cmirror-1.1.5-4.el5


How reproducible:
Often

Comment 1 RHEL Program Management 2008-02-04 15:17:47 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Jonathan Earl Brassow 2008-02-04 17:12:52 UTC

This is a real problem if it repeats over and over... means processing isn't
getting done for some reason.  If it only happens once in a while and things
keep working, then it is probably a minor issue.  I can see that you have had 10
such timeouts... how often do you see this message?

I've probably done most of the optimizing I can do in this release.  So, if it's
timing out due to slowness, then we may have to skip this bug.  If timing out
due to userspace failure to process, then this bug must be fixed.

Comment 3 Corey Marthaler 2008-02-05 20:38:42 UTC

I'm still at times seeing these timeout messages cause problems, like in bz 430296.

Comment 5 Corey Marthaler 2008-02-18 20:43:37 UTC

Saw a similar issue while testing for bz 429708. 

device-mapper: dm-log-clustered: Request timed out on DM_CLOG_FLUSH:81746 - retrying
device-mapper: dm-log-clustered: Request timed out on DM_CLOG_FLUSH:87824 - retrying
device-mapper: dm-log-clustered: Request timed out on
DM_CLOG_GET_RESYNC_WORK:87826 - retrying
device-mapper: dm-log-clustered: Stray request returned: <NULL>, 0
device-mapper: dm-log-clustered: Request timed out on DM_CLOG_CLEAR_REGION:91049
- retrying

The sync is either stuck or so slow that it nolonger moves:
[root@taft-01 ~]# date
Mon Feb 18 14:33:47 CST 2008
[root@taft-01 ~]#  lvs
  LV            VG          Attr   LSize   Origin Snap%  Move Log              
 Copy%  Convert
  taft-02.28775 lock_stress mwi-a- 500.00M                    taft-02.28775_mlog
 24.00
  taft-03.28789 lock_stress mwi-a- 500.00M                    taft-03.28789_mlog
 79.20

[root@taft-01 ~]# date
Mon Feb 18 14:42:15 CST 2008
[root@taft-01 ~]# lvs
  LV            VG          Attr   LSize   Origin Snap%  Move Log              
 Copy%  Convert
  taft-02.28775 lock_stress mwi-a- 500.00M                    taft-02.28775_mlog
 24.00
  taft-03.28789 lock_stress mwi-a- 500.00M                    taft-03.28789_mlog
 79.20

Comment 6 Corey Marthaler 2008-03-04 15:56:32 UTC

I believe that I'm still seeing this timing issue cause problems during lvm
operations.

qarshd[25822]: Running cmdline: lvconvert -m 2 lock_stress/taft-02.5326

Down converting cmirror from 3 legs to 2 on taft-02
  Error locking on node taft-03: LV lock_stress/taft-02.5326_mimage_3 in use:
not deactivating

Taft-03:
Mar  3 16:49:51 taft-03 lvm[6770]: Monitoring mirror device
lock_stress-taft--02.5326 for events
Mar  3 16:49:51 taft-03 clogd[6682]: Invalid request_type

Comment 7 Corey Marthaler 2008-03-04 19:09:31 UTC

On taft-03, you can see that the third image didn't get deleted when the down
conversion happened.

  taft-02.5326            lock_stress mwi-a- 500.00M                   
taft-02.5326_mlog 100.00         taft-02.5326_mimage_0(0)
,taft-02.5326_mimage_1(0),taft-02.5326_mimage_2(0)
  [taft-02.5326_mimage_0] lock_stress iwi-ao 500.00M                           
                         /dev/sdb1(125)

  [taft-02.5326_mimage_1] lock_stress iwi-ao 500.00M                           
                         /dev/sdd1(125)

  [taft-02.5326_mimage_2] lock_stress iwi-ao 500.00M                           
                         /dev/sde1(125)

  taft-02.5326_mimage_3   lock_stress -wi--- 500.00M                           
                         /dev/sdf1(125)

Comment 8 Corey Marthaler 2008-03-05 17:58:51 UTC

I saw another instance where a lvm cmd failed and the only thing in the logs was
this dm-log-clustered time out message.


Mar  4 19:05:55 taft-04 qarshd[17728]: Running cmdline: lvchange --resync -y
mirror_sanity/resync_regular

Mar  4 19:05:46 taft-02 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_STATUS_INFO:836712 - retrying

Resyncing resync_regular mirror
  Error locking on node taft-02: LV mirror_sanity/resync_regular_mlog in use:
not deactivating
  Unable to deactivate log LV resync_regular_mlog after wiping for resync

Comment 9 Corey Marthaler 2008-03-05 18:01:22 UTC

Making this bz a little more visible.

Comment 10 Corey Marthaler 2008-04-22 21:14:31 UTC

Just an FYI, that with enough lvm processes, you'll see these messages repeat
even if the cmds aren't failing or stuck.

2.6.18-90.el5
lvm2-2.02.32-4.el5
lvm2-cluster-2.02.32-4.el5
kmod-cmirror-0.1.8-1.el5

[...]
device-mapper: dm-log-clustered: Request timed out on
DM_CLOG_GET_SYNC_COUNT:151238 - retrg
device-mapper: dm-log-clustered: Request timed out on
DM_CLOG_CLEAR_REGION:161349 - retryig
device-mapper: dm-log-clustered: Request timed out on
DM_CLOG_CLEAR_REGION:161609 - retryig
device-mapper: dm-log-clustered: Request timed out on
DM_CLOG_CLEAR_REGION:173434 - retryig
device-mapper: dm-log-clustered: Request timed out on
DM_CLOG_CLEAR_REGION:194022 - retryig
[...]

Comment 11 Jonathan Earl Brassow 2008-05-13 15:23:07 UTC

I believe this is now fixed.

I have seen "Request timed out..." messages in further testing, but it is due to
an AIS failure.  If you see a message from OpenAIS like, "disconnecting from
library" - then that is a different bug.

Comment 12 Jonathan Earl Brassow 2008-05-15 18:35:59 UTC

This should also help (at least in the generic sense of preventing 'timed out'
messages):

commit a21d1c03600e011cf1ee132100ac823bf3478877
Author: Jonathan Brassow <jbrassow>
Date:   Thu May 15 13:02:19 2008 -0500

    clogd + dm-log-clustered: Improve communication error handling (K <-> U)

    Connector (netlink) is an unreliable protocol.  From time to time, I get
    a connector header in userspace that has all the right fields (seq #, size
    of data, etc), but the data is all zeros.  When this happens, I use the seq
    number in the header to request the kernel to resend the request.  This
    should avoid those nasty 'Invalid request' messages... which are then
    followed by 'Request timed out' (because the invalid request was ignored).

Comment 13 Corey Marthaler 2008-06-26 19:46:53 UTC

I was able to hit this while running the latest 5.2.Z kern while running the
test cmirror_lock_stress.

All user space cmds (lvs) are hung.

The following is scrolling over and over on all nodes:
Jun 26 14:42:28 hayes-02 kernel: device-mapper: dm-log-clustered: [cE1hohMV]
Request timed out: [DM_CLOG_CTg
Jun 26 14:42:28 hayes-02 clogd[5954]: Invalid log request received, ignoring.
device-mapper: dm-log-clustered: [cE1hohMV] Request timed out: [DM_CLOG_CTR/726]
- retrying
Jun 26 14:42:43 hayes-02 kernel: device-mapper: dm-log-clustered: [cE1hohMV]
Request timed out: [DM_CLOG_CTg
Jun 26 14:42:43 hayes-02 clogd[5954]: Invalid log request received, ignoring.

I'll include a dump of all the running processes.

2.6.18-92.1.5.el5
cmirror-1.1.18-1.el5 BUILT: Tue 20 May 2008 02:53:58 PM CDT
kmod-cmirror-0.1.10-1.el5 BUILT: Build Date: Tue 20 May 2008 02:55:48 PM CDT

Comment 14 Corey Marthaler 2008-06-26 20:18:17 UTC

Created attachment 310372 [details]
log from haye-01

Comment 15 Corey Marthaler 2008-06-26 20:19:08 UTC

Created attachment 310373 [details]
log from hayes-02

Comment 16 Corey Marthaler 2008-06-26 20:19:46 UTC

Created attachment 310374 [details]
log from hayes-03

Comment 17 Jonathan Earl Brassow 2008-07-21 14:42:01 UTC

I would be _very_ surprised to see you get this anymore...
"clogd[5954]: Invalid log request received, ignoring."

I believe this to be an artifact of the bug being fixed, but the build issue we
were having causing things to not work properly.

Comment 18 Corey Marthaler 2008-08-26 14:58:38 UTC

I don't see the 'clogd Invalid log request received' messages anymore. I am however still seeing a potential deadlock when running cmirror_lock_stress. I'll file a new bz for that and mark this one verified.

Comment 20 errata-xmlrpc 2009-01-20 21:25:33 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0158.html

Note You need to log in before you can comment on or make changes to this bug.