Bug 1035418

Summary:	cmirrord: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Product:	Red Hat Enterprise Linux 7	Reporter:	Corey Marthaler <cmarthal>
Component:	lvm2	Assignee:	Jonathan Earl Brassow <jbrassow>
lvm2 sub component:	Default / Unclassified	QA Contact:	Cluster QE <mspqa-list>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	agk, heinzm, jbrassow, msnitzer, nperic, prajnoha, prockai, thornber, zkabelac
Version:	7.0
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	lvm2-2.02.105-14.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-06-13 10:11:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1075262

Description Corey Marthaler 2013-11-27 17:21:46 UTC

Description of problem:
I'm seeing this message a lot while doing cmirror regression testing in rhel7.0. So far from what I can tell it appears to be harmless.


Nov 26 22:39:01 virt-123 crmd[13166]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Nov 26 22:39:01 virt-123 pengine[13165]: notice: process_pe_message: Calculated Transition 94: /var/lib/pacemaker/pengine/pe-input-5.bz2
Nov 26 22:39:01 virt-123 crmd[13166]: notice: run_graph: Transition 94 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-5.bz2): Complete
Nov 26 22:39:01 virt-123 crmd[13166]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov 26 22:40:01 virt-123 systemd: Starting Session 160 of user root.
Nov 26 22:40:01 virt-123 systemd: Started Session 160 of user root.
Nov 26 22:47:50 virt-123 lvm[21428]: No longer monitoring mirror device helter_skelter-syncd_primary_log_2legs_1 for events.
Nov 26 22:47:50 virt-123 cmirrord[13471]: cpg_dispatch failed: CS_ERR_BAD_HANDLE


Version-Release number of selected component (if applicable):
3.10.0-54.el7.x86_64
lvm2-2.02.103-5.el7    BUILT: Tue Nov 12 13:36:17 CET 2013
lvm2-libs-2.02.103-5.el7    BUILT: Tue Nov 12 13:36:17 CET 2013
lvm2-cluster-2.02.103-5.el7    BUILT: Tue Nov 12 13:36:17 CET 2013
device-mapper-1.02.82-5.el7    BUILT: Tue Nov 12 13:36:17 CET 2013
device-mapper-libs-1.02.82-5.el7    BUILT: Tue Nov 12 13:36:17 CET 2013
device-mapper-event-1.02.82-5.el7    BUILT: Tue Nov 12 13:36:17 CET 2013
device-mapper-event-libs-1.02.82-5.el7    BUILT: Tue Nov 12 13:36:17 CET 2013
device-mapper-persistent-data-0.2.8-2.el7    BUILT: Wed Oct 30 16:20:48 CET 2013
cmirror-2.02.103-5.el7    BUILT: Tue Nov 12 13:36:17 CET 2013


How reproducible:
Often

Comment 1 Jonathan Earl Brassow 2014-03-04 23:04:34 UTC

I believe I have a fix for this.  cpg_dispatch is being run after the node has left the group.  We can check the flags to see if LEAVING and INVALID are set and if so, simply skip the cpg_dispatch.  This should eliminate the message.

Comment 2 Jonathan Earl Brassow 2014-03-05 16:45:30 UTC

patch committed upstream:

commit 52aa3dbcabe85b38b51c68c27fb2397eb0fb1efd
Author: Jonathan Brassow <jbrassow>
Date:   Wed Mar 5 10:44:20 2014 -0600

    cmirrord:  Clean-up stray warning message
    
    cmirrord polls for messages on the kernel and cluster interfaces.
    Sometimes it is possible for messages to be received on the cluster
    interface and be waiting for processing while the node is in the
    process of leaving the cluster group.  When this happens, the
    messages received on the cluster interface are attempted to be
    dispatched, but an error is returned because the connection is no
    longer valid.  It is a harmless situation.  So, if we get the
    specific error (CS_ERR_BAD_HANDLE) and we know that we have left
    the group, then simply don't print the message.

Comment 4 Corey Marthaler 2014-03-18 00:00:51 UTC

This still exists in the latest rpms.

[root@host-003 ~]#  grep CS_ERR_BAD_HANDLE /var/log/messages
Mar 17 18:48:24 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:48:28 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:48:34 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:48:52 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:49:51 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:50:33 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:50:43 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:51:14 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:51:48 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:52:22 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:52:45 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:53:06 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:53:18 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:53:26 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:54:00 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:54:26 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 17 18:54:40 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE

[root@host-003 ~]# /usr/tests/sts-rhel7.0/lvm2/bin/lvm_rpms 
3.10.0-112.el7.x86_64

lvm2-2.02.105-12.el7    BUILT: Wed Mar 12 10:49:52 CDT 2014
lvm2-libs-2.02.105-12.el7    BUILT: Wed Mar 12 10:49:52 CDT 2014
lvm2-cluster-2.02.105-12.el7    BUILT: Wed Mar 12 10:49:52 CDT 2014
device-mapper-1.02.84-12.el7    BUILT: Wed Mar 12 10:49:52 CDT 2014
device-mapper-libs-1.02.84-12.el7    BUILT: Wed Mar 12 10:49:52 CDT 2014
device-mapper-event-1.02.84-12.el7    BUILT: Wed Mar 12 10:49:52 CDT 2014
device-mapper-event-libs-1.02.84-12.el7    BUILT: Wed Mar 12 10:49:52 CDT 2014
device-mapper-persistent-data-0.2.8-4.el7    BUILT: Fri Jan 24 14:28:55 CST 2014
cmirror-2.02.105-12.el7    BUILT: Wed Mar 12 10:49:52 CDT 2014

Comment 6 Jonathan Earl Brassow 2014-03-19 19:46:58 UTC

The previous commit did not cover all of the possible cases during the process of a node leaving the cluster.  The following upstream commit finishes the job.

commit bdd7baeab39f7ef61817b2c48008852aefdbcdb0
Author: Jonathan Brassow <jbrassow>
Date:   Wed Mar 19 14:43:00 2014 -0500

    cmirrord: Clean-up stray warning message (attempt #2)
    
    There are two types of CPG communications in a corosync cluster:
    messages and state transitions.  Cmirrord processes the state
    transitions first.
    
    When a cluster mirror issues a POSTSUSPEND, it signals the end of
    cluster communication with the rest of the nodes in the cluster.
    The POSTSUSPEND marks the last communication of the 'message'
    type that will go around the cluster.  The node then calls
    cpg_leave which causes a final 'state transition' communication to
    all of the nodes.  Once the out-going node receives its own state
    transition notice from the cluster, it finalizes the leave.  At this
    point, the state of the log is 'INVALID'; but it is possible that
    there remains some cluster trafic that was queued up behind the
    state transition that still wants to be processed.  It is harmless
    to attempt to dispatch any remaining messages - they won't be
    delivered because the node is no longer in the cluster.  However,
    there was a warning message that was being printed in this case
    that is now removed by this patch.  The failure of the dispatch
    created a false positive condition that triggered the message.

Comment 8 Nenad Peric 2014-03-27 15:49:15 UTC

I just ran 'revolution' test with lg/log policy of allocate, and am getting the errors on all the nodes:



Mar 27 16:37:27 virt-010 lvm[31056]: No longer monitoring mirror device revolution_9-mirror_1 for events.
Mar 27 16:37:27 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:37:30 virt-010 lvm[31056]: Monitoring mirror device revolution_9-mirror_1 for events.
Mar 27 16:37:30 virt-010 lvm[31056]: revolution_9-mirror_1 is now in-sync.
Mar 27 16:37:32 virt-010 lvm[31056]: No longer monitoring mirror device revolution_9-mirror_1 for events.
Mar 27 16:37:32 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:37:42 virt-010 lvm[31056]: Monitoring mirror device revolution_9-mirror_1 for events.
Mar 27 16:37:42 virt-010 lvm[31056]: revolution_9-mirror_1 is now in-sync.


[root@virt-010 ~]# grep CS_ERR_BAD_HANDLE /var/log/messages
Mar 27 16:33:21 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:37:26 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:37:27 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:37:32 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:39:51 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:39:52 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:39:58 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:42:25 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:42:25 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:42:25 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:43:33 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:43:35 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE
Mar 27 16:43:40 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE



Versions tested on:

3.10.0-113.el7.x86_64

lvm2-2.02.105-14.el7    BUILT: Wed Mar 26 14:29:41 CET 2014
lvm2-libs-2.02.105-14.el7    BUILT: Wed Mar 26 14:29:41 CET 2014
lvm2-cluster-2.02.105-14.el7    BUILT: Wed Mar 26 14:29:41 CET 2014
device-mapper-1.02.84-14.el7    BUILT: Wed Mar 26 14:29:41 CET 2014
device-mapper-libs-1.02.84-14.el7    BUILT: Wed Mar 26 14:29:41 CET 2014
device-mapper-event-1.02.84-14.el7    BUILT: Wed Mar 26 14:29:41 CET 2014
device-mapper-event-libs-1.02.84-14.el7    BUILT: Wed Mar 26 14:29:41 CET 2014
device-mapper-persistent-data-0.2.8-5.el7    BUILT: Sat Mar  1 02:15:56 CET 2014
cmirror-2.02.105-14.el7    BUILT: Wed Mar 26 14:29:41 CET 2014

Comment 9 Jonathan Earl Brassow 2014-03-28 20:10:19 UTC

I wasn't receiving any of these messages when I was testing, but I didn't test both leg and log as "allocate".  The only thing I can think of now is that a 'resume' command comes in before the state transition phase is processed as part of the preceding 'postsuspend'.  If that were the case, I would have to move the time at which the kernel is given acknowledgement of the processing of the 'postsuspend' command.  I'm not sure I'm willing to do that with such a short window in the RHEL7.0 timeframe and such a low priority for the bug.

I will at least make sure I can reproduce with the same RPMs.

Comment 10 Jonathan Earl Brassow 2014-03-28 22:02:00 UTC

After an hour of running revolution_9, I can't reproduce with the default fault policies.

Also, please remember to post the command used to produce the bug.  In this case, my unsuccessful attempt was from:
./revolution_9 -l /usr/tests/sts-rhel7.0/ -r /usr/tests/sts-rhel7.0/ -R ~/bp.xml

I will try with the '-F' flag ("allocate" policies for leg and log).

Comment 11 Jonathan Earl Brassow 2014-03-28 22:35:47 UTC

Ok, that's another hour with the following:
./revolution_9 -l /usr/tests/sts-rhel7.0/ -r /usr/tests/sts-rhel7.0/ -R ~/bp.xml -F

Could you please retest?  Are you sure you haven't upgraded but left an old cmirrord running?

Here is what I am running with:
[root]# /usr/tests/sts-rhel7.0/lvm2/bin/lvm_rpms 
3.10.0-115.el7.x86_64

lvm2-2.02.105-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
lvm2-libs-2.02.105-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
lvm2-cluster-2.02.105-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
device-mapper-1.02.84-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
device-mapper-libs-1.02.84-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
device-mapper-event-1.02.84-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
device-mapper-event-libs-1.02.84-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
device-mapper-persistent-data-0.2.8-5.el7    BUILT: Fri Feb 28 19:15:56 CST 2014
cmirror-2.02.105-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014

Comment 12 Nenad Peric 2014-03-31 07:59:11 UTC

That as well may be the case. 
I did the upgrade but I cannot remember now if I restarted the cluster..

I will re-test now but after reboot (just in case something is left over). 
Will post back the results asap!

Comment 13 Peter Rajnoha 2014-03-31 08:20:16 UTC

We should probably add automatic restart on update like we have for clvmd already. The only question is what happens if we call restart just in the middle of cmirror operation - this must be done gracefully (or cmirror must be able to continue with the operation once it's restarted).

Comment 14 Nenad Peric 2014-03-31 09:28:29 UTC

The messages are no longer showing up, although I ran into different issues.
However this bug can be closed since the fix works after the cluster restart after update (wasn't thinking that it did not restart/refresh the resource after the update). 

Yeah it would be good if we could somehow replace the cmirrod process.. 
Without cluster fencing off or causing some mirror corruption. 

Anyway, I can now mark this VERIFIED with the same packages as in Comment #8

Comment 15 Ludek Smid 2014-06-13 10:11:19 UTC

This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.