Bug 1312094

Summary: crmd can crash after unexpected remote connection takeover
Product: Red Hat Enterprise Linux 7 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: high    
Version: 7.2CC: abeekhof, cluster-maint, cluster-qe, phagara
Target Milestone: rc   
Target Release: 7.3   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: pacemaker-1.1.15-1.2c148ac.git.el7 Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: 1312092 Environment:
Last Closed: 2016-11-03 18:58:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1304771    
Bug Blocks: 1379784    

Comment 3 Mike McCune 2016-03-28 22:52:26 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 5 Patrik Hagara 2016-09-08 14:28:34 UTC
Setup: 3-node cluster + 1 pacemaker_remote node

Before the fix:

> Sep 08 14:26:11 [27822] virt-166       crmd:    error: remote_lrm_op_callback:	Unexpected pacemaker_remote client takeover. Disconnecting
> Sep 08 14:26:11 [27822] virt-166       crmd:     info: lrmd_api_disconnect:	Disconnecting from 3 lrmd service
> Sep 08 14:26:11 [27822] virt-166       crmd:     info: lrmd_api_disconnect:	Disconnecting from 3 lrmd service
> Sep 08 14:26:11 [27822] virt-166       crmd:     info: lrmd_tls_connection_destroy:	TLS connection destroyed
> Sep 08 14:26:11 [27816] virt-166 pacemakerd:    error: child_waitpid:	Managed process 27822 (crmd) dumped core
> Sep 08 14:26:11 [27816] virt-166 pacemakerd:    error: pcmk_child_exit:	The crmd process (27822) terminated with signal 6 (core=1)

pacemaker_remote node got disconnected from the cluster, crmd on cluster node hosting the pacemaker_remote connection crashed and was restarted,  the cluster returned to a fully operational state shortly thereafter.


After the fix:

> Sep  8 16:23:46 virt-055 pacemaker_remoted[17977]:  notice: LRMD client connection established. 0xd8e120 id: f93cb6a1-a321-4ff5-8c75-398190f50b28
> Sep  8 16:23:56 virt-055 pacemaker_remoted[17977]:  notice: LRMD client disconnecting remote client - name: <unknown> id: f93cb6a1-a321-4ff5-8c75-398190f50b28
> Sep  8 16:23:56 virt-055 pacemaker_remoted[17977]:   error: Remote client authentication timed out

Cluster remained fully operational without service disruption, no log messages on cluster node hosting the pacemaker_remote connection, the remote node itself logs auth time-out error.

Marking as verified in pacemaker-1.1.15-1.2c148ac.git.el7

Comment 7 errata-xmlrpc 2016-11-03 18:58:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html