1296876 – Monitor unknow error of systemd resources(samba,rpcbind),lrmd connect failed and cluster failover

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1296876 - Monitor unknow error of systemd resources(samba,rpcbind),lrmd connect failed and cluster failover

Summary: Monitor unknow error of systemd resources(samba,rpcbind),lrmd connect failed ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	7.2
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-08 10:15 UTC by Zhaoming Zhang
Modified:	2016-06-01 07:22 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-16 17:36:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
system messages of the cluster nodes (710.40 KB, text/plain) 2016-01-11 02:25 UTC, Zhaoming Zhang	no flags	Details
system message of the cluster node nas1 (319.08 KB, text/plain) 2016-01-11 02:27 UTC, Zhaoming Zhang	no flags	Details
View All

Description Zhaoming Zhang 2016-01-08 10:15:33 UTC

Description of problem:
Monitor unknow error of systemd resources(samba,rpcbind),lrmd connect failed and cluster failover

Version-Release number of selected component (if applicable):
pacemaker-1.1.10-29.el7.x86_64

How reproducible:
unpredictable,do nothing,just running the cluster

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


node0's log:
Dec  7 20:30:01 nas1 systemd: Starting Session 3016 of user root.
Dec  7 20:30:01 nas1 systemd: Started Session 3016 of user root.
Dec  7 20:40:01 nas1 systemd: Starting Session 3017 of user root.
Dec  7 20:40:01 nas1 systemd: Started Session 3017 of user root.
Dec  7 20:45:06 nas1 systemd: Reloading.
Dec  7 20:45:56 nas1 crmd[35821]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=49, confirmed=false) unknown error
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9)
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 85: fail-count-nas_samba=9
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1449492356)
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 88: last-failure-nas_samba=1449492356
Dec  7 20:45:56 nas1 crmd[35821]: error: crm_ipc_read: Connection to lrmd failed
Dec  7 20:45:56 nas1 crmd[35821]: error: mainloop_gio_callback: Connection to lrmd[0x215b020] closed (I/O condition=17)
Dec  7 20:45:56 nas1 crmd[35821]: crit: lrm_connection_destroy: LRM Connection failed
Dec  7 20:45:56 nas1 crmd[35821]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Dec  7 20:45:56 nas1 crmd[35821]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec  7 20:45:56 nas1 crmd[35821]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec  7 20:45:56 nas1 crmd[35821]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Dec  7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process lrmd (35818) exited: Operation not permitted (1)
Dec  7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec  7 20:45:56 nas1 crmd[35821]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi1:97 (fence_ipmi1_monitor_60000) incomplete at shutdown
Dec  7 20:45:56 nas1 crmd[35821]: error: lrm_state_verify_stopped: 1 resources were active at shutdown.
Dec  7 20:45:56 nas1 crmd[35821]: notice: do_lrm_control: Disconnected from the LRM
Dec  7 20:45:56 nas1 crmd[35821]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec  7 20:45:56 nas1 cib[16365]: warning: qb_ipcs_event_sendv: new_event_notification (16365-35821-9): Broken pipe (32)
Dec  7 20:45:56 nas1 cib[16365]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Dec  7 20:45:56 nas1 crmd[35821]: error: crmd_fast_exit: Could not recover from internal error
Dec  7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process crmd (35821) exited: Generic Pacemaker error (201)
Dec  7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: crmd
Dec  7 20:45:56 nas1 crmd[43515]: notice: main: CRM Git Version: 368c726
Dec  7 20:45:56 nas1 crmd[43515]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Dec  7 20:45:56 nas1 crmd[43515]: notice: cluster_connect_quorum: Quorum acquired
Dec  7 20:45:56 nas1 crmd[43515]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Dec  7 20:45:56 nas1 crmd[43515]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null))
Dec  7 20:45:56 nas1 crmd[43515]: notice: do_started: The local CRM is operational
Dec  7 20:45:56 nas1 crmd[43515]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Dec  7 20:45:58 nas1 crmd[43515]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Dec  7 20:45:58 nas1 attrd[16368]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Dec  7 20:45:58 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1449492356)
Dec  7 20:45:58 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Dec  7 20:45:58 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9)
Dec  7 20:45:58 nas1 attrd[16368]: warning: attrd_cib_callback: Update 90 for last-failure-nas_samba=1449492356 failed: No such device or address
Dec  7 20:45:58 nas1 attrd[16368]: warning: attrd_cib_callback: Update 92 for probe_complete=true failed: No such device or address
Dec  7 20:45:58 nas1 attrd[16368]: warning: attrd_cib_callback: Update 94 for fail-count-nas_samba=9 failed: No such device or address
Dec  7 20:45:59 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_rpcbind_monitor_0 (call=41, rc=0, cib-update=10, confirmed=true) ok
Dec  7 20:45:59 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_samba_monitor_0 (call=49, rc=0, cib-update=11, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_tomcat_monitor_0 (call=73, rc=0, cib-update=14, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_shared_metadata_monitor_0 (call=5, rc=0, cib-update=15, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_test_monitor_0 (call=9, rc=0, cib-update=16, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_Users_monitor_0 (call=13, rc=0, cib-update=17, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_gsics_monitor_0 (call=17, rc=0, cib-update=18, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_Gsix_monitor_0 (call=21, rc=0, cib-update=19, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_GX_monitor_0 (call=37, rc=0, cib-update=20, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_MAS_monitor_0 (call=25, rc=0, cib-update=21, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_User2_monitor_0 (call=33, rc=0, cib-update=22, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_kyfz_monitor_0 (call=29, rc=0, cib-update=23, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_syslun_monitor_0 (call=65, rc=0, cib-update=24, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_service_ip_192_168_3_100_monitor_0 (call=69, rc=0, cib-update=25, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_vip_10_24_171_90_monitor_0 (call=53, rc=0, cib-update=26, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_ip_monitor_0 (call=77, rc=0, cib-update=27, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_nfs_monitor_0 (call=45, rc=0, cib-update=28, confirmed=true) ok
Dec  7 20:46:00 nas1 stonith-ng[16366]: notice: stonith_device_register: Device 'fence_ipmi1' already existed in device list (2 active devices)



node1's log:
Dec  7 20:38:36 nas2 crmd[33274]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Dec  7 20:38:36 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:38:36 nas2 pengine[33273]: notice: process_pe_message: Calculated Transition 809: /var/lib/pacemaker/pengine/pe-input-117.bz2
Dec  7 20:38:36 nas2 crmd[33274]: notice: run_graph: Transition 809 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-117.bz2): Complete
Dec  7 20:38:36 nas2 crmd[33274]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Dec  7 20:40:01 nas2 systemd: Starting Session 2955 of user root.
Dec  7 20:40:01 nas2 systemd: Started Session 2955 of user root.
Dec  7 20:45:56 nas2 crmd[33274]: warning: update_failcount: Updating failcount for nas_samba on node0 after failed monitor: rc=1 (update=value++, time=1449492356)
Dec  7 20:45:56 nas2 crmd[33274]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Dec  7 20:45:56 nas2 crmd[33274]: warning: match_down_event: No match for shutdown action on 1
Dec  7 20:45:56 nas2 crmd[33274]: notice: peer_update_callback: Stonith/shutdown of node0 not matched
Dec  7 20:45:56 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:45:56 nas2 pengine[33273]: warning: unpack_rsc_op: Processing failed op monitor for nas_samba on node0: unknown error (1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: process_pe_message: Calculated Transition 810: /var/lib/pacemaker/pengine/pe-input-118.bz2
Dec  7 20:45:56 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:45:56 nas2 pengine[33273]: warning: pe_fence_node: Node node0 is unclean because it is partially and/or un-expectedly down
Dec  7 20:45:56 nas2 pengine[33273]: warning: determine_online_status: Node node0 is unclean
Dec  7 20:45:56 nas2 pengine[33273]: warning: stage6: Node node0 is unclean!
Dec  7 20:45:56 nas2 pengine[33273]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
Dec  7 20:45:56 nas2 pengine[33273]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_shared_metadata	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_test	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_Users	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_gsics	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_Gsix	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_MAS	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_kyfz	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_User2	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_GX	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_rpcbind	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_nfs	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_samba	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_vip_10_24_171_90	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   fence_ipmi1	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_syslun	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_service_ip_192_168_3_100	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_tomcat	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_ip	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: error: process_pe_message: Calculated Transition 811: /var/lib/pacemaker/pengine/pe-error-3.bz2
Dec  7 20:45:56 nas2 pengine[33273]: notice: process_pe_message: Configuration ERRORs found during PE processing.  Please run "crm_verify -L" to identify issues.
Dec  7 20:45:56 nas2 crmd[33274]: warning: do_log: FSA: Input I_PE_SUCCESS from handle_response() received in state S_INTEGRATION
Dec  7 20:45:58 nas2 cib[33269]: warning: cib_process_request: Completed cib_modify operation for section status: No such device or address (rc=-6, origin=node0/attrd/90, version=0.84.231)
Dec  7 20:45:58 nas2 cib[33269]: warning: cib_process_request: Completed cib_modify operation for section status: No such device or address (rc=-6, origin=node0/attrd/92, version=0.84.231)
Dec  7 20:45:58 nas2 cib[33269]: warning: cib_process_request: Completed cib_modify operation for section status: No such device or address (rc=-6, origin=node0/attrd/94, version=0.84.231)
Dec  7 20:45:59 nas2 attrd[33272]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Dec  7 20:45:59 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Dec  7 20:45:59 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_shared_metadata	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_test	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_Users	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_gsics	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_Gsix	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_MAS	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_kyfz	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_User2	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_GX	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_rpcbind	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_nfs	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_samba	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_vip_10_24_171_90	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   fence_ipmi1	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_syslun	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_service_ip_192_168_3_100	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_tomcat	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_ip	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: process_pe_message: Calculated Transition 812: /var/lib/pacemaker/pengine/pe-input-119.bz2
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 5: monitor nas_shared_metadata_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 6: monitor nas_filesystem_test_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 7: monitor nas_filesystem_Users_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 8: monitor nas_filesystem_gsics_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 9: monitor nas_filesystem_Gsix_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 10: monitor nas_filesystem_MAS_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 11: monitor nas_filesystem_kyfz_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 12: monitor nas_filesystem_User2_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 13: monitor nas_filesystem_GX_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 14: monitor nas_rpcbind_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 15: monitor nas_nfs_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 16: monitor nas_samba_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 17: monitor nas_vip_10_24_171_90_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 18: monitor fence_ipmi1_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 19: monitor fence_ipmi0_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 20: monitor nas_syslun_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 21: monitor nas_service_ip_192_168_3_100_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 22: monitor nas_tomcat_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 23: monitor nas_ip_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: warning: status_from_rc: Action 14 (nas_rpcbind_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:45:59 nas2 crmd[33274]: warning: status_from_rc: Action 16 (nas_samba_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 22 (nas_tomcat_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 5 (nas_shared_metadata_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 6 (nas_filesystem_test_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 7 (nas_filesystem_Users_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 8 (nas_filesystem_gsics_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 9 (nas_filesystem_Gsix_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 13 (nas_filesystem_GX_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 10 (nas_filesystem_MAS_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 12 (nas_filesystem_User2_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 11 (nas_filesystem_kyfz_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 20 (nas_syslun_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 21 (nas_service_ip_192_168_3_100_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 17 (nas_vip_10_24_171_90_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 23 (nas_ip_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 15 (nas_nfs_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 4: probe_complete probe_complete on node0 - no waiting
Dec  7 20:46:00 nas2 crmd[33274]: notice: run_graph: Transition 812 (Complete=22, Pending=0, Fired=0, Skipped=40, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-119.bz2): Stopped
Dec  7 20:46:00 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:46:00 nas2 pengine[33273]: notice: LogActions: Start   fence_ipmi1	(node0)



------------------------------------------------------------ + pcs config + ------------------------------------------------------------
Cluster Name: my_cluster
Corosync Nodes:
 node0 node1 
Pacemaker Nodes:
 node0 node1 

Resources: 
 Group: nas_group
  Resource: nas_shared_metadata (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/metadata directory=/nasgui/metadata fstype=xfs options=sync 
   Operations: start interval=0s timeout=60 (nas_shared_metadata-start-timeout-60)
               stop interval=0s timeout=60 (nas_shared_metadata-stop-timeout-60)
               monitor interval=15s timeout=15s (nas_shared_metadata-monitor-interval-15s)
  Resource: nas_filesystem_test (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342b8d58e87d8863d70d4d0000d9 directory=/mnt/test fstype=xfs options=sync,usrquota,prjquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_test-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_test-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_test-monitor-interval-15s)
  Resource: nas_filesystem_Users (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342cefeea4cdfb66db3eed0000d9 directory=/mnt/Users fstype=xfs options=sync,usrquota,prjquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_Users-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_Users-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_Users-monitor-interval-15s)
  Resource: nas_filesystem_gsics (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342196550d2db9fcd9419d0000d9 directory=/mnt/gsics fstype=xfs options=sync,usrquota,grpquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_gsics-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_gsics-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_gsics-monitor-interval-15s)
  Resource: nas_filesystem_Gsix (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342e465188ddc88ed217fd0000d9 directory=/mnt/Gsix fstype=xfs options=sync,usrquota,grpquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_Gsix-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_Gsix-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_Gsix-monitor-interval-15s)
  Resource: nas_filesystem_MAS (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342a25ca66fde792d83ddd0000d9 directory=/mnt/MAS fstype=xfs options=sync,usrquota,grpquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_MAS-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_MAS-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_MAS-monitor-interval-15s)
  Resource: nas_filesystem_kyfz (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342f6432060d750fd1114d0000d9 directory=/mnt/kyfz fstype=xfs options=sync,uquota,gquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_kyfz-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_kyfz-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_kyfz-monitor-interval-15s)
  Resource: nas_filesystem_User2 (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342aa9dc9d7df167dbb03d0000d9 directory=/mnt/User2 fstype=xfs options=sync,usrquota,prjquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_User2-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_User2-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_User2-monitor-interval-15s)
  Resource: nas_filesystem_GX (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342bf60c8a1d404bdd6c0d0000d9 directory=/mnt/GX fstype=xfs options=sync,uquota,gquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_GX-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_GX-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_GX-monitor-interval-15s)
  Resource: nas_rpcbind (class=systemd type=rpcbind)
   Operations: monitor interval=15s start-delay=10s timeout=20s (nas_rpcbind-monitor-interval-15s)
  Resource: nas_nfs (class=ocf provider=heartbeat type=nfsserver)
   Attributes: nfs_shared_infodir=/nasgui/metadata/nfsinfo nfsd_args= 
   Operations: start interval=0s timeout=40 (nas_nfs-start-timeout-40)
               stop interval=0s timeout=20s (nas_nfs-stop-timeout-20s)
               monitor interval=15s start-delay=10s timeout=20s (nas_nfs-monitor-interval-15s)
  Resource: nas_samba (class=systemd type=smb)
   Operations: monitor interval=60s start-delay=10s (nas_samba-monitor-start-delay-10s)
               monitor interval=15s timeout=20s (nas_samba-monitor-interval-15s)
  Resource: nas_vip_10_24_171_90 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.24.171.90 nic=bond1 cidr_netmask=24 
   Operations: monitor interval=20s timeout=20s (nas_vip_10_24_171_90-monitor-20s)
 Group: nas_tomcat_service
  Resource: nas_syslun (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/sys_lun directory=/nasgui/share fstype=xfs options=sync 
   Operations: start interval=0s timeout=60 (nas_syslun-start-timeout-60)
               stop interval=0s timeout=60 (nas_syslun-stop-timeout-60)
               monitor interval=15s timeout=15s (nas_syslun-monitor-interval-15s)
  Resource: nas_service_ip_192_168_3_100 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=192.168.3.100 cidr_netmask=24 
   Operations: start interval=0s timeout=20s (nas_service_ip_192_168_3_100-start-timeout-20s)
               stop interval=0s timeout=20s (nas_service_ip_192_168_3_100-stop-timeout-20s)
               monitor interval=20s timeout=20s (nas_service_ip_192_168_3_100-monitor-interval-20s)
  Resource: nas_tomcat (class=ocf provider=heartbeat type=tomcat)
   Attributes: java_home=/usr/local/jdk/jdk1.6.0_37 catalina_home=/usr/local/nas/apache-tomcat-6.0.35 
   Operations: start interval=0s timeout=60s (nas_tomcat-start-timeout-60s)
               stop interval=0s timeout=120s (nas_tomcat-stop-timeout-120s)
               monitor interval=10s start-delay=5s timeout=30s (nas_tomcat-monitor-interval-10s)
  Resource: nas_ip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.24.171.91 cidr_netmask=24 
   Operations: start interval=0s timeout=20s (nas_ip-start-timeout-20s)
               stop interval=0s timeout=20s (nas_ip-stop-timeout-20s)
               monitor interval=20s timeout=20s (nas_ip-monitor-interval-20s)

Stonith Devices: 
 Resource: fence_ipmi1 (class=stonith type=fence_ipmilan)
  Attributes: pcmk_host_list=node1 pcmk_host_check=static-list ipaddr=192.168.66.41 action=reboot login=root passwd=passwd delay=15 auth=password lanplus=true 
  Operations: monitor interval=60s (fence_ipmi1-monitor-interval-60s)
 Resource: fence_ipmi0 (class=stonith type=fence_ipmilan)
  Attributes: pcmk_host_list=node0 pcmk_host_check=static-list ipaddr=192.168.66.40 action=reboot login=root passwd=passwd delay=15 auth=password lanplus=true 
  Operations: monitor interval=60s (fence_ipmi0-monitor-interval-60s)
Fencing Levels: 

Location Constraints:
  Resource: fence_ipmi0
    Enabled on: node1 (score:INFINITY) (id:location-fence_ipmi0-node1-INFINITY)
  Resource: fence_ipmi1
    Enabled on: node0 (score:INFINITY) (id:location-fence_ipmi1-node0-INFINITY)
  Resource: nas_tomcat_service
    Enabled on: node0 (score:INFINITY) (id:location-nas_tomcat_service-node0-INFINITY)
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: corosync
 dc-version: 1.1.10-29.el7-368c726
 no-quorum-policy: ignore
 stonith-enabled: false
------------------------------------------------------------ ------------------------------------------------------------

Comment 2 Ken Gaillot 2016-01-08 22:20:44 UTC

As discussed on #1284069, the "node0 is unclean" log message indicates that the two nodes lost communication with each other, possibly due to a network issue or load issue. Normally, one node would fence the other in this situation, but fencing is disabled here.

Looking at the messages in the system log around the time of the problem might give further clues as to what caused the problem.

I'd also recommend upgrading to RHEL 7.2 if possible, to get more recent versions of pacemaker and its dependencies.

Comment 3 Zhaoming Zhang 2016-01-11 02:19:29 UTC

Thanks for your reply, Ken.
(In reply to Ken Gaillot from comment #2)
> As discussed on #1284069, the "node0 is unclean" log message indicates that
> the two nodes lost communication with each other, possibly due to a network
> issue or load issue. Normally, one node would fence the other in this
> situation, but fencing is disabled here.

We've tested to enable the fencing,and it would failover. But the power lose is another problem. We add scsi 3 reservation to the Filesystem resourece to avoid the power lose problem for the moment. And this failover bug would lead to reservation conflict.

And I' quite sure that the reason of the two nodes lost communication with each other is not related to network issue or load issue.  
The LOM for the cluster heartbeat is exclusive use.
When we have no load, we still meet the problem. And the log just says:
 
Dec  7 20:45:56 nas1 crmd[35821]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=49, confirmed=false) unknown error
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9)
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 85: fail-count-nas_samba=9
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1449492356)
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 88: last-failure-nas_samba=1449492356
Dec  7 20:45:56 nas1 crmd[35821]: error: crm_ipc_read: Connection to lrmd failed
Dec  7 20:45:56 nas1 crmd[35821]: error: mainloop_gio_callback: Connection to lrmd[0x215b020] closed (I/O condition=17)
Dec  7 20:45:56 nas1 crmd[35821]: crit: lrm_connection_destroy: LRM Connection failed


> 
> Looking at the messages in the system log around the time of the problem
> might give further clues as to what caused the problem.
> 
> I'd also recommend upgrading to RHEL 7.2 if possible, to get more recent
> versions of pacemaker and its dependencies.

We are planning to upgrade to RHEL 7.2 to verify the bug recently and will attache the messages to this bug record.

Comment 4 Zhaoming Zhang 2016-01-11 02:25:20 UTC

Created attachment 1113470 [details]
system messages of the cluster nodes

Dec  7 20:45:56 nas2 crmd[33274]: warning: match_down_event: No match for shutdown action on 1
Dec  7 20:45:56 nas2 crmd[33274]: notice: peer_update_callback: Stonith/shutdown of node0 not matched
Dec  7 20:45:56 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:45:56 nas2 pengine[33273]: warning: unpack_rsc_op: Processing failed op monitor for nas_samba on node0: unknown error (1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: process_pe_message: Calculated Transition 810: /var/lib/pacemaker/pengine/pe-input-118.bz2
Dec  7 20:45:56 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:45:56 nas2 pengine[33273]: warning: pe_fence_node: Node node0 is unclean because it is partially and/or un-expectedly down
Dec  7 20:45:56 nas2 pengine[33273]: warning: determine_online_status: Node node0 is unclean
Dec  7 20:45:56 nas2 pengine[33273]: warning: stage6: Node node0 is unclean!
Dec  7 20:45:56 nas2 pengine[33273]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
Dec  7 20:45:56 nas2 pengine[33273]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE

Comment 5 Zhaoming Zhang 2016-01-11 02:27:14 UTC

Created attachment 1113471 [details]
system message of the cluster node nas1

Dec  7 20:28:01 nas1 systemd: Starting Session 3015 of user pcp.
Dec  7 20:28:01 nas1 systemd: Started Session 3015 of user pcp.
Dec  7 20:30:01 nas1 systemd: Starting Session 3016 of user root.
Dec  7 20:30:01 nas1 systemd: Started Session 3016 of user root.
Dec  7 20:40:01 nas1 systemd: Starting Session 3017 of user root.
Dec  7 20:40:01 nas1 systemd: Started Session 3017 of user root.
Dec  7 20:45:06 nas1 systemd: Reloading.
Dec  7 20:45:56 nas1 crmd[35821]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=49, confirmed=false) unknown error
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9)
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 85: fail-count-nas_samba=9
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1449492356)
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 88: last-failure-nas_samba=1449492356
Dec  7 20:45:56 nas1 crmd[35821]: error: crm_ipc_read: Connection to lrmd failed
Dec  7 20:45:56 nas1 crmd[35821]: error: mainloop_gio_callback: Connection to lrmd[0x215b020] closed (I/O condition=17)
Dec  7 20:45:56 nas1 crmd[35821]: crit: lrm_connection_destroy: LRM Connection failed
Dec  7 20:45:56 nas1 crmd[35821]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Dec  7 20:45:56 nas1 crmd[35821]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec  7 20:45:56 nas1 crmd[35821]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec  7 20:45:56 nas1 crmd[35821]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Dec  7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process lrmd (35818) exited: Operation not permitted (1)
Dec  7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec  7 20:45:56 nas1 crmd[35821]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi1:97 (fence_ipmi1_monitor_60000) incomplete at shutdown
Dec  7 20:45:56 nas1 crmd[35821]: error: lrm_state_verify_stopped: 1 resources were active at shutdown.
Dec  7 20:45:56 nas1 crmd[35821]: notice: do_lrm_control: Disconnected from the LRM
Dec  7 20:45:56 nas1 crmd[35821]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec  7 20:45:56 nas1 cib[16365]: warning: qb_ipcs_event_sendv: new_event_notification (16365-35821-9): Broken pipe (32)
Dec  7 20:45:56 nas1 cib[16365]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Dec  7 20:45:56 nas1 crmd[35821]: error: crmd_fast_exit: Could not recover from internal error
Dec  7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process crmd (35821) exited: Generic Pacemaker error (201)
Dec  7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: crmd
Dec  7 20:45:56 nas1 crmd[43515]: notice: main: CRM Git Version: 368c726
Dec  7 20:45:56 nas1 crmd[43515]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Dec  7 20:45:56 nas1 crmd[43515]: notice: cluster_connect_quorum: Quorum acquired
Dec  7 20:45:56 nas1 crmd[43515]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Dec  7 20:45:56 nas1 crmd[43515]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null))
Dec  7 20:45:56 nas1 crmd[43515]: notice: do_started: The local CRM is operational

Comment 6 Zhaoming Zhang 2016-01-11 02:39:20 UTC

(In reply to Ken Gaillot from comment #2)
> As discussed on #1284069, the "node0 is unclean" log message indicates that
> the two nodes lost communication with each other, possibly due to a network
> issue or load issue. Normally, one node would fence the other in this
> situation, but fencing is disabled here.
> 
> Looking at the messages in the system log around the time of the problem
> might give further clues as to what caused the problem.
> 
> I'd also recommend upgrading to RHEL 7.2 if possible, to get more recent
> versions of pacemaker and its dependencies.

One more thing, we didn't find the error in pacemaker-1.1.10-29.el7.x86_64 source code 
as describled on on #1284069:
"
Additional info: This is the problem and fix:

- crm_err("Could not issue %s for %s: %s (%s)", op->action, op->rsc, error.message);
+ crm_err("Could not issue %s for %s: %s", op->action, op->rsc, error.message);

The first line has an extra %s format specifier, which causes it to try to read an argument that was never passed, thus accessing an unknown memory address.
"
Thanks again.

Comment 7 Ken Gaillot 2016-01-11 18:08:22 UTC

(In reply to Zhaoming Zhang from comment #3)
> We've tested to enable the fencing,and it would failover. But the power lose
> is another problem. We add scsi 3 reservation to the Filesystem resourece to
> avoid the power lose problem for the moment. And this failover bug would
> lead to reservation conflict.

That sounds like a good solution. So, each node would have a fencing topology with level 1 = fence_ipmilan and level 2 = fence_iscsi.

> And I' quite sure that the reason of the two nodes lost communication with
> each other is not related to network issue or load issue.  
> The LOM for the cluster heartbeat is exclusive use.
> When we have no load, we still meet the problem. And the log just says:

That is strange, especially with no corosync messages around the time of the problem. I'm looking forward to hearing whether this occurs under RHEL 7.2.

Comment 8 Zhaoming Zhang 2016-01-12 02:02:13 UTC

I noticed that 
"Dec  7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process lrmd (35818) exited: Operation not permitted (1)" 
and 
"Dec  7 20:45:56 nas1 crmd[35821]: notice: terminate_cs_connection: Disconnecting from Corosync".

Is there something wrong with the lrmd when using systemd resource?

Tests for this problem is in progress under RHEL 7.2. I wish we can get the result soon.

Thank u!

Comment 9 Ken Gaillot 2016-01-12 16:31:35 UTC

In this case, the lrmd is exiting for some unknown reason. It is exiting with exit code 1, which happens to match the system error "Operation not permitted", which is why that is displayed in the logs, but I don't think that is related to why it exited.

The crmd disconnects from corosync after it loses its connection to lrmd.

I initially thought that the "unclean" message indicated a wider problem, but it's possible that crmd's exit led to that. So lrmd might indeed be the root of the problem, but the logs don't give much information why.

Hopefully, the RHEL 7.2 tests will give us more information (or eliminate the problem entirely).

Comment 10 Zhaoming Zhang 2016-01-13 05:23:46 UTC

(In reply to Ken Gaillot from comment #9)
> In this case, the lrmd is exiting for some unknown reason. It is exiting
> with exit code 1, which happens to match the system error "Operation not
> permitted", which is why that is displayed in the logs, but I don't think
> that is related to why it exited.
> 
> The crmd disconnects from corosync after it loses its connection to lrmd.
> 
> I initially thought that the "unclean" message indicated a wider problem,
> but it's possible that crmd's exit led to that. So lrmd might indeed be the
> root of the problem, but the logs don't give much information why.
> 
> Hopefully, the RHEL 7.2 tests will give us more information (or eliminate
> the problem entirely).

I'm afraid that "Operation not permitted" is related to why lrmd exit. Every time lrmd failed, I found "Operation not permitted" was there.
Here is the logs:



first time:
Dec 16 03:58:01 nas2 systemd: Starting Session 5177 of user pcp.
Dec 16 03:58:01 nas2 systemd: Started Session 5177 of user pcp.
Dec 16 04:00:01 nas2 systemd: Starting Session 5178 of user root.
Dec 16 04:00:01 nas2 systemd: Started Session 5178 of user root.
Dec 16 04:00:26 nas2 systemd: Reloading.
Dec 16 04:01:01 nas2 systemd: Starting Session 5179 of user root.
Dec 16 04:01:01 nas2 systemd: Started Session 5179 of user root.
Dec 16 04:01:16 nas2 crmd[33274]: notice: process_lrm_event: LRM operation nas_rpcbind_monitor_15000 (call=119, rc=1, cib-update=1853, confirmed=false) unknown error
Dec 16 04:01:16 nas2 crmd[33274]: warning: update_failcount: Updating failcount for nas_rpcbind on node1 after failed monitor: rc=1 (update=value++, time=1450209676)
Dec 16 04:01:16 nas2 crmd[33274]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Dec 16 04:01:16 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_rpcbind (1)
Dec 16 04:01:16 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 108: fail-count-nas_rpcbind=1
Dec 16 04:01:16 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_rpcbind (1450209676)
Dec 16 04:01:16 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 111: last-failure-nas_rpcbind=1450209676
Dec 16 04:01:16 nas2 crmd[33274]: error: crm_ipc_read: Connection to lrmd failed
Dec 16 04:01:16 nas2 crmd[33274]: error: mainloop_gio_callback: Connection to lrmd[0x1fdb770] closed (I/O condition=17)
Dec 16 04:01:16 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (33271) exited: Operation not permitted (1)
Dec 16 04:01:16 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec 16 04:01:16 nas2 crmd[33274]: crit: lrm_connection_destroy: LRM Connection failed
Dec 16 04:01:16 nas2 crmd[33274]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_POLICY_ENGINE
Dec 16 04:01:16 nas2 crmd[33274]: warning: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec 16 04:01:16 nas2 crmd[33274]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec 16 04:01:16 nas2 crmd[33274]: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
Dec 16 04:01:16 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec 16 04:01:16 nas2 crmd[33274]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
......




second time:
Dec 18 05:25:01 nas2 systemd: Started Session 5731 of user pcp.
Dec 18 05:27:17 nas2 systemd: Reloading.
Dec 18 05:28:01 nas2 systemd: Created slice user-995.slice.
Dec 18 05:28:01 nas2 systemd: Starting Session 5732 of user pcp.
Dec 18 05:28:01 nas2 systemd: Started Session 5732 of user pcp.
Dec 18 05:28:07 nas2 crmd[6632]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=45, confirmed=false) unknown error
Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (3)
Dec 18 05:28:07 nas2 crmd[6632]: error: crm_ipc_read: Connection to lrmd failed
Dec 18 05:28:07 nas2 crmd[6632]: error: mainloop_gio_callback: Connection to lrmd[0x2024020] closed (I/O condition=17)
Dec 18 05:28:07 nas2 crmd[6632]: crit: lrm_connection_destroy: LRM Connection failed
Dec 18 05:28:07 nas2 crmd[6632]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Dec 18 05:28:07 nas2 crmd[6632]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec 18 05:28:07 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (6631) exited: Operation not permitted (1)
Dec 18 05:28:07 nas2 crmd[6632]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec 18 05:28:07 nas2 crmd[6632]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Dec 18 05:28:07 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec 18 05:28:07 nas2 crmd[6632]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:93 (fence_ipmi0_monitor_60000) incomplete at shutdown
Dec 18 05:28:07 nas2 crmd[6632]: error: lrm_state_verify_stopped: 1 resources were active at shutdown.
Dec 18 05:28:07 nas2 crmd[6632]: notice: do_lrm_control: Disconnected from the LRM
Dec 18 05:28:07 nas2 crmd[6632]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 138: fail-count-nas_samba=3
Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1450387688)
Dec 18 05:28:07 nas2 cib[33269]: warning: qb_ipcs_event_sendv: new_event_notification (33269-6632-9): Broken pipe (32)
Dec 18 05:28:07 nas2 crmd[6632]: error: crmd_fast_exit: Could not recover from internal error
Dec 18 05:28:07 nas2 cib[33269]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 140: last-failure-nas_samba=1450387688
Dec 18 05:28:07 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (6632) exited: Generic Pacemaker error (201)
Dec 18 05:28:07 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd
Dec 18 05:28:07 nas2 crmd[1500]: notice: main: CRM Git Version: 368c726
Dec 18 05:28:07 nas2 crmd[1500]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Dec 18 05:28:07 nas2 crmd[1500]: notice: cluster_connect_quorum: Quorum acquired
Dec 18 05:28:07 nas2 crmd[1500]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Dec 18 05:28:07 nas2 crmd[1500]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null))
Dec 18 05:28:07 nas2 crmd[1500]: notice: do_started: The local CRM is operational
......




third time:
Dec 20 06:50:01 nas2 systemd: Starting Session 6281 of user root.
Dec 20 06:50:01 nas2 systemd: Started Session 6281 of user root.
Dec 20 06:54:09 nas2 systemd: Reloading.
Dec 20 06:54:59 nas2 crmd[1500]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=45, confirmed=false) unknown error
Dec 20 06:54:59 nas2 crmd[1500]: error: crm_ipc_read: Connection to lrmd failed
Dec 20 06:54:59 nas2 crmd[1500]: error: mainloop_gio_callback: Connection to lrmd[0xee3020] closed (I/O condition=1)
Dec 20 06:54:59 nas2 crmd[1500]: crit: lrm_connection_destroy: LRM Connection failed
Dec 20 06:54:59 nas2 crmd[1500]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Dec 20 06:54:59 nas2 crmd[1500]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec 20 06:54:59 nas2 crmd[1500]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec 20 06:54:59 nas2 crmd[1500]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Dec 20 06:54:59 nas2 crmd[1500]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:93 (fence_ipmi0_monitor_60000) incomplete at shutdown
Dec 20 06:54:59 nas2 crmd[1500]: error: lrm_state_verify_stopped: 1 resources were active at shutdown.
Dec 20 06:54:59 nas2 crmd[1500]: notice: do_lrm_control: Disconnected from the LRM
Dec 20 06:54:59 nas2 crmd[1500]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec 20 06:54:59 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (1499) exited: Operation not permitted (1)
Dec 20 06:54:59 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (4)
Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 171: fail-count-nas_samba=4
Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1450565700)
Dec 20 06:54:59 nas2 crmd[1500]: error: crmd_fast_exit: Could not recover from internal error
Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 173: last-failure-nas_samba=1450565700
Dec 20 06:54:59 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (1500) exited: Generic Pacemaker error (201)
Dec 20 06:54:59 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd
Dec 20 06:54:59 nas2 crmd[36457]: notice: main: CRM Git Version: 368c726
Dec 20 06:54:59 nas2 crmd[36457]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Dec 20 06:54:59 nas2 crmd[36457]: notice: cluster_connect_quorum: Quorum acquired
......




forth time:
Dec 22 08:20:01 nas2 systemd: Starting Session 6832 of user root.
Dec 22 08:20:01 nas2 systemd: Started Session 6832 of user root.
Dec 22 08:21:59 nas2 systemd: Reloading.
Dec 22 08:22:49 nas2 crmd[36457]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=116, rc=1, cib-update=71, confirmed=false) unknown error
Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (7)
Dec 22 08:22:49 nas2 crmd[36457]: error: crm_ipc_read: Connection to lrmd failed
Dec 22 08:22:49 nas2 crmd[36457]: error: mainloop_gio_callback: Connection to lrmd[0xd6e020] closed (I/O condition=17)
Dec 22 08:22:49 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (36456) exited: Operation not permitted (1)
Dec 22 08:22:49 nas2 crmd[36457]: crit: lrm_connection_destroy: LRM Connection failed
Dec 22 08:22:49 nas2 crmd[36457]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Dec 22 08:22:49 nas2 crmd[36457]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec 22 08:22:49 nas2 crmd[36457]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec 22 08:22:49 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec 22 08:22:49 nas2 crmd[36457]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 213: fail-count-nas_samba=7
Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1450743771)
Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 215: last-failure-nas_samba=1450743771
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:81 (fence_ipmi0_monitor_60000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_test:96 (nas_filesystem_test_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Gsix:102 (nas_filesystem_Gsix_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_samba:116 (nas_samba_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_samba:117 (nas_samba_monitor_60000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_rpcbind:112 (nas_rpcbind_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_kyfz:106 (nas_filesystem_kyfz_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_vip_10_24_171_90:119 (nas_vip_10_24_171_90_monitor_20000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_User2:108 (nas_filesystem_User2_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_GX:110 (nas_filesystem_GX_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_gsics:100 (nas_filesystem_gsics_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Users:98 (nas_filesystem_Users_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_nfs:114 (nas_nfs_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_shared_metadata:94 (nas_shared_metadata_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_MAS:104 (nas_filesystem_MAS_monitor_15000) incomplete at shutdown
Dec 22 08:22:49 nas2 crmd[36457]: error: lrm_state_verify_stopped: 14 resources were active at shutdown.
Dec 22 08:22:49 nas2 crmd[36457]: notice: do_lrm_control: Disconnected from the LRM
Dec 22 08:22:49 nas2 crmd[36457]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec 22 08:22:49 nas2 cib[33269]: warning: qb_ipcs_event_sendv: new_event_notification (33269-36457-9): Broken pipe (32)
Dec 22 08:22:49 nas2 cib[33269]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Dec 22 08:22:49 nas2 crmd[36457]: error: crmd_fast_exit: Could not recover from internal error
......




5th time:
Dec 24 09:40:01 nas2 systemd: Starting Session 7381 of user root.
Dec 24 09:40:01 nas2 systemd: Started Session 7381 of user root.
Dec 24 09:48:51 nas2 systemd: Reloading.
Dec 24 09:49:41 nas2 crmd[14257]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=45, confirmed=false) unknown error
Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (8)
Dec 24 09:49:41 nas2 crmd[14257]: error: crm_ipc_read: Connection to lrmd failed
Dec 24 09:49:41 nas2 crmd[14257]: error: mainloop_gio_callback: Connection to lrmd[0x14e8020] closed (I/O condition=17)
Dec 24 09:49:41 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (14256) exited: Operation not permitted (1)
Dec 24 09:49:41 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec 24 09:49:41 nas2 crmd[14257]: crit: lrm_connection_destroy: LRM Connection failed
Dec 24 09:49:41 nas2 crmd[14257]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Dec 24 09:49:41 nas2 crmd[14257]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec 24 09:49:41 nas2 crmd[14257]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec 24 09:49:41 nas2 crmd[14257]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 248: fail-count-nas_samba=8
Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1450921782)
Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 250: last-failure-nas_samba=1450921782
Dec 24 09:49:41 nas2 crmd[14257]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:93 (fence_ipmi0_monitor_60000) incomplete at shutdown
Dec 24 09:49:41 nas2 crmd[14257]: error: lrm_state_verify_stopped: 1 resources were active at shutdown.
Dec 24 09:49:41 nas2 crmd[14257]: notice: do_lrm_control: Disconnected from the LRM
Dec 24 09:49:41 nas2 crmd[14257]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec 24 09:49:41 nas2 crmd[14257]: error: crmd_fast_exit: Could not recover from internal error
Dec 24 09:49:41 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (14257) exited: Generic Pacemaker error (201)
Dec 24 09:49:41 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd
Dec 24 09:49:41 nas2 crmd[47742]: notice: main: CRM Git Version: 368c726
Dec 24 09:49:41 nas2 crmd[47742]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Dec 24 09:49:41 nas2 crmd[47742]: notice: cluster_connect_quorum: Quorum acquired
Dec 24 09:49:41 nas2 crmd[47742]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Dec 24 09:49:41 nas2 crmd[47742]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null))
Dec 24 09:49:41 nas2 crmd[47742]: notice: do_started: The local CRM is operational
Dec 24 09:49:41 nas2 crmd[47742]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Dec 24 09:49:43 nas2 crmd[47742]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Dec 24 09:49:43 nas2 attrd[33272]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
......




6th time:
Dec 26 11:10:01 nas2 systemd: Started Session 7932 of user root.
Dec 26 11:15:42 nas2 systemd: Reloading.
Dec 26 11:16:32 nas2 crmd[47742]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=45, confirmed=false) unknown error
Dec 26 11:16:32 nas2 crmd[47742]: error: crm_ipc_read: Connection to lrmd failed
Dec 26 11:16:32 nas2 crmd[47742]: error: mainloop_gio_callback: Connection to lrmd[0x2216020] closed (I/O condition=1)
Dec 26 11:16:32 nas2 crmd[47742]: crit: lrm_connection_destroy: LRM Connection failed
Dec 26 11:16:32 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (47737) exited: Operation not permitted (1)
Dec 26 11:16:32 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec 26 11:16:32 nas2 crmd[47742]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Dec 26 11:16:32 nas2 crmd[47742]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec 26 11:16:32 nas2 crmd[47742]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec 26 11:16:32 nas2 crmd[47742]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9)
Dec 26 11:16:32 nas2 crmd[47742]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:93 (fence_ipmi0_monitor_60000) incomplete at shutdown
Dec 26 11:16:32 nas2 crmd[47742]: error: lrm_state_verify_stopped: 1 resources were active at shutdown.
Dec 26 11:16:32 nas2 crmd[47742]: notice: do_lrm_control: Disconnected from the LRM
Dec 26 11:16:32 nas2 crmd[47742]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 279: fail-count-nas_samba=9
Dec 26 11:16:32 nas2 cib[33269]: warning: qb_ipcs_event_sendv: new_event_notification (33269-47742-9): Broken pipe (32)
Dec 26 11:16:32 nas2 cib[33269]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1451099794)
Dec 26 11:16:32 nas2 crmd[47742]: error: crmd_fast_exit: Could not recover from internal error
Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 281: last-failure-nas_samba=1451099794
Dec 26 11:16:32 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (47742) exited: Generic Pacemaker error (201)
Dec 26 11:16:32 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd
Dec 26 11:16:32 nas2 crmd[25757]: notice: main: CRM Git Version: 368c726
Dec 26 11:16:32 nas2 crmd[25757]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Dec 26 11:16:32 nas2 crmd[25757]: notice: cluster_connect_quorum: Quorum acquired
Dec 26 11:16:32 nas2 crmd[25757]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Dec 26 11:16:32 nas2 crmd[25757]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null))
Dec 26 11:16:32 nas2 crmd[25757]: notice: do_started: The local CRM is operational
......




7th time:
Dec 28 12:40:01 nas2 systemd: Starting Session 8483 of user root.
Dec 28 12:40:01 nas2 systemd: Started Session 8483 of user root.
Dec 28 12:43:18 nas2 systemd: Reloading.
Dec 28 12:44:08 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (25755) exited: Operation not permitted (1)
Dec 28 12:44:08 nas2 crmd[25757]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=116, rc=1, cib-update=71, confirmed=false) unknown error
Dec 28 12:44:08 nas2 crmd[25757]: error: crm_ipc_read: Connection to lrmd failed
Dec 28 12:44:08 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec 28 12:44:08 nas2 crmd[25757]: error: mainloop_gio_callback: Connection to lrmd[0xc98020] closed (I/O condition=1)
Dec 28 12:44:08 nas2 crmd[25757]: crit: lrm_connection_destroy: LRM Connection failed
Dec 28 12:44:08 nas2 crmd[25757]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Dec 28 12:44:08 nas2 crmd[25757]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec 28 12:44:08 nas2 crmd[25757]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec 28 12:44:08 nas2 crmd[25757]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:80 (fence_ipmi0_monitor_60000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_test:96 (nas_filesystem_test_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Gsix:102 (nas_filesystem_Gsix_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_samba:116 (nas_samba_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_samba:117 (nas_samba_monitor_60000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_kyfz:106 (nas_filesystem_kyfz_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_rpcbind:112 (nas_rpcbind_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_vip_10_24_171_90:119 (nas_vip_10_24_171_90_monitor_20000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_User2:108 (nas_filesystem_User2_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_GX:110 (nas_filesystem_GX_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Users:98 (nas_filesystem_Users_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_gsics:100 (nas_filesystem_gsics_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_nfs:114 (nas_nfs_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_shared_metadata:94 (nas_shared_metadata_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_MAS:104 (nas_filesystem_MAS_monitor_15000) incomplete at shutdown
Dec 28 12:44:08 nas2 crmd[25757]: error: lrm_state_verify_stopped: 14 resources were active at shutdown.
Dec 28 12:44:08 nas2 crmd[25757]: notice: do_lrm_control: Disconnected from the LRM
Dec 28 12:44:08 nas2 crmd[25757]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (12)
Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 322: fail-count-nas_samba=12
Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1451277850)
Dec 28 12:44:08 nas2 cib[33269]: warning: qb_ipcs_event_sendv: new_event_notification (33269-25757-9): Broken pipe (32)
Dec 28 12:44:08 nas2 cib[33269]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Dec 28 12:44:08 nas2 crmd[25757]: error: crmd_fast_exit: Could not recover from internal error
Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 324: last-failure-nas_samba=1451277850
Dec 28 12:44:08 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (25757) exited: Generic Pacemaker error (201)
Dec 28 12:44:08 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd
Dec 28 12:44:08 nas2 crmd[42199]: notice: main: CRM Git Version: 368c726
Dec 28 12:44:08 nas2 crmd[42199]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Dec 28 12:44:08 nas2 crmd[42199]: notice: cluster_connect_quorum: Quorum acquired
Dec 28 12:44:08 nas2 crmd[42199]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Dec 28 12:44:08 nas2 crmd[42199]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null))
Dec 28 12:44:08 nas2 crmd[42199]: notice: do_started: The local CRM is operational
......




8th time:
Dec 30 14:01:01 nas2 systemd: Started Session 9033 of user root.
Dec 30 14:10:01 nas2 systemd: Starting Session 9034 of user root.
Dec 30 14:10:01 nas2 systemd: Started Session 9034 of user root.
Dec 30 14:10:52 nas2 systemd: Reloading.
Dec 30 14:11:42 nas2 crmd[42199]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=116, rc=1, cib-update=71, confirmed=false) unknown error
Dec 30 14:11:42 nas2 crmd[42199]: error: crm_ipc_read: Connection to lrmd failed
Dec 30 14:11:42 nas2 crmd[42199]: error: mainloop_gio_callback: Connection to lrmd[0x7ab020] closed (I/O condition=1)
Dec 30 14:11:42 nas2 crmd[42199]: crit: lrm_connection_destroy: LRM Connection failed
Dec 30 14:11:42 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (42197) exited: Operation not permitted (1)
Dec 30 14:11:42 nas2 crmd[42199]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Dec 30 14:11:42 nas2 crmd[42199]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec 30 14:11:42 nas2 crmd[42199]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec 30 14:11:42 nas2 crmd[42199]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:80 (fence_ipmi0_monitor_60000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_test:96 (nas_filesystem_test_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Gsix:102 (nas_filesystem_Gsix_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_samba:116 (nas_samba_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_samba:117 (nas_samba_monitor_60000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_rpcbind:112 (nas_rpcbind_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_kyfz:106 (nas_filesystem_kyfz_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_vip_10_24_171_90:119 (nas_vip_10_24_171_90_monitor_20000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_User2:108 (nas_filesystem_User2_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_GX:110 (nas_filesystem_GX_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Users:98 (nas_filesystem_Users_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_gsics:100 (nas_filesystem_gsics_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_nfs:114 (nas_nfs_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_shared_metadata:94 (nas_shared_metadata_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_MAS:104 (nas_filesystem_MAS_monitor_15000) incomplete at shutdown
Dec 30 14:11:42 nas2 crmd[42199]: error: lrm_state_verify_stopped: 14 resources were active at shutdown.
Dec 30 14:11:42 nas2 crmd[42199]: notice: do_lrm_control: Disconnected from the LRM
Dec 30 14:11:42 nas2 crmd[42199]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec 30 14:11:42 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (15)
Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 361: fail-count-nas_samba=15
Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1451455905)
Dec 30 14:11:42 nas2 crmd[42199]: error: crmd_fast_exit: Could not recover from internal error
Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 363: last-failure-nas_samba=1451455905
Dec 30 14:11:42 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (42199) exited: Generic Pacemaker error (201)
Dec 30 14:11:42 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd
Dec 30 14:11:42 nas2 crmd[14225]: notice: main: CRM Git Version: 368c726
Dec 30 14:11:42 nas2 crmd[14225]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Dec 30 14:11:42 nas2 crmd[14225]: notice: cluster_connect_quorum: Quorum acquired
Dec 30 14:11:42 nas2 crmd[14225]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Dec 30 14:11:42 nas2 crmd[14225]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null))
Dec 30 14:11:42 nas2 crmd[14225]: notice: do_started: The local CRM is operational
......




9th time:
Jan  1 15:30:01 nas2 systemd: Starting Session 9584 of user root.
Jan  1 15:30:01 nas2 systemd: Started Session 9584 of user root.
Jan  1 15:38:26 nas2 systemd: Reloading.
Jan  1 15:39:16 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (14224) exited: Operation not permitted (1)
Jan  1 15:39:16 nas2 crmd[14225]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=116, rc=1, cib-update=71, confirmed=false) unknown error
Jan  1 15:39:16 nas2 crmd[14225]: error: crm_ipc_read: Connection to lrmd failed
Jan  1 15:39:16 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Jan  1 15:39:16 nas2 crmd[14225]: error: mainloop_gio_callback: Connection to lrmd[0x9bb020] closed (I/O condition=1)
Jan  1 15:39:16 nas2 crmd[14225]: crit: lrm_connection_destroy: LRM Connection failed
Jan  1 15:39:16 nas2 crmd[14225]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Jan  1 15:39:16 nas2 crmd[14225]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Jan  1 15:39:16 nas2 crmd[14225]: warning: do_recover: Fast-tracking shutdown in response to errors
Jan  1 15:39:16 nas2 crmd[14225]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:80 (fence_ipmi0_monitor_60000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_test:96 (nas_filesystem_test_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Gsix:102 (nas_filesystem_Gsix_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_samba:116 (nas_samba_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_samba:117 (nas_samba_monitor_60000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_rpcbind:112 (nas_rpcbind_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_kyfz:106 (nas_filesystem_kyfz_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_vip_10_24_171_90:119 (nas_vip_10_24_171_90_monitor_20000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_User2:108 (nas_filesystem_User2_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_GX:110 (nas_filesystem_GX_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_gsics:100 (nas_filesystem_gsics_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Users:98 (nas_filesystem_Users_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_nfs:114 (nas_nfs_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_shared_metadata:94 (nas_shared_metadata_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_MAS:104 (nas_filesystem_MAS_monitor_15000) incomplete at shutdown
Jan  1 15:39:16 nas2 crmd[14225]: error: lrm_state_verify_stopped: 14 resources were active at shutdown.
Jan  1 15:39:16 nas2 crmd[14225]: notice: do_lrm_control: Disconnected from the LRM
Jan  1 15:39:16 nas2 crmd[14225]: notice: terminate_cs_connection: Disconnecting from Corosync
Jan  1 15:39:16 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Jan  1 15:39:16 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (18)
Jan  1 15:39:16 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 404: fail-count-nas_samba=18
Jan  1 15:39:16 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Jan  1 15:39:16 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1451633959)
Jan  1 15:39:16 nas2 crmd[14225]: error: crmd_fast_exit: Could not recover from internal error
Jan  1 15:39:16 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 406: last-failure-nas_samba=1451633959
Jan  1 15:39:16 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (14225) exited: Generic Pacemaker error (201)
Jan  1 15:39:16 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd
Jan  1 15:39:16 nas2 crmd[38163]: notice: main: CRM Git Version: 368c726
Jan  1 15:39:16 nas2 crmd[38163]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Jan  1 15:39:16 nas2 crmd[38163]: notice: cluster_connect_quorum: Quorum acquired
Jan  1 15:39:16 nas2 crmd[38163]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Jan  1 15:39:16 nas2 crmd[38163]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null))
Jan  1 15:39:16 nas2 crmd[38163]: notice: do_started: The local CRM is operational
......




10th time:
Jan  3 17:01:01 nas2 systemd: Starting Session 10136 of user root.
Jan  3 17:01:01 nas2 systemd: Started Session 10136 of user root.
Jan  3 17:05:59 nas2 systemd: Reloading.
Jan  3 17:06:49 nas2 crmd[38163]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=116, rc=1, cib-update=71, confirmed=false) unknown error
Jan  3 17:06:49 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Jan  3 17:06:49 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (21)
Jan  3 17:06:49 nas2 crmd[38163]: error: crm_ipc_read: Connection to lrmd failed
Jan  3 17:06:49 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (38162) exited: Operation not permitted (1)
Jan  3 17:06:49 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Jan  3 17:06:49 nas2 crmd[38163]: error: mainloop_gio_callback: Connection to lrmd[0x979020] closed (I/O condition=17)
Jan  3 17:06:49 nas2 crmd[38163]: crit: lrm_connection_destroy: LRM Connection failed
Jan  3 17:06:49 nas2 crmd[38163]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Jan  3 17:06:49 nas2 crmd[38163]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Jan  3 17:06:49 nas2 crmd[38163]: warning: do_recover: Fast-tracking shutdown in response to errors
Jan  3 17:06:49 nas2 crmd[38163]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Jan  3 17:06:49 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 443: fail-count-nas_samba=21
Jan  3 17:06:49 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0
Jan  3 17:06:49 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1451812012)
Jan  3 17:06:49 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 445: last-failure-nas_samba=1451812012
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:80 (fence_ipmi0_monitor_60000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_test:96 (nas_filesystem_test_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Gsix:102 (nas_filesystem_Gsix_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_samba:116 (nas_samba_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_samba:117 (nas_samba_monitor_60000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_rpcbind:112 (nas_rpcbind_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_kyfz:106 (nas_filesystem_kyfz_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_vip_10_24_171_90:119 (nas_vip_10_24_171_90_monitor_20000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_User2:108 (nas_filesystem_User2_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_GX:110 (nas_filesystem_GX_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Users:98 (nas_filesystem_Users_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_gsics:100 (nas_filesystem_gsics_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_nfs:114 (nas_nfs_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_shared_metadata:94 (nas_shared_metadata_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_MAS:104 (nas_filesystem_MAS_monitor_15000) incomplete at shutdown
Jan  3 17:06:49 nas2 crmd[38163]: error: lrm_state_verify_stopped: 14 resources were active at shutdown.
Jan  3 17:06:49 nas2 crmd[38163]: notice: do_lrm_control: Disconnected from the LRM
Jan  3 17:06:49 nas2 crmd[38163]: notice: terminate_cs_connection: Disconnecting from Corosync
Jan  3 17:06:49 nas2 cib[33269]: warning: qb_ipcs_event_sendv: new_event_notification (33269-38163-9): Broken pipe (32)
Jan  3 17:06:49 nas2 cib[33269]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Jan  3 17:06:49 nas2 crmd[38163]: error: crmd_fast_exit: Could not recover from internal error
Jan  3 17:06:49 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (38163) exited: Generic Pacemaker error (201)
Jan  3 17:06:49 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd
Jan  3 17:06:49 nas2 crmd[8674]: notice: main: CRM Git Version: 368c726
Jan  3 17:06:49 nas2 crmd[8674]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Jan  3 17:06:49 nas2 crmd[8674]: notice: cluster_connect_quorum: Quorum acquired
Jan  3 17:06:49 nas2 crmd[8674]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Jan  3 17:06:49 nas2 crmd[8674]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null))
Jan  3 17:06:49 nas2 crmd[8674]: notice: do_started: The local CRM is operational
......




Waiting for the RHEL 7.2 tests results,too.

Comment 11 Ken Gaillot 2016-01-13 17:03:14 UTC

The message will be printed every time, but by "not related" I mean that the text of the error message is likely inaccurate.

Pacemakerd always prints the system error message corresponding to the child daemon's exit code. Because lrmd is exiting with exit code 1 every time, and exit code 1 corresponds to the system error message "Operation not permitted", pacemakerd is printing that message. But I do not think lrmd is using exit code 1 because of a permissions error.

I don't see a likely code path for lrmd to return exit code 1, so I'm not sure why it's doing so.

Comment 12 Zhaoming Zhang 2016-02-19 07:02:24 UTC

So far we have not reproduce this problem in RHEL 7.2 tests.
But when we added more systemd resources to the cluster of RHEL 7.0 enviroments, it reproduced more frequently.
Just make you informed, thanks.

Comment 13 Ken Gaillot 2016-02-19 15:13:38 UTC

Zhaoming,

That's good news about RHEL 7.2. I'm not aware of a specific fix related to this issue, but there were major systemd-related enhancements between 7.0 and 7.2.

Comment 14 Ken Gaillot 2016-05-16 17:36:25 UTC

Zhaoming,

I am marking this issue as fixed in the 7.2 release. If you encounter this issue or any other in the 7.2 release, feel free to re-open this report or open a new one.

Comment 15 Zhaoming Zhang 2016-06-01 07:22:42 UTC

(In reply to Ken Gaillot from comment #14)
> Zhaoming,
> 
> I am marking this issue as fixed in the 7.2 release. If you encounter this
> issue or any other in the 7.2 release, feel free to re-open this report or
> open a new one.

OK，thanks a lot！

Note You need to log in before you can comment on or make changes to this bug.