Bug 1284069

Summary: Pacemaker's lrmd crashes after certain systemd errors
Product: Red Hat Enterprise Linux 7 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact: Milan Navratil <mnavrati>
Priority: urgent    
Version: 7.2CC: 252150241, cfeist, cluster-maint, jkortus, mjuricek, mnavrati, royoung, tlavigne
Target Milestone: rcKeywords: ZStream
Target Release: 7.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.15-1.2c148ac.git.el7 Doc Type: Bug Fix
Doc Text:
*lrmd* logs errors as expected and no longer crashes Previously, Pacemaker's Local Resource Management Daemon (lrmd) used an invalid format string when logging certain rare *systemd* errors. As a consequence, *lrmd* could terminate unexpectedly with a segmentation fault. A patch has been applied to fix the format string. As a result, *lrmd* no longer crashes and logs the aforementioned rare error messages as intended.
Story Points: ---
Clone Of:
: 1299339 (view as bug list) Environment:
Last Closed: 2016-11-03 18:56:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1304771    
Bug Blocks: 1299339, 1394068    

Description Ken Gaillot 2015-11-20 17:30:50 UTC
Description of problem: When a Pacemaker cluster executes an operation for a systemd resource, when an error occurs under certain narrow conditions, LRMD may segfault and dump core when trying to log the error.

Version-Release number of selected component (if applicable): 1.1.13 and earlier


How reproducible: I have been unable to come up with a reproducer; it may only happen under high-load conditions. However it's a trivial, self-obvious bug with no other effects.

Actual results: Lrmd crashes.

Expected results: Lrmd logs the error.


Additional info: This is the problem and fix:

- crm_err("Could not issue %s for %s: %s (%s)", op->action, op->rsc, error.message);
+ crm_err("Could not issue %s for %s: %s", op->action, op->rsc, error.message);

The first line has an extra %s format specifier, which causes it to try to read an argument that was never passed, thus accessing an unknown memory address.

Comment 1 Ken Gaillot 2015-11-20 17:32:07 UTC
Fixed upstream in 6c495a4

Comment 4 Zhaoming Zhang 2016-01-06 11:18:12 UTC
I found a problem in my two-nodes pacemaker cluster.

node0's log:
Dec  7 20:30:01 nas1 systemd: Starting Session 3016 of user root.
Dec  7 20:30:01 nas1 systemd: Started Session 3016 of user root.
Dec  7 20:40:01 nas1 systemd: Starting Session 3017 of user root.
Dec  7 20:40:01 nas1 systemd: Started Session 3017 of user root.
Dec  7 20:45:06 nas1 systemd: Reloading.
Dec  7 20:45:56 nas1 crmd[35821]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=49, confirmed=false) unknown error
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9)
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 85: fail-count-nas_samba=9
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1449492356)
Dec  7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 88: last-failure-nas_samba=1449492356
Dec  7 20:45:56 nas1 crmd[35821]: error: crm_ipc_read: Connection to lrmd failed
Dec  7 20:45:56 nas1 crmd[35821]: error: mainloop_gio_callback: Connection to lrmd[0x215b020] closed (I/O condition=17)
Dec  7 20:45:56 nas1 crmd[35821]: crit: lrm_connection_destroy: LRM Connection failed
Dec  7 20:45:56 nas1 crmd[35821]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Dec  7 20:45:56 nas1 crmd[35821]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec  7 20:45:56 nas1 crmd[35821]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec  7 20:45:56 nas1 crmd[35821]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Dec  7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process lrmd (35818) exited: Operation not permitted (1)
Dec  7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec  7 20:45:56 nas1 crmd[35821]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi1:97 (fence_ipmi1_monitor_60000) incomplete at shutdown
Dec  7 20:45:56 nas1 crmd[35821]: error: lrm_state_verify_stopped: 1 resources were active at shutdown.
Dec  7 20:45:56 nas1 crmd[35821]: notice: do_lrm_control: Disconnected from the LRM
Dec  7 20:45:56 nas1 crmd[35821]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec  7 20:45:56 nas1 cib[16365]: warning: qb_ipcs_event_sendv: new_event_notification (16365-35821-9): Broken pipe (32)
Dec  7 20:45:56 nas1 cib[16365]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Dec  7 20:45:56 nas1 crmd[35821]: error: crmd_fast_exit: Could not recover from internal error
Dec  7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process crmd (35821) exited: Generic Pacemaker error (201)
Dec  7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: crmd
Dec  7 20:45:56 nas1 crmd[43515]: notice: main: CRM Git Version: 368c726
Dec  7 20:45:56 nas1 crmd[43515]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Dec  7 20:45:56 nas1 crmd[43515]: notice: cluster_connect_quorum: Quorum acquired
Dec  7 20:45:56 nas1 crmd[43515]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Dec  7 20:45:56 nas1 crmd[43515]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null))
Dec  7 20:45:56 nas1 crmd[43515]: notice: do_started: The local CRM is operational
Dec  7 20:45:56 nas1 crmd[43515]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Dec  7 20:45:58 nas1 crmd[43515]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Dec  7 20:45:58 nas1 attrd[16368]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Dec  7 20:45:58 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1449492356)
Dec  7 20:45:58 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Dec  7 20:45:58 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9)
Dec  7 20:45:58 nas1 attrd[16368]: warning: attrd_cib_callback: Update 90 for last-failure-nas_samba=1449492356 failed: No such device or address
Dec  7 20:45:58 nas1 attrd[16368]: warning: attrd_cib_callback: Update 92 for probe_complete=true failed: No such device or address
Dec  7 20:45:58 nas1 attrd[16368]: warning: attrd_cib_callback: Update 94 for fail-count-nas_samba=9 failed: No such device or address
Dec  7 20:45:59 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_rpcbind_monitor_0 (call=41, rc=0, cib-update=10, confirmed=true) ok
Dec  7 20:45:59 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_samba_monitor_0 (call=49, rc=0, cib-update=11, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_tomcat_monitor_0 (call=73, rc=0, cib-update=14, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_shared_metadata_monitor_0 (call=5, rc=0, cib-update=15, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_test_monitor_0 (call=9, rc=0, cib-update=16, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_Users_monitor_0 (call=13, rc=0, cib-update=17, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_gsics_monitor_0 (call=17, rc=0, cib-update=18, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_Gsix_monitor_0 (call=21, rc=0, cib-update=19, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_GX_monitor_0 (call=37, rc=0, cib-update=20, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_MAS_monitor_0 (call=25, rc=0, cib-update=21, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_User2_monitor_0 (call=33, rc=0, cib-update=22, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_kyfz_monitor_0 (call=29, rc=0, cib-update=23, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_syslun_monitor_0 (call=65, rc=0, cib-update=24, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_service_ip_192_168_3_100_monitor_0 (call=69, rc=0, cib-update=25, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_vip_10_24_171_90_monitor_0 (call=53, rc=0, cib-update=26, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_ip_monitor_0 (call=77, rc=0, cib-update=27, confirmed=true) ok
Dec  7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_nfs_monitor_0 (call=45, rc=0, cib-update=28, confirmed=true) ok
Dec  7 20:46:00 nas1 stonith-ng[16366]: notice: stonith_device_register: Device 'fence_ipmi1' already existed in device list (2 active devices)



node1's log:
Dec  7 20:38:36 nas2 crmd[33274]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Dec  7 20:38:36 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:38:36 nas2 pengine[33273]: notice: process_pe_message: Calculated Transition 809: /var/lib/pacemaker/pengine/pe-input-117.bz2
Dec  7 20:38:36 nas2 crmd[33274]: notice: run_graph: Transition 809 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-117.bz2): Complete
Dec  7 20:38:36 nas2 crmd[33274]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Dec  7 20:40:01 nas2 systemd: Starting Session 2955 of user root.
Dec  7 20:40:01 nas2 systemd: Started Session 2955 of user root.
Dec  7 20:45:56 nas2 crmd[33274]: warning: update_failcount: Updating failcount for nas_samba on node0 after failed monitor: rc=1 (update=value++, time=1449492356)
Dec  7 20:45:56 nas2 crmd[33274]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Dec  7 20:45:56 nas2 crmd[33274]: warning: match_down_event: No match for shutdown action on 1
Dec  7 20:45:56 nas2 crmd[33274]: notice: peer_update_callback: Stonith/shutdown of node0 not matched
Dec  7 20:45:56 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:45:56 nas2 pengine[33273]: warning: unpack_rsc_op: Processing failed op monitor for nas_samba on node0: unknown error (1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: process_pe_message: Calculated Transition 810: /var/lib/pacemaker/pengine/pe-input-118.bz2
Dec  7 20:45:56 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:45:56 nas2 pengine[33273]: warning: pe_fence_node: Node node0 is unclean because it is partially and/or un-expectedly down
Dec  7 20:45:56 nas2 pengine[33273]: warning: determine_online_status: Node node0 is unclean
Dec  7 20:45:56 nas2 pengine[33273]: warning: stage6: Node node0 is unclean!
Dec  7 20:45:56 nas2 pengine[33273]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
Dec  7 20:45:56 nas2 pengine[33273]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_shared_metadata	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_test	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_Users	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_gsics	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_Gsix	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_MAS	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_kyfz	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_User2	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_GX	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_rpcbind	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_nfs	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_samba	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_vip_10_24_171_90	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   fence_ipmi1	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_syslun	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_service_ip_192_168_3_100	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_tomcat	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start   nas_ip	(node1)
Dec  7 20:45:56 nas2 pengine[33273]: error: process_pe_message: Calculated Transition 811: /var/lib/pacemaker/pengine/pe-error-3.bz2
Dec  7 20:45:56 nas2 pengine[33273]: notice: process_pe_message: Configuration ERRORs found during PE processing.  Please run "crm_verify -L" to identify issues.
Dec  7 20:45:56 nas2 crmd[33274]: warning: do_log: FSA: Input I_PE_SUCCESS from handle_response() received in state S_INTEGRATION
Dec  7 20:45:58 nas2 cib[33269]: warning: cib_process_request: Completed cib_modify operation for section status: No such device or address (rc=-6, origin=node0/attrd/90, version=0.84.231)
Dec  7 20:45:58 nas2 cib[33269]: warning: cib_process_request: Completed cib_modify operation for section status: No such device or address (rc=-6, origin=node0/attrd/92, version=0.84.231)
Dec  7 20:45:58 nas2 cib[33269]: warning: cib_process_request: Completed cib_modify operation for section status: No such device or address (rc=-6, origin=node0/attrd/94, version=0.84.231)
Dec  7 20:45:59 nas2 attrd[33272]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Dec  7 20:45:59 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Dec  7 20:45:59 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_shared_metadata	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_test	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_Users	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_gsics	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_Gsix	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_MAS	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_kyfz	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_User2	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_filesystem_GX	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_rpcbind	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_nfs	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_samba	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_vip_10_24_171_90	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   fence_ipmi1	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_syslun	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_service_ip_192_168_3_100	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_tomcat	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start   nas_ip	(node0)
Dec  7 20:45:59 nas2 pengine[33273]: notice: process_pe_message: Calculated Transition 812: /var/lib/pacemaker/pengine/pe-input-119.bz2
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 5: monitor nas_shared_metadata_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 6: monitor nas_filesystem_test_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 7: monitor nas_filesystem_Users_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 8: monitor nas_filesystem_gsics_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 9: monitor nas_filesystem_Gsix_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 10: monitor nas_filesystem_MAS_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 11: monitor nas_filesystem_kyfz_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 12: monitor nas_filesystem_User2_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 13: monitor nas_filesystem_GX_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 14: monitor nas_rpcbind_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 15: monitor nas_nfs_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 16: monitor nas_samba_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 17: monitor nas_vip_10_24_171_90_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 18: monitor fence_ipmi1_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 19: monitor fence_ipmi0_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 20: monitor nas_syslun_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 21: monitor nas_service_ip_192_168_3_100_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 22: monitor nas_tomcat_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 23: monitor nas_ip_monitor_0 on node0
Dec  7 20:45:59 nas2 crmd[33274]: warning: status_from_rc: Action 14 (nas_rpcbind_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:45:59 nas2 crmd[33274]: warning: status_from_rc: Action 16 (nas_samba_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 22 (nas_tomcat_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 5 (nas_shared_metadata_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 6 (nas_filesystem_test_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 7 (nas_filesystem_Users_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 8 (nas_filesystem_gsics_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 9 (nas_filesystem_Gsix_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 13 (nas_filesystem_GX_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 10 (nas_filesystem_MAS_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 12 (nas_filesystem_User2_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 11 (nas_filesystem_kyfz_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 20 (nas_syslun_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 21 (nas_service_ip_192_168_3_100_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 17 (nas_vip_10_24_171_90_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 23 (nas_ip_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 15 (nas_nfs_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error
Dec  7 20:46:00 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 4: probe_complete probe_complete on node0 - no waiting
Dec  7 20:46:00 nas2 crmd[33274]: notice: run_graph: Transition 812 (Complete=22, Pending=0, Fired=0, Skipped=40, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-119.bz2): Stopped
Dec  7 20:46:00 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  7 20:46:00 nas2 pengine[33273]: notice: LogActions: Start   fence_ipmi1	(node0)



------------------------------------------------------------ + pcs config + ------------------------------------------------------------
Cluster Name: my_cluster
Corosync Nodes:
 node0 node1 
Pacemaker Nodes:
 node0 node1 

Resources: 
 Group: nas_group
  Resource: nas_shared_metadata (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/metadata directory=/nasgui/metadata fstype=xfs options=sync 
   Operations: start interval=0s timeout=60 (nas_shared_metadata-start-timeout-60)
               stop interval=0s timeout=60 (nas_shared_metadata-stop-timeout-60)
               monitor interval=15s timeout=15s (nas_shared_metadata-monitor-interval-15s)
  Resource: nas_filesystem_test (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342b8d58e87d8863d70d4d0000d9 directory=/mnt/test fstype=xfs options=sync,usrquota,prjquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_test-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_test-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_test-monitor-interval-15s)
  Resource: nas_filesystem_Users (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342cefeea4cdfb66db3eed0000d9 directory=/mnt/Users fstype=xfs options=sync,usrquota,prjquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_Users-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_Users-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_Users-monitor-interval-15s)
  Resource: nas_filesystem_gsics (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342196550d2db9fcd9419d0000d9 directory=/mnt/gsics fstype=xfs options=sync,usrquota,grpquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_gsics-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_gsics-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_gsics-monitor-interval-15s)
  Resource: nas_filesystem_Gsix (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342e465188ddc88ed217fd0000d9 directory=/mnt/Gsix fstype=xfs options=sync,usrquota,grpquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_Gsix-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_Gsix-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_Gsix-monitor-interval-15s)
  Resource: nas_filesystem_MAS (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342a25ca66fde792d83ddd0000d9 directory=/mnt/MAS fstype=xfs options=sync,usrquota,grpquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_MAS-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_MAS-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_MAS-monitor-interval-15s)
  Resource: nas_filesystem_kyfz (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342f6432060d750fd1114d0000d9 directory=/mnt/kyfz fstype=xfs options=sync,uquota,gquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_kyfz-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_kyfz-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_kyfz-monitor-interval-15s)
  Resource: nas_filesystem_User2 (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342aa9dc9d7df167dbb03d0000d9 directory=/mnt/User2 fstype=xfs options=sync,usrquota,prjquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_User2-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_User2-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_User2-monitor-interval-15s)
  Resource: nas_filesystem_GX (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/3600b342bf60c8a1d404bdd6c0d0000d9 directory=/mnt/GX fstype=xfs options=sync,uquota,gquota 
   Operations: start interval=0s timeout=60 (nas_filesystem_GX-start-timeout-60)
               stop interval=0s timeout=60 (nas_filesystem_GX-stop-timeout-60)
               monitor interval=15s timeout=20s (nas_filesystem_GX-monitor-interval-15s)
  Resource: nas_rpcbind (class=systemd type=rpcbind)
   Operations: monitor interval=15s start-delay=10s timeout=20s (nas_rpcbind-monitor-interval-15s)
  Resource: nas_nfs (class=ocf provider=heartbeat type=nfsserver)
   Attributes: nfs_shared_infodir=/nasgui/metadata/nfsinfo nfsd_args= 
   Operations: start interval=0s timeout=40 (nas_nfs-start-timeout-40)
               stop interval=0s timeout=20s (nas_nfs-stop-timeout-20s)
               monitor interval=15s start-delay=10s timeout=20s (nas_nfs-monitor-interval-15s)
  Resource: nas_samba (class=systemd type=smb)
   Operations: monitor interval=60s start-delay=10s (nas_samba-monitor-start-delay-10s)
               monitor interval=15s timeout=20s (nas_samba-monitor-interval-15s)
  Resource: nas_vip_10_24_171_90 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.24.171.90 nic=bond1 cidr_netmask=24 
   Operations: monitor interval=20s timeout=20s (nas_vip_10_24_171_90-monitor-20s)
 Group: nas_tomcat_service
  Resource: nas_syslun (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/sys_lun directory=/nasgui/share fstype=xfs options=sync 
   Operations: start interval=0s timeout=60 (nas_syslun-start-timeout-60)
               stop interval=0s timeout=60 (nas_syslun-stop-timeout-60)
               monitor interval=15s timeout=15s (nas_syslun-monitor-interval-15s)
  Resource: nas_service_ip_192_168_3_100 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=192.168.3.100 cidr_netmask=24 
   Operations: start interval=0s timeout=20s (nas_service_ip_192_168_3_100-start-timeout-20s)
               stop interval=0s timeout=20s (nas_service_ip_192_168_3_100-stop-timeout-20s)
               monitor interval=20s timeout=20s (nas_service_ip_192_168_3_100-monitor-interval-20s)
  Resource: nas_tomcat (class=ocf provider=heartbeat type=tomcat)
   Attributes: java_home=/usr/local/jdk/jdk1.6.0_37 catalina_home=/usr/local/nas/apache-tomcat-6.0.35 
   Operations: start interval=0s timeout=60s (nas_tomcat-start-timeout-60s)
               stop interval=0s timeout=120s (nas_tomcat-stop-timeout-120s)
               monitor interval=10s start-delay=5s timeout=30s (nas_tomcat-monitor-interval-10s)
  Resource: nas_ip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.24.171.91 cidr_netmask=24 
   Operations: start interval=0s timeout=20s (nas_ip-start-timeout-20s)
               stop interval=0s timeout=20s (nas_ip-stop-timeout-20s)
               monitor interval=20s timeout=20s (nas_ip-monitor-interval-20s)

Stonith Devices: 
 Resource: fence_ipmi1 (class=stonith type=fence_ipmilan)
  Attributes: pcmk_host_list=node1 pcmk_host_check=static-list ipaddr=192.168.66.41 action=reboot login=root passwd=passwd delay=15 auth=password lanplus=true 
  Operations: monitor interval=60s (fence_ipmi1-monitor-interval-60s)
 Resource: fence_ipmi0 (class=stonith type=fence_ipmilan)
  Attributes: pcmk_host_list=node0 pcmk_host_check=static-list ipaddr=192.168.66.40 action=reboot login=root passwd=passwd delay=15 auth=password lanplus=true 
  Operations: monitor interval=60s (fence_ipmi0-monitor-interval-60s)
Fencing Levels: 

Location Constraints:
  Resource: fence_ipmi0
    Enabled on: node1 (score:INFINITY) (id:location-fence_ipmi0-node1-INFINITY)
  Resource: fence_ipmi1
    Enabled on: node0 (score:INFINITY) (id:location-fence_ipmi1-node0-INFINITY)
  Resource: nas_tomcat_service
    Enabled on: node0 (score:INFINITY) (id:location-nas_tomcat_service-node0-INFINITY)
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: corosync
 dc-version: 1.1.10-29.el7-368c726
 no-quorum-policy: ignore
 stonith-enabled: false
------------------------------------------------------------ ------------------------------------------------------------



pacemaker edition:
pacemaker-1.1.10-29.el7.x86_64



It happens repeatly and confuses me.
Is it the same problem with this one(Bug 1284069,https://bugzilla.redhat.com/show_bug.cgi?id=1284069) ?
Any comments would be helpful.
Thanks.

Comment 5 Ken Gaillot 2016-01-06 15:22:56 UTC
Zhaoming,

It does look like lrmd is crashing on node0 after a failed monitor on the nas_samba resource, which is a systemd resource, so it is possible that the lrmd crash is related to his bug -- though it's not certain from just these logs.

However, that would be a symptom of a larger issue, as this bug happens when reporting an existing error condition.

I notice you have fence devices defined, but stonith-enabled is false. Enabling stonith would allow the cluster to recover safely from lost connections. Beware that using only onboard IPMI for fencing is vulnerable to a complete power loss.

The most important errors in your logs are:

Dec  7 20:45:56 nas2 crmd[33274]: warning: match_down_event: No match for shutdown action on 1
...
Dec  7 20:45:56 nas2 pengine[33273]: warning: pe_fence_node: Node node0 is unclean because it is partially and/or un-expectedly down

That means that corosync lost the connection between the nodes at that time. There could be a network issue, or one node was unresponsive for a length of time due to high load, or something like that. If fencing were enabled, nas2 would have fenced nas1.

You probably have more information in /var/log/corosync.log (or /var/log/cluster/corosync.log).

Comment 6 Zhaoming Zhang 2016-01-07 03:10:10 UTC
Ken,hanks for your reply.

(In reply to Ken Gaillot from comment #5)
> Zhaoming,
> 
> It does look like lrmd is crashing on node0 after a failed monitor on the
> nas_samba resource, which is a systemd resource, so it is possible that the
> lrmd crash is related to his bug -- though it's not certain from just these
> logs.
Anything else needed to confirm?


> 
> However, that would be a symptom of a larger issue, as this bug happens when
> reporting an existing error condition.
> 
> I notice you have fence devices defined, but stonith-enabled is false.
> Enabling stonith would allow the cluster to recover safely from lost
> connections. Beware that using only onboard IPMI for fencing is vulnerable
> to a complete power loss.
Thanks for the tips. I've found that using only onboard IPMI for fencing makes the fence action keep trying and the resources will not failover when the Master node lose power completely. That's the reason I configed stonith-enabled false. Is there any other solution to the complete power lose problem?


> 
> The most important errors in your logs are:
> 
> Dec  7 20:45:56 nas2 crmd[33274]: warning: match_down_event: No match for
> shutdown action on 1
> ...
> Dec  7 20:45:56 nas2 pengine[33273]: warning: pe_fence_node: Node node0 is
> unclean because it is partially and/or un-expectedly down
> 
> That means that corosync lost the connection between the nodes at that time.
> There could be a network issue, or one node was unresponsive for a length of
> time due to high load, or something like that. If fencing were enabled, nas2
> would have fenced nas1.

Yes, I've found the failure of connection to LRM in nas1's logs(showed in Comment 4 Zhaoming Zhang 2016-01-06 06:18:12 EST): 
Dec  7 20:45:56 nas1 crmd[35821]: error: crm_ipc_read: Connection to lrmd failed
Dec  7 20:45:56 nas1 crmd[35821]: error: mainloop_gio_callback: Connection to lrmd[0x215b020] closed (I/O condition=17)
Dec  7 20:45:56 nas1 crmd[35821]: crit: lrm_connection_destroy: LRM Connection failed
……
Dec  7 20:45:56 nas1 crmd[35821]: notice: do_lrm_control: Disconnected from the LRM
Dec  7 20:45:56 nas1 crmd[35821]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec  7 20:45:56 nas1 cib[16365]: warning: qb_ipcs_event_sendv: new_event_notification (16365-35821-9): Broken pipe (32)
Dec  7 20:45:56 nas1 cib[16365]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Dec  7 20:45:56 nas1 crmd[35821]: error: crmd_fast_exit: Could not recover from internal error
Dec  7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process crmd (35821) exited: Generic Pacemaker error (201)
Dec  7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: crmd
Dec  7 20:45:56 nas1 crmd[43515]: notice: main: CRM Git Version: 368c726
Dec  7 20:45:56 nas1 crmd[43515]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Dec  7 20:45:56 nas1 crmd[43515]: notice: cluster_connect_quorum: Quorum acquired

I'm not sure what's the reason of the failure and wether is it relative with this Bug 1284069.



> 
> You probably have more information in /var/log/corosync.log (or
> /var/log/cluster/corosync.log).

I'm sorry to show the /etc/corosync/corosync.conf which explains I got nothing about corosync.log in /var/log or /var/log/cluster.
cat /etc/corosync/corosync.conf
totem {
version: 2
secauth: off
cluster_name: my_cluster
transport: udpu
}

nodelist {
  node {
        ring0_addr: node0
        nodeid: 1
       }
  node {
        ring0_addr: node1
        nodeid: 2
       }
}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_syslog: yes
}



Shall I attach the /var/log/message file or something else?

Comment 7 Ken Gaillot 2016-01-08 16:48:24 UTC
(In reply to Zhaoming Zhang from comment #6)
> Ken,hanks for your reply.
> 
> (In reply to Ken Gaillot from comment #5)
> > Zhaoming,
> > 
> > It does look like lrmd is crashing on node0 after a failed monitor on the
> > nas_samba resource, which is a systemd resource, so it is possible that the
> > lrmd crash is related to his bug -- though it's not certain from just these
> > logs.
> Anything else needed to confirm?

I don't see a "core dumped" message in your logs, so I think it's unlikely. If you haven't disabled cores and don't see one in /var/lib/pacemaker/cores when the problem occurs, it's probably not related.

> > I notice you have fence devices defined, but stonith-enabled is false.
> > Enabling stonith would allow the cluster to recover safely from lost
> > connections. Beware that using only onboard IPMI for fencing is vulnerable
> > to a complete power loss.
> Thanks for the tips. I've found that using only onboard IPMI for fencing
> makes the fence action keep trying and the resources will not failover when
> the Master node lose power completely. That's the reason I configed
> stonith-enabled false. Is there any other solution to the complete power
> lose problem?

The only solution is using a second level of fencing, typically an intelligent power switch, or potentially a shared storage cutoff (iscsi reservations) and/or network cutoff (intelligent Ethernet switch). Pacemaker allows you to configure a fencing "topology" with multiple levels, for example, try IPMI first, and if that fails, try the power switch.

Fencing is a bit of a pain to set up, but it's the only way to recover from certain error situations safely.

> > The most important errors in your logs are:
> > 
> > Dec  7 20:45:56 nas2 crmd[33274]: warning: match_down_event: No match for
> > shutdown action on 1
> > ...
> > Dec  7 20:45:56 nas2 pengine[33273]: warning: pe_fence_node: Node node0 is
> > unclean because it is partially and/or un-expectedly down
> > 
> > That means that corosync lost the connection between the nodes at that time.
> > There could be a network issue, or one node was unresponsive for a length of
> > time due to high load, or something like that. If fencing were enabled, nas2
> > would have fenced nas1.
> 
> Yes, I've found the failure of connection to LRM in nas1's logs(showed in
> Comment 4 Zhaoming Zhang 2016-01-06 06:18:12 EST): 
> Dec  7 20:45:56 nas1 crmd[35821]: error: crm_ipc_read: Connection to lrmd
> failed

The "node0 is unclean" message means that the connection issue was not limited to lrmd, but affected the entire node. The usual culprit is either a network issue, or high load on a node that makes it unresponsive for an extended time. You can usually tell from the system logs on that node (either a bunch of services report network-related trouble, or the kernel reports scheduling problems). 

> > You probably have more information in /var/log/corosync.log (or
> > /var/log/cluster/corosync.log).
> 
> I'm sorry to show the /etc/corosync/corosync.conf which explains I got
> nothing about corosync.log in /var/log or /var/log/cluster.

Sometimes packages will ship with corosync.log enabled somewhere, but most likely the system log (/var/log/messages or whatever you've got) will have corosync messages. If you grep for corosync and look around the time of the error, you might see something useful.

Comment 9 Mike McCune 2016-03-28 22:52:26 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 13 errata-xmlrpc 2016-11-03 18:56:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html