Bug 1284069
Summary: | Pacemaker's lrmd crashes after certain systemd errors | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ken Gaillot <kgaillot> | |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | urgent | Docs Contact: | Milan Navratil <mnavrati> | |
Priority: | urgent | |||
Version: | 7.2 | CC: | 252150241, cfeist, cluster-maint, jkortus, mjuricek, mnavrati, royoung, tlavigne | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | 7.3 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | pacemaker-1.1.15-1.2c148ac.git.el7 | Doc Type: | Bug Fix | |
Doc Text: |
*lrmd* logs errors as expected and no longer crashes
Previously, Pacemaker's Local Resource Management Daemon (lrmd) used an invalid format string when logging certain rare *systemd* errors. As a consequence, *lrmd* could terminate unexpectedly with a segmentation fault. A patch has been applied to fix the format string. As a result, *lrmd* no longer crashes and logs the aforementioned rare error messages as intended.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1299339 (view as bug list) | Environment: | ||
Last Closed: | 2016-11-03 18:56:39 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1304771 | |||
Bug Blocks: | 1299339, 1394068 |
Description
Ken Gaillot
2015-11-20 17:30:50 UTC
Fixed upstream in 6c495a4 I found a problem in my two-nodes pacemaker cluster. node0's log: Dec 7 20:30:01 nas1 systemd: Starting Session 3016 of user root. Dec 7 20:30:01 nas1 systemd: Started Session 3016 of user root. Dec 7 20:40:01 nas1 systemd: Starting Session 3017 of user root. Dec 7 20:40:01 nas1 systemd: Started Session 3017 of user root. Dec 7 20:45:06 nas1 systemd: Reloading. Dec 7 20:45:56 nas1 crmd[35821]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=49, confirmed=false) unknown error Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1 Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9) Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 85: fail-count-nas_samba=9 Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1 Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1449492356) Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 88: last-failure-nas_samba=1449492356 Dec 7 20:45:56 nas1 crmd[35821]: error: crm_ipc_read: Connection to lrmd failed Dec 7 20:45:56 nas1 crmd[35821]: error: mainloop_gio_callback: Connection to lrmd[0x215b020] closed (I/O condition=17) Dec 7 20:45:56 nas1 crmd[35821]: crit: lrm_connection_destroy: LRM Connection failed Dec 7 20:45:56 nas1 crmd[35821]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC Dec 7 20:45:56 nas1 crmd[35821]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Dec 7 20:45:56 nas1 crmd[35821]: warning: do_recover: Fast-tracking shutdown in response to errors Dec 7 20:45:56 nas1 crmd[35821]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Dec 7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process lrmd (35818) exited: Operation not permitted (1) Dec 7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: lrmd Dec 7 20:45:56 nas1 crmd[35821]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi1:97 (fence_ipmi1_monitor_60000) incomplete at shutdown Dec 7 20:45:56 nas1 crmd[35821]: error: lrm_state_verify_stopped: 1 resources were active at shutdown. Dec 7 20:45:56 nas1 crmd[35821]: notice: do_lrm_control: Disconnected from the LRM Dec 7 20:45:56 nas1 crmd[35821]: notice: terminate_cs_connection: Disconnecting from Corosync Dec 7 20:45:56 nas1 cib[16365]: warning: qb_ipcs_event_sendv: new_event_notification (16365-35821-9): Broken pipe (32) Dec 7 20:45:56 nas1 cib[16365]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type Dec 7 20:45:56 nas1 crmd[35821]: error: crmd_fast_exit: Could not recover from internal error Dec 7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process crmd (35821) exited: Generic Pacemaker error (201) Dec 7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: crmd Dec 7 20:45:56 nas1 crmd[43515]: notice: main: CRM Git Version: 368c726 Dec 7 20:45:56 nas1 crmd[43515]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Dec 7 20:45:56 nas1 crmd[43515]: notice: cluster_connect_quorum: Quorum acquired Dec 7 20:45:56 nas1 crmd[43515]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null)) Dec 7 20:45:56 nas1 crmd[43515]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null)) Dec 7 20:45:56 nas1 crmd[43515]: notice: do_started: The local CRM is operational Dec 7 20:45:56 nas1 crmd[43515]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] Dec 7 20:45:58 nas1 crmd[43515]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Dec 7 20:45:58 nas1 attrd[16368]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Dec 7 20:45:58 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1449492356) Dec 7 20:45:58 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Dec 7 20:45:58 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9) Dec 7 20:45:58 nas1 attrd[16368]: warning: attrd_cib_callback: Update 90 for last-failure-nas_samba=1449492356 failed: No such device or address Dec 7 20:45:58 nas1 attrd[16368]: warning: attrd_cib_callback: Update 92 for probe_complete=true failed: No such device or address Dec 7 20:45:58 nas1 attrd[16368]: warning: attrd_cib_callback: Update 94 for fail-count-nas_samba=9 failed: No such device or address Dec 7 20:45:59 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_rpcbind_monitor_0 (call=41, rc=0, cib-update=10, confirmed=true) ok Dec 7 20:45:59 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_samba_monitor_0 (call=49, rc=0, cib-update=11, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_tomcat_monitor_0 (call=73, rc=0, cib-update=14, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_shared_metadata_monitor_0 (call=5, rc=0, cib-update=15, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_test_monitor_0 (call=9, rc=0, cib-update=16, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_Users_monitor_0 (call=13, rc=0, cib-update=17, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_gsics_monitor_0 (call=17, rc=0, cib-update=18, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_Gsix_monitor_0 (call=21, rc=0, cib-update=19, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_GX_monitor_0 (call=37, rc=0, cib-update=20, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_MAS_monitor_0 (call=25, rc=0, cib-update=21, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_User2_monitor_0 (call=33, rc=0, cib-update=22, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_filesystem_kyfz_monitor_0 (call=29, rc=0, cib-update=23, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_syslun_monitor_0 (call=65, rc=0, cib-update=24, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_service_ip_192_168_3_100_monitor_0 (call=69, rc=0, cib-update=25, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_vip_10_24_171_90_monitor_0 (call=53, rc=0, cib-update=26, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_ip_monitor_0 (call=77, rc=0, cib-update=27, confirmed=true) ok Dec 7 20:46:00 nas1 crmd[43515]: notice: process_lrm_event: LRM operation nas_nfs_monitor_0 (call=45, rc=0, cib-update=28, confirmed=true) ok Dec 7 20:46:00 nas1 stonith-ng[16366]: notice: stonith_device_register: Device 'fence_ipmi1' already existed in device list (2 active devices) node1's log: Dec 7 20:38:36 nas2 crmd[33274]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Dec 7 20:38:36 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 7 20:38:36 nas2 pengine[33273]: notice: process_pe_message: Calculated Transition 809: /var/lib/pacemaker/pengine/pe-input-117.bz2 Dec 7 20:38:36 nas2 crmd[33274]: notice: run_graph: Transition 809 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-117.bz2): Complete Dec 7 20:38:36 nas2 crmd[33274]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Dec 7 20:40:01 nas2 systemd: Starting Session 2955 of user root. Dec 7 20:40:01 nas2 systemd: Started Session 2955 of user root. Dec 7 20:45:56 nas2 crmd[33274]: warning: update_failcount: Updating failcount for nas_samba on node0 after failed monitor: rc=1 (update=value++, time=1449492356) Dec 7 20:45:56 nas2 crmd[33274]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Dec 7 20:45:56 nas2 crmd[33274]: warning: match_down_event: No match for shutdown action on 1 Dec 7 20:45:56 nas2 crmd[33274]: notice: peer_update_callback: Stonith/shutdown of node0 not matched Dec 7 20:45:56 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 7 20:45:56 nas2 pengine[33273]: warning: unpack_rsc_op: Processing failed op monitor for nas_samba on node0: unknown error (1) Dec 7 20:45:56 nas2 pengine[33273]: notice: process_pe_message: Calculated Transition 810: /var/lib/pacemaker/pengine/pe-input-118.bz2 Dec 7 20:45:56 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 7 20:45:56 nas2 pengine[33273]: warning: pe_fence_node: Node node0 is unclean because it is partially and/or un-expectedly down Dec 7 20:45:56 nas2 pengine[33273]: warning: determine_online_status: Node node0 is unclean Dec 7 20:45:56 nas2 pengine[33273]: warning: stage6: Node node0 is unclean! Dec 7 20:45:56 nas2 pengine[33273]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED Dec 7 20:45:56 nas2 pengine[33273]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_shared_metadata (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_test (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_Users (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_gsics (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_Gsix (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_MAS (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_kyfz (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_User2 (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_GX (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_rpcbind (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_nfs (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_samba (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_vip_10_24_171_90 (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start fence_ipmi1 (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_syslun (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_service_ip_192_168_3_100 (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_tomcat (node1) Dec 7 20:45:56 nas2 pengine[33273]: notice: LogActions: Start nas_ip (node1) Dec 7 20:45:56 nas2 pengine[33273]: error: process_pe_message: Calculated Transition 811: /var/lib/pacemaker/pengine/pe-error-3.bz2 Dec 7 20:45:56 nas2 pengine[33273]: notice: process_pe_message: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues. Dec 7 20:45:56 nas2 crmd[33274]: warning: do_log: FSA: Input I_PE_SUCCESS from handle_response() received in state S_INTEGRATION Dec 7 20:45:58 nas2 cib[33269]: warning: cib_process_request: Completed cib_modify operation for section status: No such device or address (rc=-6, origin=node0/attrd/90, version=0.84.231) Dec 7 20:45:58 nas2 cib[33269]: warning: cib_process_request: Completed cib_modify operation for section status: No such device or address (rc=-6, origin=node0/attrd/92, version=0.84.231) Dec 7 20:45:58 nas2 cib[33269]: warning: cib_process_request: Completed cib_modify operation for section status: No such device or address (rc=-6, origin=node0/attrd/94, version=0.84.231) Dec 7 20:45:59 nas2 attrd[33272]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Dec 7 20:45:59 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Dec 7 20:45:59 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_shared_metadata (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_test (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_Users (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_gsics (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_Gsix (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_MAS (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_kyfz (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_User2 (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_filesystem_GX (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_rpcbind (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_nfs (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_samba (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_vip_10_24_171_90 (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start fence_ipmi1 (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_syslun (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_service_ip_192_168_3_100 (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_tomcat (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: LogActions: Start nas_ip (node0) Dec 7 20:45:59 nas2 pengine[33273]: notice: process_pe_message: Calculated Transition 812: /var/lib/pacemaker/pengine/pe-input-119.bz2 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 5: monitor nas_shared_metadata_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 6: monitor nas_filesystem_test_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 7: monitor nas_filesystem_Users_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 8: monitor nas_filesystem_gsics_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 9: monitor nas_filesystem_Gsix_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 10: monitor nas_filesystem_MAS_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 11: monitor nas_filesystem_kyfz_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 12: monitor nas_filesystem_User2_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 13: monitor nas_filesystem_GX_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 14: monitor nas_rpcbind_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 15: monitor nas_nfs_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 16: monitor nas_samba_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 17: monitor nas_vip_10_24_171_90_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 18: monitor fence_ipmi1_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 19: monitor fence_ipmi0_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 20: monitor nas_syslun_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 21: monitor nas_service_ip_192_168_3_100_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 22: monitor nas_tomcat_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 23: monitor nas_ip_monitor_0 on node0 Dec 7 20:45:59 nas2 crmd[33274]: warning: status_from_rc: Action 14 (nas_rpcbind_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:45:59 nas2 crmd[33274]: warning: status_from_rc: Action 16 (nas_samba_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 22 (nas_tomcat_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 5 (nas_shared_metadata_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 6 (nas_filesystem_test_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 7 (nas_filesystem_Users_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 8 (nas_filesystem_gsics_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 9 (nas_filesystem_Gsix_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 13 (nas_filesystem_GX_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 10 (nas_filesystem_MAS_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 12 (nas_filesystem_User2_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 11 (nas_filesystem_kyfz_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 20 (nas_syslun_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 21 (nas_service_ip_192_168_3_100_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 17 (nas_vip_10_24_171_90_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 23 (nas_ip_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: warning: status_from_rc: Action 15 (nas_nfs_monitor_0) on node0 failed (target: 7 vs. rc: 0): Error Dec 7 20:46:00 nas2 crmd[33274]: notice: te_rsc_command: Initiating action 4: probe_complete probe_complete on node0 - no waiting Dec 7 20:46:00 nas2 crmd[33274]: notice: run_graph: Transition 812 (Complete=22, Pending=0, Fired=0, Skipped=40, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-119.bz2): Stopped Dec 7 20:46:00 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 7 20:46:00 nas2 pengine[33273]: notice: LogActions: Start fence_ipmi1 (node0) ------------------------------------------------------------ + pcs config + ------------------------------------------------------------ Cluster Name: my_cluster Corosync Nodes: node0 node1 Pacemaker Nodes: node0 node1 Resources: Group: nas_group Resource: nas_shared_metadata (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/mapper/metadata directory=/nasgui/metadata fstype=xfs options=sync Operations: start interval=0s timeout=60 (nas_shared_metadata-start-timeout-60) stop interval=0s timeout=60 (nas_shared_metadata-stop-timeout-60) monitor interval=15s timeout=15s (nas_shared_metadata-monitor-interval-15s) Resource: nas_filesystem_test (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/mapper/3600b342b8d58e87d8863d70d4d0000d9 directory=/mnt/test fstype=xfs options=sync,usrquota,prjquota Operations: start interval=0s timeout=60 (nas_filesystem_test-start-timeout-60) stop interval=0s timeout=60 (nas_filesystem_test-stop-timeout-60) monitor interval=15s timeout=20s (nas_filesystem_test-monitor-interval-15s) Resource: nas_filesystem_Users (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/mapper/3600b342cefeea4cdfb66db3eed0000d9 directory=/mnt/Users fstype=xfs options=sync,usrquota,prjquota Operations: start interval=0s timeout=60 (nas_filesystem_Users-start-timeout-60) stop interval=0s timeout=60 (nas_filesystem_Users-stop-timeout-60) monitor interval=15s timeout=20s (nas_filesystem_Users-monitor-interval-15s) Resource: nas_filesystem_gsics (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/mapper/3600b342196550d2db9fcd9419d0000d9 directory=/mnt/gsics fstype=xfs options=sync,usrquota,grpquota Operations: start interval=0s timeout=60 (nas_filesystem_gsics-start-timeout-60) stop interval=0s timeout=60 (nas_filesystem_gsics-stop-timeout-60) monitor interval=15s timeout=20s (nas_filesystem_gsics-monitor-interval-15s) Resource: nas_filesystem_Gsix (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/mapper/3600b342e465188ddc88ed217fd0000d9 directory=/mnt/Gsix fstype=xfs options=sync,usrquota,grpquota Operations: start interval=0s timeout=60 (nas_filesystem_Gsix-start-timeout-60) stop interval=0s timeout=60 (nas_filesystem_Gsix-stop-timeout-60) monitor interval=15s timeout=20s (nas_filesystem_Gsix-monitor-interval-15s) Resource: nas_filesystem_MAS (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/mapper/3600b342a25ca66fde792d83ddd0000d9 directory=/mnt/MAS fstype=xfs options=sync,usrquota,grpquota Operations: start interval=0s timeout=60 (nas_filesystem_MAS-start-timeout-60) stop interval=0s timeout=60 (nas_filesystem_MAS-stop-timeout-60) monitor interval=15s timeout=20s (nas_filesystem_MAS-monitor-interval-15s) Resource: nas_filesystem_kyfz (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/mapper/3600b342f6432060d750fd1114d0000d9 directory=/mnt/kyfz fstype=xfs options=sync,uquota,gquota Operations: start interval=0s timeout=60 (nas_filesystem_kyfz-start-timeout-60) stop interval=0s timeout=60 (nas_filesystem_kyfz-stop-timeout-60) monitor interval=15s timeout=20s (nas_filesystem_kyfz-monitor-interval-15s) Resource: nas_filesystem_User2 (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/mapper/3600b342aa9dc9d7df167dbb03d0000d9 directory=/mnt/User2 fstype=xfs options=sync,usrquota,prjquota Operations: start interval=0s timeout=60 (nas_filesystem_User2-start-timeout-60) stop interval=0s timeout=60 (nas_filesystem_User2-stop-timeout-60) monitor interval=15s timeout=20s (nas_filesystem_User2-monitor-interval-15s) Resource: nas_filesystem_GX (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/mapper/3600b342bf60c8a1d404bdd6c0d0000d9 directory=/mnt/GX fstype=xfs options=sync,uquota,gquota Operations: start interval=0s timeout=60 (nas_filesystem_GX-start-timeout-60) stop interval=0s timeout=60 (nas_filesystem_GX-stop-timeout-60) monitor interval=15s timeout=20s (nas_filesystem_GX-monitor-interval-15s) Resource: nas_rpcbind (class=systemd type=rpcbind) Operations: monitor interval=15s start-delay=10s timeout=20s (nas_rpcbind-monitor-interval-15s) Resource: nas_nfs (class=ocf provider=heartbeat type=nfsserver) Attributes: nfs_shared_infodir=/nasgui/metadata/nfsinfo nfsd_args= Operations: start interval=0s timeout=40 (nas_nfs-start-timeout-40) stop interval=0s timeout=20s (nas_nfs-stop-timeout-20s) monitor interval=15s start-delay=10s timeout=20s (nas_nfs-monitor-interval-15s) Resource: nas_samba (class=systemd type=smb) Operations: monitor interval=60s start-delay=10s (nas_samba-monitor-start-delay-10s) monitor interval=15s timeout=20s (nas_samba-monitor-interval-15s) Resource: nas_vip_10_24_171_90 (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.24.171.90 nic=bond1 cidr_netmask=24 Operations: monitor interval=20s timeout=20s (nas_vip_10_24_171_90-monitor-20s) Group: nas_tomcat_service Resource: nas_syslun (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/mapper/sys_lun directory=/nasgui/share fstype=xfs options=sync Operations: start interval=0s timeout=60 (nas_syslun-start-timeout-60) stop interval=0s timeout=60 (nas_syslun-stop-timeout-60) monitor interval=15s timeout=15s (nas_syslun-monitor-interval-15s) Resource: nas_service_ip_192_168_3_100 (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=192.168.3.100 cidr_netmask=24 Operations: start interval=0s timeout=20s (nas_service_ip_192_168_3_100-start-timeout-20s) stop interval=0s timeout=20s (nas_service_ip_192_168_3_100-stop-timeout-20s) monitor interval=20s timeout=20s (nas_service_ip_192_168_3_100-monitor-interval-20s) Resource: nas_tomcat (class=ocf provider=heartbeat type=tomcat) Attributes: java_home=/usr/local/jdk/jdk1.6.0_37 catalina_home=/usr/local/nas/apache-tomcat-6.0.35 Operations: start interval=0s timeout=60s (nas_tomcat-start-timeout-60s) stop interval=0s timeout=120s (nas_tomcat-stop-timeout-120s) monitor interval=10s start-delay=5s timeout=30s (nas_tomcat-monitor-interval-10s) Resource: nas_ip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.24.171.91 cidr_netmask=24 Operations: start interval=0s timeout=20s (nas_ip-start-timeout-20s) stop interval=0s timeout=20s (nas_ip-stop-timeout-20s) monitor interval=20s timeout=20s (nas_ip-monitor-interval-20s) Stonith Devices: Resource: fence_ipmi1 (class=stonith type=fence_ipmilan) Attributes: pcmk_host_list=node1 pcmk_host_check=static-list ipaddr=192.168.66.41 action=reboot login=root passwd=passwd delay=15 auth=password lanplus=true Operations: monitor interval=60s (fence_ipmi1-monitor-interval-60s) Resource: fence_ipmi0 (class=stonith type=fence_ipmilan) Attributes: pcmk_host_list=node0 pcmk_host_check=static-list ipaddr=192.168.66.40 action=reboot login=root passwd=passwd delay=15 auth=password lanplus=true Operations: monitor interval=60s (fence_ipmi0-monitor-interval-60s) Fencing Levels: Location Constraints: Resource: fence_ipmi0 Enabled on: node1 (score:INFINITY) (id:location-fence_ipmi0-node1-INFINITY) Resource: fence_ipmi1 Enabled on: node0 (score:INFINITY) (id:location-fence_ipmi1-node0-INFINITY) Resource: nas_tomcat_service Enabled on: node0 (score:INFINITY) (id:location-nas_tomcat_service-node0-INFINITY) Ordering Constraints: Colocation Constraints: Cluster Properties: cluster-infrastructure: corosync dc-version: 1.1.10-29.el7-368c726 no-quorum-policy: ignore stonith-enabled: false ------------------------------------------------------------ ------------------------------------------------------------ pacemaker edition: pacemaker-1.1.10-29.el7.x86_64 It happens repeatly and confuses me. Is it the same problem with this one(Bug 1284069,https://bugzilla.redhat.com/show_bug.cgi?id=1284069) ? Any comments would be helpful. Thanks. Zhaoming, It does look like lrmd is crashing on node0 after a failed monitor on the nas_samba resource, which is a systemd resource, so it is possible that the lrmd crash is related to his bug -- though it's not certain from just these logs. However, that would be a symptom of a larger issue, as this bug happens when reporting an existing error condition. I notice you have fence devices defined, but stonith-enabled is false. Enabling stonith would allow the cluster to recover safely from lost connections. Beware that using only onboard IPMI for fencing is vulnerable to a complete power loss. The most important errors in your logs are: Dec 7 20:45:56 nas2 crmd[33274]: warning: match_down_event: No match for shutdown action on 1 ... Dec 7 20:45:56 nas2 pengine[33273]: warning: pe_fence_node: Node node0 is unclean because it is partially and/or un-expectedly down That means that corosync lost the connection between the nodes at that time. There could be a network issue, or one node was unresponsive for a length of time due to high load, or something like that. If fencing were enabled, nas2 would have fenced nas1. You probably have more information in /var/log/corosync.log (or /var/log/cluster/corosync.log). Ken,hanks for your reply. (In reply to Ken Gaillot from comment #5) > Zhaoming, > > It does look like lrmd is crashing on node0 after a failed monitor on the > nas_samba resource, which is a systemd resource, so it is possible that the > lrmd crash is related to his bug -- though it's not certain from just these > logs. Anything else needed to confirm? > > However, that would be a symptom of a larger issue, as this bug happens when > reporting an existing error condition. > > I notice you have fence devices defined, but stonith-enabled is false. > Enabling stonith would allow the cluster to recover safely from lost > connections. Beware that using only onboard IPMI for fencing is vulnerable > to a complete power loss. Thanks for the tips. I've found that using only onboard IPMI for fencing makes the fence action keep trying and the resources will not failover when the Master node lose power completely. That's the reason I configed stonith-enabled false. Is there any other solution to the complete power lose problem? > > The most important errors in your logs are: > > Dec 7 20:45:56 nas2 crmd[33274]: warning: match_down_event: No match for > shutdown action on 1 > ... > Dec 7 20:45:56 nas2 pengine[33273]: warning: pe_fence_node: Node node0 is > unclean because it is partially and/or un-expectedly down > > That means that corosync lost the connection between the nodes at that time. > There could be a network issue, or one node was unresponsive for a length of > time due to high load, or something like that. If fencing were enabled, nas2 > would have fenced nas1. Yes, I've found the failure of connection to LRM in nas1's logs(showed in Comment 4 Zhaoming Zhang 2016-01-06 06:18:12 EST): Dec 7 20:45:56 nas1 crmd[35821]: error: crm_ipc_read: Connection to lrmd failed Dec 7 20:45:56 nas1 crmd[35821]: error: mainloop_gio_callback: Connection to lrmd[0x215b020] closed (I/O condition=17) Dec 7 20:45:56 nas1 crmd[35821]: crit: lrm_connection_destroy: LRM Connection failed …… Dec 7 20:45:56 nas1 crmd[35821]: notice: do_lrm_control: Disconnected from the LRM Dec 7 20:45:56 nas1 crmd[35821]: notice: terminate_cs_connection: Disconnecting from Corosync Dec 7 20:45:56 nas1 cib[16365]: warning: qb_ipcs_event_sendv: new_event_notification (16365-35821-9): Broken pipe (32) Dec 7 20:45:56 nas1 cib[16365]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type Dec 7 20:45:56 nas1 crmd[35821]: error: crmd_fast_exit: Could not recover from internal error Dec 7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process crmd (35821) exited: Generic Pacemaker error (201) Dec 7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: crmd Dec 7 20:45:56 nas1 crmd[43515]: notice: main: CRM Git Version: 368c726 Dec 7 20:45:56 nas1 crmd[43515]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Dec 7 20:45:56 nas1 crmd[43515]: notice: cluster_connect_quorum: Quorum acquired I'm not sure what's the reason of the failure and wether is it relative with this Bug 1284069. > > You probably have more information in /var/log/corosync.log (or > /var/log/cluster/corosync.log). I'm sorry to show the /etc/corosync/corosync.conf which explains I got nothing about corosync.log in /var/log or /var/log/cluster. cat /etc/corosync/corosync.conf totem { version: 2 secauth: off cluster_name: my_cluster transport: udpu } nodelist { node { ring0_addr: node0 nodeid: 1 } node { ring0_addr: node1 nodeid: 2 } } quorum { provider: corosync_votequorum two_node: 1 } logging { to_syslog: yes } Shall I attach the /var/log/message file or something else? (In reply to Zhaoming Zhang from comment #6) > Ken,hanks for your reply. > > (In reply to Ken Gaillot from comment #5) > > Zhaoming, > > > > It does look like lrmd is crashing on node0 after a failed monitor on the > > nas_samba resource, which is a systemd resource, so it is possible that the > > lrmd crash is related to his bug -- though it's not certain from just these > > logs. > Anything else needed to confirm? I don't see a "core dumped" message in your logs, so I think it's unlikely. If you haven't disabled cores and don't see one in /var/lib/pacemaker/cores when the problem occurs, it's probably not related. > > I notice you have fence devices defined, but stonith-enabled is false. > > Enabling stonith would allow the cluster to recover safely from lost > > connections. Beware that using only onboard IPMI for fencing is vulnerable > > to a complete power loss. > Thanks for the tips. I've found that using only onboard IPMI for fencing > makes the fence action keep trying and the resources will not failover when > the Master node lose power completely. That's the reason I configed > stonith-enabled false. Is there any other solution to the complete power > lose problem? The only solution is using a second level of fencing, typically an intelligent power switch, or potentially a shared storage cutoff (iscsi reservations) and/or network cutoff (intelligent Ethernet switch). Pacemaker allows you to configure a fencing "topology" with multiple levels, for example, try IPMI first, and if that fails, try the power switch. Fencing is a bit of a pain to set up, but it's the only way to recover from certain error situations safely. > > The most important errors in your logs are: > > > > Dec 7 20:45:56 nas2 crmd[33274]: warning: match_down_event: No match for > > shutdown action on 1 > > ... > > Dec 7 20:45:56 nas2 pengine[33273]: warning: pe_fence_node: Node node0 is > > unclean because it is partially and/or un-expectedly down > > > > That means that corosync lost the connection between the nodes at that time. > > There could be a network issue, or one node was unresponsive for a length of > > time due to high load, or something like that. If fencing were enabled, nas2 > > would have fenced nas1. > > Yes, I've found the failure of connection to LRM in nas1's logs(showed in > Comment 4 Zhaoming Zhang 2016-01-06 06:18:12 EST): > Dec 7 20:45:56 nas1 crmd[35821]: error: crm_ipc_read: Connection to lrmd > failed The "node0 is unclean" message means that the connection issue was not limited to lrmd, but affected the entire node. The usual culprit is either a network issue, or high load on a node that makes it unresponsive for an extended time. You can usually tell from the system logs on that node (either a bunch of services report network-related trouble, or the kernel reports scheduling problems). > > You probably have more information in /var/log/corosync.log (or > > /var/log/cluster/corosync.log). > > I'm sorry to show the /etc/corosync/corosync.conf which explains I got > nothing about corosync.log in /var/log or /var/log/cluster. Sometimes packages will ship with corosync.log enabled somewhere, but most likely the system log (/var/log/messages or whatever you've got) will have corosync messages. If you grep for corosync and look around the time of the error, you might see something useful. This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html |