| Summary: | Monitor unknow error of systemd resources(samba,rpcbind),lrmd connect failed and cluster failover | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Zhaoming Zhang <252150241> | ||||||
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | cluster-qe <cluster-qe> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 7.0 | CC: | 252150241, abeekhof, cluster-maint | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | 7.2 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2016-05-16 17:36:25 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Zhaoming Zhang
2016-01-08 10:15:33 UTC
As discussed on #1284069, the "node0 is unclean" log message indicates that the two nodes lost communication with each other, possibly due to a network issue or load issue. Normally, one node would fence the other in this situation, but fencing is disabled here. Looking at the messages in the system log around the time of the problem might give further clues as to what caused the problem. I'd also recommend upgrading to RHEL 7.2 if possible, to get more recent versions of pacemaker and its dependencies. Thanks for your reply, Ken. (In reply to Ken Gaillot from comment #2) > As discussed on #1284069, the "node0 is unclean" log message indicates that > the two nodes lost communication with each other, possibly due to a network > issue or load issue. Normally, one node would fence the other in this > situation, but fencing is disabled here. We've tested to enable the fencing,and it would failover. But the power lose is another problem. We add scsi 3 reservation to the Filesystem resourece to avoid the power lose problem for the moment. And this failover bug would lead to reservation conflict. And I' quite sure that the reason of the two nodes lost communication with each other is not related to network issue or load issue. The LOM for the cluster heartbeat is exclusive use. When we have no load, we still meet the problem. And the log just says: Dec 7 20:45:56 nas1 crmd[35821]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=49, confirmed=false) unknown error Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1 Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9) Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 85: fail-count-nas_samba=9 Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1 Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1449492356) Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 88: last-failure-nas_samba=1449492356 Dec 7 20:45:56 nas1 crmd[35821]: error: crm_ipc_read: Connection to lrmd failed Dec 7 20:45:56 nas1 crmd[35821]: error: mainloop_gio_callback: Connection to lrmd[0x215b020] closed (I/O condition=17) Dec 7 20:45:56 nas1 crmd[35821]: crit: lrm_connection_destroy: LRM Connection failed > > Looking at the messages in the system log around the time of the problem > might give further clues as to what caused the problem. > > I'd also recommend upgrading to RHEL 7.2 if possible, to get more recent > versions of pacemaker and its dependencies. We are planning to upgrade to RHEL 7.2 to verify the bug recently and will attache the messages to this bug record. Created attachment 1113470 [details] system messages of the cluster nodes Dec 7 20:45:56 nas2 crmd[33274]: warning: match_down_event: No match for shutdown action on 1 Dec 7 20:45:56 nas2 crmd[33274]: notice: peer_update_callback: Stonith/shutdown of node0 not matched Dec 7 20:45:56 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 7 20:45:56 nas2 pengine[33273]: warning: unpack_rsc_op: Processing failed op monitor for nas_samba on node0: unknown error (1) Dec 7 20:45:56 nas2 pengine[33273]: notice: process_pe_message: Calculated Transition 810: /var/lib/pacemaker/pengine/pe-input-118.bz2 Dec 7 20:45:56 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 7 20:45:56 nas2 pengine[33273]: warning: pe_fence_node: Node node0 is unclean because it is partially and/or un-expectedly down Dec 7 20:45:56 nas2 pengine[33273]: warning: determine_online_status: Node node0 is unclean Dec 7 20:45:56 nas2 pengine[33273]: warning: stage6: Node node0 is unclean! Dec 7 20:45:56 nas2 pengine[33273]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED Dec 7 20:45:56 nas2 pengine[33273]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE Created attachment 1113471 [details]
system message of the cluster node nas1
Dec 7 20:28:01 nas1 systemd: Starting Session 3015 of user pcp.
Dec 7 20:28:01 nas1 systemd: Started Session 3015 of user pcp.
Dec 7 20:30:01 nas1 systemd: Starting Session 3016 of user root.
Dec 7 20:30:01 nas1 systemd: Started Session 3016 of user root.
Dec 7 20:40:01 nas1 systemd: Starting Session 3017 of user root.
Dec 7 20:40:01 nas1 systemd: Started Session 3017 of user root.
Dec 7 20:45:06 nas1 systemd: Reloading.
Dec 7 20:45:56 nas1 crmd[35821]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=49, confirmed=false) unknown error
Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1
Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9)
Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 85: fail-count-nas_samba=9
Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_cs_dispatch: Update relayed from node1
Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1449492356)
Dec 7 20:45:56 nas1 attrd[16368]: notice: attrd_perform_update: Sent update 88: last-failure-nas_samba=1449492356
Dec 7 20:45:56 nas1 crmd[35821]: error: crm_ipc_read: Connection to lrmd failed
Dec 7 20:45:56 nas1 crmd[35821]: error: mainloop_gio_callback: Connection to lrmd[0x215b020] closed (I/O condition=17)
Dec 7 20:45:56 nas1 crmd[35821]: crit: lrm_connection_destroy: LRM Connection failed
Dec 7 20:45:56 nas1 crmd[35821]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Dec 7 20:45:56 nas1 crmd[35821]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Dec 7 20:45:56 nas1 crmd[35821]: warning: do_recover: Fast-tracking shutdown in response to errors
Dec 7 20:45:56 nas1 crmd[35821]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Dec 7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process lrmd (35818) exited: Operation not permitted (1)
Dec 7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: lrmd
Dec 7 20:45:56 nas1 crmd[35821]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi1:97 (fence_ipmi1_monitor_60000) incomplete at shutdown
Dec 7 20:45:56 nas1 crmd[35821]: error: lrm_state_verify_stopped: 1 resources were active at shutdown.
Dec 7 20:45:56 nas1 crmd[35821]: notice: do_lrm_control: Disconnected from the LRM
Dec 7 20:45:56 nas1 crmd[35821]: notice: terminate_cs_connection: Disconnecting from Corosync
Dec 7 20:45:56 nas1 cib[16365]: warning: qb_ipcs_event_sendv: new_event_notification (16365-35821-9): Broken pipe (32)
Dec 7 20:45:56 nas1 cib[16365]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Dec 7 20:45:56 nas1 crmd[35821]: error: crmd_fast_exit: Could not recover from internal error
Dec 7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process crmd (35821) exited: Generic Pacemaker error (201)
Dec 7 20:45:56 nas1 pacemakerd[16364]: notice: pcmk_process_exit: Respawning failed child process: crmd
Dec 7 20:45:56 nas1 crmd[43515]: notice: main: CRM Git Version: 368c726
Dec 7 20:45:56 nas1 crmd[43515]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Dec 7 20:45:56 nas1 crmd[43515]: notice: cluster_connect_quorum: Quorum acquired
Dec 7 20:45:56 nas1 crmd[43515]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null))
Dec 7 20:45:56 nas1 crmd[43515]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null))
Dec 7 20:45:56 nas1 crmd[43515]: notice: do_started: The local CRM is operational
(In reply to Ken Gaillot from comment #2) > As discussed on #1284069, the "node0 is unclean" log message indicates that > the two nodes lost communication with each other, possibly due to a network > issue or load issue. Normally, one node would fence the other in this > situation, but fencing is disabled here. > > Looking at the messages in the system log around the time of the problem > might give further clues as to what caused the problem. > > I'd also recommend upgrading to RHEL 7.2 if possible, to get more recent > versions of pacemaker and its dependencies. One more thing, we didn't find the error in pacemaker-1.1.10-29.el7.x86_64 source code as describled on on #1284069: " Additional info: This is the problem and fix: - crm_err("Could not issue %s for %s: %s (%s)", op->action, op->rsc, error.message); + crm_err("Could not issue %s for %s: %s", op->action, op->rsc, error.message); The first line has an extra %s format specifier, which causes it to try to read an argument that was never passed, thus accessing an unknown memory address. " Thanks again. (In reply to Zhaoming Zhang from comment #3) > We've tested to enable the fencing,and it would failover. But the power lose > is another problem. We add scsi 3 reservation to the Filesystem resourece to > avoid the power lose problem for the moment. And this failover bug would > lead to reservation conflict. That sounds like a good solution. So, each node would have a fencing topology with level 1 = fence_ipmilan and level 2 = fence_iscsi. > And I' quite sure that the reason of the two nodes lost communication with > each other is not related to network issue or load issue. > The LOM for the cluster heartbeat is exclusive use. > When we have no load, we still meet the problem. And the log just says: That is strange, especially with no corosync messages around the time of the problem. I'm looking forward to hearing whether this occurs under RHEL 7.2. I noticed that "Dec 7 20:45:56 nas1 pacemakerd[16364]: error: pcmk_child_exit: Child process lrmd (35818) exited: Operation not permitted (1)" and "Dec 7 20:45:56 nas1 crmd[35821]: notice: terminate_cs_connection: Disconnecting from Corosync". Is there something wrong with the lrmd when using systemd resource? Tests for this problem is in progress under RHEL 7.2. I wish we can get the result soon. Thank u! In this case, the lrmd is exiting for some unknown reason. It is exiting with exit code 1, which happens to match the system error "Operation not permitted", which is why that is displayed in the logs, but I don't think that is related to why it exited. The crmd disconnects from corosync after it loses its connection to lrmd. I initially thought that the "unclean" message indicated a wider problem, but it's possible that crmd's exit led to that. So lrmd might indeed be the root of the problem, but the logs don't give much information why. Hopefully, the RHEL 7.2 tests will give us more information (or eliminate the problem entirely). (In reply to Ken Gaillot from comment #9) > In this case, the lrmd is exiting for some unknown reason. It is exiting > with exit code 1, which happens to match the system error "Operation not > permitted", which is why that is displayed in the logs, but I don't think > that is related to why it exited. > > The crmd disconnects from corosync after it loses its connection to lrmd. > > I initially thought that the "unclean" message indicated a wider problem, > but it's possible that crmd's exit led to that. So lrmd might indeed be the > root of the problem, but the logs don't give much information why. > > Hopefully, the RHEL 7.2 tests will give us more information (or eliminate > the problem entirely). I'm afraid that "Operation not permitted" is related to why lrmd exit. Every time lrmd failed, I found "Operation not permitted" was there. Here is the logs: first time: Dec 16 03:58:01 nas2 systemd: Starting Session 5177 of user pcp. Dec 16 03:58:01 nas2 systemd: Started Session 5177 of user pcp. Dec 16 04:00:01 nas2 systemd: Starting Session 5178 of user root. Dec 16 04:00:01 nas2 systemd: Started Session 5178 of user root. Dec 16 04:00:26 nas2 systemd: Reloading. Dec 16 04:01:01 nas2 systemd: Starting Session 5179 of user root. Dec 16 04:01:01 nas2 systemd: Started Session 5179 of user root. Dec 16 04:01:16 nas2 crmd[33274]: notice: process_lrm_event: LRM operation nas_rpcbind_monitor_15000 (call=119, rc=1, cib-update=1853, confirmed=false) unknown error Dec 16 04:01:16 nas2 crmd[33274]: warning: update_failcount: Updating failcount for nas_rpcbind on node1 after failed monitor: rc=1 (update=value++, time=1450209676) Dec 16 04:01:16 nas2 crmd[33274]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Dec 16 04:01:16 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_rpcbind (1) Dec 16 04:01:16 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 108: fail-count-nas_rpcbind=1 Dec 16 04:01:16 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_rpcbind (1450209676) Dec 16 04:01:16 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 111: last-failure-nas_rpcbind=1450209676 Dec 16 04:01:16 nas2 crmd[33274]: error: crm_ipc_read: Connection to lrmd failed Dec 16 04:01:16 nas2 crmd[33274]: error: mainloop_gio_callback: Connection to lrmd[0x1fdb770] closed (I/O condition=17) Dec 16 04:01:16 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (33271) exited: Operation not permitted (1) Dec 16 04:01:16 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd Dec 16 04:01:16 nas2 crmd[33274]: crit: lrm_connection_destroy: LRM Connection failed Dec 16 04:01:16 nas2 crmd[33274]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_POLICY_ENGINE Dec 16 04:01:16 nas2 crmd[33274]: warning: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Dec 16 04:01:16 nas2 crmd[33274]: warning: do_recover: Fast-tracking shutdown in response to errors Dec 16 04:01:16 nas2 crmd[33274]: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY Dec 16 04:01:16 nas2 pengine[33273]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 16 04:01:16 nas2 crmd[33274]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY ...... second time: Dec 18 05:25:01 nas2 systemd: Started Session 5731 of user pcp. Dec 18 05:27:17 nas2 systemd: Reloading. Dec 18 05:28:01 nas2 systemd: Created slice user-995.slice. Dec 18 05:28:01 nas2 systemd: Starting Session 5732 of user pcp. Dec 18 05:28:01 nas2 systemd: Started Session 5732 of user pcp. Dec 18 05:28:07 nas2 crmd[6632]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=45, confirmed=false) unknown error Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (3) Dec 18 05:28:07 nas2 crmd[6632]: error: crm_ipc_read: Connection to lrmd failed Dec 18 05:28:07 nas2 crmd[6632]: error: mainloop_gio_callback: Connection to lrmd[0x2024020] closed (I/O condition=17) Dec 18 05:28:07 nas2 crmd[6632]: crit: lrm_connection_destroy: LRM Connection failed Dec 18 05:28:07 nas2 crmd[6632]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC Dec 18 05:28:07 nas2 crmd[6632]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Dec 18 05:28:07 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (6631) exited: Operation not permitted (1) Dec 18 05:28:07 nas2 crmd[6632]: warning: do_recover: Fast-tracking shutdown in response to errors Dec 18 05:28:07 nas2 crmd[6632]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Dec 18 05:28:07 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd Dec 18 05:28:07 nas2 crmd[6632]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:93 (fence_ipmi0_monitor_60000) incomplete at shutdown Dec 18 05:28:07 nas2 crmd[6632]: error: lrm_state_verify_stopped: 1 resources were active at shutdown. Dec 18 05:28:07 nas2 crmd[6632]: notice: do_lrm_control: Disconnected from the LRM Dec 18 05:28:07 nas2 crmd[6632]: notice: terminate_cs_connection: Disconnecting from Corosync Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 138: fail-count-nas_samba=3 Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1450387688) Dec 18 05:28:07 nas2 cib[33269]: warning: qb_ipcs_event_sendv: new_event_notification (33269-6632-9): Broken pipe (32) Dec 18 05:28:07 nas2 crmd[6632]: error: crmd_fast_exit: Could not recover from internal error Dec 18 05:28:07 nas2 cib[33269]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type Dec 18 05:28:07 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 140: last-failure-nas_samba=1450387688 Dec 18 05:28:07 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (6632) exited: Generic Pacemaker error (201) Dec 18 05:28:07 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd Dec 18 05:28:07 nas2 crmd[1500]: notice: main: CRM Git Version: 368c726 Dec 18 05:28:07 nas2 crmd[1500]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Dec 18 05:28:07 nas2 crmd[1500]: notice: cluster_connect_quorum: Quorum acquired Dec 18 05:28:07 nas2 crmd[1500]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null)) Dec 18 05:28:07 nas2 crmd[1500]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null)) Dec 18 05:28:07 nas2 crmd[1500]: notice: do_started: The local CRM is operational ...... third time: Dec 20 06:50:01 nas2 systemd: Starting Session 6281 of user root. Dec 20 06:50:01 nas2 systemd: Started Session 6281 of user root. Dec 20 06:54:09 nas2 systemd: Reloading. Dec 20 06:54:59 nas2 crmd[1500]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=45, confirmed=false) unknown error Dec 20 06:54:59 nas2 crmd[1500]: error: crm_ipc_read: Connection to lrmd failed Dec 20 06:54:59 nas2 crmd[1500]: error: mainloop_gio_callback: Connection to lrmd[0xee3020] closed (I/O condition=1) Dec 20 06:54:59 nas2 crmd[1500]: crit: lrm_connection_destroy: LRM Connection failed Dec 20 06:54:59 nas2 crmd[1500]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC Dec 20 06:54:59 nas2 crmd[1500]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Dec 20 06:54:59 nas2 crmd[1500]: warning: do_recover: Fast-tracking shutdown in response to errors Dec 20 06:54:59 nas2 crmd[1500]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Dec 20 06:54:59 nas2 crmd[1500]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:93 (fence_ipmi0_monitor_60000) incomplete at shutdown Dec 20 06:54:59 nas2 crmd[1500]: error: lrm_state_verify_stopped: 1 resources were active at shutdown. Dec 20 06:54:59 nas2 crmd[1500]: notice: do_lrm_control: Disconnected from the LRM Dec 20 06:54:59 nas2 crmd[1500]: notice: terminate_cs_connection: Disconnecting from Corosync Dec 20 06:54:59 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (1499) exited: Operation not permitted (1) Dec 20 06:54:59 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (4) Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 171: fail-count-nas_samba=4 Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1450565700) Dec 20 06:54:59 nas2 crmd[1500]: error: crmd_fast_exit: Could not recover from internal error Dec 20 06:54:59 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 173: last-failure-nas_samba=1450565700 Dec 20 06:54:59 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (1500) exited: Generic Pacemaker error (201) Dec 20 06:54:59 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd Dec 20 06:54:59 nas2 crmd[36457]: notice: main: CRM Git Version: 368c726 Dec 20 06:54:59 nas2 crmd[36457]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Dec 20 06:54:59 nas2 crmd[36457]: notice: cluster_connect_quorum: Quorum acquired ...... forth time: Dec 22 08:20:01 nas2 systemd: Starting Session 6832 of user root. Dec 22 08:20:01 nas2 systemd: Started Session 6832 of user root. Dec 22 08:21:59 nas2 systemd: Reloading. Dec 22 08:22:49 nas2 crmd[36457]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=116, rc=1, cib-update=71, confirmed=false) unknown error Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (7) Dec 22 08:22:49 nas2 crmd[36457]: error: crm_ipc_read: Connection to lrmd failed Dec 22 08:22:49 nas2 crmd[36457]: error: mainloop_gio_callback: Connection to lrmd[0xd6e020] closed (I/O condition=17) Dec 22 08:22:49 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (36456) exited: Operation not permitted (1) Dec 22 08:22:49 nas2 crmd[36457]: crit: lrm_connection_destroy: LRM Connection failed Dec 22 08:22:49 nas2 crmd[36457]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC Dec 22 08:22:49 nas2 crmd[36457]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Dec 22 08:22:49 nas2 crmd[36457]: warning: do_recover: Fast-tracking shutdown in response to errors Dec 22 08:22:49 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd Dec 22 08:22:49 nas2 crmd[36457]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 213: fail-count-nas_samba=7 Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1450743771) Dec 22 08:22:49 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 215: last-failure-nas_samba=1450743771 Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:81 (fence_ipmi0_monitor_60000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_test:96 (nas_filesystem_test_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Gsix:102 (nas_filesystem_Gsix_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_samba:116 (nas_samba_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_samba:117 (nas_samba_monitor_60000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_rpcbind:112 (nas_rpcbind_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_kyfz:106 (nas_filesystem_kyfz_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_vip_10_24_171_90:119 (nas_vip_10_24_171_90_monitor_20000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_User2:108 (nas_filesystem_User2_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_GX:110 (nas_filesystem_GX_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_gsics:100 (nas_filesystem_gsics_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Users:98 (nas_filesystem_Users_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_nfs:114 (nas_nfs_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_shared_metadata:94 (nas_shared_metadata_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_MAS:104 (nas_filesystem_MAS_monitor_15000) incomplete at shutdown Dec 22 08:22:49 nas2 crmd[36457]: error: lrm_state_verify_stopped: 14 resources were active at shutdown. Dec 22 08:22:49 nas2 crmd[36457]: notice: do_lrm_control: Disconnected from the LRM Dec 22 08:22:49 nas2 crmd[36457]: notice: terminate_cs_connection: Disconnecting from Corosync Dec 22 08:22:49 nas2 cib[33269]: warning: qb_ipcs_event_sendv: new_event_notification (33269-36457-9): Broken pipe (32) Dec 22 08:22:49 nas2 cib[33269]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type Dec 22 08:22:49 nas2 crmd[36457]: error: crmd_fast_exit: Could not recover from internal error ...... 5th time: Dec 24 09:40:01 nas2 systemd: Starting Session 7381 of user root. Dec 24 09:40:01 nas2 systemd: Started Session 7381 of user root. Dec 24 09:48:51 nas2 systemd: Reloading. Dec 24 09:49:41 nas2 crmd[14257]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=45, confirmed=false) unknown error Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (8) Dec 24 09:49:41 nas2 crmd[14257]: error: crm_ipc_read: Connection to lrmd failed Dec 24 09:49:41 nas2 crmd[14257]: error: mainloop_gio_callback: Connection to lrmd[0x14e8020] closed (I/O condition=17) Dec 24 09:49:41 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (14256) exited: Operation not permitted (1) Dec 24 09:49:41 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd Dec 24 09:49:41 nas2 crmd[14257]: crit: lrm_connection_destroy: LRM Connection failed Dec 24 09:49:41 nas2 crmd[14257]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC Dec 24 09:49:41 nas2 crmd[14257]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Dec 24 09:49:41 nas2 crmd[14257]: warning: do_recover: Fast-tracking shutdown in response to errors Dec 24 09:49:41 nas2 crmd[14257]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 248: fail-count-nas_samba=8 Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1450921782) Dec 24 09:49:41 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 250: last-failure-nas_samba=1450921782 Dec 24 09:49:41 nas2 crmd[14257]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:93 (fence_ipmi0_monitor_60000) incomplete at shutdown Dec 24 09:49:41 nas2 crmd[14257]: error: lrm_state_verify_stopped: 1 resources were active at shutdown. Dec 24 09:49:41 nas2 crmd[14257]: notice: do_lrm_control: Disconnected from the LRM Dec 24 09:49:41 nas2 crmd[14257]: notice: terminate_cs_connection: Disconnecting from Corosync Dec 24 09:49:41 nas2 crmd[14257]: error: crmd_fast_exit: Could not recover from internal error Dec 24 09:49:41 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (14257) exited: Generic Pacemaker error (201) Dec 24 09:49:41 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd Dec 24 09:49:41 nas2 crmd[47742]: notice: main: CRM Git Version: 368c726 Dec 24 09:49:41 nas2 crmd[47742]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Dec 24 09:49:41 nas2 crmd[47742]: notice: cluster_connect_quorum: Quorum acquired Dec 24 09:49:41 nas2 crmd[47742]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null)) Dec 24 09:49:41 nas2 crmd[47742]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null)) Dec 24 09:49:41 nas2 crmd[47742]: notice: do_started: The local CRM is operational Dec 24 09:49:41 nas2 crmd[47742]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] Dec 24 09:49:43 nas2 crmd[47742]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Dec 24 09:49:43 nas2 attrd[33272]: notice: attrd_local_callback: Sending full refresh (origin=crmd) ...... 6th time: Dec 26 11:10:01 nas2 systemd: Started Session 7932 of user root. Dec 26 11:15:42 nas2 systemd: Reloading. Dec 26 11:16:32 nas2 crmd[47742]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=89, rc=1, cib-update=45, confirmed=false) unknown error Dec 26 11:16:32 nas2 crmd[47742]: error: crm_ipc_read: Connection to lrmd failed Dec 26 11:16:32 nas2 crmd[47742]: error: mainloop_gio_callback: Connection to lrmd[0x2216020] closed (I/O condition=1) Dec 26 11:16:32 nas2 crmd[47742]: crit: lrm_connection_destroy: LRM Connection failed Dec 26 11:16:32 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (47737) exited: Operation not permitted (1) Dec 26 11:16:32 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd Dec 26 11:16:32 nas2 crmd[47742]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC Dec 26 11:16:32 nas2 crmd[47742]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Dec 26 11:16:32 nas2 crmd[47742]: warning: do_recover: Fast-tracking shutdown in response to errors Dec 26 11:16:32 nas2 crmd[47742]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (9) Dec 26 11:16:32 nas2 crmd[47742]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:93 (fence_ipmi0_monitor_60000) incomplete at shutdown Dec 26 11:16:32 nas2 crmd[47742]: error: lrm_state_verify_stopped: 1 resources were active at shutdown. Dec 26 11:16:32 nas2 crmd[47742]: notice: do_lrm_control: Disconnected from the LRM Dec 26 11:16:32 nas2 crmd[47742]: notice: terminate_cs_connection: Disconnecting from Corosync Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 279: fail-count-nas_samba=9 Dec 26 11:16:32 nas2 cib[33269]: warning: qb_ipcs_event_sendv: new_event_notification (33269-47742-9): Broken pipe (32) Dec 26 11:16:32 nas2 cib[33269]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1451099794) Dec 26 11:16:32 nas2 crmd[47742]: error: crmd_fast_exit: Could not recover from internal error Dec 26 11:16:32 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 281: last-failure-nas_samba=1451099794 Dec 26 11:16:32 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (47742) exited: Generic Pacemaker error (201) Dec 26 11:16:32 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd Dec 26 11:16:32 nas2 crmd[25757]: notice: main: CRM Git Version: 368c726 Dec 26 11:16:32 nas2 crmd[25757]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Dec 26 11:16:32 nas2 crmd[25757]: notice: cluster_connect_quorum: Quorum acquired Dec 26 11:16:32 nas2 crmd[25757]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null)) Dec 26 11:16:32 nas2 crmd[25757]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null)) Dec 26 11:16:32 nas2 crmd[25757]: notice: do_started: The local CRM is operational ...... 7th time: Dec 28 12:40:01 nas2 systemd: Starting Session 8483 of user root. Dec 28 12:40:01 nas2 systemd: Started Session 8483 of user root. Dec 28 12:43:18 nas2 systemd: Reloading. Dec 28 12:44:08 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (25755) exited: Operation not permitted (1) Dec 28 12:44:08 nas2 crmd[25757]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=116, rc=1, cib-update=71, confirmed=false) unknown error Dec 28 12:44:08 nas2 crmd[25757]: error: crm_ipc_read: Connection to lrmd failed Dec 28 12:44:08 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd Dec 28 12:44:08 nas2 crmd[25757]: error: mainloop_gio_callback: Connection to lrmd[0xc98020] closed (I/O condition=1) Dec 28 12:44:08 nas2 crmd[25757]: crit: lrm_connection_destroy: LRM Connection failed Dec 28 12:44:08 nas2 crmd[25757]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC Dec 28 12:44:08 nas2 crmd[25757]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Dec 28 12:44:08 nas2 crmd[25757]: warning: do_recover: Fast-tracking shutdown in response to errors Dec 28 12:44:08 nas2 crmd[25757]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:80 (fence_ipmi0_monitor_60000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_test:96 (nas_filesystem_test_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Gsix:102 (nas_filesystem_Gsix_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_samba:116 (nas_samba_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_samba:117 (nas_samba_monitor_60000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_kyfz:106 (nas_filesystem_kyfz_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_rpcbind:112 (nas_rpcbind_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_vip_10_24_171_90:119 (nas_vip_10_24_171_90_monitor_20000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_User2:108 (nas_filesystem_User2_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_GX:110 (nas_filesystem_GX_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Users:98 (nas_filesystem_Users_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_gsics:100 (nas_filesystem_gsics_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_nfs:114 (nas_nfs_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_shared_metadata:94 (nas_shared_metadata_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_MAS:104 (nas_filesystem_MAS_monitor_15000) incomplete at shutdown Dec 28 12:44:08 nas2 crmd[25757]: error: lrm_state_verify_stopped: 14 resources were active at shutdown. Dec 28 12:44:08 nas2 crmd[25757]: notice: do_lrm_control: Disconnected from the LRM Dec 28 12:44:08 nas2 crmd[25757]: notice: terminate_cs_connection: Disconnecting from Corosync Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (12) Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 322: fail-count-nas_samba=12 Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1451277850) Dec 28 12:44:08 nas2 cib[33269]: warning: qb_ipcs_event_sendv: new_event_notification (33269-25757-9): Broken pipe (32) Dec 28 12:44:08 nas2 cib[33269]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type Dec 28 12:44:08 nas2 crmd[25757]: error: crmd_fast_exit: Could not recover from internal error Dec 28 12:44:08 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 324: last-failure-nas_samba=1451277850 Dec 28 12:44:08 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (25757) exited: Generic Pacemaker error (201) Dec 28 12:44:08 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd Dec 28 12:44:08 nas2 crmd[42199]: notice: main: CRM Git Version: 368c726 Dec 28 12:44:08 nas2 crmd[42199]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Dec 28 12:44:08 nas2 crmd[42199]: notice: cluster_connect_quorum: Quorum acquired Dec 28 12:44:08 nas2 crmd[42199]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null)) Dec 28 12:44:08 nas2 crmd[42199]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null)) Dec 28 12:44:08 nas2 crmd[42199]: notice: do_started: The local CRM is operational ...... 8th time: Dec 30 14:01:01 nas2 systemd: Started Session 9033 of user root. Dec 30 14:10:01 nas2 systemd: Starting Session 9034 of user root. Dec 30 14:10:01 nas2 systemd: Started Session 9034 of user root. Dec 30 14:10:52 nas2 systemd: Reloading. Dec 30 14:11:42 nas2 crmd[42199]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=116, rc=1, cib-update=71, confirmed=false) unknown error Dec 30 14:11:42 nas2 crmd[42199]: error: crm_ipc_read: Connection to lrmd failed Dec 30 14:11:42 nas2 crmd[42199]: error: mainloop_gio_callback: Connection to lrmd[0x7ab020] closed (I/O condition=1) Dec 30 14:11:42 nas2 crmd[42199]: crit: lrm_connection_destroy: LRM Connection failed Dec 30 14:11:42 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (42197) exited: Operation not permitted (1) Dec 30 14:11:42 nas2 crmd[42199]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC Dec 30 14:11:42 nas2 crmd[42199]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Dec 30 14:11:42 nas2 crmd[42199]: warning: do_recover: Fast-tracking shutdown in response to errors Dec 30 14:11:42 nas2 crmd[42199]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:80 (fence_ipmi0_monitor_60000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_test:96 (nas_filesystem_test_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Gsix:102 (nas_filesystem_Gsix_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_samba:116 (nas_samba_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_samba:117 (nas_samba_monitor_60000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_rpcbind:112 (nas_rpcbind_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_kyfz:106 (nas_filesystem_kyfz_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_vip_10_24_171_90:119 (nas_vip_10_24_171_90_monitor_20000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_User2:108 (nas_filesystem_User2_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_GX:110 (nas_filesystem_GX_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Users:98 (nas_filesystem_Users_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_gsics:100 (nas_filesystem_gsics_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_nfs:114 (nas_nfs_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_shared_metadata:94 (nas_shared_metadata_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_MAS:104 (nas_filesystem_MAS_monitor_15000) incomplete at shutdown Dec 30 14:11:42 nas2 crmd[42199]: error: lrm_state_verify_stopped: 14 resources were active at shutdown. Dec 30 14:11:42 nas2 crmd[42199]: notice: do_lrm_control: Disconnected from the LRM Dec 30 14:11:42 nas2 crmd[42199]: notice: terminate_cs_connection: Disconnecting from Corosync Dec 30 14:11:42 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (15) Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 361: fail-count-nas_samba=15 Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1451455905) Dec 30 14:11:42 nas2 crmd[42199]: error: crmd_fast_exit: Could not recover from internal error Dec 30 14:11:42 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 363: last-failure-nas_samba=1451455905 Dec 30 14:11:42 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (42199) exited: Generic Pacemaker error (201) Dec 30 14:11:42 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd Dec 30 14:11:42 nas2 crmd[14225]: notice: main: CRM Git Version: 368c726 Dec 30 14:11:42 nas2 crmd[14225]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Dec 30 14:11:42 nas2 crmd[14225]: notice: cluster_connect_quorum: Quorum acquired Dec 30 14:11:42 nas2 crmd[14225]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null)) Dec 30 14:11:42 nas2 crmd[14225]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null)) Dec 30 14:11:42 nas2 crmd[14225]: notice: do_started: The local CRM is operational ...... 9th time: Jan 1 15:30:01 nas2 systemd: Starting Session 9584 of user root. Jan 1 15:30:01 nas2 systemd: Started Session 9584 of user root. Jan 1 15:38:26 nas2 systemd: Reloading. Jan 1 15:39:16 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (14224) exited: Operation not permitted (1) Jan 1 15:39:16 nas2 crmd[14225]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=116, rc=1, cib-update=71, confirmed=false) unknown error Jan 1 15:39:16 nas2 crmd[14225]: error: crm_ipc_read: Connection to lrmd failed Jan 1 15:39:16 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd Jan 1 15:39:16 nas2 crmd[14225]: error: mainloop_gio_callback: Connection to lrmd[0x9bb020] closed (I/O condition=1) Jan 1 15:39:16 nas2 crmd[14225]: crit: lrm_connection_destroy: LRM Connection failed Jan 1 15:39:16 nas2 crmd[14225]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC Jan 1 15:39:16 nas2 crmd[14225]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Jan 1 15:39:16 nas2 crmd[14225]: warning: do_recover: Fast-tracking shutdown in response to errors Jan 1 15:39:16 nas2 crmd[14225]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:80 (fence_ipmi0_monitor_60000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_test:96 (nas_filesystem_test_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Gsix:102 (nas_filesystem_Gsix_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_samba:116 (nas_samba_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_samba:117 (nas_samba_monitor_60000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_rpcbind:112 (nas_rpcbind_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_kyfz:106 (nas_filesystem_kyfz_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_vip_10_24_171_90:119 (nas_vip_10_24_171_90_monitor_20000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_User2:108 (nas_filesystem_User2_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_GX:110 (nas_filesystem_GX_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_gsics:100 (nas_filesystem_gsics_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Users:98 (nas_filesystem_Users_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_nfs:114 (nas_nfs_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_shared_metadata:94 (nas_shared_metadata_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_MAS:104 (nas_filesystem_MAS_monitor_15000) incomplete at shutdown Jan 1 15:39:16 nas2 crmd[14225]: error: lrm_state_verify_stopped: 14 resources were active at shutdown. Jan 1 15:39:16 nas2 crmd[14225]: notice: do_lrm_control: Disconnected from the LRM Jan 1 15:39:16 nas2 crmd[14225]: notice: terminate_cs_connection: Disconnecting from Corosync Jan 1 15:39:16 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Jan 1 15:39:16 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (18) Jan 1 15:39:16 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 404: fail-count-nas_samba=18 Jan 1 15:39:16 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Jan 1 15:39:16 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1451633959) Jan 1 15:39:16 nas2 crmd[14225]: error: crmd_fast_exit: Could not recover from internal error Jan 1 15:39:16 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 406: last-failure-nas_samba=1451633959 Jan 1 15:39:16 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (14225) exited: Generic Pacemaker error (201) Jan 1 15:39:16 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd Jan 1 15:39:16 nas2 crmd[38163]: notice: main: CRM Git Version: 368c726 Jan 1 15:39:16 nas2 crmd[38163]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Jan 1 15:39:16 nas2 crmd[38163]: notice: cluster_connect_quorum: Quorum acquired Jan 1 15:39:16 nas2 crmd[38163]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null)) Jan 1 15:39:16 nas2 crmd[38163]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null)) Jan 1 15:39:16 nas2 crmd[38163]: notice: do_started: The local CRM is operational ...... 10th time: Jan 3 17:01:01 nas2 systemd: Starting Session 10136 of user root. Jan 3 17:01:01 nas2 systemd: Started Session 10136 of user root. Jan 3 17:05:59 nas2 systemd: Reloading. Jan 3 17:06:49 nas2 crmd[38163]: notice: process_lrm_event: LRM operation nas_samba_monitor_15000 (call=116, rc=1, cib-update=71, confirmed=false) unknown error Jan 3 17:06:49 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Jan 3 17:06:49 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-nas_samba (21) Jan 3 17:06:49 nas2 crmd[38163]: error: crm_ipc_read: Connection to lrmd failed Jan 3 17:06:49 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process lrmd (38162) exited: Operation not permitted (1) Jan 3 17:06:49 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: lrmd Jan 3 17:06:49 nas2 crmd[38163]: error: mainloop_gio_callback: Connection to lrmd[0x979020] closed (I/O condition=17) Jan 3 17:06:49 nas2 crmd[38163]: crit: lrm_connection_destroy: LRM Connection failed Jan 3 17:06:49 nas2 crmd[38163]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC Jan 3 17:06:49 nas2 crmd[38163]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Jan 3 17:06:49 nas2 crmd[38163]: warning: do_recover: Fast-tracking shutdown in response to errors Jan 3 17:06:49 nas2 crmd[38163]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Jan 3 17:06:49 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 443: fail-count-nas_samba=21 Jan 3 17:06:49 nas2 attrd[33272]: notice: attrd_cs_dispatch: Update relayed from node0 Jan 3 17:06:49 nas2 attrd[33272]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-nas_samba (1451812012) Jan 3 17:06:49 nas2 attrd[33272]: notice: attrd_perform_update: Sent update 445: last-failure-nas_samba=1451812012 Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action fence_ipmi0:80 (fence_ipmi0_monitor_60000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_test:96 (nas_filesystem_test_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Gsix:102 (nas_filesystem_Gsix_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_samba:116 (nas_samba_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_samba:117 (nas_samba_monitor_60000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_rpcbind:112 (nas_rpcbind_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_kyfz:106 (nas_filesystem_kyfz_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_vip_10_24_171_90:119 (nas_vip_10_24_171_90_monitor_20000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_User2:108 (nas_filesystem_User2_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_GX:110 (nas_filesystem_GX_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_Users:98 (nas_filesystem_Users_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_gsics:100 (nas_filesystem_gsics_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_nfs:114 (nas_nfs_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_shared_metadata:94 (nas_shared_metadata_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: notice: lrm_state_verify_stopped: Recurring action nas_filesystem_MAS:104 (nas_filesystem_MAS_monitor_15000) incomplete at shutdown Jan 3 17:06:49 nas2 crmd[38163]: error: lrm_state_verify_stopped: 14 resources were active at shutdown. Jan 3 17:06:49 nas2 crmd[38163]: notice: do_lrm_control: Disconnected from the LRM Jan 3 17:06:49 nas2 crmd[38163]: notice: terminate_cs_connection: Disconnecting from Corosync Jan 3 17:06:49 nas2 cib[33269]: warning: qb_ipcs_event_sendv: new_event_notification (33269-38163-9): Broken pipe (32) Jan 3 17:06:49 nas2 cib[33269]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type Jan 3 17:06:49 nas2 crmd[38163]: error: crmd_fast_exit: Could not recover from internal error Jan 3 17:06:49 nas2 pacemakerd[33257]: error: pcmk_child_exit: Child process crmd (38163) exited: Generic Pacemaker error (201) Jan 3 17:06:49 nas2 pacemakerd[33257]: notice: pcmk_process_exit: Respawning failed child process: crmd Jan 3 17:06:49 nas2 crmd[8674]: notice: main: CRM Git Version: 368c726 Jan 3 17:06:49 nas2 crmd[8674]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Jan 3 17:06:49 nas2 crmd[8674]: notice: cluster_connect_quorum: Quorum acquired Jan 3 17:06:49 nas2 crmd[8674]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member (was (null)) Jan 3 17:06:49 nas2 crmd[8674]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[2] - state is now member (was (null)) Jan 3 17:06:49 nas2 crmd[8674]: notice: do_started: The local CRM is operational ...... Waiting for the RHEL 7.2 tests results,too. The message will be printed every time, but by "not related" I mean that the text of the error message is likely inaccurate. Pacemakerd always prints the system error message corresponding to the child daemon's exit code. Because lrmd is exiting with exit code 1 every time, and exit code 1 corresponds to the system error message "Operation not permitted", pacemakerd is printing that message. But I do not think lrmd is using exit code 1 because of a permissions error. I don't see a likely code path for lrmd to return exit code 1, so I'm not sure why it's doing so. So far we have not reproduce this problem in RHEL 7.2 tests. But when we added more systemd resources to the cluster of RHEL 7.0 enviroments, it reproduced more frequently. Just make you informed, thanks. Zhaoming, That's good news about RHEL 7.2. I'm not aware of a specific fix related to this issue, but there were major systemd-related enhancements between 7.0 and 7.2. Zhaoming, I am marking this issue as fixed in the 7.2 release. If you encounter this issue or any other in the 7.2 release, feel free to re-open this report or open a new one. (In reply to Ken Gaillot from comment #14) > Zhaoming, > > I am marking this issue as fixed in the 7.2 release. If you encounter this > issue or any other in the 7.2 release, feel free to re-open this report or > open a new one. OK,thanks a lot! |