Bug 1114852
Summary: | Sometimes cluster unexpectedly dies and fails to start. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Leonid Natapov <lnatapov> | ||||
Component: | libqb | Assignee: | Christine Caulfield <ccaulfie> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 7.1 | CC: | abeekhof, arkadiusz.bubala, ccaulfie, cfeist, cluster-maint, ctowsley, djansa, fdinitto, jfriesse, jkortus, jkurik, jpokorny, jruemker, konstantin.ponomarenko, lmiksik, lnatapov, Marcin.Dulak, markd, oblaut, sbradley, skinjo, snagar, srevivo, tatsu-ab1, tojeline, ykawada | ||||
Target Milestone: | rc | Keywords: | Reopened, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libqb-1.0-1.el7 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1276345 1277538 (view as bug list) | Environment: | |||||
Last Closed: | 2016-11-04 04:39:51 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1276345, 1277538 | ||||||
Attachments: |
|
This looks to be the root cause of the restart issues: Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: mcp_read_config: Configured corosync to accept connections from group 189: Library error (2) Something to do with the corosync API by the looks of it Chrissie: I need your eyes here.... I'm looking in corosync-blackbox-live.txt ... The last lines I see are: trace Jul 01 09:21:44 messages_deliver_to_app(3901):12: Delivering 35f3c to 35f3d trace Jul 01 09:21:44 messages_deliver_to_app(3970):12: Delivering MCAST message with seq 0 to pending delivery queue yet pacemakerd is doing things (like trying to connect) that I'd have though would produce logging much later than that. Eg. this call that fails: Jul 1 09:33:04 puma43 pacemakerd[104109]: notice: mcp_read_config: Configured corosync to accept connections from group 189: Library error (2) Could you take a look? Its starting to smell like a corosync bug to me. Andrew, for me interesting lines are: trace Jul 01 09:21:34 messages_deliver_toERROR: qb_rb_chunk_read failed: Connection timed out [trace] ENTERING qb_rb_close() [debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header So I believe messages AFTER these lines are just result of reading of uninitialized memory. What also seems interesting is "Could not establish pacemakerd connection: Connection refused (111)". Is this normal? But there is thing I don't understand at all. cmap_set_uint8 returning CS_ERR_LIBRARY. I mean if cmap_init would fail, I would believe it is corosync problem (or fixed one with kernel and unix socket credentials). But cmap_set_int8 calling simply cannot return CS_ERR_LIBRARY (directly) in it's code path. Only place where it can is libqb (qb_ipcc_sendv_recv probably?). This is why I believe problem is hidden somewhere in LibQB. Dave, any idea? Maybe you've already fixed this bug in upstream? Please return back if you believe that this is not libqb but corosync problem. (In reply to Jan Friesse from comment #6) > Andrew, > for me interesting lines are: > > trace Jul 01 09:21:34 messages_deliver_toERROR: qb_rb_chunk_read failed: > Connection timed out > [trace] ENTERING qb_rb_close() > [debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header > > So I believe messages AFTER these lines are just result of reading of > uninitialized memory. Is that not from the subsequent calling of qb-blackbox though? Not the actual running of corosync > What also seems interesting is "Could not establish pacemakerd connection: > Connection refused (111)". Is this normal? Yes, thats pacemakerd checking for another instance of itself. (In reply to Andrew Beekhof from comment #7) > (In reply to Jan Friesse from comment #6) > > Andrew, > > for me interesting lines are: > > > > trace Jul 01 09:21:34 messages_deliver_toERROR: qb_rb_chunk_read failed: > > Connection timed out > > [trace] ENTERING qb_rb_close() > > [debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header > > > > So I believe messages AFTER these lines are just result of reading of > > uninitialized memory. > > Is that not from the subsequent calling of qb-blackbox though? Not the > actual running of corosync qb-create_from_file-header is a blackbox thing. I'm not seeing anything here that indicates a libqb bug. I haven't encountered anything like this before. Leonid, let's try to isolate faulty component. When you hit bug again, can you please try to login to nodes, run corosync-cmapctl (on every node) and send result (output + error code)? Leonid, can you please provide requested info? I faced the same issue. The OS is Debian. Two-node cluster. The situation is faced when the cluster is configured for two-nodes, but there is no second node. Actually in this case "node-1" is the only node which I try to start, and "node-0" is not used (is shutdown). The software: -libqb0_0.17.1 -corosync_2.3.4 -pacemaker_1.1.12 /var/log/messages: ------------------ Mar 11 17:17:02 daemon.notice<29> corosync[101978]: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service. Mar 11 17:17:02 daemon.info<30> corosync[101978]: [MAIN ] Corosync built-in features: pie relro bindnow Mar 11 17:17:02 daemon.notice<29> corosync[101979]: [TOTEM ] Initializing transport (UDP/IP Unicast). Mar 11 17:17:02 daemon.notice<29> corosync[101979]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none Mar 11 17:17:02 daemon.notice<29> corosync[101979]: [TOTEM ] Initializing transport (UDP/IP Unicast). Mar 11 17:17:02 daemon.notice<29> corosync[101979]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [TOTEM ] The network interface [169.254.0.3] is now up. Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [SERV ] Service engine loaded: corosync configuration map access [0] Mar 11 17:17:03 daemon.info<30> corosync[101979]: [QB ] server name: cmap Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [SERV ] Service engine loaded: corosync configuration service [1] Mar 11 17:17:03 daemon.info<30> corosync[101979]: [QB ] server name: cfg Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] Mar 11 17:17:03 daemon.info<30> corosync[101979]: [QB ] server name: cpg Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [SERV ] Service engine loaded: corosync profile loading service [4] Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [QUORUM] Using quorum provider corosync_votequorum Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] Mar 11 17:17:03 daemon.info<30> corosync[101979]: [QB ] server name: votequorum Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] Mar 11 17:17:03 daemon.info<30> corosync[101979]: [QB ] server name: quorum Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [TOTEM ] adding new UDPU member {169.254.0.2} Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [TOTEM ] adding new UDPU member {169.254.0.3} Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [TOTEM ] The network interface [169.254.1.3] is now up. Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [TOTEM ] adding new UDPU member {169.254.1.2} Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [TOTEM ] adding new UDPU member {169.254.1.3} Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [TOTEM ] A new membership (169.254.0.3:848) was formed. Members joined: 2 Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [QUORUM] Members[1]: 2 Mar 11 17:17:03 daemon.notice<29> corosync[101979]: [MAIN ] Completed service synchronization, ready to provide service. Mar 11 17:17:03 daemon.notice<29> pacemaker: Starting Pacemaker Cluster Manager Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Mar 11 17:17:03 daemon.err<27> pacemakerd[101990]: error: mcp_read_config: Couldn't create logfile: /var/log/pacemaker.log Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]: notice: mcp_read_config: Configured corosync to accept connections from group 107: Library error (2) Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]: notice: main: Starting Pacemaker 1.1.12 (Build: 561c4cf): generated-manpages agent-manpages ascii-docs ncurses libqb-logging libqb-ipc lha-fencing upstart nagios corosync-native snmp libesmtp acls Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]: notice: cluster_connect_quorum: Quorum lost Mar 11 17:17:03 daemon.notice<29> stonithd[101993]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Mar 11 17:17:03 daemon.notice<29> attrd[101995]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Mar 11 17:17:03 daemon.err<27> corosync[101979]: [MAIN ] Denied connection attempt from 105:107 Mar 11 17:17:03 daemon.err<27> attrd[101995]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11 Mar 11 17:17:03 daemon.err<27> attrd[101995]: error: main: Cluster connection failed Mar 11 17:17:03 daemon.notice<29> attrd[101995]: notice: main: Cleaning up before exit Mar 11 17:17:03 daemon.notice<29> cib[101992]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Mar 11 17:17:03 daemon.err<27> corosync[101979]: [QB ] Invalid IPC credentials (101979-101995-2). Mar 11 17:17:03 kern.info<6> kernel: [447239.222441] attrd[101995]: segfault at 1b8 ip 00007f5e62f0e9e1 sp 00007fffe35300a0 error 4 in libqb.so.0.17.1[7f5e62eff000+22000] Mar 11 17:17:03 daemon.err<27> corosync[101979]: [MAIN ] Denied connection attempt from 105:107 Mar 11 17:17:03 daemon.err<27> corosync[101979]: [QB ] Invalid IPC credentials (101979-101992-2). Mar 11 17:17:03 daemon.err<27> cib[101992]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11 Mar 11 17:17:03 daemon.crit<26> cib[101992]: crit: cib_init: Cannot sign in to the cluster... terminating Mar 11 17:17:03 daemon.notice<29> crmd[101997]: notice: main: CRM Git Version: 561c4cf Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node-1[2] - state is now member (was (null)) Mar 11 17:17:03 daemon.err<27> pacemakerd[101990]: error: pcmk_child_exit: Child process cib (101992) exited: Network is down (100) Mar 11 17:17:03 daemon.warning<28> pacemakerd[101990]: warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down. Mar 11 17:17:03 daemon.err<27> pacemakerd[101990]: error: child_waitpid: Managed process 101995 (attrd) dumped core Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]: notice: pcmk_child_exit: Child process attrd terminated with signal 11 (pid=101995, core=1) Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]: notice: pcmk_shutdown_worker: Shuting down Pacemaker Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]: notice: stop_child: Stopping crmd: Sent -15 to process 101997 Mar 11 17:17:03 daemon.warning<28> crmd[101997]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Mar 11 17:17:03 daemon.notice<29> crmd[101997]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms Mar 11 17:17:03 daemon.warning<28> crmd[101997]: warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING Mar 11 17:17:03 daemon.notice<29> crmd[101997]: notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ] Mar 11 17:17:03 daemon.notice<29> crmd[101997]: notice: terminate_cs_connection: Disconnecting from Corosync Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]: notice: stop_child: Stopping pengine: Sent -15 to process 101996 Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]: notice: stop_child: Stopping lrmd: Sent -15 to process 101994 Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 101993 Mar 11 17:17:12 daemon.err<27> stonithd[101993]: error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107) Mar 11 17:17:12 daemon.notice<29> pacemakerd[101990]: notice: pcmk_shutdown_worker: Shutdown complete Mar 11 17:17:12 daemon.notice<29> pacemakerd[101990]: notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error corosync-cmapctl: ----------------- config.totemconfig_reload_in_progress (u8) = 0 internal_configuration.service.0.name (str) = corosync_cmap internal_configuration.service.0.ver (u32) = 0 internal_configuration.service.1.name (str) = corosync_cfg internal_configuration.service.1.ver (u32) = 0 internal_configuration.service.2.name (str) = corosync_cpg internal_configuration.service.2.ver (u32) = 0 internal_configuration.service.3.name (str) = corosync_quorum internal_configuration.service.3.ver (u32) = 0 internal_configuration.service.4.name (str) = corosync_pload internal_configuration.service.4.ver (u32) = 0 internal_configuration.service.5.name (str) = corosync_votequorum internal_configuration.service.5.ver (u32) = 0 logging.debug (str) = off logging.fileline (str) = off logging.logfile (str) = /var/log/cluster/corosync.log logging.logfile_priority (str) = err logging.logger_subsys.QUORUM.debug (str) = off logging.logger_subsys.QUORUM.subsys (str) = QUORUM logging.timestamp (str) = on logging.to_logfile (str) = yes logging.to_stderr (str) = no logging.to_syslog (str) = yes nodelist.local_node_pos (u32) = 1 nodelist.node.0.name (str) = node-0 nodelist.node.0.nodeid (u32) = 1 nodelist.node.0.ring0_addr (str) = 169.254.0.2 nodelist.node.0.ring1_addr (str) = 169.254.1.2 nodelist.node.1.name (str) = node-1 nodelist.node.1.nodeid (u32) = 2 nodelist.node.1.ring0_addr (str) = 169.254.0.3 nodelist.node.1.ring1_addr (str) = 169.254.1.3 quorum.provider (str) = corosync_votequorum runtime.blackbox.dump_flight_data (str) = no runtime.blackbox.dump_state (str) = no runtime.connections.active (u64) = 1 runtime.connections.closed (u64) = 36 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.client_pid (u32) = 102017 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.dispatched (u64) = 0 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.flow_control (u32) = 0 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.flow_control_count (u64) = 0 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.invalid_request (u64) = 0 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.name (str) = corosync-cmapct runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.overload (u64) = 0 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.queue_size (u32) = 0 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.recv_retries (u64) = 0 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.requests (u64) = 0 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.responses (u64) = 0 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.send_retries (u64) = 0 runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.service_id (u32) = 0 runtime.services.cfg.0.rx (u64) = 0 runtime.services.cfg.0.tx (u64) = 0 runtime.services.cfg.1.rx (u64) = 0 runtime.services.cfg.1.tx (u64) = 0 runtime.services.cfg.2.rx (u64) = 0 runtime.services.cfg.2.tx (u64) = 0 runtime.services.cfg.3.rx (u64) = 0 runtime.services.cfg.3.tx (u64) = 0 runtime.services.cfg.service_id (u16) = 1 runtime.services.cmap.0.rx (u64) = 1 runtime.services.cmap.0.tx (u64) = 1 runtime.services.cmap.service_id (u16) = 0 runtime.services.cpg.0.rx (u64) = 2 runtime.services.cpg.0.tx (u64) = 2 runtime.services.cpg.1.rx (u64) = 2 runtime.services.cpg.1.tx (u64) = 2 runtime.services.cpg.2.rx (u64) = 0 runtime.services.cpg.2.tx (u64) = 0 runtime.services.cpg.3.rx (u64) = 12 runtime.services.cpg.3.tx (u64) = 12 runtime.services.cpg.4.rx (u64) = 0 runtime.services.cpg.4.tx (u64) = 0 runtime.services.cpg.5.rx (u64) = 1 runtime.services.cpg.5.tx (u64) = 1 runtime.services.cpg.service_id (u16) = 2 runtime.services.pload.0.rx (u64) = 0 runtime.services.pload.0.tx (u64) = 0 runtime.services.pload.1.rx (u64) = 0 runtime.services.pload.1.tx (u64) = 0 runtime.services.pload.service_id (u16) = 4 runtime.services.quorum.service_id (u16) = 3 runtime.services.votequorum.0.rx (u64) = 3 runtime.services.votequorum.0.tx (u64) = 2 runtime.services.votequorum.1.rx (u64) = 0 runtime.services.votequorum.1.tx (u64) = 0 runtime.services.votequorum.2.rx (u64) = 0 runtime.services.votequorum.2.tx (u64) = 0 runtime.services.votequorum.3.rx (u64) = 0 runtime.services.votequorum.3.tx (u64) = 0 runtime.services.votequorum.service_id (u16) = 5 runtime.totem.pg.mrp.rrp.0.faulty (u8) = 0 runtime.totem.pg.mrp.rrp.1.faulty (u8) = 0 runtime.totem.pg.mrp.srp.avg_backlog_calc (u32) = 0 runtime.totem.pg.mrp.srp.avg_token_workload (u32) = 0 runtime.totem.pg.mrp.srp.commit_entered (u64) = 1 runtime.totem.pg.mrp.srp.commit_token_lost (u64) = 0 runtime.totem.pg.mrp.srp.consensus_timeouts (u64) = 0 runtime.totem.pg.mrp.srp.continuous_gather (u32) = 0 runtime.totem.pg.mrp.srp.continuous_sendmsg_failures (u32) = 0 runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure (u8) = 0 runtime.totem.pg.mrp.srp.gather_entered (u64) = 1 runtime.totem.pg.mrp.srp.gather_token_lost (u64) = 0 runtime.totem.pg.mrp.srp.mcast_retx (u64) = 0 runtime.totem.pg.mrp.srp.mcast_rx (u64) = 0 runtime.totem.pg.mrp.srp.mcast_tx (u64) = 22 runtime.totem.pg.mrp.srp.memb_commit_token_rx (u64) = 2 runtime.totem.pg.mrp.srp.memb_commit_token_tx (u64) = 3 runtime.totem.pg.mrp.srp.memb_join_rx (u64) = 1 runtime.totem.pg.mrp.srp.memb_join_tx (u64) = 1 runtime.totem.pg.mrp.srp.memb_merge_detect_rx (u64) = 2062 runtime.totem.pg.mrp.srp.memb_merge_detect_tx (u64) = 2062 runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(169.254.0.3) r(1) ip(169.254.1.3) runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.2.status (str) = joined runtime.totem.pg.mrp.srp.mtt_rx_token (u32) = 180 runtime.totem.pg.mrp.srp.operational_entered (u64) = 1 runtime.totem.pg.mrp.srp.operational_token_lost (u64) = 0 runtime.totem.pg.mrp.srp.orf_token_rx (u64) = 4511 runtime.totem.pg.mrp.srp.orf_token_tx (u64) = 1 runtime.totem.pg.mrp.srp.recovery_entered (u64) = 1 runtime.totem.pg.mrp.srp.recovery_token_lost (u64) = 0 runtime.totem.pg.mrp.srp.rx_msg_dropped (u64) = 0 runtime.totem.pg.mrp.srp.token_hold_cancel_rx (u64) = 6 runtime.totem.pg.mrp.srp.token_hold_cancel_tx (u64) = 6 runtime.totem.pg.msg_queue_avail (u32) = 0 runtime.totem.pg.msg_reserved (u32) = 1 runtime.votequorum.ev_barrier (u32) = 2 runtime.votequorum.this_node_id (u32) = 2 runtime.votequorum.two_node (u8) = 0 totem.cluster_name (str) = avid_storage_cluster totem.crypto_cipher (str) = none totem.crypto_hash (str) = none totem.interface.0.bindnetaddr (str) = 169.254.0.3 totem.interface.0.mcastaddr (str) = 239.192.90.15 totem.interface.0.mcastport (u16) = 5405 totem.interface.1.bindnetaddr (str) = 169.254.1.3 totem.interface.1.mcastaddr (str) = 239.192.90.16 totem.interface.1.mcastport (u16) = 5405 totem.rrp_mode (str) = passive totem.transport (str) = udpu totem.version (u32) = 2 John, do you have coredumps of crashed corosync? According to: "Mar 27 14:00:50 daemon.err<27> pacemakerd[111069]: error: child_waitpid: Managed process 111074 (attrd) dumped core", there is a core dump in "/var/cores/core.attrd-111074-1427464849". How can I attach the core dump here? While interesting, attrd crashing is not related to this bug. You need a crash from corosync. "/var/cores/" doesn't have any other core dumps. Does Corosync use the same place for dumps? I am not sure that Corosync actually crashed. I will try to reproduce this state, and as soon as I get more info I will get back with it. I faced this issue one more time. Now I can surly say that Corosync doesn't crash. On a working machine I stopped Pacemaker and Corosync. Then I started them with the next commands and got this: ------------------ # /etc/init.d/corosync start Starting Corosync Cluster Engine (corosync): [ OK ] # /etc/init.d/corosync status corosync (pid 100837) is running... # /etc/init.d/pacemaker start Starting Pacemaker Cluster Manager[ OK ] # /etc/init.d/pacemaker status pacemakerd is stopped ------------------ /var/log/messages: ------------------ Apr 22 10:49:08 daemon.notice<29> pacemaker: Starting Pacemaker Cluster Manager Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]: error: mcp_read_config: Couldn't create logfile: /var/log/pacemaker.log Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: mcp_read_config: Configured corosync to accept connections from group 107: Library error (2) Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: main: Starting Pacemaker 1.1.12 (Build: 561c4cf): generated-manpages agent-manpages ascii-docs ncurses libqb-logging libqb-ipc lha-fencing upstart nagios corosync-native snmp libesmtp acls Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: cluster_connect_quorum: Quorum lost Apr 22 10:49:08 daemon.notice<29> stonithd[114136]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Apr 22 10:49:08 daemon.notice<29> attrd[114138]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Apr 22 10:49:08 daemon.err<27> corosync[100837]: [MAIN ] Denied connection attempt from 105:107 Apr 22 10:49:08 daemon.err<27> attrd[114138]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11 Apr 22 10:49:08 daemon.err<27> attrd[114138]: error: main: Cluster connection failed Apr 22 10:49:08 daemon.notice<29> attrd[114138]: notice: main: Cleaning up before exit Apr 22 10:49:08 kern.info<6> kernel: [162259.416242] attrd[114138]: segfault at 1b8 ip 00007f375481c9e1 sp 00007fff7ddf0d50 error 4 in libqb.so.0.17.1[7f375480d000+22000] Apr 22 10:49:08 daemon.err<27> corosync[100837]: [QB ] Invalid IPC credentials (100837-114138-2). Apr 22 10:49:08 daemon.notice<29> cib[114135]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Apr 22 10:49:08 daemon.err<27> cib[114135]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11 Apr 22 10:49:08 daemon.crit<26> cib[114135]: crit: cib_init: Cannot sign in to the cluster... terminating Apr 22 10:49:08 daemon.err<27> corosync[100837]: [MAIN ] Denied connection attempt from 105:107 Apr 22 10:49:08 daemon.err<27> corosync[100837]: [QB ] Invalid IPC credentials (100837-114135-3). Apr 22 10:49:08 daemon.notice<29> crmd[114140]: notice: main: CRM Git Version: 561c4cf Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node-0[1] - state is now member (was (null)) Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]: error: pcmk_child_exit: Child process cib (114135) exited: Network is down (100) Apr 22 10:49:08 daemon.warning<28> pacemakerd[114133]: warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down. Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]: error: child_waitpid: Managed process 114138 (attrd) dumped core Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: pcmk_child_exit: Child process attrd terminated with signal 11 (pid=114138, core=1) Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: pcmk_shutdown_worker: Shuting down Pacemaker Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: stop_child: Stopping crmd: Sent -15 to process 114140 Apr 22 10:49:08 daemon.warning<28> crmd[114140]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Apr 22 10:49:08 daemon.notice<29> crmd[114140]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms Apr 22 10:49:08 daemon.warning<28> crmd[114140]: warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING Apr 22 10:49:08 daemon.notice<29> crmd[114140]: notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ] Apr 22 10:49:08 daemon.notice<29> crmd[114140]: notice: terminate_cs_connection: Disconnecting from Corosync Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: stop_child: Stopping pengine: Sent -15 to process 114139 Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: stop_child: Stopping lrmd: Sent -15 to process 114137 Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 114136 Apr 22 10:49:17 daemon.err<27> stonithd[114136]: error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107) Apr 22 10:49:17 daemon.notice<29> pacemakerd[114133]: notice: pcmk_shutdown_worker: Shutdown complete Apr 22 10:49:17 daemon.notice<29> pacemakerd[114133]: notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error ------------------ "/var/cores/" contains only "core.attrd-*". What else can I do? Could be the problem in 'libqb'? I noticed this line in the log: Apr 22 10:49:08 kern.info<6> kernel: [162259.416242] attrd[114138]: segfault at 1b8 ip 00007f375481c9e1 sp 00007fff7ddf0d50 error 4 in libqb.so.0.17.1[7f375480d000+22000] Konstantyn, of course it can be bug in libqb, but I don't think it's very probable. Really important line is: Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]: notice: mcp_read_config: Configured corosync to accept connections from group 107: Library error (2) What exactly is your reproducer? Just stop pacemaker + corosync and then start them again? How reliable is such reproducer? Yes, that is correct. You need sequentially start/stop Corosync/Pacemaker for some amount of times (~5-10). Kostiantyn, there was bug in libqb bug 1211375. After installing fixed version, is the problem still there? Jan, I haven't tried the fixed version yet. I am still using "v0.17.1 Final Release". Hello, I've got the same issue. The update of libqb to the latest version didn't resolve the problem. It seems that it is much harder to reproduce but it's still here. The steps to reproduce this bug is: 1. Halt the secondary node. 2. Stop Corosync/Pacemaker. 3. Start Corosync/Pacemaker. The corosync is running but pacemaker exits. 4. Start the secondary node. 5. Start Corosync/Pacemaker on the secondary node. The corosync is running but pacemaker exits. 6. There is no segmentation fault error. The pacemaker exits because it can't connect to corosync. After a few retries corosync and pacemaker starts correctly. Arkadiusz, is halting of secondary node needed, or sequential start/stop of pacemaker on one node also reproduce the problem? Jan, I've never reproduced this problem with sequential start/stop of pacemaker but I could try to test it with stop/start in loop. This issue also seldom happens while rebooting the secondary node. The second thing is that this issue only occurs on very powerfull environment. Currently both nodes are running on Intel Server Board S2600IP with two Intel Xeon E5-2643 3.30GHz CPUs and 128 GB of RAM. Jan, I can surly say that it is reproducible with only one node up. I don't remember whether is it reproducible with two nodes up. BTW, my environment: Intel(R) Xeon(R) CPU E5-2618L v2 @ 2.00GHz 32 GB RAM Jan, Kostiantyn, I tested it with stop/start sequence and it is reproducible but both - corosync and peacemaker have to be restarted. When I did stop/start of pacemaker only this issue didn't occur. It is also impossible to reproduce this with debug log level. So I've added some log prints to the libqb. Below are the results: When everything works fine: Aug 04 12:09:06 [61070] node2 pacemakerd: info: crm_ipc_connect: Could not establish pacemakerd connection: Connection refused (111) Aug 04 12:09:06 [61070] node2 pacemakerd: info: get_cluster_type: Detected an active 'corosync' cluster Aug 04 12:09:06 [61070] node2 pacemakerd: info: mcp_read_config: Reading configure for stack: corosync Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipcc_sendv_recv: res: 40: Resource temporarily unavailable (11) Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipcc_sendv_recv: res: 44: Resource temporarily unavailable (11) Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipcc_sendv_recv: res: 40: Resource temporarily unavailable (11) Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipcc_sendv_recv: res: 43: Resource temporarily unavailable (11) Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipcc_sendv_recv: res: 40: Resource temporarily unavailable (11) Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipcc_sendv_recv: res: 62: Resource temporarily unavailable (11) Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipcc_sendv_recv: res: 40: Resource temporarily unavailable (11) Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipcc_sendv_recv: res: 44: Resource temporarily unavailable (11) Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipcc_sendv_recv: res: 40: Resource temporarily unavailable (11) Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 2 Aug 04 12:09:06 [61070] node2 pacemakerd: warning: qb_ipcc_sendv_recv: res: 24: Success (0) Aug 04 12:09:06 [61070] node2 pacemakerd: notice: mcp_read_config: Configured corosync to accept connections from group 112: OK (1) When the error occurs: Aug 04 12:13:22 [110652] node2 pacemakerd: info: crm_ipc_connect: Could not establish pacemakerd connection: Connection refused (111) Aug 04 12:13:22 [110652] node2 pacemakerd: info: get_cluster_type: Detected an active 'corosync' cluster Aug 04 12:13:22 [110652] node2 pacemakerd: info: mcp_read_config: Reading configure for stack: corosync Aug 04 12:13:22 [110652] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm_fc_get_rc = 1 Aug 04 12:13:22 [110652] node2 pacemakerd: error: qb_ipcc_sendv_recv: fc_get failed, result = -107: Resource temporarily unavailable (11) Aug 04 12:13:22 [110652] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 1 Aug 04 12:13:22 [110652] node2 pacemakerd: error: qb_ipcc_sendv_recv: fc_get failed, result = -107: Resource temporarily unavailable (11) Aug 04 12:13:22 [110652] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 1 Aug 04 12:13:22 [110652] node2 pacemakerd: error: qb_ipcc_sendv_recv: fc_get failed, result = -107: Resource temporarily unavailable (11) Aug 04 12:13:22 [110652] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 0 Aug 04 12:13:22 [110652] node2 pacemakerd: error: qb_ipcc_sendv_recv: fc_get failed, result = -107: Resource temporarily unavailable (11) Aug 04 12:13:22 [110652] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 0 Aug 04 12:13:22 [110652] node2 pacemakerd: error: qb_ipcc_sendv_recv: fc_get failed, result = -107: Resource temporarily unavailable (11) Aug 04 12:13:22 [110652] node2 pacemakerd: warning: qb_ipc_shm_fc_get: shm _fc_get rc = 0 Aug 04 12:13:22 [110652] node2 pacemakerd: error: qb_ipcc_sendv_recv: fc_get failed, result = -107: Success (0) Aug 04 12:13:22 [110652] node2 pacemakerd: notice: mcp_read_config: Configured corosync to accept connections from group 112: Library error (2) I'll continue debugging and let you know when I'll find something. Ok, so it looks like some kind of funny race condition, that may be why I'm unable to reproduce it. Guys, what exactly are you using to restart? Just service (or systemctl) pacemaker stop; service corosync stop; service corosync start; service pacemaker start? I found the root of the issue. In pacemaker code in function mcp_read_config there are: cmap_initialize(&local_handle); ... stack = get_cluster_type(); This creates two cmap sockets from pacemakerd daemon in a short time. From pacemaker.log: Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_ipcc_shm_connect: (1438758850834812) ipcc_shm_connect: cmap Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_rb_open_2: Increase ref_count for /dev/shm/qb-cmap-request-110743-112876-1-data Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_rb_open_2: Increase ref_count for /dev/shm/qb-cmap-response-110743-112876-1-data Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_rb_open_2: Increase ref_count for /dev/shm/qb-cmap-event-110743-112876-1-data Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_ipcc_shm_connect: (1438758850838617) ipcc_shm_connect: cmap Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_rb_open_2: Increase ref_count for /dev/shm/qb-cmap-request-110743-112876-1-data Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_rb_open_2: Increase ref_count for /dev/shm/qb-cmap-response-110743-112876-1-data Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_rb_open_2: Increase ref_count for /dev/shm/qb-cmap-event-110743-112876-1-data Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_ipcc_shm_disconnect: shm_disconnect Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_rb_close: Decrease ref_count for /dev/shm/qb-cmap-request-110743-112876-1-data Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_rb_close: Decrease ref_count for /dev/shm/qb-cmap-response-110743-112876-1-data Aug 05 09:14:10 [112876] node2 pacemakerd: warning: qb_rb_close: Decrease ref_count for /dev/shm/qb-cmap-event-110743-112876-1-data From corosync.log: Aug 5 09:14:10 node2 corosync[110743]: [QB ] (1438758850832743) ipcs_shm_connect: cmap (pid: 112876) Aug 5 09:14:10 node2 corosync[110743]: [QB ] creae new ref_count on /dev/shm/qb-cmap-request-110743-112876-1-data Aug 5 09:14:10 node2 corosync[110743]: [QB ] creae new ref_count on /dev/shm/qb-cmap-response-110743-112876-1-data Aug 5 09:14:10 node2 corosync[110743]: [QB ] creae new ref_count on /dev/shm/qb-cmap-event-110743-112876-1-data Aug 5 09:14:10 node2 corosync[110743]: [QB ] (1438758850836802) ipcs_shm_connect: cmap (pid: 112876) Aug 5 09:14:10 node2 corosync[110743]: [QB ] creae new ref_count on /dev/shm/qb-cmap-request-110743-112876-1-data Aug 5 09:14:10 node2 corosync[110743]: [QB ] creae new ref_count on /dev/shm/qb-cmap-response-110743-112876-1-data Aug 5 09:14:10 node2 corosync[110743]: [QB ] creae new ref_count on /dev/shm/qb-cmap-event-110743-112876-1-data The reference_counter was overwritten when get_cluster_type created new connection and then decreased when it closed this connection. When I moved the get_clsuter_type() function before cmap_initialize() in pacemaker code this issue didn't occur. Update: I found what really generates this problem. It is length of CONNECTION_DESCRIPTION field. It is defined as 16 by default in libqb ipc_int.h: #define CONNECTION_DESCRIPTION (16) When both corosync and pacemaker has PID >= 10000 and the fd >= 10 the description field will have at least 17 chars. Increasing the length of description field resolved this issue. (In reply to Arkadiusz Bubała from comment #41) > Update: > > I found what really generates this problem. It is length of > CONNECTION_DESCRIPTION field. It is defined as 16 by default in libqb > ipc_int.h: > #define CONNECTION_DESCRIPTION (16) > > When both corosync and pacemaker has PID >= 10000 and the fd >= 10 the > description field will have at least 17 chars. Increasing the length of > description field resolved this issue. The format string is pid-pid-fd 10000-10000-10 is only 14 characters. Looking at your debug, the pids are actually greater that 100,000, which is way larger than the default maximum pid used in rhel6 and rhel7. # cat /proc/sys/kernel/pid_max 32768 Under normal operation 16 characters should work. You all have increased the maximum pid size (likely for a really good reason) which has unexpectedly broken IPC. As a workaround, is it possible to lower the pid_max value back to the default range.. or something <= 99,999? This will resolve the issue in the short term. Of course we'll fix this in libqb. I just wanted to make you aware there's a work around. David, my mistake, of course the pid is larger than 100000 not 10000. I forgot about one zero. The pid_max on tested environments was 122880. Yes, you are right. Here is what I have: # cat /proc/sys/kernel/pid_max 122880 Which version of "libqb" will get the fix, the next one - v0.17.2? We should be able to get this in 0.17.2, yes. Sorry, my mistake. I should have said that will be in 0.17.3. 0.17.2 will be released this week and it won't happen that soon. I guess a backport ot RHEL6.7 is needed too. I'm seeing pacemaker crashing, and have: # cat /proc/sys/kernel/pid_max 147456 When will there be errata of this bug? (In reply to marcindulak from comment #48) > I guess a backport ot RHEL6.7 is needed too. > I'm seeing pacemaker crashing, and have: > # cat /proc/sys/kernel/pid_max > 147456 Hello, Do you have a case open with Red Hat Global Support Services on this issue? If not, I'd like to request you please open one so we can work with you on getting this fixed in the release you need it for. Thanks, John Ruemker Principal Software Maintenance Engineer Red Hat Global Support Services This will be covered by the rebase BZ #1299968 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2363.html |
Created attachment 913689 [details] crm report Sometimes pace maker unexpectedly dies and fails to start. I have cluster with 3 nodes that successfully running. Sometimes on one of the nodes cluster dies. Suddenly "pcs status" shows Error: cluster is not currently running on this node. Trying to start cluster fails. It happens from time to time on all 3 nodes. Here is the /var/log/messages log. crm_report is attached. pacemaker-cluster-libs-1.1.10-31.el7_0.x86_64 pacemaker-cli-1.1.10-31.el7_0.x86_64 pacemaker-libs-1.1.10-31.el7_0.x86_64 pacemaker-1.1.10-31.el7_0.x86_64 -------------------------------------------------------------------------- Jul 1 10:09:15 puma43 systemd: Started Corosync Cluster Engine. Jul 1 10:09:15 puma43 systemd: Starting Pacemaker High Availability Cluster Manager... Jul 1 10:09:15 puma43 systemd: Started Pacemaker High Availability Cluster Manager. Jul 1 10:09:15 puma43 systemd: Failed to reset devices.list on /system.slice/system-epmd.slice: Invalid argument Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Jul 1 10:09:15 puma43 pacemakerd: Could not establish pacemakerd connection: Connection refused (111) Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: mcp_read_config: Configured corosync to accept connections from group 189: Library error (2) Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: main: Starting Pacemaker 1.1.10-31.el7_0 (Build: 368c726): generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc systemd nagios corosync-native Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 2 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 2 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: cluster_connect_quorum: Quorum acquired Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 1 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 1 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 1 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 2 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Defaulting to uname -n for the local corosync node name Jul 1 10:09:15 puma43 cib[108543]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Jul 1 10:09:15 puma43 lrmd[108545]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Jul 1 10:09:15 puma43 attrd[108546]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Jul 1 10:09:15 puma43 stonith-ng[108544]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 1 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 1 Jul 1 10:09:15 puma43 attrd[108546]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Jul 1 10:09:15 puma43 stonith-ng[108544]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Jul 1 10:09:15 puma43 pengine[108547]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Jul 1 10:09:15 puma43 crmd[108548]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Jul 1 10:09:15 puma43 corosync[102268]: [MAIN ] Denied connection attempt from 189:189 Jul 1 10:09:15 puma43 corosync[102268]: [QB ] Invalid IPC credentials (102268-108546-2). Jul 1 10:09:15 puma43 attrd[108546]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11 Jul 1 10:09:15 puma43 attrd[108546]: error: main: HA Signon failed Jul 1 10:09:15 puma43 attrd[108546]: error: main: Aborting startup Jul 1 10:09:15 puma43 crmd[108548]: notice: main: CRM Git Version: 368c726 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 1 Jul 1 10:09:15 puma43 cib[108543]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Jul 1 10:09:15 puma43 corosync[102268]: [MAIN ] Denied connection attempt from 189:189 Jul 1 10:09:15 puma43 corosync[102268]: [QB ] Invalid IPC credentials (102268-108543-3). Jul 1 10:09:15 puma43 cib[108543]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 1 Jul 1 10:09:15 puma43 cib[108543]: crit: cib_init: Cannot sign in to the cluster... terminating Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 1 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1] - state is now member (was (null)) Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: crm_update_peer_state: pcmk_quorum_notification: Node puma43.scl.lab.tlv.redhat.com[2] - state is now member (was (null)) Jul 1 10:09:15 puma43 stonith-ng[108544]: notice: corosync_node_name: Unable to get node name for nodeid 2 Jul 1 10:09:15 puma43 stonith-ng[108544]: notice: get_node_name: Could not obtain a node name for corosync nodeid 2 Jul 1 10:09:15 puma43 stonith-ng[108544]: notice: corosync_node_name: Unable to get node name for nodeid 2 Jul 1 10:09:15 puma43 stonith-ng[108544]: notice: get_node_name: Defaulting to uname -n for the local corosync node name Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 3 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 3 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 3 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 3 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 3 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[3] - state is now member (was (null)) Jul 1 10:09:15 puma43 pacemakerd[108542]: error: pcmk_child_exit: Child process cib (108543) exited: Network is down (100) Jul 1 10:09:15 puma43 pacemakerd[108542]: warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down. Jul 1 10:09:15 puma43 pacemakerd[108542]: error: pcmk_child_exit: Child process attrd (108546) exited: Network is down (100) Jul 1 10:09:15 puma43 pacemakerd[108542]: warning: pcmk_child_exit: Pacemaker child process attrd no longer wishes to be respawned. Shutting ourselves down. Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: pcmk_shutdown_worker: Shuting down Pacemaker Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: stop_child: Stopping crmd: Sent -15 to process 108548 Jul 1 10:09:15 puma43 crmd[108548]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Jul 1 10:09:15 puma43 crmd[108548]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms Jul 1 10:09:15 puma43 crmd[108548]: warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING Jul 1 10:09:15 puma43 crmd[108548]: notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ] Jul 1 10:09:15 puma43 crmd[108548]: notice: terminate_cs_connection: Disconnecting from Corosync Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: stop_child: Stopping pengine: Sent -15 to process 108547 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: stop_child: Stopping lrmd: Sent -15 to process 108545 Jul 1 10:09:15 puma43 pacemakerd[108542]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 108544 Jul 1 10:09:24 puma43 stonith-ng[108544]: error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107) Jul 1 10:09:24 puma43 pacemakerd[108542]: notice: pcmk_shutdown_worker: Shutdown complete