Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1158815

Summary: MariaDB service appears down after sometime of working.
Product: Red Hat OpenStack Reporter: Leonid Natapov <lnatapov>
Component: openstack-foreman-installerAssignee: Crag Wolfe <cwolfe>
Status: CLOSED CURRENTRELEASE QA Contact: Omri Hochman <ohochman>
Severity: unspecified Docs Contact:
Priority: high    
Version: Foreman (RHEL 6)CC: cwolfe, fdinitto, lnatapov, mburns, morazi, rhos-maint, rohara, sasha, yeylon
Target Milestone: z2   
Target Release: Installer   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-11-04 15:19:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
crm_report output none

Description Leonid Natapov 2014-10-30 10:03:14 UTC
MariaDB service appears down after sometime of working.
I have an HA + Neutron deployment. 3 Controllers and 1 Compute.
After sometime of working with no problem (all services are up and running), I've noticed that mariadb service is down on two of three cluster nodes.
After checking the third node (the one where mariad is still running).
If you have 3 nodes in galera cluster and you stop 2, the remainig goes to non-primary state due to quorum being lost. The question why other two nodes was stopped. At the same time when mysql was stopped I've notice the following messages in /var/log/message file: 
-----------------------------------------------------------------------------

Oct 30 01:00:01 mac000af779932a systemd: Failed to reset devices.list on /system.slice: Invalid argument
Oct 30 01:00:05 mac000af779932a crmd[3090]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Oct 30 01:00:05 mac000af779932a pengine[3088]: warning: unpack_rsc_op: Processing failed op start for mysqld:0 on mac000af779931a.example.com: not running (7)
Oct 30 01:00:05 mac000af779932a pengine[3088]: warning: common_apply_stickiness: Forcing mysqld-clone away from mac000af779931a.example.com after 1000000 failures (max=1000000)
Oct 30 01:00:05 mac000af779932a pengine[3088]: warning: common_apply_stickiness: Forcing mysqld-clone away from mac000af779931a.example.com after 1000000 failures (max=1000000)
Oct 30 01:00:05 mac000af779932a pengine[3088]: warning: common_apply_stickiness: Forcing mysqld-clone away from mac000af779931a.example.com after 1000000 failures (max=1000000)
Oct 30 01:00:05 mac000af779932a pengine[3088]: notice: process_pe_message: Calculated Transition 43: /var/lib/pacemaker/pengine/pe-input-20.bz2
Oct 30 01:00:05 mac000af779932a crmd[3090]: notice: run_graph: Transition 43 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-20.bz2): Complete
Oct 30 01:00:05 mac000af779932a crmd[3090]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
--------------------------------------------------------------------------------

[root@mac000af7799318 ~]# mysql -e "SHOW STATUS LIKE 'wsrep%';"
+------------------------------+--------------------------------------+
| Variable_name                | Value                                |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid       | b2467da9-5fad-11e4-963a-1e12c959fd42 |
| wsrep_protocol_version       | 5                                    |
| wsrep_last_committed         | 317                                  |
| wsrep_replicated             | 22                                   |
| wsrep_replicated_bytes       | 9298                                 |
| wsrep_repl_keys              | 110                                  |
| wsrep_repl_keys_bytes        | 1386                                 |
| wsrep_repl_data_bytes        | 6504                                 |
| wsrep_repl_other_bytes       | 0                                    |
| wsrep_received               | 3                                    |
| wsrep_received_bytes         | 226                                  |
| wsrep_local_commits          | 22                                   |
| wsrep_local_cert_failures    | 0                                    |
| wsrep_local_replays          | 0                                    |
| wsrep_local_send_queue       | 0                                    |
| wsrep_local_send_queue_avg   | 0.041667                             |
| wsrep_local_recv_queue       | 0                                    |
| wsrep_local_recv_queue_avg   | 0.000000                             |
| wsrep_local_cached_downto    | 296                                  |
| wsrep_flow_control_paused_ns | 0                                    |
| wsrep_flow_control_paused    | 0.000000                             |
| wsrep_flow_control_sent      | 0                                    |
| wsrep_flow_control_recv      | 0                                    |
| wsrep_cert_deps_distance     | 1.772727                             |
| wsrep_apply_oooe             | 0.000000                             |
| wsrep_apply_oool             | 0.000000                             |
| wsrep_apply_window           | 1.000000                             |
| wsrep_commit_oooe            | 0.000000                             |
| wsrep_commit_oool            | 0.000000                             |
| wsrep_commit_window          | 1.000000                             |
| wsrep_local_state            | 0                                    |
| wsrep_local_state_comment    | Initialized                          |
| wsrep_cert_index_size        | 8                                    |
| wsrep_causal_reads           | 0                                    |
| wsrep_cert_interval          | 0.000000                             |
| wsrep_incoming_addresses     |                                      |
| wsrep_cluster_conf_id        | 18446744073709551615                 |
| wsrep_cluster_size           | 0                                    |
| wsrep_cluster_state_uuid     | b2467da9-5fad-11e4-963a-1e12c959fd42 |
| wsrep_cluster_status         | non-Primary                          |
| wsrep_connected              | ON                                   |
| wsrep_local_bf_aborts        | 0                                    |
| wsrep_local_index            | 18446744073709551615                 |
| wsrep_provider_name          | Galera                               |
| wsrep_provider_vendor        | Codership Oy <info>    |
| wsrep_provider_version       | 3.5(rXXXX)                           |
| wsrep_ready                  | OFF                                  |
+------------------------------+--------------------------------------+

I am attaching 3 mariadb log file from all cluster nodes. mariadb_bad files are from nodes where mysql was stopped. mariadb file is from the node where it was still running. Also /var/log/messages file attached.

Comment 1 Leonid Natapov 2014-10-30 10:08:52 UTC
Created attachment 952049 [details]
logs

Comment 2 Jason Guiditta 2014-10-30 13:17:59 UTC
OK, I am no galera expert, but I see in the logs:

141030  1:03:05 [ERROR] WSREP: got error while reading ist stream: 1
141030  1:03:05 [ERROR] WSREP: IST didn't contain all write sets, expected last: 41405 last received: 20917
141030  1:03:05 [ERROR] WSREP: receiving IST failed, node restart required: IST receiver reported error: 71 (Protocol error)
         at galera/src/ist.cpp:recv():501

So my guess is we need to look into why IST would fail.  Is this config using fencing or no fencing?  I have one with fencing disabled that has been running without issue for 2 days now.  Also, were there any network blips/drops?  I have seen that cause such failures in the past

Comment 3 Jason Guiditta 2014-10-30 13:30:29 UTC
In bad2, I also see:

141029 13:32:20 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():141
141029 13:32:20 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -110 (Connection timed out)
141029 13:32:20 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'galera_cluster' at 'gcomm://10.19.95.22,10.19.95.23,10.19.95.24': -110 (Connection timed out)
141029 13:32:20 [ERROR] WSREP: gcs connect failed: Connection timed out
141029 13:32:20 [ERROR] WSREP: wsrep::connect() failed: 7
141029 13:32:20 [ERROR] Aborting

Comment 4 Ryan O'Hara 2014-10-30 15:01:32 UTC
(In reply to Jason Guiditta from comment #2)
> OK, I am no galera expert, but I see in the logs:
> 
> 141030  1:03:05 [ERROR] WSREP: got error while reading ist stream: 1
> 141030  1:03:05 [ERROR] WSREP: IST didn't contain all write sets, expected
> last: 41405 last received: 20917
> 141030  1:03:05 [ERROR] WSREP: receiving IST failed, node restart required:
> IST receiver reported error: 71 (Protocol error)
>          at galera/src/ist.cpp:recv():501
> 
> So my guess is we need to look into why IST would fail.  Is this config
> using fencing or no fencing?  I have one with fencing disabled that has been
> running without issue for 2 days now.  Also, were there any network
> blips/drops?  I have seen that cause such failures in the past

The output from 'mysql -e "SHOW STATUS LIKE 'wsrep%';"' shows that the node is in non-Primary state. I am assuming this is coming from the node that still has mariadb running. This indicates loss of quorum. The other nodes are failing to join the cluster, which is expected because you cannot join a node that does not have a primary view of the cluster.

Comment 5 Ryan O'Hara 2014-10-30 15:36:20 UTC
Please check the version of pacemaker that you have installed.

Comment 6 Mike Burns 2014-10-30 15:36:55 UTC
confirmed with QE:  pacemaker-1.1.10-32.el7_0.1.x86_64

Comment 7 Fabio Massimo Di Nitto 2014-10-30 15:40:31 UTC
Please attach a crm_report.

Comment 8 Alexander Chuzhoy 2014-10-30 16:02:24 UTC
Created attachment 952249 [details]
crm_report output

Comment 10 Fabio Massimo Di Nitto 2014-10-31 05:10:45 UTC
It appears that the problem starts here:

Oct 29 17:06:51 mac000af7799318 dbus-daemon: dbus[1175]: [system] Rejected send message, 2 matched rules; type="method_call", sender=":1.19"
 (uid=189 pid=3012 comm="/usr/libexec/pacemaker/crmd ") interface="org.freedesktop.systemd1.Manager" member="Reload" error name="(unset)" re
quested_reply="0" destination="org.freedesktop.systemd1" (uid=0 pid=1 comm="/usr/lib/systemd/systemd --switched-root --system ")
Oct 29 17:06:51 mac000af7799318 dbus[1175]: [system] Rejected send message, 2 matched rules; type="method_call", sender=":1.19" (uid=189 pid
=3012 comm="/usr/libexec/pacemaker/crmd ") interface="org.freedesktop.systemd1.Manager" member="Reload" error name="(unset)" requested_reply
="0" destination="org.freedesktop.systemd1" (uid=0 pid=1 comm="/usr/lib/systemd/systemd --switched-root --system ")
Oct 29 17:06:51 mac000af7799318 crmd[3012]: notice: process_lrm_event: LRM operation mysqld_monitor_0 (call=57, rc=7, cib-update=20, confirmed=true) not running

We see serveral of those message rejected by dbus and monitor operations failing.

Later, they start succeeding again:

Oct 30 01:02:27 mac000af7799318 crmd[3012]: notice: process_lrm_event: LRM operation mysqld_monitor_0 (call=380, rc=0, cib-update=182, confirmed=true) ok

Oct 30 01:02:57 mac000af7799318 corosync[2353]: [TOTEM ] A new membership (192.168.0.2:188) was formed. Members left: 3
Oct 30 01:02:57 mac000af7799318 corosync[2353]: [QUORUM] Members[2]: 1 2

there is a membership change here that makes me think that there was a network hiccup? or were you trying to do failure testing?

From there on, mysql fails to start, probably due to the fact that the db is out of sync and needs manual intervention.

From another node:
Oct 29 17:14:03 mac000af779931a lrmd[2832]: notice: action_complete: mysqld monitor failed (rc=196): re-scheduling (time_sum=5360ms, start_d
elay=2000ms, timeout=14640ms)
Oct 29 17:14:04 mac000af779931a mysqld_safe: 141029 17:14:04 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
Oct 29 17:14:05 mac000af779931a crmd[2838]: notice: process_lrm_event: LRM operation mysqld_start_0 (call=259, rc=7, cib-update=94, confirmed=true) not running

Comment 11 Leonid Natapov 2014-10-31 07:49:44 UTC
I was doing fencing tests but when I left the setup before the problem appear all nodes were up and running with all services up and running. 



According to /var/log/messages from one of the nodes,there was a reboot initiated by stonith 
on Oct 30 01:03 ,the time when you say "they start succeeding again".
May it has something to do with the problem. BTW,I didn't touch the nodes at that time. 

Oct 30 01:03:11 mac000af779931a stonith-ng[2831]: notice: remote_op_done: Operation reboot of mac000af779932a.example.com by mac000af7799318.example.com for crmd.3012.com.b48b3d12: OK
Oct 30 01:03:11 mac000af779931a crmd[2838]: notice: tengine_stonith_notify: Peer mac000af779932a.example.com was terminated (reboot) by mac000af7799318.example.com for mac000af7799318.example.com: OK (ref=b48b3d12-d9ea-4325-8632-d9fd1b3c0542) by client crmd.3012

Comment 12 Alexander Chuzhoy 2014-11-04 15:02:32 UTC
Didn't reproduce in another BM setup. The service is UP for more than 22 hours on all controllers.

Comment 13 Mike Burns 2014-11-04 15:19:08 UTC
Given comment 12, and conversations with the cluster team, this seems like it was just a network down issue on all hosts simultaneously.