Bug 1158815
| Summary: | MariaDB service appears down after sometime of working. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Leonid Natapov <lnatapov> | ||||||
| Component: | openstack-foreman-installer | Assignee: | Crag Wolfe <cwolfe> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Omri Hochman <ohochman> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | Foreman (RHEL 6) | CC: | cwolfe, fdinitto, lnatapov, mburns, morazi, rhos-maint, rohara, sasha, yeylon | ||||||
| Target Milestone: | z2 | ||||||||
| Target Release: | Installer | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2014-11-04 15:19:08 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Leonid Natapov
2014-10-30 10:03:14 UTC
Created attachment 952049 [details]
logs
OK, I am no galera expert, but I see in the logs:
141030 1:03:05 [ERROR] WSREP: got error while reading ist stream: 1
141030 1:03:05 [ERROR] WSREP: IST didn't contain all write sets, expected last: 41405 last received: 20917
141030 1:03:05 [ERROR] WSREP: receiving IST failed, node restart required: IST receiver reported error: 71 (Protocol error)
at galera/src/ist.cpp:recv():501
So my guess is we need to look into why IST would fail. Is this config using fencing or no fencing? I have one with fencing disabled that has been running without issue for 2 days now. Also, were there any network blips/drops? I have seen that cause such failures in the past
In bad2, I also see:
141029 13:32:20 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():141
141029 13:32:20 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -110 (Connection timed out)
141029 13:32:20 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'galera_cluster' at 'gcomm://10.19.95.22,10.19.95.23,10.19.95.24': -110 (Connection timed out)
141029 13:32:20 [ERROR] WSREP: gcs connect failed: Connection timed out
141029 13:32:20 [ERROR] WSREP: wsrep::connect() failed: 7
141029 13:32:20 [ERROR] Aborting
(In reply to Jason Guiditta from comment #2) > OK, I am no galera expert, but I see in the logs: > > 141030 1:03:05 [ERROR] WSREP: got error while reading ist stream: 1 > 141030 1:03:05 [ERROR] WSREP: IST didn't contain all write sets, expected > last: 41405 last received: 20917 > 141030 1:03:05 [ERROR] WSREP: receiving IST failed, node restart required: > IST receiver reported error: 71 (Protocol error) > at galera/src/ist.cpp:recv():501 > > So my guess is we need to look into why IST would fail. Is this config > using fencing or no fencing? I have one with fencing disabled that has been > running without issue for 2 days now. Also, were there any network > blips/drops? I have seen that cause such failures in the past The output from 'mysql -e "SHOW STATUS LIKE 'wsrep%';"' shows that the node is in non-Primary state. I am assuming this is coming from the node that still has mariadb running. This indicates loss of quorum. The other nodes are failing to join the cluster, which is expected because you cannot join a node that does not have a primary view of the cluster. Please check the version of pacemaker that you have installed. confirmed with QE: pacemaker-1.1.10-32.el7_0.1.x86_64 Please attach a crm_report. Created attachment 952249 [details]
crm_report output
It appears that the problem starts here: Oct 29 17:06:51 mac000af7799318 dbus-daemon: dbus[1175]: [system] Rejected send message, 2 matched rules; type="method_call", sender=":1.19" (uid=189 pid=3012 comm="/usr/libexec/pacemaker/crmd ") interface="org.freedesktop.systemd1.Manager" member="Reload" error name="(unset)" re quested_reply="0" destination="org.freedesktop.systemd1" (uid=0 pid=1 comm="/usr/lib/systemd/systemd --switched-root --system ") Oct 29 17:06:51 mac000af7799318 dbus[1175]: [system] Rejected send message, 2 matched rules; type="method_call", sender=":1.19" (uid=189 pid =3012 comm="/usr/libexec/pacemaker/crmd ") interface="org.freedesktop.systemd1.Manager" member="Reload" error name="(unset)" requested_reply ="0" destination="org.freedesktop.systemd1" (uid=0 pid=1 comm="/usr/lib/systemd/systemd --switched-root --system ") Oct 29 17:06:51 mac000af7799318 crmd[3012]: notice: process_lrm_event: LRM operation mysqld_monitor_0 (call=57, rc=7, cib-update=20, confirmed=true) not running We see serveral of those message rejected by dbus and monitor operations failing. Later, they start succeeding again: Oct 30 01:02:27 mac000af7799318 crmd[3012]: notice: process_lrm_event: LRM operation mysqld_monitor_0 (call=380, rc=0, cib-update=182, confirmed=true) ok Oct 30 01:02:57 mac000af7799318 corosync[2353]: [TOTEM ] A new membership (192.168.0.2:188) was formed. Members left: 3 Oct 30 01:02:57 mac000af7799318 corosync[2353]: [QUORUM] Members[2]: 1 2 there is a membership change here that makes me think that there was a network hiccup? or were you trying to do failure testing? From there on, mysql fails to start, probably due to the fact that the db is out of sync and needs manual intervention. From another node: Oct 29 17:14:03 mac000af779931a lrmd[2832]: notice: action_complete: mysqld monitor failed (rc=196): re-scheduling (time_sum=5360ms, start_d elay=2000ms, timeout=14640ms) Oct 29 17:14:04 mac000af779931a mysqld_safe: 141029 17:14:04 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended Oct 29 17:14:05 mac000af779931a crmd[2838]: notice: process_lrm_event: LRM operation mysqld_start_0 (call=259, rc=7, cib-update=94, confirmed=true) not running I was doing fencing tests but when I left the setup before the problem appear all nodes were up and running with all services up and running. According to /var/log/messages from one of the nodes,there was a reboot initiated by stonith on Oct 30 01:03 ,the time when you say "they start succeeding again". May it has something to do with the problem. BTW,I didn't touch the nodes at that time. Oct 30 01:03:11 mac000af779931a stonith-ng[2831]: notice: remote_op_done: Operation reboot of mac000af779932a.example.com by mac000af7799318.example.com for crmd.3012.com.b48b3d12: OK Oct 30 01:03:11 mac000af779931a crmd[2838]: notice: tengine_stonith_notify: Peer mac000af779932a.example.com was terminated (reboot) by mac000af7799318.example.com for mac000af7799318.example.com: OK (ref=b48b3d12-d9ea-4325-8632-d9fd1b3c0542) by client crmd.3012 Didn't reproduce in another BM setup. The service is UP for more than 22 hours on all controllers. Given comment 12, and conversations with the cluster team, this seems like it was just a network down issue on all hosts simultaneously. |