Bug 1432571
| Summary: | [OSP10] after failed openstack overcloud upgrade galera takes >1h to start | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Ondrej <ochalups> | ||||
| Component: | rhosp-director | Assignee: | Sofer Athlan-Guyot <sathlang> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Amit Ugol <augol> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 10.0 (Newton) | CC: | cpaquin, dbecker, dciabrin, fdinitto, lbezdick, mbayer, mburns, michele, morazi, mschuppe, ochalups, pablo.iranzo, rhel-osp-director-maint, rlopez, sathlang, sbaker, shardy, srevivo | ||||
| Target Milestone: | --- | Keywords: | Triaged, ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-05-30 07:49:23 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Ondrej
2017-03-15 17:11:18 UTC
The logs available in collab-shell are still showing the same error: 2017-02-17 16:41:14.042 4612 ERROR heat.engine.resource ResourceFailure: resources.CephStorage: resources[0]: Error: resources.UpdateDeployment: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 3 Hi,
so the last error during the upgrade is an "exit 1", not "exit 3".
The exact error is a failure to restart the galera cluster[1] in time.
At 9:31 we start the update and stop the cluster, and at 9:33 we try
to restart it. We wait half an hour for galera to come back[2] and
then give up we an "exit 1"[3]
We can see that the promotion of the cluster fails in the corosync
log[4] at 9:38.
At 10:45, the galera cluster was successful "created"[5]
Did you do any form of cluster manipulation to have the galera cluster
back, like a "pcs resource cleanup galera" for instance at 10:45 ?
We're going to pull in a specialist in galera to look at those logs to
better understand what could be the root cause of the issue.
[1]: "ERROR galera sync timed out" from sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/pacemaker/crm_report/overcloud-oscar1ctr03.localdomain/journal.log
[2]: from the same file, here is the clustercheck output (~360 times)
Galera cluster node is not synced.
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Connection: close
Content-Length: 36
[3]: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/yum_update.sh#L123-L130
[4]: in sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/var/log/cluster/corosync.log:
Mar 10 09:38:48 [790616] overcloud-oscar1ctr03.localdomain crmd: error: process_lrm_event: Result of promote operation for galera on overcloud-oscar1ctr03: Timed Out | call=57 key=galera_promote_0 timeout=300000ms
[5]: from sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/var/log/mysqld.log
170310 9:31:44 [ERROR] mysqld: Got an error writing communication packets
170310 09:33:36 mysqld_safe Number of processes running now: 0
170310 09:33:36 mysqld_safe WSREP: not restarting wsrep node automatically
170310 09:33:36 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended
170310 09:33:48 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
170310 09:33:48 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.f5tFvN' --pid-file='/var/lib/mysql/overcloud-oscar1ctr03.localdomain-recover.pid'
170310 9:33:48 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
170310 10:45:23 mysqld_safe WSREP: Recovered position 1da1081c-de49-11e6-a27e-6f95973e550b:22857127
Hi,
so we have a confirmation that the command that look into the the seq
number for starting galera took 1h12 (from log[5] above).
To check what is happening there, could you run those commands on oscar1ctr03 as root:
Stop galera on oscar1ctr03 node: (the cluster will still be
functional)
pcs resource ban --wait galera overcloud-oscar1ctr03
Execute the command that took 1h12 in the failed update attempt:
nohup mysqld_safe --wsrep-recover --pid-file=/var/run/mysql/mysqld.pid > /tmp/mysqld_safe_run-$(hostname).log
You should have this kind of output and the command should last
less than a minute:
170316 18:20:55 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
/bin/mysqld_safe: line 755: ulimit: -1: invalid option
ulimit: usage: ulimit [-SHacdefilmnpqrstuvx] [limit]
170316 18:20:55 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
170316 18:20:55 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.c15tD9' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid'
170316 18:20:58 mysqld_safe WSREP: Recovered position ea7c7d3e-0a42-11e7-adbd-b3c69a709c8a:46938
170316 18:21:01 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended
This is the command that tooks 1h12 during the failed update so we
expect to have some log of what is happening there. It does nothing
beside reading the recovery position and after it, the mysql daemon
will be stopped.
To put the galera cluster back:
pcs resource clear galera --wait
For the moment we cannot tell why the command tooks so long. During
it, the mysqld process is started, the database files are opened but
beside that nothing really happen. The usual suspect would be disk
error, but I cannot see anything in dmesg[1], and swap.
For swap it's interesting to see:
- the swap is a file in /root/tmp-swapfile of 9GB, which is
half full[2]
- mongodb is eating 14GB [3]
So let's see how long is taking the command and send back the
/tmp/mysqld_safe_run-$(hostname).log so that we can debug further.
[1] sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/kernel/dmesg
[2] sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/memory/swapon_--bytes_--show
[3] sort -rn -k11,11 sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/process/ps_axo_flags_state_uid_pid_ppid_pgid_sid_cls_pri_addr_sz_wchan_stime_tty_time_cmd | head
Thanks to Michele and Damien for their help so far.
Regards,
Hi Ondrej, hum the last output look a like the on you put in comment 15 but only shorter. Am it missing something ? If the one from comment 15 is the right one, we can see here that the command didn't take 1h to complete. I've checked and the only network related execution that happen during this command is binding to port 3306 when the mysql server is started. Can we investigate as why the promotion is not working. The files /var/log/mysqld.log and /var/log/mariadb/mariadb.log and the output of journalctl -t 'galera(galera)' -t rsyncd would help here as a starter. Thanks, Created attachment 1266594 [details]
journalctl -t 'galera(galera)' -t rsyncd
|