Bug 1432571
Summary: | [OSP10] after failed openstack overcloud upgrade galera takes >1h to start | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Ondrej <ochalups> | ||||
Component: | rhosp-director | Assignee: | Sofer Athlan-Guyot <sathlang> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Amit Ugol <augol> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 10.0 (Newton) | CC: | cpaquin, dbecker, dciabrin, fdinitto, lbezdick, mbayer, mburns, michele, morazi, mschuppe, ochalups, pablo.iranzo, rhel-osp-director-maint, rlopez, sathlang, sbaker, shardy, srevivo | ||||
Target Milestone: | --- | Keywords: | Triaged, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-05-30 07:49:23 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Ondrej
2017-03-15 17:11:18 UTC
The logs available in collab-shell are still showing the same error: 2017-02-17 16:41:14.042 4612 ERROR heat.engine.resource ResourceFailure: resources.CephStorage: resources[0]: Error: resources.UpdateDeployment: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 3 Hi, so the last error during the upgrade is an "exit 1", not "exit 3". The exact error is a failure to restart the galera cluster[1] in time. At 9:31 we start the update and stop the cluster, and at 9:33 we try to restart it. We wait half an hour for galera to come back[2] and then give up we an "exit 1"[3] We can see that the promotion of the cluster fails in the corosync log[4] at 9:38. At 10:45, the galera cluster was successful "created"[5] Did you do any form of cluster manipulation to have the galera cluster back, like a "pcs resource cleanup galera" for instance at 10:45 ? We're going to pull in a specialist in galera to look at those logs to better understand what could be the root cause of the issue. [1]: "ERROR galera sync timed out" from sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/pacemaker/crm_report/overcloud-oscar1ctr03.localdomain/journal.log [2]: from the same file, here is the clustercheck output (~360 times) Galera cluster node is not synced. HTTP/1.1 503 Service Unavailable Content-Type: text/plain Connection: close Content-Length: 36 [3]: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/yum_update.sh#L123-L130 [4]: in sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/var/log/cluster/corosync.log: Mar 10 09:38:48 [790616] overcloud-oscar1ctr03.localdomain crmd: error: process_lrm_event: Result of promote operation for galera on overcloud-oscar1ctr03: Timed Out | call=57 key=galera_promote_0 timeout=300000ms [5]: from sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/var/log/mysqld.log 170310 9:31:44 [ERROR] mysqld: Got an error writing communication packets 170310 09:33:36 mysqld_safe Number of processes running now: 0 170310 09:33:36 mysqld_safe WSREP: not restarting wsrep node automatically 170310 09:33:36 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended 170310 09:33:48 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 170310 09:33:48 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.f5tFvN' --pid-file='/var/lib/mysql/overcloud-oscar1ctr03.localdomain-recover.pid' 170310 9:33:48 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 170310 10:45:23 mysqld_safe WSREP: Recovered position 1da1081c-de49-11e6-a27e-6f95973e550b:22857127 Hi, so we have a confirmation that the command that look into the the seq number for starting galera took 1h12 (from log[5] above). To check what is happening there, could you run those commands on oscar1ctr03 as root: Stop galera on oscar1ctr03 node: (the cluster will still be functional) pcs resource ban --wait galera overcloud-oscar1ctr03 Execute the command that took 1h12 in the failed update attempt: nohup mysqld_safe --wsrep-recover --pid-file=/var/run/mysql/mysqld.pid > /tmp/mysqld_safe_run-$(hostname).log You should have this kind of output and the command should last less than a minute: 170316 18:20:55 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. /bin/mysqld_safe: line 755: ulimit: -1: invalid option ulimit: usage: ulimit [-SHacdefilmnpqrstuvx] [limit] 170316 18:20:55 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 170316 18:20:55 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.c15tD9' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid' 170316 18:20:58 mysqld_safe WSREP: Recovered position ea7c7d3e-0a42-11e7-adbd-b3c69a709c8a:46938 170316 18:21:01 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended This is the command that tooks 1h12 during the failed update so we expect to have some log of what is happening there. It does nothing beside reading the recovery position and after it, the mysql daemon will be stopped. To put the galera cluster back: pcs resource clear galera --wait For the moment we cannot tell why the command tooks so long. During it, the mysqld process is started, the database files are opened but beside that nothing really happen. The usual suspect would be disk error, but I cannot see anything in dmesg[1], and swap. For swap it's interesting to see: - the swap is a file in /root/tmp-swapfile of 9GB, which is half full[2] - mongodb is eating 14GB [3] So let's see how long is taking the command and send back the /tmp/mysqld_safe_run-$(hostname).log so that we can debug further. [1] sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/kernel/dmesg [2] sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/memory/swapon_--bytes_--show [3] sort -rn -k11,11 sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/process/ps_axo_flags_state_uid_pid_ppid_pgid_sid_cls_pri_addr_sz_wchan_stime_tty_time_cmd | head Thanks to Michele and Damien for their help so far. Regards, Hi Ondrej, hum the last output look a like the on you put in comment 15 but only shorter. Am it missing something ? If the one from comment 15 is the right one, we can see here that the command didn't take 1h to complete. I've checked and the only network related execution that happen during this command is binding to port 3306 when the mysql server is started. Can we investigate as why the promotion is not working. The files /var/log/mysqld.log and /var/log/mariadb/mariadb.log and the output of journalctl -t 'galera(galera)' -t rsyncd would help here as a starter. Thanks, Created attachment 1266594 [details]
journalctl -t 'galera(galera)' -t rsyncd
|