Description of problem: We troubleshooted some other issues and bugs along the way and ended up with update always hanging on one of controller nodes. 1)The overcloud package update was failing with yum_update.sh erroring out. BZ 1416228 Error: resources.UpdateDeployment: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 3 -the script was failing due to 'pacemaker_status=$(systemctl is-active pacemaker)' gets run on compute where systemd resturns 3 as there is not pacemaker on the computes. -Fix: compute_node:/var/lib/heat-config/deployed/b00c2a7c-defe-442f-9c91-8d47a5d9cac5.json $ diff -u b00c2a7c-defe-442f-9c91-8d47a5d9cac5.org b00c2a7c-defe-442f-9c91-8d47a5d9cac5 --- b00c2a7c-defe-442f-9c91-8d47a5d9cac5.org 2017-02-24 04:58:56.880000000 -0500 +++ b00c2a7c-defe-442f-9c91-8d47a5d9cac5 2017-02-24 05:00:31.180000000 -0500 @@ -365,7 +365,7 @@ exit 0 fi -pacemaker_status=$(systemctl is-active pacemaker) +pacemaker_status=$(systemctl is-active pacemaker || :) # Fix the redis/rabbit resource start/stop timeouts. See https://bugs.launchpad.net/tripleo/+bug/1633455 # and https://bugs.launchpad.net/tripleo/+bug/1634851 -worked locally -provided Hotfix: openstack-tripleo-heat-templates-5.2.0-5.el7ost-hotfix.tar <snip> pacemaker_status="" if hiera -c /etc/puppet/hiera.yaml service_names | grep -q pacemaker; then pacemaker_status=$(systemctl is-active pacemaker) fi </snip> 2) We found out the template is not updated in undercloud swift which is supplied to overcloud nodes -Fix: openstack overcloud deploy --update-plan-only ... <your environment options as used on deploy time> 3) Update kept erroring with "pre-update hook is not defined" $ openstack overcloud update stack -i overcloud starting package update on stack overcloud WAITING on_breakpoint: [u'overcloud-oscar1com03', u'overcloud-oscar1com06', u'overcloud-oscar1ctr01', u'overcloud-oscar1com02', u'overcloud-oscar1com01', u'overcloud-oscar1ctr02', u'overcloud-oscar1cep02', u'overcloud-oscar1cep01', u'overcloud-oscar1com04', u'overcloud-oscar1ctr03', u'overcloud-oscar1com05'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 39b815f7-3e11-4985-8755-d98fa9cee117), no=cancel update, C-c=quit interactive mode: failed to remove breakpoint on overcloud-oscar1com05: ERROR: The "pre-update" hook is not defined on SoftwareDeployment "UpdateDeployment" [39b815f7-3e11-4985-8755-d98fa9cee117] Stack "overcloud-Compute-tgkbzaqcyn42-4-3rqvymqchyht" [7bcf2ac6-d672-479b-bc2d-bf7574743eb0] FAILED update finished with status FAILED Stack update failed. -Fix change line 469 in /usr/lib/python2.7/site-packages/heat/engine/resource.py as follows: # diff -u /usr/lib/python2.7/site-packages/heat/engine/resource.py.org /usr/lib/python2.7/site-packages/heat/engine/resource.py --- /usr/lib/python2.7/site-packages/heat/engine/resource.py.org 2017-03-08 04:00:21.046912939 -0500 +++ /usr/lib/python2.7/site-packages/heat/engine/resource.py 2017-03-08 04:00:34.852912939 -0500 @@ -466,7 +466,7 @@ self.trigger_hook(hook) LOG.info(_LI('Reached hook on %s'), six.text_type(self)) - while self.has_hook(hook) and self.status != self.FAILED: + while self.has_hook(hook): try: yield except BaseException as exc: 4) Compute5 update timed-out. Fix: -systemctl restart os-collect-config 5) Controller03 update failed - The node was successfully updated a day ago(9.3.) - we found galera and pacemaker resources in wrong state - Fix: - pcs resource cleanup 6) Controller03 update hanged and failed - it seems the node goes through update fine, resource status is updated to UPDATE_COMPLETE, but the update does not continue to rest of nodes. ############################################################################# from heat resource-list -n5 overcloud | 0 | b4f757b2-d7eb-44b1-9574-0b6706c12104 | OS::TripleO::Controller | UPDATE_FAILED | 2017-03-10T14:50:38Z | overcloud-Controller-ga3365oyszje | 2 | 11f6b434-2cb9-468b-9853-ededc08355ff | OS::TripleO::Controller | UPDATE_COMPLETE | 2017-03-10T14:50:39Z | overcloud-Controller-ga3365oyszje | 1 | cac72b7d-e042-428d-a2f7-520569839c73 | OS::TripleO::Controller | UPDATE_FAILED | 2017-03-10T14:50:41Z | overcloud-Controller-ga3365oyszje ############################################################################# heat-engine.log 2017-03-10 16:00:22.401 4530 INFO heat.engine.resource [req-41a4c73f-da6f-4dfd-a15c-d298cf8ca936 - 3e1c2a4813c24e23bcab940b9cdf8fc6 - - -] Clearing pre-update hook on SoftwareDeployment "UpdateDeployment" [626cb5b4-f9da-4c34-bda5-569bf15dfb68] Stack "overcloud-Controller-ga3365oyszje-2-siicr4iazlzq" [11f6b434-2cb9-468b-9853-ededc08355ff] 2017-03-10 16:02:02.254 4528 INFO heat.engine.resource [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] deleting SoftwareDeployment "UpdateDeployment" [626cb5b4-f9da-4c34-bda5-569bf15dfb68] Stack "overcloud-Controller-ga3365oyszje-2-siicr4iazlzq*" [068c2982-ad75-475d-8180-4dec0f744657] .... 2017-03-10 16:02:03.670 4528 DEBUG heat.engine.stack [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Deleting backup stack update_task /usr/lib/python2.7/site-packages/heat/engine/stack.py:1503 2017-03-10 16:02:03.692 4528 INFO heat.engine.stack [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Stack DELETE IN_PROGRESS (overcloud-Controller-ga3365oyszje-2-siicr4iazlzq*): Stack DELETE started 2017-03-10 16:02:13.784 4528 INFO heat.engine.stack [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Stack DELETE COMPLETE (overcloud-Controller-ga3365oyszje-2-siicr4iazlzq*): Stack DELETE completed successfully 2017-03-10 16:02:13.825 4528 INFO heat.engine.stack [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Stack UPDATE COMPLETE (overcloud-Controller-ga3365oyszje-2-siicr4iazlzq): Stack UPDATE completed successfully 2017-03-10 16:02:13.880 4528 DEBUG heat.engine.scheduler [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Task update_task from Stack "overcloud-Controller-ga3365oyszje-2-siicr4iazlzq" [11f6b434-2cb9-468b-9853-ededc08355ff] complete step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:222 2017-03-10 16:02:13.882 4528 DEBUG heat.engine.stack_lock [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Engine d2a40ac7-1b1d-493d-96e8-31c8fa9f8894 released lock on stack 11f6b434-2cb9-468b-9853-ededc08355ff release /usr/lib/python2.7/site-packages/heat/engine/stack_lock.py:124 7) update without -i interactive mode failed too $ openstack overcloud update stack overcloud starting package update on stack overcloud stack overcloud status: FAILED 8) Currently openstack overcloud update stuck on overcloud-oscar1ctr02 -the observation of the current failures seem to be it's always stuck after updating one of the controller nodes. After the node is updated the process does not proceed and times out. $ openstack overcloud update stack -i overcloud starting package update on stack overcloud WAITING on_breakpoint: [u'overcloud-oscar1cep01', u'overcloud-oscar1com06', u'overcloud-oscar1com02', u'overcloud-oscar1ctr01', u'overcloud-oscar1com03', u'overcloud-oscar1com05', u'overcloud-oscar1com01', u'overcloud-oscar1cep02', u'overcloud-oscar1ctr02', u'overcloud-oscar1com04', u'overcloud-oscar1ctr03'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 626cb5b4-f9da-4c34-bda5-569bf15dfb68), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS FAILED update finished with status FAILED Stack update failed. Version-Release number of selected component (if applicable): heat-cfntools-1.3.0-2.el7ost.noarch openstack-heat-api-7.0.1-2.el7ost.noarch openstack-heat-api-cfn-7.0.1-2.el7ost.noarch openstack-heat-common-7.0.1-2.el7ost.noarch openstack-heat-engine-7.0.1-2.el7ost.noarch openstack-heat-templates-0-0.10.1e6015dgit.el7ost.noarch openstack-tripleo-heat-templates-5.2.0-5.el7ost.noarch openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch puppet-heat-9.4.1-2.el7ost.noarch python-heat-agent-0-0.10.1e6015dgit.el7ost.noarch python-heat-tests-7.0.1-2.el7ost.noarch python-heatclient-1.5.0-1.el7ost.noarc How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: overcloud updated successfully Expected results: "openstack overcloud update stack -i overcloud" fails Additional info: logs available in collab-shell /cases/01794388
The logs available in collab-shell are still showing the same error: 2017-02-17 16:41:14.042 4612 ERROR heat.engine.resource ResourceFailure: resources.CephStorage: resources[0]: Error: resources.UpdateDeployment: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 3
Hi, so the last error during the upgrade is an "exit 1", not "exit 3". The exact error is a failure to restart the galera cluster[1] in time. At 9:31 we start the update and stop the cluster, and at 9:33 we try to restart it. We wait half an hour for galera to come back[2] and then give up we an "exit 1"[3] We can see that the promotion of the cluster fails in the corosync log[4] at 9:38. At 10:45, the galera cluster was successful "created"[5] Did you do any form of cluster manipulation to have the galera cluster back, like a "pcs resource cleanup galera" for instance at 10:45 ? We're going to pull in a specialist in galera to look at those logs to better understand what could be the root cause of the issue. [1]: "ERROR galera sync timed out" from sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/pacemaker/crm_report/overcloud-oscar1ctr03.localdomain/journal.log [2]: from the same file, here is the clustercheck output (~360 times) Galera cluster node is not synced. HTTP/1.1 503 Service Unavailable Content-Type: text/plain Connection: close Content-Length: 36 [3]: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/yum_update.sh#L123-L130 [4]: in sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/var/log/cluster/corosync.log: Mar 10 09:38:48 [790616] overcloud-oscar1ctr03.localdomain crmd: error: process_lrm_event: Result of promote operation for galera on overcloud-oscar1ctr03: Timed Out | call=57 key=galera_promote_0 timeout=300000ms [5]: from sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/var/log/mysqld.log 170310 9:31:44 [ERROR] mysqld: Got an error writing communication packets 170310 09:33:36 mysqld_safe Number of processes running now: 0 170310 09:33:36 mysqld_safe WSREP: not restarting wsrep node automatically 170310 09:33:36 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended 170310 09:33:48 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 170310 09:33:48 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.f5tFvN' --pid-file='/var/lib/mysql/overcloud-oscar1ctr03.localdomain-recover.pid' 170310 9:33:48 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 170310 10:45:23 mysqld_safe WSREP: Recovered position 1da1081c-de49-11e6-a27e-6f95973e550b:22857127
Hi, so we have a confirmation that the command that look into the the seq number for starting galera took 1h12 (from log[5] above). To check what is happening there, could you run those commands on oscar1ctr03 as root: Stop galera on oscar1ctr03 node: (the cluster will still be functional) pcs resource ban --wait galera overcloud-oscar1ctr03 Execute the command that took 1h12 in the failed update attempt: nohup mysqld_safe --wsrep-recover --pid-file=/var/run/mysql/mysqld.pid > /tmp/mysqld_safe_run-$(hostname).log You should have this kind of output and the command should last less than a minute: 170316 18:20:55 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. /bin/mysqld_safe: line 755: ulimit: -1: invalid option ulimit: usage: ulimit [-SHacdefilmnpqrstuvx] [limit] 170316 18:20:55 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 170316 18:20:55 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.c15tD9' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid' 170316 18:20:58 mysqld_safe WSREP: Recovered position ea7c7d3e-0a42-11e7-adbd-b3c69a709c8a:46938 170316 18:21:01 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended This is the command that tooks 1h12 during the failed update so we expect to have some log of what is happening there. It does nothing beside reading the recovery position and after it, the mysql daemon will be stopped. To put the galera cluster back: pcs resource clear galera --wait For the moment we cannot tell why the command tooks so long. During it, the mysqld process is started, the database files are opened but beside that nothing really happen. The usual suspect would be disk error, but I cannot see anything in dmesg[1], and swap. For swap it's interesting to see: - the swap is a file in /root/tmp-swapfile of 9GB, which is half full[2] - mongodb is eating 14GB [3] So let's see how long is taking the command and send back the /tmp/mysqld_safe_run-$(hostname).log so that we can debug further. [1] sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/kernel/dmesg [2] sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/memory/swapon_--bytes_--show [3] sort -rn -k11,11 sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/process/ps_axo_flags_state_uid_pid_ppid_pgid_sid_cls_pri_addr_sz_wchan_stime_tty_time_cmd | head Thanks to Michele and Damien for their help so far. Regards,
Hi Ondrej, hum the last output look a like the on you put in comment 15 but only shorter. Am it missing something ? If the one from comment 15 is the right one, we can see here that the command didn't take 1h to complete. I've checked and the only network related execution that happen during this command is binding to port 3306 when the mysql server is started. Can we investigate as why the promotion is not working. The files /var/log/mysqld.log and /var/log/mariadb/mariadb.log and the output of journalctl -t 'galera(galera)' -t rsyncd would help here as a starter. Thanks,
Created attachment 1266594 [details] journalctl -t 'galera(galera)' -t rsyncd