Bug 1432571

Summary:

[OSP10] after failed openstack overcloud upgrade galera takes >1h to start

Product:

Red Hat OpenStack

Reporter:

Ondrej <ochalups>

Component:

rhosp-director

Assignee:

Sofer Athlan-Guyot <sathlang>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Amit Ugol <augol>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

10.0 (Newton)

CC:

cpaquin, dbecker, dciabrin, fdinitto, lbezdick, mbayer, mburns, michele, morazi, mschuppe, ochalups, pablo.iranzo, rhel-osp-director-maint, rlopez, sathlang, sbaker, shardy, srevivo

Target Milestone:

---

Keywords:

Triaged, ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-05-30 07:49:23 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
journalctl -t 'galera(galera)' -t rsyncd	none

Description Ondrej 2017-03-15 17:11:18 UTC

Description of problem:
We troubleshooted some other issues and bugs along the way and ended up with update always hanging on one of controller nodes.

1)The overcloud package update was failing with yum_update.sh erroring out. BZ 1416228
Error: resources.UpdateDeployment: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 3

 -the script was failing due to 'pacemaker_status=$(systemctl is-active pacemaker)' gets run on compute where systemd resturns 3 as there is not pacemaker on the computes.
 -Fix:
compute_node:/var/lib/heat-config/deployed/b00c2a7c-defe-442f-9c91-8d47a5d9cac5.json

$ diff -u b00c2a7c-defe-442f-9c91-8d47a5d9cac5.org b00c2a7c-defe-442f-9c91-8d47a5d9cac5
--- b00c2a7c-defe-442f-9c91-8d47a5d9cac5.org    2017-02-24 04:58:56.880000000 -0500
+++ b00c2a7c-defe-442f-9c91-8d47a5d9cac5        2017-02-24 05:00:31.180000000 -0500
@@ -365,7 +365,7 @@
     exit 0
 fi
 
-pacemaker_status=$(systemctl is-active pacemaker)
+pacemaker_status=$(systemctl is-active pacemaker || :)
 
 # Fix the redis/rabbit resource start/stop timeouts. See https://bugs.launchpad.net/tripleo/+bug/1633455
 # and https://bugs.launchpad.net/tripleo/+bug/1634851

-worked locally

-provided Hotfix: openstack-tripleo-heat-templates-5.2.0-5.el7ost-hotfix.tar
<snip>
pacemaker_status=""
if hiera -c /etc/puppet/hiera.yaml service_names | grep -q pacemaker; then
    pacemaker_status=$(systemctl is-active pacemaker)
fi
</snip>

2) We found out the template is not updated in undercloud swift which is supplied to overcloud nodes
-Fix:
 openstack overcloud deploy --update-plan-only ... <your environment options as used on deploy time>

3) Update kept erroring with "pre-update hook is not defined"
$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
WAITING
on_breakpoint: [u'overcloud-oscar1com03', u'overcloud-oscar1com06', u'overcloud-oscar1ctr01', u'overcloud-oscar1com02', u'overcloud-oscar1com01', u'overcloud-oscar1ctr02', u'overcloud-oscar1cep02', u'overcloud-oscar1cep01', u'overcloud-oscar1com04', u'overcloud-oscar1ctr03', u'overcloud-oscar1com05']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 39b815f7-3e11-4985-8755-d98fa9cee117), no=cancel update, C-c=quit interactive mode:
failed to remove breakpoint on overcloud-oscar1com05: ERROR: The "pre-update" hook is not defined on SoftwareDeployment "UpdateDeployment" [39b815f7-3e11-4985-8755-d98fa9cee117] Stack "overcloud-Compute-tgkbzaqcyn42-4-3rqvymqchyht" [7bcf2ac6-d672-479b-bc2d-bf7574743eb0]
FAILED
update finished with status FAILED
Stack update failed.

-Fix
change line 469 in /usr/lib/python2.7/site-packages/heat/engine/resource.py as follows:

# diff -u /usr/lib/python2.7/site-packages/heat/engine/resource.py.org /usr/lib/python2.7/site-packages/heat/engine/resource.py
--- /usr/lib/python2.7/site-packages/heat/engine/resource.py.org        2017-03-08 04:00:21.046912939 -0500
+++ /usr/lib/python2.7/site-packages/heat/engine/resource.py    2017-03-08 04:00:34.852912939 -0500
@@ -466,7 +466,7 @@
             self.trigger_hook(hook)
             LOG.info(_LI('Reached hook on %s'), six.text_type(self))
 
-            while self.has_hook(hook) and self.status != self.FAILED:
+            while self.has_hook(hook):
                 try:
                     yield
                 except BaseException as exc:

4) Compute5 update timed-out. 
Fix:
-systemctl restart os-collect-config

5) Controller03 update failed
- The node was successfully updated a day ago(9.3.)
- we found galera and pacemaker resources in wrong state
- Fix:
 - pcs resource cleanup

6) Controller03 update hanged and failed
- it seems the node goes through update fine, resource status is updated to UPDATE_COMPLETE, but the update does not continue to rest of nodes.
#############################################################################
from heat resource-list -n5 overcloud
| 0                                             | b4f757b2-d7eb-44b1-9574-0b6706c12104                                            | OS::TripleO::Controller                                                                                                                                          | UPDATE_FAILED   | 2017-03-10T14:50:38Z | overcloud-Controller-ga3365oyszje

| 2                                             | 11f6b434-2cb9-468b-9853-ededc08355ff                                            | OS::TripleO::Controller                                                                                                                                          | UPDATE_COMPLETE | 2017-03-10T14:50:39Z | overcloud-Controller-ga3365oyszje

| 1                                             | cac72b7d-e042-428d-a2f7-520569839c73                                            | OS::TripleO::Controller                                                                                                                                          | UPDATE_FAILED   | 2017-03-10T14:50:41Z | overcloud-Controller-ga3365oyszje
#############################################################################
heat-engine.log
2017-03-10 16:00:22.401 4530 INFO heat.engine.resource [req-41a4c73f-da6f-4dfd-a15c-d298cf8ca936 - 3e1c2a4813c24e23bcab940b9cdf8fc6 - - -] Clearing pre-update hook on SoftwareDeployment "UpdateDeployment" [626cb5b4-f9da-4c34-bda5-569bf15dfb68] Stack "overcloud-Controller-ga3365oyszje-2-siicr4iazlzq" [11f6b434-2cb9-468b-9853-ededc08355ff]
 
 
2017-03-10 16:02:02.254 4528 INFO heat.engine.resource [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] deleting SoftwareDeployment "UpdateDeployment" [626cb5b4-f9da-4c34-bda5-569bf15dfb68] Stack "overcloud-Controller-ga3365oyszje-2-siicr4iazlzq*" [068c2982-ad75-475d-8180-4dec0f744657]
....

2017-03-10 16:02:03.670 4528 DEBUG heat.engine.stack [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Deleting backup stack update_task /usr/lib/python2.7/site-packages/heat/engine/stack.py:1503
2017-03-10 16:02:03.692 4528 INFO heat.engine.stack [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Stack DELETE IN_PROGRESS (overcloud-Controller-ga3365oyszje-2-siicr4iazlzq*): Stack DELETE started
2017-03-10 16:02:13.784 4528 INFO heat.engine.stack [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Stack DELETE COMPLETE (overcloud-Controller-ga3365oyszje-2-siicr4iazlzq*): Stack DELETE completed successfully
2017-03-10 16:02:13.825 4528 INFO heat.engine.stack [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Stack UPDATE COMPLETE (overcloud-Controller-ga3365oyszje-2-siicr4iazlzq): Stack UPDATE completed successfully

2017-03-10 16:02:13.880 4528 DEBUG heat.engine.scheduler [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Task update_task from Stack "overcloud-Controller-ga3365oyszje-2-siicr4iazlzq" [11f6b434-2cb9-468b-9853-ededc08355ff] complete step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:222
2017-03-10 16:02:13.882 4528 DEBUG heat.engine.stack_lock [req-d7a09136-4914-492a-aea7-d8c32af5d5f0 - - - - -] Engine d2a40ac7-1b1d-493d-96e8-31c8fa9f8894 released lock on stack 11f6b434-2cb9-468b-9853-ededc08355ff release /usr/lib/python2.7/site-packages/heat/engine/stack_lock.py:124

7) update without -i interactive mode failed too
$ openstack overcloud update stack overcloud
starting package update on stack overcloud
stack overcloud status: FAILED             

8) Currently openstack overcloud update stuck on overcloud-oscar1ctr02
-the observation of the current failures seem to be it's always stuck after updating one of the controller nodes. After the node is updated the process does not proceed and times out.

$ openstack overcloud update stack -i overcloud                                                                                         starting package update on stack overcloud
WAITING
on_breakpoint: [u'overcloud-oscar1cep01', u'overcloud-oscar1com06', u'overcloud-oscar1com02', u'overcloud-oscar1ctr01', u'overcloud-oscar1com03', u'overcloud-oscar1com05', u'overcloud-oscar1com01', u'overcloud-oscar1cep02', u'overcloud-oscar1ctr02', u'overcloud-oscar1com04', u'overcloud-oscar1ctr03']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 626cb5b4-f9da-4c34-bda5-569bf15dfb68), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
FAILED
update finished with status FAILED
Stack update failed.

Version-Release number of selected component (if applicable):
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-heat-api-7.0.1-2.el7ost.noarch
openstack-heat-api-cfn-7.0.1-2.el7ost.noarch
openstack-heat-common-7.0.1-2.el7ost.noarch
openstack-heat-engine-7.0.1-2.el7ost.noarch
openstack-heat-templates-0-0.10.1e6015dgit.el7ost.noarch
openstack-tripleo-heat-templates-5.2.0-5.el7ost.noarch
openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch
puppet-heat-9.4.1-2.el7ost.noarch
python-heat-agent-0-0.10.1e6015dgit.el7ost.noarch
python-heat-tests-7.0.1-2.el7ost.noarch
python-heatclient-1.5.0-1.el7ost.noarc

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:
overcloud updated successfully

Expected results:
"openstack overcloud update stack -i overcloud" fails

Additional info:

logs available in collab-shell
/cases/01794388

Comment 1 Zane Bitter 2017-03-15 19:01:36 UTC

The logs available in collab-shell are still showing the same error:

2017-02-17 16:41:14.042 4612 ERROR heat.engine.resource ResourceFailure: resources.CephStorage: resources[0]: Error: resources.UpdateDeployment: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 3

Comment 4 Sofer Athlan-Guyot 2017-03-16 16:49:15 UTC

Hi,

so the last error during the upgrade is an "exit 1", not "exit 3".

The exact error is a failure to restart the galera cluster[1] in time.

At 9:31 we start the update and stop the cluster, and at 9:33 we try
to restart it.  We wait half an hour for galera to come back[2] and
then give up we an "exit 1"[3]

We can see that the promotion of the cluster fails in the corosync
log[4] at 9:38.

At 10:45, the galera cluster was successful "created"[5]

Did you do any form of cluster manipulation to have the galera cluster
back, like a "pcs resource cleanup galera" for instance at 10:45 ?

We're going to pull in a specialist in galera to look at those logs to
better understand what could be the root cause of the issue.

[1]: "ERROR galera sync timed out" from sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/pacemaker/crm_report/overcloud-oscar1ctr03.localdomain/journal.log
[2]: from the same file, here is the clustercheck output (~360 times)

    Galera cluster node is not synced.
    HTTP/1.1 503 Service Unavailable
    Content-Type: text/plain
    Connection: close
    Content-Length: 36

[3]: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/yum_update.sh#L123-L130
[4]: in sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/var/log/cluster/corosync.log:

    Mar 10 09:38:48 [790616] overcloud-oscar1ctr03.localdomain       crmd:    error: process_lrm_event:     Result of promote operation for galera on overcloud-oscar1ctr03: Timed Out | call=57 key=galera_promote_0 timeout=300000ms

[5]: from sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/var/log/mysqld.log

    170310  9:31:44 [ERROR] mysqld: Got an error writing communication packets
    170310 09:33:36 mysqld_safe Number of processes running now: 0
    170310 09:33:36 mysqld_safe WSREP: not restarting wsrep node automatically
    170310 09:33:36 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended
    170310 09:33:48 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
    170310 09:33:48 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.f5tFvN' --pid-file='/var/lib/mysql/overcloud-oscar1ctr03.localdomain-recover.pid'
    170310  9:33:48 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
    170310 10:45:23 mysqld_safe WSREP: Recovered position 1da1081c-de49-11e6-a27e-6f95973e550b:22857127

Comment 5 Sofer Athlan-Guyot 2017-03-16 18:46:24 UTC

Hi,

so we have a confirmation that the command that look into the the seq
number for starting galera took 1h12 (from log[5] above).

To check what is happening there, could you run those commands on oscar1ctr03 as root:

Stop galera on oscar1ctr03 node: (the cluster will still be
functional)

    pcs resource ban --wait galera overcloud-oscar1ctr03

Execute the command that took 1h12 in the failed update attempt:

    nohup mysqld_safe --wsrep-recover --pid-file=/var/run/mysql/mysqld.pid > /tmp/mysqld_safe_run-$(hostname).log

You should have this kind of output and the command should last
less than a minute:

    170316 18:20:55 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
    /bin/mysqld_safe: line 755: ulimit: -1: invalid option
    ulimit: usage: ulimit [-SHacdefilmnpqrstuvx] [limit]
    170316 18:20:55 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
    170316 18:20:55 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.c15tD9' --pid-file='/var/lib/mysql/overcloud-controller-0-recover.pid'
    170316 18:20:58 mysqld_safe WSREP: Recovered position ea7c7d3e-0a42-11e7-adbd-b3c69a709c8a:46938
    170316 18:21:01 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended

This is the command that tooks 1h12 during the failed update so we
expect to have some log of what is happening there. It does nothing
beside reading the recovery position and after it, the mysql daemon
will be stopped.

To put the galera cluster back:

    pcs resource clear galera --wait

For the moment we cannot tell why the command tooks so long.  During
it, the mysqld process is started, the database files are opened but
beside that nothing really happen.  The usual suspect would be disk
error, but I cannot see anything in dmesg[1], and swap.

For swap it's interesting to see:
- the swap is a file in /root/tmp-swapfile of 9GB, which is
  half full[2]
- mongodb is eating 14GB [3]

So let's see how long is taking the command and send back the
/tmp/mysqld_safe_run-$(hostname).log so that we can debug further.

[1] sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/kernel/dmesg
[2] sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/memory/swapon_--bytes_--show
[3] sort -rn -k11,11 sosreport-20170310-122136/overcloud-oscar1ctr03.localdomain/sos_commands/process/ps_axo_flags_state_uid_pid_ppid_pgid_sid_cls_pri_addr_sz_wchan_stime_tty_time_cmd  | head

Thanks to Michele and Damien for their help so far.

Regards,

Comment 17 Sofer Athlan-Guyot 2017-03-24 09:19:49 UTC

Hi Ondrej,

hum the last output look a like the on you put in comment 15 but only shorter.  Am it missing something ?


If the one from comment 15 is the right one, we can see here that the command didn't take 1h to complete.  I've checked and the only network related execution that happen during this command is binding to port 3306 when the mysql server is started.

Can we investigate as why the promotion is not working.  The files  /var/log/mysqld.log and /var/log/mariadb/mariadb.log and the output of


    journalctl -t 'galera(galera)' -t rsyncd 


would help here as a starter.

Thanks,

Comment 21 Ondrej 2017-03-27 10:52:21 UTC

Created attachment 1266594 [details]
journalctl -t 'galera(galera)' -t rsyncd