Bug 1575620

Summary:

[UPDATE] Compute update exit with error code while executing step 5

Product:

Red Hat OpenStack

Reporter:

Raviv Bar-Tal <rbartal>

Component:

rhosp-director

Assignee:

Jiri Stransky <jstransk>

Status:

CLOSED DUPLICATE

QA Contact:

Amit Ugol <augol>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

13.0 (Queens)

CC:

dbecker, mandreou, mburns, mcornea, morazi, rbartal

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

python-tripleoclient-9.2.1-4.el7ost openstack-tripleo-common-8.6.1-4.el7ost

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1607143 (view as bug list)

Environment:

Last Closed:

2018-05-10 11:04:23 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1568714

Bug Blocks:

1607143

Attachments:

Description	Flags
stack files	none
compute logs	none
undercloud sosreport part 1	none
undercloud sosreport part 2	none
some errors from compute /var/log/messages	none

Description Raviv Bar-Tal 2018-05-07 12:54:51 UTC

Description of problem:
Compute node update fails and exit during step 5

Version-Release number of selected component (if applicable):


How reproducible:
run the overcloud_update_run-Compute.sh script

Steps to Reproduce:
1.
2.
3.

Actual results:
overcloud_update_run_Compute.log exit unexpectetlly 
u'TASK [Check if /var/lib/docker-puppet/docker-puppet-tasks5.json exists] ********',
 u'ok: [192.168.24.10] => {"changed": false, "stat": {"exists": false}}']
(undercloud) [stack@undercloud-0 ~]$ less !$
less overcloud_update_run_Compute_new.log
(undercloud) [stack@undercloud-0 ~]$ tail -20 overcloud_update_run_Compute_new.log
 u'TASK [Debug output for task which failed: Run docker-puppet tasks (generate config) during step 5] ***',
 u'skipping: [192.168.24.15] => {"changed": false, "skip_reason": "Conditional result was False"}',
 u'skipping: [192.168.24.10] => {"changed": false, "skip_reason": "Conditional result was False"}',
 u'',
 u'TASK [Start containers for step 5] *********************************************',
 u'ok: [192.168.24.10] => {"censored": "the output has been hidden due to the fact that \'no_log: true\' was specified for this result", "changed": false}',
 u'ok: [192.168.24.15] => {"censored": "the output has been hidden due to the fact that \'no_log: true\' was specified for this result", "changed": false}',
 u'',
 u'TASK [Debug output for task which failed: Start containers for step 5] *********',
 u'ok: [192.168.24.15] => {',
 u'    "failed_when_result": false, ',
 u'    "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []',
 u'}',
 u'ok: [192.168.24.10] => {',
 u'    "failed_when_result": false, ',
 u'    "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []',
 u'}',
 u'',
 u'TASK [Check if /var/lib/docker-puppet/docker-puppet-tasks5.json exists] ********',
 u'ok: [192.168.24.10] => {"changed": false, "stat": {"exists": false}}']

Expected results:


Additional info:

Comment 1 Marios Andreou 2018-05-07 13:08:24 UTC

can you please triage this (from triage call round robin)

Comment 2 Raviv Bar-Tal 2018-05-08 06:58:51 UTC

Created attachment 1433023 [details]
stack files

Comment 3 Raviv Bar-Tal 2018-05-08 07:23:14 UTC

Created attachment 1433026 [details]
compute logs

Comment 4 Raviv Bar-Tal 2018-05-08 08:24:37 UTC

Created attachment 1433060 [details]
undercloud sosreport part 1

Comment 5 Raviv Bar-Tal 2018-05-08 08:25:31 UTC

Created attachment 1433061 [details]
undercloud sosreport part 2

Comment 6 Marios Andreou 2018-05-08 13:20:50 UTC

Created attachment 1433200 [details]
some errors from compute /var/log/messages

having a look here for triage on irc request via morazi. To be clear as it isn't in comment #0, this is OSP13 minor update process. The overcloud was deployed like (from attached undercloud /home/stack/overcloud_deploy.sh):
        #!/bin/bash

        openstack overcloud deploy \
        --timeout 100 \
        --templates /usr/share/openstack-tripleo-heat-templates \
        --stack overcloud \
        --libvirt-type kvm \
        --ntp-server clock.redhat.com \
        -e /home/stack/virt/internal.yaml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
        -e /home/stack/virt/network/network-environment.yaml \
        -e /home/stack/virt/hostnames.yml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
        -e /home/stack/virt/nodes_data.yaml \
        -e /home/stack/virt/extra_templates.yaml \
        -e /home/stack/virt/docker-images.yaml \
        --log-file overcloud_deployment_63.log
        
The update was run like (from attached undercloud /home/stack/overcloud_update_run-Compute.sh):

        #!/bin/env bash
        #
        # Run minor update on overcloud nodes
        #
        set -euo pipefail

        source /home/stack/stackrc


        echo "Running minor update all playbooks for Compute role"
        openstack overcloud update run \
                    --nodes Compute --playbook all 2>&1
             
Looking at the compute logs i see 2 possibly relevant issues (for convenience attaching here)  one is error finding a container (
so there may be a problem with the container images being used here (that were passed into the update prepare command):

        warning msg="libcontainerd: failed to retrieve container bb23f544ee04047dc4a8d89ca9f9b75b898428119632ef53fa82605681e0ec63 
        and
        error msg="Handler for GET /v1.26/containers/06f92879af1a/json returned error: No such container: 06f92879af1a"$

But I think the root cause is more related to ovs/containers because there is a long error trace after what looks like  

        dockerd-current[2143]: + /usr/bin/neutron-openvswitch-agent ...
        error : virNetSocketReadWire:1808 : End of file while reading data: Input/output error

Comment 7 Jiri Stransky 2018-05-09 13:30:08 UTC

A couple of observations:

* We dropped some params on the `update prepare` command line. The compute node lost NTP info during the prepare it seems. At one point in the /var/log/messages, it went "back in time" from May 7 to May 2. We need to keep providing NTP param to the prepare command.

* I can't see anything broken regarding containers on the compute. E.g. in the sosreport from compute, the `docker ps -a` output only has Running or Exited(0) containers, no failed or restarting ones.

* The OVS agent issue that Marios found indeed looks suspicious, but again i don't think it caused a container failure so it shouldn't manifest as "failed step 5". Furthermore the OVS container is started in step 4 which passed successfully.

* I think step 5 might have passed successfully too, and we just don't know about it because it was client output that got broken. Notice that it doesn't say that ansible failed, the output is simply cut short.

Suggestions:

* If we still have the environment, please get /var/log/mistral from the undercloud separately, as the sosreport is not sufficient there (it only contains api.log, we need the others too, and the api.log there doesn't open for me anyway). This could help us validate if the mistral workflow indeed failed or if we just reproduced the client output issue from 1568714. Also let's get `openstack workflow execution list` output.

* If we don't see a failed update workflow, or we don't have the env anymore, let's assume we just reproduced bug 1568714, put this to MODIFIED or ON_QA with the same fixed-in-version. If we can still reproduce the same problem after that, let's do FailedQA.

Comment 8 Jiri Stransky 2018-05-09 14:15:55 UTC

Re NTP parameter issue -- we do need to persist it indeed, but the root cause with the time flapping by 5 days difference was that this is a virtual environment, and likely the *host machine* had desynced time, so the virtual compute node got wrong time on reboot (May 2), before it restored its time (back to May 7).


Re this bug -- we don't have the environment to collect additional info, let's assume we hit bug 1568714 and re-test with the latest puddle where this should be fixed.

Comment 9 Raviv Bar-Tal 2018-05-10 11:04:23 UTC


*** This bug has been marked as a duplicate of bug 1568714 ***

Comment 10 Raviv Bar-Tal 2018-05-10 11:25:31 UTC

In this bug the behaviour is the same as in 1568714 and it is solved by the same fix, so it is closed as duplicate 

As for the NTP problem, it will have to be investigated separately, but from a quick check, it looks like the "host machine" is in sync.
[root@seal02 ~]# timedatectl 
      Local time: Thu 2018-05-10 14:25:14 IDT
  Universal time: Thu 2018-05-10 11:25:14 UTC
        RTC time: Thu 2018-05-10 11:25:14
       Time zone: Asia/Jerusalem (IDT, +0300)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: yes
 Last DST change: DST began at
                  Fri 2018-03-23 01:59:59 IST
                  Fri 2018-03-23 03:00:00 IDT
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2018-10-28 01:59:59 IDT
                  Sun 2018-10-28 01:00:00 IST
[root@seal02 ~]#