Bug 1575620 - [UPDATE] Compute update exit with error code while executing step 5
Summary: [UPDATE] Compute update exit with error code while executing step 5
Keywords:
Status: CLOSED DUPLICATE of bug 1568714
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Jiri Stransky
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On: 1568714
Blocks: 1607143
TreeView+ depends on / blocked
 
Reported: 2018-05-07 12:54 UTC by Raviv Bar-Tal
Modified: 2023-02-22 23:02 UTC (History)
6 users (show)

Fixed In Version: python-tripleoclient-9.2.1-4.el7ost openstack-tripleo-common-8.6.1-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1607143 (view as bug list)
Environment:
Last Closed: 2018-05-10 11:04:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
stack files (12.70 MB, application/x-xz)
2018-05-08 06:58 UTC, Raviv Bar-Tal
no flags Details
compute logs (11.44 MB, application/x-xz)
2018-05-08 07:23 UTC, Raviv Bar-Tal
no flags Details
undercloud sosreport part 1 (19.00 MB, application/x-xz)
2018-05-08 08:24 UTC, Raviv Bar-Tal
no flags Details
undercloud sosreport part 2 (15.21 MB, application/octet-stream)
2018-05-08 08:25 UTC, Raviv Bar-Tal
no flags Details
some errors from compute /var/log/messages (15.47 KB, text/plain)
2018-05-08 13:20 UTC, Marios Andreou
no flags Details

Description Raviv Bar-Tal 2018-05-07 12:54:51 UTC
Description of problem:
Compute node update fails and exit during step 5

Version-Release number of selected component (if applicable):


How reproducible:
run the overcloud_update_run-Compute.sh script

Steps to Reproduce:
1.
2.
3.

Actual results:
overcloud_update_run_Compute.log exit unexpectetlly 
u'TASK [Check if /var/lib/docker-puppet/docker-puppet-tasks5.json exists] ********',
 u'ok: [192.168.24.10] => {"changed": false, "stat": {"exists": false}}']
(undercloud) [stack@undercloud-0 ~]$ less !$
less overcloud_update_run_Compute_new.log
(undercloud) [stack@undercloud-0 ~]$ tail -20 overcloud_update_run_Compute_new.log
 u'TASK [Debug output for task which failed: Run docker-puppet tasks (generate config) during step 5] ***',
 u'skipping: [192.168.24.15] => {"changed": false, "skip_reason": "Conditional result was False"}',
 u'skipping: [192.168.24.10] => {"changed": false, "skip_reason": "Conditional result was False"}',
 u'',
 u'TASK [Start containers for step 5] *********************************************',
 u'ok: [192.168.24.10] => {"censored": "the output has been hidden due to the fact that \'no_log: true\' was specified for this result", "changed": false}',
 u'ok: [192.168.24.15] => {"censored": "the output has been hidden due to the fact that \'no_log: true\' was specified for this result", "changed": false}',
 u'',
 u'TASK [Debug output for task which failed: Start containers for step 5] *********',
 u'ok: [192.168.24.15] => {',
 u'    "failed_when_result": false, ',
 u'    "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []',
 u'}',
 u'ok: [192.168.24.10] => {',
 u'    "failed_when_result": false, ',
 u'    "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []',
 u'}',
 u'',
 u'TASK [Check if /var/lib/docker-puppet/docker-puppet-tasks5.json exists] ********',
 u'ok: [192.168.24.10] => {"changed": false, "stat": {"exists": false}}']

Expected results:


Additional info:

Comment 1 Marios Andreou 2018-05-07 13:08:24 UTC
can you please triage this (from triage call round robin)

Comment 2 Raviv Bar-Tal 2018-05-08 06:58:51 UTC
Created attachment 1433023 [details]
stack files

Comment 3 Raviv Bar-Tal 2018-05-08 07:23:14 UTC
Created attachment 1433026 [details]
compute logs

Comment 4 Raviv Bar-Tal 2018-05-08 08:24:37 UTC
Created attachment 1433060 [details]
undercloud sosreport part 1

Comment 5 Raviv Bar-Tal 2018-05-08 08:25:31 UTC
Created attachment 1433061 [details]
undercloud sosreport part 2

Comment 6 Marios Andreou 2018-05-08 13:20:50 UTC
Created attachment 1433200 [details]
some errors from compute /var/log/messages

having a look here for triage on irc request via morazi. To be clear as it isn't in comment #0, this is OSP13 minor update process. The overcloud was deployed like (from attached undercloud /home/stack/overcloud_deploy.sh):
        #!/bin/bash

        openstack overcloud deploy \
        --timeout 100 \
        --templates /usr/share/openstack-tripleo-heat-templates \
        --stack overcloud \
        --libvirt-type kvm \
        --ntp-server clock.redhat.com \
        -e /home/stack/virt/internal.yaml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
        -e /home/stack/virt/network/network-environment.yaml \
        -e /home/stack/virt/hostnames.yml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
        -e /home/stack/virt/nodes_data.yaml \
        -e /home/stack/virt/extra_templates.yaml \
        -e /home/stack/virt/docker-images.yaml \
        --log-file overcloud_deployment_63.log
        
The update was run like (from attached undercloud /home/stack/overcloud_update_run-Compute.sh):

        #!/bin/env bash
        #
        # Run minor update on overcloud nodes
        #
        set -euo pipefail

        source /home/stack/stackrc


        echo "Running minor update all playbooks for Compute role"
        openstack overcloud update run \
                    --nodes Compute --playbook all 2>&1
             
Looking at the compute logs i see 2 possibly relevant issues (for convenience attaching here)  one is error finding a container (
so there may be a problem with the container images being used here (that were passed into the update prepare command):

        warning msg="libcontainerd: failed to retrieve container bb23f544ee04047dc4a8d89ca9f9b75b898428119632ef53fa82605681e0ec63 
        and
        error msg="Handler for GET /v1.26/containers/06f92879af1a/json returned error: No such container: 06f92879af1a"$

But I think the root cause is more related to ovs/containers because there is a long error trace after what looks like  

        dockerd-current[2143]: + /usr/bin/neutron-openvswitch-agent ...
        error : virNetSocketReadWire:1808 : End of file while reading data: Input/output error

Comment 7 Jiri Stransky 2018-05-09 13:30:08 UTC
A couple of observations:

* We dropped some params on the `update prepare` command line. The compute node lost NTP info during the prepare it seems. At one point in the /var/log/messages, it went "back in time" from May 7 to May 2. We need to keep providing NTP param to the prepare command.

* I can't see anything broken regarding containers on the compute. E.g. in the sosreport from compute, the `docker ps -a` output only has Running or Exited(0) containers, no failed or restarting ones.

* The OVS agent issue that Marios found indeed looks suspicious, but again i don't think it caused a container failure so it shouldn't manifest as "failed step 5". Furthermore the OVS container is started in step 4 which passed successfully.

* I think step 5 might have passed successfully too, and we just don't know about it because it was client output that got broken. Notice that it doesn't say that ansible failed, the output is simply cut short.

Suggestions:

* If we still have the environment, please get /var/log/mistral from the undercloud separately, as the sosreport is not sufficient there (it only contains api.log, we need the others too, and the api.log there doesn't open for me anyway). This could help us validate if the mistral workflow indeed failed or if we just reproduced the client output issue from 1568714. Also let's get `openstack workflow execution list` output.

* If we don't see a failed update workflow, or we don't have the env anymore, let's assume we just reproduced bug 1568714, put this to MODIFIED or ON_QA with the same fixed-in-version. If we can still reproduce the same problem after that, let's do FailedQA.

Comment 8 Jiri Stransky 2018-05-09 14:15:55 UTC
Re NTP parameter issue -- we do need to persist it indeed, but the root cause with the time flapping by 5 days difference was that this is a virtual environment, and likely the *host machine* had desynced time, so the virtual compute node got wrong time on reboot (May 2), before it restored its time (back to May 7).


Re this bug -- we don't have the environment to collect additional info, let's assume we hit bug 1568714 and re-test with the latest puddle where this should be fixed.

Comment 9 Raviv Bar-Tal 2018-05-10 11:04:23 UTC

*** This bug has been marked as a duplicate of bug 1568714 ***

Comment 10 Raviv Bar-Tal 2018-05-10 11:25:31 UTC
In this bug the behaviour is the same as in 1568714 and it is solved by the same fix, so it is closed as duplicate 

As for the NTP problem, it will have to be investigated separately, but from a quick check, it looks like the "host machine" is in sync.
[root@seal02 ~]# timedatectl 
      Local time: Thu 2018-05-10 14:25:14 IDT
  Universal time: Thu 2018-05-10 11:25:14 UTC
        RTC time: Thu 2018-05-10 11:25:14
       Time zone: Asia/Jerusalem (IDT, +0300)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: yes
 Last DST change: DST began at
                  Fri 2018-03-23 01:59:59 IST
                  Fri 2018-03-23 03:00:00 IDT
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2018-10-28 01:59:59 IDT
                  Sun 2018-10-28 01:00:00 IST
[root@seal02 ~]#


Note You need to log in before you can comment on or make changes to this bug.