Bug 1607143

Summary: [UPDATE] Compute and Controller update exit with error code while executing step 5
Product: Red Hat OpenStack Reporter: Raviv Bar-Tal <rbartal>
Component: openstack-tripleo-commonAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: Raviv Bar-Tal <rbartal>
Severity: high Docs Contact:
Priority: high    
Version: 12.0 (Pike)CC: augol, ccamacho, dbecker, jfrancoa, jstransk, mandreou, mburns, mcornea, morazi, rbartal, slinaber
Target Milestone: z3Keywords: Regression, Triaged, ZStream
Target Release: 12.0 (Pike)   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: openstack-tripleo-common-7.6.13-3.el7ost python-tripleoclient-7.3.10-3.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1575620 Environment:
Last Closed: 2018-08-20 13:02:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1568714, 1575620, 1600038    
Bug Blocks:    
Attachments:
Description Flags
mistral logs
none
stack update logs none

Description Raviv Bar-Tal 2018-07-22 13:06:23 UTC
+++ This bug was initially created as a clone of Bug #1575620 +++

Description of problem:
Compute node update fails and exit during step 5

Version-Release number of selected component (if applicable):


How reproducible:
run the overcloud_update_run-Compute.sh script

Steps to Reproduce:
1.
2.
3.

Actual results:
overcloud_update_run_Compute.log exit unexpectetlly 
u'TASK [Check if /var/lib/docker-puppet/docker-puppet-tasks5.json exists] ********',
 u'ok: [192.168.24.10] => {"changed": false, "stat": {"exists": false}}']
(undercloud) [stack@undercloud-0 ~]$ less !$
less overcloud_update_run_Compute_new.log
(undercloud) [stack@undercloud-0 ~]$ tail -20 overcloud_update_run_Compute_new.log
 u'TASK [Debug output for task which failed: Run docker-puppet tasks (generate config) during step 5] ***',
 u'skipping: [192.168.24.15] => {"changed": false, "skip_reason": "Conditional result was False"}',
 u'skipping: [192.168.24.10] => {"changed": false, "skip_reason": "Conditional result was False"}',
 u'',
 u'TASK [Start containers for step 5] *********************************************',
 u'ok: [192.168.24.10] => {"censored": "the output has been hidden due to the fact that \'no_log: true\' was specified for this result", "changed": false}',
 u'ok: [192.168.24.15] => {"censored": "the output has been hidden due to the fact that \'no_log: true\' was specified for this result", "changed": false}',
 u'',
 u'TASK [Debug output for task which failed: Start containers for step 5] *********',
 u'ok: [192.168.24.15] => {',
 u'    "failed_when_result": false, ',
 u'    "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []',
 u'}',
 u'ok: [192.168.24.10] => {',
 u'    "failed_when_result": false, ',
 u'    "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []',
 u'}',
 u'',
 u'TASK [Check if /var/lib/docker-puppet/docker-puppet-tasks5.json exists] ********',
 u'ok: [192.168.24.10] => {"changed": false, "stat": {"exists": false}}']

Expected results:


Additional info:

--- Additional comment from Marios Andreou on 2018-05-07 09:08:24 EDT ---

can you please triage this (from triage call round robin)

--- Additional comment from Raviv Bar-Tal on 2018-05-08 02:58 EDT ---



--- Additional comment from Raviv Bar-Tal on 2018-05-08 03:23 EDT ---



--- Additional comment from Raviv Bar-Tal on 2018-05-08 04:24 EDT ---



--- Additional comment from Raviv Bar-Tal on 2018-05-08 04:25 EDT ---



--- Additional comment from Marios Andreou on 2018-05-08 09:20 EDT ---

having a look here for triage on irc request via morazi. To be clear as it isn't in comment #0, this is OSP13 minor update process. The overcloud was deployed like (from attached undercloud /home/stack/overcloud_deploy.sh):
        #!/bin/bash

        openstack overcloud deploy \
        --timeout 100 \
        --templates /usr/share/openstack-tripleo-heat-templates \
        --stack overcloud \
        --libvirt-type kvm \
        --ntp-server clock.redhat.com \
        -e /home/stack/virt/internal.yaml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
        -e /home/stack/virt/network/network-environment.yaml \
        -e /home/stack/virt/hostnames.yml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
        -e /home/stack/virt/nodes_data.yaml \
        -e /home/stack/virt/extra_templates.yaml \
        -e /home/stack/virt/docker-images.yaml \
        --log-file overcloud_deployment_63.log
        
The update was run like (from attached undercloud /home/stack/overcloud_update_run-Compute.sh):

        #!/bin/env bash
        #
        # Run minor update on overcloud nodes
        #
        set -euo pipefail

        source /home/stack/stackrc


        echo "Running minor update all playbooks for Compute role"
        openstack overcloud update run \
                    --nodes Compute --playbook all 2>&1
             
Looking at the compute logs i see 2 possibly relevant issues (for convenience attaching here)  one is error finding a container (
so there may be a problem with the container images being used here (that were passed into the update prepare command):

        warning msg="libcontainerd: failed to retrieve container bb23f544ee04047dc4a8d89ca9f9b75b898428119632ef53fa82605681e0ec63 
        and
        error msg="Handler for GET /v1.26/containers/06f92879af1a/json returned error: No such container: 06f92879af1a"$

But I think the root cause is more related to ovs/containers because there is a long error trace after what looks like  

        dockerd-current[2143]: + /usr/bin/neutron-openvswitch-agent ...
        error : virNetSocketReadWire:1808 : End of file while reading data: Input/output error

--- Additional comment from Jiri Stransky on 2018-05-09 09:30:08 EDT ---

A couple of observations:

* We dropped some params on the `update prepare` command line. The compute node lost NTP info during the prepare it seems. At one point in the /var/log/messages, it went "back in time" from May 7 to May 2. We need to keep providing NTP param to the prepare command.

* I can't see anything broken regarding containers on the compute. E.g. in the sosreport from compute, the `docker ps -a` output only has Running or Exited(0) containers, no failed or restarting ones.

* The OVS agent issue that Marios found indeed looks suspicious, but again i don't think it caused a container failure so it shouldn't manifest as "failed step 5". Furthermore the OVS container is started in step 4 which passed successfully.

* I think step 5 might have passed successfully too, and we just don't know about it because it was client output that got broken. Notice that it doesn't say that ansible failed, the output is simply cut short.

Suggestions:

* If we still have the environment, please get /var/log/mistral from the undercloud separately, as the sosreport is not sufficient there (it only contains api.log, we need the others too, and the api.log there doesn't open for me anyway). This could help us validate if the mistral workflow indeed failed or if we just reproduced the client output issue from 1568714. Also let's get `openstack workflow execution list` output.

* If we don't see a failed update workflow, or we don't have the env anymore, let's assume we just reproduced bug 1568714, put this to MODIFIED or ON_QA with the same fixed-in-version. If we can still reproduce the same problem after that, let's do FailedQA.

--- Additional comment from Jiri Stransky on 2018-05-09 10:15:55 EDT ---

Re NTP parameter issue -- we do need to persist it indeed, but the root cause with the time flapping by 5 days difference was that this is a virtual environment, and likely the *host machine* had desynced time, so the virtual compute node got wrong time on reboot (May 2), before it restored its time (back to May 7).


Re this bug -- we don't have the environment to collect additional info, let's assume we hit bug 1568714 and re-test with the latest puddle where this should be fixed.

--- Additional comment from Raviv Bar-Tal on 2018-05-10 07:04:23 EDT ---



--- Additional comment from Raviv Bar-Tal on 2018-05-10 07:25:31 EDT ---

In this bug the behaviour is the same as in 1568714 and it is solved by the same fix, so it is closed as duplicate 

As for the NTP problem, it will have to be investigated separately, but from a quick check, it looks like the "host machine" is in sync.
[root@seal02 ~]# timedatectl 
      Local time: Thu 2018-05-10 14:25:14 IDT
  Universal time: Thu 2018-05-10 11:25:14 UTC
        RTC time: Thu 2018-05-10 11:25:14
       Time zone: Asia/Jerusalem (IDT, +0300)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: yes
 Last DST change: DST began at
                  Fri 2018-03-23 01:59:59 IST
                  Fri 2018-03-23 03:00:00 IDT
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2018-10-28 01:59:59 IDT
                  Sun 2018-10-28 01:00:00 IST
[root@seal02 ~]#

Comment 1 Raviv Bar-Tal 2018-07-22 13:11:14 UTC
Hi all,
I get the same behaviour while updating osp 12 from GA to passed_phase2 (currently  2018-07-19.1)

You can find relevant logs attached and in the job logs page:
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-12_director-rhel-7.5-virthost-3cont_2comp_3ceph-ipv6-vxlan/1/artifact/

Comment 2 Raviv Bar-Tal 2018-07-22 13:14:26 UTC
Created attachment 1469768 [details]
mistral logs

Comment 3 Raviv Bar-Tal 2018-07-22 13:14:51 UTC
Created attachment 1469769 [details]
stack update logs

Comment 4 Jiri Stransky 2018-07-25 09:02:55 UTC
In the job artifacts undercloud.tar.gz, looking at /home/stack/oc-update-00-Controller.log, i can see that the output ends abruptly without any failure message:

2018-07-22 08:25:03 |  u'TASK [Write the config_step hieradata] *****************'headers'
2018-07-22 08:25:03 | ************************',
2018-07-22 08:25:03 |  u'changed: [192.168.24.8]',
2018-07-22 08:25:03 |  u'',
2018-07-22 08:25:03 |  u'TASK [Run puppet host configuration for step 5] ********************************',
2018-07-22 08:25:03 |  u'ok: [192.168.24.8]']

That indeed looks a lot like bug 1568714, so i cherry-picked the relevant patches, but they still need to be tested.

Comment 6 Jiri Stransky 2018-07-26 15:31:52 UTC
Tested, all patches are ready for reviews. Testing won't prove that the bug is fixed, as i think to hit it we need the update to cross some time boundary. But i at least wanted to validate that the patches don't break the workflow.

Comment 14 errata-xmlrpc 2018-08-20 13:02:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2331