Description of problem: Compute node update fails and exit during step 5 Version-Release number of selected component (if applicable): How reproducible: run the overcloud_update_run-Compute.sh script Steps to Reproduce: 1. 2. 3. Actual results: overcloud_update_run_Compute.log exit unexpectetlly u'TASK [Check if /var/lib/docker-puppet/docker-puppet-tasks5.json exists] ********', u'ok: [192.168.24.10] => {"changed": false, "stat": {"exists": false}}'] (undercloud) [stack@undercloud-0 ~]$ less !$ less overcloud_update_run_Compute_new.log (undercloud) [stack@undercloud-0 ~]$ tail -20 overcloud_update_run_Compute_new.log u'TASK [Debug output for task which failed: Run docker-puppet tasks (generate config) during step 5] ***', u'skipping: [192.168.24.15] => {"changed": false, "skip_reason": "Conditional result was False"}', u'skipping: [192.168.24.10] => {"changed": false, "skip_reason": "Conditional result was False"}', u'', u'TASK [Start containers for step 5] *********************************************', u'ok: [192.168.24.10] => {"censored": "the output has been hidden due to the fact that \'no_log: true\' was specified for this result", "changed": false}', u'ok: [192.168.24.15] => {"censored": "the output has been hidden due to the fact that \'no_log: true\' was specified for this result", "changed": false}', u'', u'TASK [Debug output for task which failed: Start containers for step 5] *********', u'ok: [192.168.24.15] => {', u' "failed_when_result": false, ', u' "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []', u'}', u'ok: [192.168.24.10] => {', u' "failed_when_result": false, ', u' "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []', u'}', u'', u'TASK [Check if /var/lib/docker-puppet/docker-puppet-tasks5.json exists] ********', u'ok: [192.168.24.10] => {"changed": false, "stat": {"exists": false}}'] Expected results: Additional info:
can you please triage this (from triage call round robin)
Created attachment 1433023 [details] stack files
Created attachment 1433026 [details] compute logs
Created attachment 1433060 [details] undercloud sosreport part 1
Created attachment 1433061 [details] undercloud sosreport part 2
Created attachment 1433200 [details] some errors from compute /var/log/messages having a look here for triage on irc request via morazi. To be clear as it isn't in comment #0, this is OSP13 minor update process. The overcloud was deployed like (from attached undercloud /home/stack/overcloud_deploy.sh): #!/bin/bash openstack overcloud deploy \ --timeout 100 \ --templates /usr/share/openstack-tripleo-heat-templates \ --stack overcloud \ --libvirt-type kvm \ --ntp-server clock.redhat.com \ -e /home/stack/virt/internal.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /home/stack/virt/hostnames.yml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \ -e /home/stack/virt/nodes_data.yaml \ -e /home/stack/virt/extra_templates.yaml \ -e /home/stack/virt/docker-images.yaml \ --log-file overcloud_deployment_63.log The update was run like (from attached undercloud /home/stack/overcloud_update_run-Compute.sh): #!/bin/env bash # # Run minor update on overcloud nodes # set -euo pipefail source /home/stack/stackrc echo "Running minor update all playbooks for Compute role" openstack overcloud update run \ --nodes Compute --playbook all 2>&1 Looking at the compute logs i see 2 possibly relevant issues (for convenience attaching here) one is error finding a container ( so there may be a problem with the container images being used here (that were passed into the update prepare command): warning msg="libcontainerd: failed to retrieve container bb23f544ee04047dc4a8d89ca9f9b75b898428119632ef53fa82605681e0ec63 and error msg="Handler for GET /v1.26/containers/06f92879af1a/json returned error: No such container: 06f92879af1a"$ But I think the root cause is more related to ovs/containers because there is a long error trace after what looks like dockerd-current[2143]: + /usr/bin/neutron-openvswitch-agent ... error : virNetSocketReadWire:1808 : End of file while reading data: Input/output error
A couple of observations: * We dropped some params on the `update prepare` command line. The compute node lost NTP info during the prepare it seems. At one point in the /var/log/messages, it went "back in time" from May 7 to May 2. We need to keep providing NTP param to the prepare command. * I can't see anything broken regarding containers on the compute. E.g. in the sosreport from compute, the `docker ps -a` output only has Running or Exited(0) containers, no failed or restarting ones. * The OVS agent issue that Marios found indeed looks suspicious, but again i don't think it caused a container failure so it shouldn't manifest as "failed step 5". Furthermore the OVS container is started in step 4 which passed successfully. * I think step 5 might have passed successfully too, and we just don't know about it because it was client output that got broken. Notice that it doesn't say that ansible failed, the output is simply cut short. Suggestions: * If we still have the environment, please get /var/log/mistral from the undercloud separately, as the sosreport is not sufficient there (it only contains api.log, we need the others too, and the api.log there doesn't open for me anyway). This could help us validate if the mistral workflow indeed failed or if we just reproduced the client output issue from 1568714. Also let's get `openstack workflow execution list` output. * If we don't see a failed update workflow, or we don't have the env anymore, let's assume we just reproduced bug 1568714, put this to MODIFIED or ON_QA with the same fixed-in-version. If we can still reproduce the same problem after that, let's do FailedQA.
Re NTP parameter issue -- we do need to persist it indeed, but the root cause with the time flapping by 5 days difference was that this is a virtual environment, and likely the *host machine* had desynced time, so the virtual compute node got wrong time on reboot (May 2), before it restored its time (back to May 7). Re this bug -- we don't have the environment to collect additional info, let's assume we hit bug 1568714 and re-test with the latest puddle where this should be fixed.
*** This bug has been marked as a duplicate of bug 1568714 ***
In this bug the behaviour is the same as in 1568714 and it is solved by the same fix, so it is closed as duplicate As for the NTP problem, it will have to be investigated separately, but from a quick check, it looks like the "host machine" is in sync. [root@seal02 ~]# timedatectl Local time: Thu 2018-05-10 14:25:14 IDT Universal time: Thu 2018-05-10 11:25:14 UTC RTC time: Thu 2018-05-10 11:25:14 Time zone: Asia/Jerusalem (IDT, +0300) NTP enabled: yes NTP synchronized: yes RTC in local TZ: no DST active: yes Last DST change: DST began at Fri 2018-03-23 01:59:59 IST Fri 2018-03-23 03:00:00 IDT Next DST change: DST ends (the clock jumps one hour backwards) at Sun 2018-10-28 01:59:59 IDT Sun 2018-10-28 01:00:00 IST [root@seal02 ~]#