Created attachment 1634873 [details] Jenkins job logs Description of problem: Run below job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/deployment/job/DFG-df-deployment-16-virthost-1cont_1comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv6-vlan-RHELOSP-31817/ got errors on overcloud reboot stage: FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up See attached logs Version-Release number of selected component (if applicable): rhos16 How reproducible: always Steps to Reproduce: 1. Run jobs https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/deployment/job/DFG-df-deployment-16-virthost-1cont_1comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv6-vlan-RHELOSP-31817/ 2. 3. Actual results: overcloud reboot failed Expected results: To succeed Additional info:
Terry, can you update this BZ with latest status? This issues is blocking a far number of OSP16 regressions. Thanks...
pweeks: The linked gerrit review is the patch that fixes this. The updates of what are going will show up there until this merges and is backported/merged to stable branches and the status here goes to POST.
*** Bug 1779177 has been marked as a duplicate of this bug. ***
Hi, so I've tested an "update" from RHOS_TRUNK-16.0-RHEL-8-20191213.n.1 to RHOS_TRUNK-16.0-RHEL-8-20191213.n.1, both of which should contain the fix. Problem is that it fails again during reboot: TASK [Waiting for ovn-controller agent on compute node to come up] ************* task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/infrared/plugins/tripleo-overcloud/overcloud_reboot.yml:430 Friday 13 December 2019 22:50:30 +0000 (0:00:00.197) 0:19:07.591 ******* FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (60 retries left). FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (59 retries left). ... FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (3 retries left). FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (2 retries left). FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (1 retries left). fatal: [compute-1 -> 172.16.0.54]: FAILED! => { "attempts": 60, "changed": true, "cmd": "source /home/stack/qe-Cloud-0rc\n openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains(\"ovn-controller\")) | .Alive'", "delta": "0:00:02.415348", "end": "2019-12-13 22:57:53.960529", "rc": 0, "start": "2019-12-13 22:57:51.545181" } So it seems we still have the error. The full logs are there https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/23/ Tell me if you need something more to check why would the patch not work there.
Hi, so I went on a failed during reboot osp16 environment and check the source /home/stack/qe-Cloud-0rc\n openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains(\"ovn-controller\")) | .Alive' manually. Before the patch I had the output of that command flapping (false, true, false, true ...), and as reboot wasn't working I expected to see the same behavior. So first the behavior didn't reproduce, I had only "true" output, so I went on checking out the infrared code and I think I've found the issue. Infrared expects to find ":-)" in the output to get a success. In osp16 we don't have any more the smiley face, but only "true". So the infrared test was always false. I've pushed that review https://review.gerrithub.io/c/redhat-openstack/infrared/+/478249 to solve this. I've checked the flapping issue like this: for i in $(seq 1 60); do openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains("networking-ovn-metadata-agent")) | .Alive' >> meta.txt; sleep 1;done for i in $(seq 1 60); do openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains("ovn-controller")) | .Alive' >> ovn.txt; sleep 1;done & grep true ovn.txt | wc -l 60 grep true meta.txt | wc -l 60 For me that would validate the patch.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:0283