Bug 1770907
| Summary: | RHOS16: FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Jad Haj Yahya <jhajyahy> | ||||
| Component: | python-networking-ovn | Assignee: | Terry Wilson <twilson> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Eran Kuris <ekuris> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 16.0 (Train) | CC: | apevec, bdobreli, cjeanner, dalvarez, drosenfe, jlibosva, lbezdick, lhh, majopela, michele, pweeks, rrasouli, rsafrono, sathlang, sclewis, scohen, shrjoshi | ||||
| Target Milestone: | rc | Keywords: | AutomationBlocker, TestBlocker, Triaged | ||||
| Target Release: | 16.0 (Train on RHEL 8.1) | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | python-networking-ovn-7.0.1-0.20191205040313.2ef5322.el8ost | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-02-06 14:42:51 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Jad Haj Yahya
2019-11-11 13:28:00 UTC
Terry, can you update this BZ with latest status? This issues is blocking a far number of OSP16 regressions. Thanks... pweeks: The linked gerrit review is the patch that fixes this. The updates of what are going will show up there until this merges and is backported/merged to stable branches and the status here goes to POST. *** Bug 1779177 has been marked as a duplicate of this bug. *** Hi,
so I've tested an "update" from RHOS_TRUNK-16.0-RHEL-8-20191213.n.1 to RHOS_TRUNK-16.0-RHEL-8-20191213.n.1, both of which should contain the fix.
Problem is that it fails again during reboot:
TASK [Waiting for ovn-controller agent on compute node to come up] *************
task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/infrared/plugins/tripleo-overcloud/overcloud_reboot.yml:430
Friday 13 December 2019 22:50:30 +0000 (0:00:00.197) 0:19:07.591 *******
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (60 retries left).
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (59 retries left).
...
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (3 retries left).
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (2 retries left).
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (1 retries left).
fatal: [compute-1 -> 172.16.0.54]: FAILED! => {
"attempts": 60,
"changed": true,
"cmd": "source /home/stack/qe-Cloud-0rc\n openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains(\"ovn-controller\")) | .Alive'",
"delta": "0:00:02.415348",
"end": "2019-12-13 22:57:53.960529",
"rc": 0,
"start": "2019-12-13 22:57:51.545181"
}
So it seems we still have the error. The full logs are there https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/23/
Tell me if you need something more to check why would the patch not work there.
Hi, so I went on a failed during reboot osp16 environment and check the source /home/stack/qe-Cloud-0rc\n openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains(\"ovn-controller\")) | .Alive' manually. Before the patch I had the output of that command flapping (false, true, false, true ...), and as reboot wasn't working I expected to see the same behavior. So first the behavior didn't reproduce, I had only "true" output, so I went on checking out the infrared code and I think I've found the issue. Infrared expects to find ":-)" in the output to get a success. In osp16 we don't have any more the smiley face, but only "true". So the infrared test was always false. I've pushed that review https://review.gerrithub.io/c/redhat-openstack/infrared/+/478249 to solve this. I've checked the flapping issue like this: for i in $(seq 1 60); do openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains("networking-ovn-metadata-agent")) | .Alive' >> meta.txt; sleep 1;done for i in $(seq 1 60); do openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains("ovn-controller")) | .Alive' >> ovn.txt; sleep 1;done & grep true ovn.txt | wc -l 60 grep true meta.txt | wc -l 60 For me that would validate the patch. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:0283 |