Bug 1770907 - RHOS16: FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up
Summary: RHOS16: FAILED - RETRYING: Waiting for ovn-controller agent on compute node t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 16.0 (Train on RHEL 8.1)
Assignee: Terry Wilson
QA Contact: Eran Kuris
URL:
Whiteboard:
: 1779177 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-11 13:28 UTC by Jad Haj Yahya
Modified: 2020-02-06 14:43 UTC (History)
17 users (show)

Fixed In Version: python-networking-ovn-7.0.1-0.20191205040313.2ef5322.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-06 14:42:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Jenkins job logs (674.96 KB, application/zip)
2019-11-11 13:28 UTC, Jad Haj Yahya
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 694840 0 'None' 'MERGED' 'Fix agent extension support after hashring merge' 2019-12-08 16:35:57 UTC
OpenStack gerrit 696936 0 'None' 'MERGED' 'Fix agent extension support after hashring merge' 2019-12-08 20:59:51 UTC
Red Hat Product Errata RHEA-2020:0283 0 None None None 2020-02-06 14:43:22 UTC

Description Jad Haj Yahya 2019-11-11 13:28:00 UTC
Created attachment 1634873 [details]
Jenkins job logs

Description of problem:
Run below job:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/deployment/job/DFG-df-deployment-16-virthost-1cont_1comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv6-vlan-RHELOSP-31817/

got errors on overcloud reboot stage:
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up

See attached logs

Version-Release number of selected component (if applicable):
rhos16

How reproducible:
always

Steps to Reproduce:
1. Run jobs https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/deployment/job/DFG-df-deployment-16-virthost-1cont_1comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv6-vlan-RHELOSP-31817/

2.
3.

Actual results:
overcloud reboot failed

Expected results:
To succeed

Additional info:

Comment 5 pweeks 2019-11-25 15:16:37 UTC
Terry, can you update this BZ with latest status?
This issues is blocking a far number of OSP16 regressions.
Thanks...

Comment 6 Terry Wilson 2019-11-27 18:16:00 UTC
pweeks: The linked gerrit review is the patch that fixes this. The updates of what are going will show up there until this merges and is backported/merged to stable branches and the status here goes to POST.

Comment 10 Michele Baldessari 2019-12-03 14:02:44 UTC
*** Bug 1779177 has been marked as a duplicate of this bug. ***

Comment 16 Sofer Athlan-Guyot 2019-12-16 17:09:20 UTC
Hi,

so I've tested an "update" from RHOS_TRUNK-16.0-RHEL-8-20191213.n.1 to RHOS_TRUNK-16.0-RHEL-8-20191213.n.1, both of which should contain the fix.

Problem is that it fails again during reboot:

TASK [Waiting for ovn-controller agent on compute node to come up] *************
task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/infrared/plugins/tripleo-overcloud/overcloud_reboot.yml:430
Friday 13 December 2019  22:50:30 +0000 (0:00:00.197)       0:19:07.591 ******* 
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (60 retries left).
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (59 retries left).
...
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (3 retries left).
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (2 retries left).
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (1 retries left).
fatal: [compute-1 -> 172.16.0.54]: FAILED! => {
    "attempts": 60, 
    "changed": true, 
    "cmd": "source /home/stack/qe-Cloud-0rc\n openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains(\"ovn-controller\")) | .Alive'", 
    "delta": "0:00:02.415348", 
    "end": "2019-12-13 22:57:53.960529", 
    "rc": 0, 
    "start": "2019-12-13 22:57:51.545181"
}

So it seems we still have the error.  The full logs are there https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/23/

Tell me if you need something more to check why would the patch not work there.

Comment 18 Sofer Athlan-Guyot 2019-12-18 02:16:06 UTC
Hi,

so I went on a failed during reboot osp16 environment and check the

   source /home/stack/qe-Cloud-0rc\n openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains(\"ovn-controller\")) | .Alive'

manually.

Before the patch I had the output of that command flapping (false,
true, false, true ...), and as reboot wasn't working I expected to see
the same behavior.

So first the behavior didn't reproduce, I had only "true" output, so I
went on checking out the infrared code and I think I've found the
issue.

Infrared expects to find ":-)" in the output to get a success.  In
osp16 we don't have any more the smiley face, but only "true".  So the
infrared test was always false.

I've pushed that review
https://review.gerrithub.io/c/redhat-openstack/infrared/+/478249 to
solve this.

I've checked the flapping issue like this:

  for i in $(seq 1 60); do openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains("networking-ovn-metadata-agent")) | .Alive' >> meta.txt; sleep 1;done
  for i in $(seq 1 60); do openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains("ovn-controller")) | .Alive' >> ovn.txt; sleep 1;done &

grep true ovn.txt  | wc -l
60
grep true meta.txt  | wc -l
60

For me that would validate the patch.

Comment 22 errata-xmlrpc 2020-02-06 14:42:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0283


Note You need to log in before you can comment on or make changes to this bug.