Bug 1770907

Summary: RHOS16: FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up
Product: Red Hat OpenStack Reporter: Jad Haj Yahya <jhajyahy>
Component: python-networking-ovnAssignee: Terry Wilson <twilson>
Status: CLOSED ERRATA QA Contact: Eran Kuris <ekuris>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 16.0 (Train)CC: apevec, bdobreli, cjeanner, dalvarez, drosenfe, jlibosva, lbezdick, lhh, majopela, michele, pweeks, rrasouli, rsafrono, sathlang, sclewis, scohen, shrjoshi
Target Milestone: rcKeywords: AutomationBlocker, TestBlocker, Triaged
Target Release: 16.0 (Train on RHEL 8.1)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-networking-ovn-7.0.1-0.20191205040313.2ef5322.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-06 14:42:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Jenkins job logs none

Description Jad Haj Yahya 2019-11-11 13:28:00 UTC
Created attachment 1634873 [details]
Jenkins job logs

Description of problem:
Run below job:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/deployment/job/DFG-df-deployment-16-virthost-1cont_1comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv6-vlan-RHELOSP-31817/

got errors on overcloud reboot stage:
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up

See attached logs

Version-Release number of selected component (if applicable):
rhos16

How reproducible:
always

Steps to Reproduce:
1. Run jobs https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/deployment/job/DFG-df-deployment-16-virthost-1cont_1comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv6-vlan-RHELOSP-31817/

2.
3.

Actual results:
overcloud reboot failed

Expected results:
To succeed

Additional info:

Comment 5 pweeks 2019-11-25 15:16:37 UTC
Terry, can you update this BZ with latest status?
This issues is blocking a far number of OSP16 regressions.
Thanks...

Comment 6 Terry Wilson 2019-11-27 18:16:00 UTC
pweeks: The linked gerrit review is the patch that fixes this. The updates of what are going will show up there until this merges and is backported/merged to stable branches and the status here goes to POST.

Comment 10 Michele Baldessari 2019-12-03 14:02:44 UTC
*** Bug 1779177 has been marked as a duplicate of this bug. ***

Comment 16 Sofer Athlan-Guyot 2019-12-16 17:09:20 UTC
Hi,

so I've tested an "update" from RHOS_TRUNK-16.0-RHEL-8-20191213.n.1 to RHOS_TRUNK-16.0-RHEL-8-20191213.n.1, both of which should contain the fix.

Problem is that it fails again during reboot:

TASK [Waiting for ovn-controller agent on compute node to come up] *************
task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/infrared/plugins/tripleo-overcloud/overcloud_reboot.yml:430
Friday 13 December 2019  22:50:30 +0000 (0:00:00.197)       0:19:07.591 ******* 
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (60 retries left).
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (59 retries left).
...
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (3 retries left).
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (2 retries left).
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (1 retries left).
fatal: [compute-1 -> 172.16.0.54]: FAILED! => {
    "attempts": 60, 
    "changed": true, 
    "cmd": "source /home/stack/qe-Cloud-0rc\n openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains(\"ovn-controller\")) | .Alive'", 
    "delta": "0:00:02.415348", 
    "end": "2019-12-13 22:57:53.960529", 
    "rc": 0, 
    "start": "2019-12-13 22:57:51.545181"
}

So it seems we still have the error.  The full logs are there https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/23/

Tell me if you need something more to check why would the patch not work there.

Comment 18 Sofer Athlan-Guyot 2019-12-18 02:16:06 UTC
Hi,

so I went on a failed during reboot osp16 environment and check the

   source /home/stack/qe-Cloud-0rc\n openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains(\"ovn-controller\")) | .Alive'

manually.

Before the patch I had the output of that command flapping (false,
true, false, true ...), and as reboot wasn't working I expected to see
the same behavior.

So first the behavior didn't reproduce, I had only "true" output, so I
went on checking out the infrared code and I think I've found the
issue.

Infrared expects to find ":-)" in the output to get a success.  In
osp16 we don't have any more the smiley face, but only "true".  So the
infrared test was always false.

I've pushed that review
https://review.gerrithub.io/c/redhat-openstack/infrared/+/478249 to
solve this.

I've checked the flapping issue like this:

  for i in $(seq 1 60); do openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains("networking-ovn-metadata-agent")) | .Alive' >> meta.txt; sleep 1;done
  for i in $(seq 1 60); do openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains("ovn-controller")) | .Alive' >> ovn.txt; sleep 1;done &

grep true ovn.txt  | wc -l
60
grep true meta.txt  | wc -l
60

For me that would validate the patch.

Comment 22 errata-xmlrpc 2020-02-06 14:42:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0283