Bug 1779177

Summary: [OSP16] After minor update and overcloud reboot ovn-controller is not Alive on compute node
Product: Red Hat OpenStack Reporter: Roman Safronov <rsafrono>
Component: python-networking-ovnAssignee: Assaf Muller <amuller>
Status: CLOSED DUPLICATE QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16.0 (Train)CC: apevec, lhh, majopela, michele, scohen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-12-03 14:02:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Safronov 2019-12-03 13:14:33 UTC
Description of problem:
Minor update CI job failed on 'overcloud reboot' stage because ovn-controller is not Alive on at least one of the nodes (compute-1), see [0] below

Link to the CI minor update job
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/6/


Version-Release number of selected component (if applicable):
Minor update from   RHOS_TRUNK-16.0-RHEL-8-20191115.n.0 to RHOS_TRUNK-16.0-RHEL-8-20191126.n.2


How reproducible:
Tried this scenario once and the issue occurred

Note: I also tried update from RHOS_TRUNK-16.0-RHEL-8-20191115.n.0 to the brand new passed_phase1 puddle (RHOS_TRUNK-16.0-RHEL-8-20191202.n.1) but the job failed on undercloud update stage: see https://bugzilla.redhat.com/show_bug.cgi?id=1779165


Steps to Reproduce:
1. run minor update CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/

specify the following parameter for the build:
PRODUCT_BUILD:	  RHOS_TRUNK-16.0-RHEL-8-20191115.n.0
UPDATE_TO:	RHOS_TRUNK-16.0-RHEL-8-20191126.n.2	
DIRECTOR_UPDATE_BUILD:	RHOS_TRUNK-16.0-RHEL-8-20191126.n.2



Actual results:
Not all network agents are alive after overcloud reboot, CI job fails

Expected results:
Minor update performed successfully, all network agents are Alive after overcloud reboot


Additional info:

[0] from the job log

TASK [Waiting for ovn-controller agent on compute node to come up] *************
task path: /home/rhos-ci/jenkins/workspace/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/infrared/plugins/tripleo-overcloud/overcloud_reboot.yml:430
Sunday 01 December 2019  06:50:38 +0000 (0:00:00.214)       0:12:40.928 ******* 
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (60 retries left).
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (59 retries left).
.....
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (2 retries left).
FAILED - RETRYING: Waiting for ovn-controller agent on compute node to come up (1 retries left).
fatal: [compute-1 -> 172.16.0.98]: FAILED! => {
    "attempts": 60, 
    "changed": true, 
    "cmd": "source /home/stack/overcloudrc\n openstack network agent list --host compute-1.redhat.local -f json | jq -r -c '.[] | select(.Binary | contains(\"ovn-controller\")) | .Alive'", 
    "delta": "0:00:03.076233", 
    "end": "2019-12-01 06:59:49.934447", 
    "rc": 0, 
    "start": "2019-12-01 06:59:46.858214"
}

STDOUT:

false


NO MORE HOSTS LEFT *****