Bug 1573973

Summary: VMs sometimes fail to start (no compute host available) when one controller gets removed from the cluster.
Product: Red Hat OpenStack Reporter: Tomas Jamrisko <tjamrisk>
Component: opendaylightAssignee: Stephen Kitt <skitt>
Status: CLOSED DUPLICATE QA Contact: Itzik Brown <itbrown>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: aadam, jluhrsen, mkolesni, nyechiel, skitt, tjamrisk
Target Milestone: beta   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: odl_netvirt, odl_ha
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
N/A
Last Closed: 2018-05-15 08:01:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Artifacts from the job
none
controller-0 karaf.log
none
odl and neutron logs for all three controllers none

Description Tomas Jamrisko 2018-05-02 15:43:17 UTC
Created attachment 1430151 [details]
Artifacts from the job

Description of problem:
We're seeing issues where VMs sometimes fail to start after one of the controllers get taken down (either stopping the container or blocking 2550)

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-8.el7ost

How reproducible:
Random

Steps to Reproduce:
1. Remove a controller from the cluster
2. Start a VM

Actual results:
VM stays in BUILD and eventually transitions to ERROR. And reports that it can't get a valid host.

- https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/28/robot/report/log.html

Expected results:
VM should start

Additional info:

Comment 3 jamo luhrsen 2018-05-03 21:50:44 UTC
another job that saw this is here:
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/26/robot/report/log.html#s1-s5-t11-k4-k1-k3-k4-k2

the "fault" in "server show" is:
  {"message": "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 464b5640-8ef3-4237-9a3b-22fb05f29787.", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 580, in build_instances | | | raise exception.MaxRetriesExceeded(reason=msg) 

We need to dig in the other openstack logs, I think. like nova and maybe
neutron.

Comment 4 Mike Kolesnik 2018-05-07 06:53:43 UTC
Please attach logs from neutron & ODL.

Comment 5 jamo luhrsen 2018-05-07 16:48:27 UTC
Created attachment 1432698 [details]
controller-0 karaf.log

Comment 6 jamo luhrsen 2018-05-07 16:49:38 UTC
Created attachment 1432699 [details]
odl and neutron logs for all three controllers

Comment 7 Mike Kolesnik 2018-05-15 08:01:17 UTC
This seems like a duplicate of bug 1575150 judging by the logs, they have the same time outs while trying to contact ODL.

The difference seems to be that here the "agents" were detected as "dead" before the VM creation so it seems like a different error, but the root cause is the same.

If the other bug solution doesn't solve this one, please reopen.

*** This bug has been marked as a duplicate of bug 1575150 ***