Bug 1573973

Summary:

VMs sometimes fail to start (no compute host available) when one controller gets removed from the cluster.

Product:

Red Hat OpenStack

Reporter:

Tomas Jamrisko <tjamrisk>

Component:

opendaylight

Assignee:

Stephen Kitt <skitt>

Status:

CLOSED DUPLICATE

QA Contact:

Itzik Brown <itbrown>

Severity:

high

Docs Contact:

Priority:

high

Version:

13.0 (Queens)

CC:

aadam, jluhrsen, mkolesni, nyechiel, skitt, tjamrisk

Target Milestone:

beta

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

odl_netvirt, odl_ha

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

N/A

Last Closed:

2018-05-15 08:01:17 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Artifacts from the job	none
controller-0 karaf.log	none
odl and neutron logs for all three controllers	none

Description Tomas Jamrisko 2018-05-02 15:43:17 UTC

Created attachment 1430151 [details]
Artifacts from the job

Description of problem:
We're seeing issues where VMs sometimes fail to start after one of the controllers get taken down (either stopping the container or blocking 2550)

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-8.el7ost

How reproducible:
Random

Steps to Reproduce:
1. Remove a controller from the cluster
2. Start a VM

Actual results:
VM stays in BUILD and eventually transitions to ERROR. And reports that it can't get a valid host.

- https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/28/robot/report/log.html

Expected results:
VM should start

Additional info:

Comment 3 jamo luhrsen 2018-05-03 21:50:44 UTC

another job that saw this is here:
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/26/robot/report/log.html#s1-s5-t11-k4-k1-k3-k4-k2

the "fault" in "server show" is:
  {"message": "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 464b5640-8ef3-4237-9a3b-22fb05f29787.", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 580, in build_instances | | | raise exception.MaxRetriesExceeded(reason=msg) 

We need to dig in the other openstack logs, I think. like nova and maybe
neutron.

Comment 4 Mike Kolesnik 2018-05-07 06:53:43 UTC

Please attach logs from neutron & ODL.

Comment 5 jamo luhrsen 2018-05-07 16:48:27 UTC

Created attachment 1432698 [details]
controller-0 karaf.log

Comment 6 jamo luhrsen 2018-05-07 16:49:38 UTC

Created attachment 1432699 [details]
odl and neutron logs for all three controllers

Comment 7 Mike Kolesnik 2018-05-15 08:01:17 UTC

This seems like a duplicate of bug 1575150 judging by the logs, they have the same time outs while trying to contact ODL.

The difference seems to be that here the "agents" were detected as "dead" before the VM creation so it seems like a different error, but the root cause is the same.

If the other bug solution doesn't solve this one, please reopen.

*** This bug has been marked as a duplicate of bug 1575150 ***