Bug 1573973 - VMs sometimes fail to start (no compute host available) when one controller gets removed from the cluster.
Summary: VMs sometimes fail to start (no compute host available) when one controller g...
Keywords:
Status: CLOSED DUPLICATE of bug 1575150
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: ---
Assignee: Stephen Kitt
QA Contact: Itzik Brown
URL:
Whiteboard: odl_netvirt, odl_ha
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-02 15:43 UTC by Tomas Jamrisko
Modified: 2018-10-24 12:37 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
N/A
Last Closed: 2018-05-15 08:01:17 UTC
Target Upstream Version:


Attachments (Terms of Use)
Artifacts from the job (521.62 KB, application/zip)
2018-05-02 15:43 UTC, Tomas Jamrisko
no flags Details
controller-0 karaf.log (17.20 MB, text/plain)
2018-05-07 16:48 UTC, jamo luhrsen
no flags Details
odl and neutron logs for all three controllers (4.45 MB, application/x-gzip)
2018-05-07 16:49 UTC, jamo luhrsen
no flags Details

Description Tomas Jamrisko 2018-05-02 15:43:17 UTC
Created attachment 1430151 [details]
Artifacts from the job

Description of problem:
We're seeing issues where VMs sometimes fail to start after one of the controllers get taken down (either stopping the container or blocking 2550)

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-8.el7ost

How reproducible:
Random

Steps to Reproduce:
1. Remove a controller from the cluster
2. Start a VM

Actual results:
VM stays in BUILD and eventually transitions to ERROR. And reports that it can't get a valid host.

- https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/28/robot/report/log.html

Expected results:
VM should start

Additional info:

Comment 3 jamo luhrsen 2018-05-03 21:50:44 UTC
another job that saw this is here:
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/26/robot/report/log.html#s1-s5-t11-k4-k1-k3-k4-k2

the "fault" in "server show" is:
  {"message": "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 464b5640-8ef3-4237-9a3b-22fb05f29787.", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 580, in build_instances | | | raise exception.MaxRetriesExceeded(reason=msg) 

We need to dig in the other openstack logs, I think. like nova and maybe
neutron.

Comment 4 Mike Kolesnik 2018-05-07 06:53:43 UTC
Please attach logs from neutron & ODL.

Comment 5 jamo luhrsen 2018-05-07 16:48:27 UTC
Created attachment 1432698 [details]
controller-0 karaf.log

Comment 6 jamo luhrsen 2018-05-07 16:49:38 UTC
Created attachment 1432699 [details]
odl and neutron logs for all three controllers

Comment 7 Mike Kolesnik 2018-05-15 08:01:17 UTC
This seems like a duplicate of bug 1575150 judging by the logs, they have the same time outs while trying to contact ODL.

The difference seems to be that here the "agents" were detected as "dead" before the VM creation so it seems like a different error, but the root cause is the same.

If the other bug solution doesn't solve this one, please reopen.

*** This bug has been marked as a duplicate of bug 1575150 ***


Note You need to log in before you can comment on or make changes to this bug.