Bug 1575150 - [HA] ODL Cluster stops responding when one controller gets removed from the cluster
Summary: [HA] ODL Cluster stops responding when one controller gets removed from the c...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: z5
: 13.0 (Queens)
Assignee: Stephen Kitt
QA Contact: Tomas Jamrisko
URL:
Whiteboard: HA
: 1573973 1574739 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-04 21:36 UTC by jamo luhrsen
Modified: 2019-03-06 16:17 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
There is a known issue where the OpenDaylight cluster may stop responding for up to 30 minutes when an OpenDaylight cluster member is stopped (due to failure or otherwise). The workaround is wait until the cluster becomes active again.
Clone Of:
Environment:
Last Closed: 2019-03-06 16:16:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
nova, neutron, odl logs (2.53 MB, application/x-xz)
2018-05-12 21:34 UTC, jamo luhrsen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenDaylight Bug NETVIRT-1460 0 None None None 2018-10-17 13:26:14 UTC
OpenDaylight Bug NETVIRT-1461 0 None None None 2018-10-17 13:26:37 UTC
OpenDaylight Bug OPNFLWPLUG-1013 0 None None None 2018-06-04 09:38:11 UTC
OpenDaylight Bug OPNFLWPLUG-1039 0 None None None 2018-10-17 13:27:30 UTC
OpenDaylight gerrit 72239 0 None None None 2018-09-12 09:06:40 UTC

Description jamo luhrsen 2018-05-04 21:36:49 UTC
Description of problem:

Instance goes to ERROR state with the following fault:

| fault | {"message": "Build of instance 50c22b9b-7e4c-4c8e-a075-ee8c4e857ff5 aborted: Failed to allocate the network(s), not rescheduling.", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 1840, in _do_build_and_run_instance |

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-9.el7ost.noarch.rpm

How reproducible:
sporadically

Steps to Reproduce:
1. run this job:
     https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/


Actual results:

Instance stuck in ERROR state

Expected results:

Instance should be ACTIVE with connectivity

Additional info:

There is a similar bug, but the nova fault indicates there was no compute host available
to spawn the instance. In this case, the reason for the failure is due to network allocation.

we'll need to dig through neutron and odl logs first, I think.

This is the specific place to see the error in the jobs robot log:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/32/robot/report/log.html#s1-s5-t11-k4-k1-k3-k5-k2

(NOTE: sometimes the jenkins server seems to not serve up the robot html files
and you get a blank page. if that happens just wget the log.html file to your
local system and look at it that way)

controller logs (including neutron, odl, etc logs) are here:

  https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/32/artifact/

Comment 1 Mike Kolesnik 2018-05-07 06:55:39 UTC
Please attach logs from neutron & ODL.

Comment 2 jamo luhrsen 2018-05-12 21:34:32 UTC
Created attachment 1435512 [details]
nova, neutron, odl logs

logs that should be included:

$ tree -s
.
├── [       4096]  compute-0
│   └── [    8754072]  nova.log
├── [       4096]  compute-1
│   └── [    5343585]  nova.log
├── [       4096]  controller-0
│   ├── [    1563711]  neutron-dhcp.log
│   ├── [    7526355]  neutron.log
│   └── [   19361348]  odl.log
├── [       4096]  controller-1
│   ├── [    1542974]  neutron-dhcp.log
│   ├── [    7361950]  neutron.log
│   └── [   14770551]  odl.log
├── [       4096]  controller-2
│   ├── [    1528786]  neutron-dhcp.log
│   ├── [    8322507]  neutron.log
│   └── [   36139353]  odl.log
└── [    2649048]  logs.tar.xz

Comment 3 Mike Kolesnik 2018-05-15 07:49:24 UTC
From the neutron logs it seems that around 2018-05-03 03:25:34.160 and 2018-05-03 03:52:51.995 the whole cluster was unresponsive, it either reported 404 or the connection timed out entirely.

Comment 4 Mike Kolesnik 2018-05-15 08:01:17 UTC
*** Bug 1573973 has been marked as a duplicate of this bug. ***

Comment 5 jamo luhrsen 2018-05-16 00:28:59 UTC
Mike,

what does "whole cluster" mean in this context? This is just rest
api calls from neutron to our haproxy VIP right?

404 would mean that at least RESTCONF is working, but unresponsive
could mean haproxy is sending the requests to a downed ODL.

I am digging through the odl logs to look at the cluster state
changes to see if I can map anything to problems in the robot time
stamps or what you see in the neutron logs.

Comment 6 jdchester7 2018-05-20 20:41:10 UTC
Im Am also having this problem, Has anyone found a solution. If you could you help me out.

Comment 7 jamo luhrsen 2018-05-21 21:36:56 UTC
*** Bug 1574739 has been marked as a duplicate of this bug. ***

Comment 8 jamo luhrsen 2018-05-25 18:08:16 UTC
I was making the assumption that this bug was only hitting us in our d/s jobs
so thought we could focus on how we did our deployments, but that is not the
case. I see this in u/s ODL CSIT too:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-openstack-queens-upstream-stateful-oxygen/270/robot-plugin/log_full.html.gz#s1-s5-t21-k3-k1-k3-k1-k2

Comment 9 Stephen Kitt 2018-05-29 08:31:19 UTC
(In reply to jamo luhrsen from comment #8)
> I was making the assumption that this bug was only hitting us in our d/s jobs
> so thought we could focus on how we did our deployments, but that is not the
> case. I see this in u/s ODL CSIT too:
> 
> https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-
> 3node-openstack-queens-upstream-stateful-oxygen/270/robot-plugin/log_full.
> html.gz#s1-s5-t21-k3-k1-k3-k1-k2

So it’s not related to HAProxy...

Comment 10 Stephen Kitt 2018-05-29 09:10:02 UTC
(In reply to jdchester7 from comment #6)
> Im Am also having this problem, Has anyone found a solution. If you could
> you help me out.

We haven’t found a solution yet, but we’re interested in any information you might have — do you have logs from OpenStack and OpenDaylight when the problem occurs? (karaf.log at least from the OpenDaylight controllers.)

Comment 22 Stephen Kitt 2018-10-04 08:09:30 UTC
This still shows up in Oxygen CSIT.

Comment 25 Franck Baudin 2019-03-06 16:16:03 UTC
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Comment 26 Franck Baudin 2019-03-06 16:17:34 UTC
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality


Note You need to log in before you can comment on or make changes to this bug.