1575150 – [HA] ODL Cluster stops responding when one controller gets removed from the cluster

Bug 1575150 - [HA] ODL Cluster stops responding when one controller gets removed from the cluster

Summary: [HA] ODL Cluster stops responding when one controller gets removed from the c...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	opendaylight
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	z5
Target Release:	13.0 (Queens)
Assignee:	Stephen Kitt
QA Contact:	Tomas Jamrisko
Docs Contact:
URL:
Whiteboard:	HA
Duplicates (2):	1573973 1574739 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-04 21:36 UTC by jamo luhrsen
Modified:	2019-03-06 16:17 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	There is a known issue where the OpenDaylight cluster may stop responding for up to 30 minutes when an OpenDaylight cluster member is stopped (due to failure or otherwise). The workaround is wait until the cluster becomes active again.
Clone Of:
Environment:
Last Closed:	2019-03-06 16:16:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
nova, neutron, odl logs (2.53 MB, application/x-xz) 2018-05-12 21:34 UTC, jamo luhrsen	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
OpenDaylight Bug	NETVIRT-1460	None	None	None	2018-10-17 13:26:14 UTC
OpenDaylight Bug	NETVIRT-1461	None	None	None	2018-10-17 13:26:37 UTC
OpenDaylight Bug	OPNFLWPLUG-1013	None	None	None	2018-06-04 09:38:11 UTC
OpenDaylight Bug	OPNFLWPLUG-1039	None	None	None	2018-10-17 13:27:30 UTC
OpenDaylight gerrit	72239	None	None	None	2018-09-12 09:06:40 UTC

Description jamo luhrsen 2018-05-04 21:36:49 UTC

Description of problem:

Instance goes to ERROR state with the following fault:

| fault | {"message": "Build of instance 50c22b9b-7e4c-4c8e-a075-ee8c4e857ff5 aborted: Failed to allocate the network(s), not rescheduling.", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 1840, in _do_build_and_run_instance |

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-9.el7ost.noarch.rpm

How reproducible:
sporadically

Steps to Reproduce:
1. run this job:
     https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/


Actual results:

Instance stuck in ERROR state

Expected results:

Instance should be ACTIVE with connectivity

Additional info:

There is a similar bug, but the nova fault indicates there was no compute host available
to spawn the instance. In this case, the reason for the failure is due to network allocation.

we'll need to dig through neutron and odl logs first, I think.

This is the specific place to see the error in the jobs robot log:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/32/robot/report/log.html#s1-s5-t11-k4-k1-k3-k5-k2

(NOTE: sometimes the jenkins server seems to not serve up the robot html files
and you get a blank page. if that happens just wget the log.html file to your
local system and look at it that way)

controller logs (including neutron, odl, etc logs) are here:

  https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/32/artifact/

Comment 1 Mike Kolesnik 2018-05-07 06:55:39 UTC

Please attach logs from neutron & ODL.

Comment 2 jamo luhrsen 2018-05-12 21:34:32 UTC

Created attachment 1435512 [details]
nova, neutron, odl logs

logs that should be included:

$ tree -s
.
├── [       4096]  compute-0
│   └── [    8754072]  nova.log
├── [       4096]  compute-1
│   └── [    5343585]  nova.log
├── [       4096]  controller-0
│   ├── [    1563711]  neutron-dhcp.log
│   ├── [    7526355]  neutron.log
│   └── [   19361348]  odl.log
├── [       4096]  controller-1
│   ├── [    1542974]  neutron-dhcp.log
│   ├── [    7361950]  neutron.log
│   └── [   14770551]  odl.log
├── [       4096]  controller-2
│   ├── [    1528786]  neutron-dhcp.log
│   ├── [    8322507]  neutron.log
│   └── [   36139353]  odl.log
└── [    2649048]  logs.tar.xz

Comment 3 Mike Kolesnik 2018-05-15 07:49:24 UTC

From the neutron logs it seems that around 2018-05-03 03:25:34.160 and 2018-05-03 03:52:51.995 the whole cluster was unresponsive, it either reported 404 or the connection timed out entirely.

Comment 4 Mike Kolesnik 2018-05-15 08:01:17 UTC

*** Bug 1573973 has been marked as a duplicate of this bug. ***

Comment 5 jamo luhrsen 2018-05-16 00:28:59 UTC

Mike,

what does "whole cluster" mean in this context? This is just rest
api calls from neutron to our haproxy VIP right?

404 would mean that at least RESTCONF is working, but unresponsive
could mean haproxy is sending the requests to a downed ODL.

I am digging through the odl logs to look at the cluster state
changes to see if I can map anything to problems in the robot time
stamps or what you see in the neutron logs.

Comment 6 jdchester7 2018-05-20 20:41:10 UTC

Im Am also having this problem, Has anyone found a solution. If you could you help me out.

Comment 7 jamo luhrsen 2018-05-21 21:36:56 UTC

*** Bug 1574739 has been marked as a duplicate of this bug. ***

Comment 8 jamo luhrsen 2018-05-25 18:08:16 UTC

I was making the assumption that this bug was only hitting us in our d/s jobs
so thought we could focus on how we did our deployments, but that is not the
case. I see this in u/s ODL CSIT too:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-openstack-queens-upstream-stateful-oxygen/270/robot-plugin/log_full.html.gz#s1-s5-t21-k3-k1-k3-k1-k2

Comment 9 Stephen Kitt 2018-05-29 08:31:19 UTC

(In reply to jamo luhrsen from comment #8)
> I was making the assumption that this bug was only hitting us in our d/s jobs
> so thought we could focus on how we did our deployments, but that is not the
> case. I see this in u/s ODL CSIT too:
> 
> https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-
> 3node-openstack-queens-upstream-stateful-oxygen/270/robot-plugin/log_full.
> html.gz#s1-s5-t21-k3-k1-k3-k1-k2

So it’s not related to HAProxy...

Comment 10 Stephen Kitt 2018-05-29 09:10:02 UTC

(In reply to jdchester7 from comment #6)
> Im Am also having this problem, Has anyone found a solution. If you could
> you help me out.

We haven’t found a solution yet, but we’re interested in any information you might have — do you have logs from OpenStack and OpenDaylight when the problem occurs? (karaf.log at least from the OpenDaylight controllers.)

Comment 22 Stephen Kitt 2018-10-04 08:09:30 UTC

This still shows up in Oxygen CSIT.

Comment 25 Franck Baudin 2019-03-06 16:16:03 UTC

As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Comment 26 Franck Baudin 2019-03-06 16:17:34 UTC

As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Note You need to log in before you can comment on or make changes to this bug.