Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 1575150 - [HA] ODL Cluster stops responding when one controller gets removed from the cluster
[HA] ODL Cluster stops responding when one controller gets removed from the c...
Status: ASSIGNED
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight (Show other bugs)
13.0 (Queens)
Unspecified Unspecified
high Severity urgent
: z4
: 13.0 (Queens)
Assigned To: Stephen Kitt
Tomas Jamrisko
HA
: AutomationBlocker, Triaged, ZStream
: 1573973 1574739 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-05-04 17:36 EDT by jamo luhrsen
Modified: 2018-10-17 09:27 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
There is a known issue where the OpenDaylight cluster may stop responding for up to 30 minutes when an OpenDaylight cluster member is stopped (due to failure or otherwise). The workaround is wait until the cluster becomes active again.
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
nova, neutron, odl logs (2.53 MB, application/x-xz)
2018-05-12 17:34 EDT, jamo luhrsen
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenDaylight Bug NETVIRT-1460 None None None 2018-10-17 09:26 EDT
OpenDaylight Bug NETVIRT-1461 None None None 2018-10-17 09:26 EDT
OpenDaylight Bug OPNFLWPLUG-1013 None None None 2018-06-04 05:38 EDT
OpenDaylight Bug OPNFLWPLUG-1039 None None None 2018-10-17 09:27 EDT
OpenDaylight gerrit 72239 None None None 2018-09-12 05:06 EDT

  None (edit)
Description jamo luhrsen 2018-05-04 17:36:49 EDT
Description of problem:

Instance goes to ERROR state with the following fault:

| fault | {"message": "Build of instance 50c22b9b-7e4c-4c8e-a075-ee8c4e857ff5 aborted: Failed to allocate the network(s), not rescheduling.", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 1840, in _do_build_and_run_instance |

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-9.el7ost.noarch.rpm

How reproducible:
sporadically

Steps to Reproduce:
1. run this job:
     https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/


Actual results:

Instance stuck in ERROR state

Expected results:

Instance should be ACTIVE with connectivity

Additional info:

There is a similar bug, but the nova fault indicates there was no compute host available
to spawn the instance. In this case, the reason for the failure is due to network allocation.

we'll need to dig through neutron and odl logs first, I think.

This is the specific place to see the error in the jobs robot log:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/32/robot/report/log.html#s1-s5-t11-k4-k1-k3-k5-k2

(NOTE: sometimes the jenkins server seems to not serve up the robot html files
and you get a blank page. if that happens just wget the log.html file to your
local system and look at it that way)

controller logs (including neutron, odl, etc logs) are here:

  https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/32/artifact/
Comment 1 Mike Kolesnik 2018-05-07 02:55:39 EDT
Please attach logs from neutron & ODL.
Comment 2 jamo luhrsen 2018-05-12 17:34 EDT
Created attachment 1435512 [details]
nova, neutron, odl logs

logs that should be included:

$ tree -s
.
├── [       4096]  compute-0
│   └── [    8754072]  nova.log
├── [       4096]  compute-1
│   └── [    5343585]  nova.log
├── [       4096]  controller-0
│   ├── [    1563711]  neutron-dhcp.log
│   ├── [    7526355]  neutron.log
│   └── [   19361348]  odl.log
├── [       4096]  controller-1
│   ├── [    1542974]  neutron-dhcp.log
│   ├── [    7361950]  neutron.log
│   └── [   14770551]  odl.log
├── [       4096]  controller-2
│   ├── [    1528786]  neutron-dhcp.log
│   ├── [    8322507]  neutron.log
│   └── [   36139353]  odl.log
└── [    2649048]  logs.tar.xz
Comment 3 Mike Kolesnik 2018-05-15 03:49:24 EDT
From the neutron logs it seems that around 2018-05-03 03:25:34.160 and 2018-05-03 03:52:51.995 the whole cluster was unresponsive, it either reported 404 or the connection timed out entirely.
Comment 4 Mike Kolesnik 2018-05-15 04:01:17 EDT
*** Bug 1573973 has been marked as a duplicate of this bug. ***
Comment 5 jamo luhrsen 2018-05-15 20:28:59 EDT
Mike,

what does "whole cluster" mean in this context? This is just rest
api calls from neutron to our haproxy VIP right?

404 would mean that at least RESTCONF is working, but unresponsive
could mean haproxy is sending the requests to a downed ODL.

I am digging through the odl logs to look at the cluster state
changes to see if I can map anything to problems in the robot time
stamps or what you see in the neutron logs.
Comment 6 jdchester7 2018-05-20 16:41:10 EDT
Im Am also having this problem, Has anyone found a solution. If you could you help me out.
Comment 7 jamo luhrsen 2018-05-21 17:36:56 EDT
*** Bug 1574739 has been marked as a duplicate of this bug. ***
Comment 8 jamo luhrsen 2018-05-25 14:08:16 EDT
I was making the assumption that this bug was only hitting us in our d/s jobs
so thought we could focus on how we did our deployments, but that is not the
case. I see this in u/s ODL CSIT too:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-openstack-queens-upstream-stateful-oxygen/270/robot-plugin/log_full.html.gz#s1-s5-t21-k3-k1-k3-k1-k2
Comment 9 Stephen Kitt 2018-05-29 04:31:19 EDT
(In reply to jamo luhrsen from comment #8)
> I was making the assumption that this bug was only hitting us in our d/s jobs
> so thought we could focus on how we did our deployments, but that is not the
> case. I see this in u/s ODL CSIT too:
> 
> https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-
> 3node-openstack-queens-upstream-stateful-oxygen/270/robot-plugin/log_full.
> html.gz#s1-s5-t21-k3-k1-k3-k1-k2

So it’s not related to HAProxy...
Comment 10 Stephen Kitt 2018-05-29 05:10:02 EDT
(In reply to jdchester7 from comment #6)
> Im Am also having this problem, Has anyone found a solution. If you could
> you help me out.

We haven’t found a solution yet, but we’re interested in any information you might have — do you have logs from OpenStack and OpenDaylight when the problem occurs? (karaf.log at least from the OpenDaylight controllers.)
Comment 22 Stephen Kitt 2018-10-04 04:09:30 EDT
This still shows up in Oxygen CSIT.

Note You need to log in before you can comment on or make changes to this bug.