Bug 1568976 - [Deployment] One ODL controller is not started correctly returning 404 on every REST call leading to failed OSP+ODL deploy
Summary: [Deployment] One ODL controller is not started correctly returning 404 on eve...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z1
: 13.0 (Queens)
Assignee: Stephen Kitt
QA Contact: Waldemar Znoinski
URL:
Whiteboard: Deployment,
Depends On: 1570848
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-18 13:14 UTC by Sai Sindhur Malleni
Modified: 2018-10-18 07:23 UTC (History)
10 users (show)

Fixed In Version: opendaylight-8.3.0-1.el7ost
Doc Type: Known Issue
Doc Text:
During deployment, one or more OpenDaylight instances may fail to start correctly due to a feature loading bug. This may lead to a deployment or functional failure. When a deployment passes, only two of the three OpenDaylight instances must be functional for the deployment to succeed. It is possible that the third OpenDaylight instance started incorrectly. Check the health status of each container with the `docker ps` command. If it is unhealthy, restart the container with `docker restart opendaylight_api`. When a deployment fails, the only option is to restart the deployment. For TLS-based deployments, all OpenDaylight instances must boot correctly or deployment will fail.
Clone Of:
Environment:
N/A
Last Closed: 2018-07-19 13:53:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenDaylight gerrit 70979 0 None None None 2018-04-18 13:15:19 UTC
Red Hat Product Errata RHBA-2018:2215 0 None None None 2018-07-19 13:53:47 UTC

Description Sai Sindhur Malleni 2018-04-18 13:14:55 UTC
Description of problem:
When deploying a 3 controller + 6 compute node deployment with ODLs in HA collocated with the OSP controllers, occasionally we see a deploy failing with the following error.

curl -k -o /dev/null --fail --silent --head -u admin:admin http://172.16.0.13:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 returned 22 instead of one of [0]"

172.16.0.13 is the VIP for ODL.

On further investigation by Tim Rozet, we found out that there were no networking issues. The curl would work from all 3 controllers and some of the computes but not all computes. 

The issue seems to be that ODL features are not loaded in the correct order sometimes  leading to a non-functional ODL (started but returning HTTP 404) on one of the controllers. So this seems to be an initialization race condition where Jersey needs to finish initialization before ODL starts. More details in the commit message here: https://git.opendaylight.org/gerrit/#/c/70979/

java.lang.RuntimeException: Error obtaining AAAShiroProvider 


               

Version-Release number of selected component (if applicable):
OSP 13

How reproducible:

Very ocassionally only during scale deploys mostly

Steps to Reproduce:
1. Deploy OSP + ODL setup with a lot of computes (6 in our case)
2.
3.

Actual results:
Deploy failed

Expected results:
Deploy should succeeed everytime

Additional info:

Comment 15 Sai Sindhur Malleni 2018-04-30 13:42:14 UTC
Mike,

THis is happening pretty consistently on my environment. So the solution is, if the deploy fails, you manually restart the ODL controllers and run a stack update? Is this OK? Shouldn't we have documentation in place that talks about this? I believe an overcloud failed stack isn't a great sign.

Comment 16 Itzik Brown 2018-05-01 08:14:23 UTC
*** Bug 1573224 has been marked as a duplicate of this bug. ***

Comment 22 Mike Kolesnik 2018-05-21 12:53:08 UTC
This should be available once we rebase to stable/oxygen, moving to POST

Comment 32 Janki 2018-07-17 04:33:13 UTC
I have been doing successful deployments with this rpm for quite some time now and have not encountered this error.

Comment 34 errata-xmlrpc 2018-07-19 13:53:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2215


Note You need to log in before you can comment on or make changes to this bug.