Bug 1568976

Summary:	[Deployment] One ODL controller is not started correctly returning 404 on every REST call leading to failed OSP+ODL deploy
Product:	Red Hat OpenStack	Reporter:	Sai Sindhur Malleni <smalleni>
Component:	opendaylight	Assignee:	Stephen Kitt <skitt>
Status:	CLOSED ERRATA	QA Contact:	Waldemar Znoinski <wznoinsk>
Severity:	high	Docs Contact:
Priority:	high
Version:	13.0 (Queens)	CC:	aadam, jchhatba, jluhrsen, lmarsh, mkolesni, nyechiel, sclewis, skitt, trozet, wznoinsk
Target Milestone:	z1	Keywords:	Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	Deployment,
Fixed In Version:	opendaylight-8.3.0-1.el7ost	Doc Type:	Known Issue
Doc Text:	During deployment, one or more OpenDaylight instances may fail to start correctly due to a feature loading bug. This may lead to a deployment or functional failure. When a deployment passes, only two of the three OpenDaylight instances must be functional for the deployment to succeed. It is possible that the third OpenDaylight instance started incorrectly. Check the health status of each container with the `docker ps` command. If it is unhealthy, restart the container with `docker restart opendaylight_api`. When a deployment fails, the only option is to restart the deployment. For TLS-based deployments, all OpenDaylight instances must boot correctly or deployment will fail.	Story Points:	---
Clone Of:		Environment:	N/A
Last Closed:	2018-07-19 13:53:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1570848
Bug Blocks:

Description Sai Sindhur Malleni 2018-04-18 13:14:55 UTC

Description of problem:
When deploying a 3 controller + 6 compute node deployment with ODLs in HA collocated with the OSP controllers, occasionally we see a deploy failing with the following error.

curl -k -o /dev/null --fail --silent --head -u admin:admin http://172.16.0.13:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 returned 22 instead of one of [0]"

172.16.0.13 is the VIP for ODL.

On further investigation by Tim Rozet, we found out that there were no networking issues. The curl would work from all 3 controllers and some of the computes but not all computes. 

The issue seems to be that ODL features are not loaded in the correct order sometimes  leading to a non-functional ODL (started but returning HTTP 404) on one of the controllers. So this seems to be an initialization race condition where Jersey needs to finish initialization before ODL starts. More details in the commit message here: https://git.opendaylight.org/gerrit/#/c/70979/

java.lang.RuntimeException: Error obtaining AAAShiroProvider 


               

Version-Release number of selected component (if applicable):
OSP 13

How reproducible:

Very ocassionally only during scale deploys mostly

Steps to Reproduce:
1. Deploy OSP + ODL setup with a lot of computes (6 in our case)
2.
3.

Actual results:
Deploy failed

Expected results:
Deploy should succeeed everytime

Additional info:

Comment 15 Sai Sindhur Malleni 2018-04-30 13:42:14 UTC

Mike,

THis is happening pretty consistently on my environment. So the solution is, if the deploy fails, you manually restart the ODL controllers and run a stack update? Is this OK? Shouldn't we have documentation in place that talks about this? I believe an overcloud failed stack isn't a great sign.

Comment 16 Itzik Brown 2018-05-01 08:14:23 UTC

*** Bug 1573224 has been marked as a duplicate of this bug. ***

Comment 22 Mike Kolesnik 2018-05-21 12:53:08 UTC

This should be available once we rebase to stable/oxygen, moving to POST

Comment 32 Janki 2018-07-17 04:33:13 UTC

I have been doing successful deployments with this rpm for quite some time now and have not encountered this error.

Comment 34 errata-xmlrpc 2018-07-19 13:53:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2215