Bug 1568976

Summary: [Deployment] One ODL controller is not started correctly returning 404 on every REST call leading to failed OSP+ODL deploy
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: opendaylightAssignee: Stephen Kitt <skitt>
Status: CLOSED ERRATA QA Contact: Waldemar Znoinski <wznoinsk>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: aadam, jchhatba, jluhrsen, lmarsh, mkolesni, nyechiel, sclewis, skitt, trozet, wznoinsk
Target Milestone: z1Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: Deployment,
Fixed In Version: opendaylight-8.3.0-1.el7ost Doc Type: Known Issue
Doc Text:
During deployment, one or more OpenDaylight instances may fail to start correctly due to a feature loading bug. This may lead to a deployment or functional failure. When a deployment passes, only two of the three OpenDaylight instances must be functional for the deployment to succeed. It is possible that the third OpenDaylight instance started incorrectly. Check the health status of each container with the `docker ps` command. If it is unhealthy, restart the container with `docker restart opendaylight_api`. When a deployment fails, the only option is to restart the deployment. For TLS-based deployments, all OpenDaylight instances must boot correctly or deployment will fail.
Story Points: ---
Clone Of: Environment:
N/A
Last Closed: 2018-07-19 13:53:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1570848    
Bug Blocks:    

Description Sai Sindhur Malleni 2018-04-18 13:14:55 UTC
Description of problem:
When deploying a 3 controller + 6 compute node deployment with ODLs in HA collocated with the OSP controllers, occasionally we see a deploy failing with the following error.

curl -k -o /dev/null --fail --silent --head -u admin:admin http://172.16.0.13:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 returned 22 instead of one of [0]"

172.16.0.13 is the VIP for ODL.

On further investigation by Tim Rozet, we found out that there were no networking issues. The curl would work from all 3 controllers and some of the computes but not all computes. 

The issue seems to be that ODL features are not loaded in the correct order sometimes  leading to a non-functional ODL (started but returning HTTP 404) on one of the controllers. So this seems to be an initialization race condition where Jersey needs to finish initialization before ODL starts. More details in the commit message here: https://git.opendaylight.org/gerrit/#/c/70979/

java.lang.RuntimeException: Error obtaining AAAShiroProvider 


               

Version-Release number of selected component (if applicable):
OSP 13

How reproducible:

Very ocassionally only during scale deploys mostly

Steps to Reproduce:
1. Deploy OSP + ODL setup with a lot of computes (6 in our case)
2.
3.

Actual results:
Deploy failed

Expected results:
Deploy should succeeed everytime

Additional info:

Comment 15 Sai Sindhur Malleni 2018-04-30 13:42:14 UTC
Mike,

THis is happening pretty consistently on my environment. So the solution is, if the deploy fails, you manually restart the ODL controllers and run a stack update? Is this OK? Shouldn't we have documentation in place that talks about this? I believe an overcloud failed stack isn't a great sign.

Comment 16 Itzik Brown 2018-05-01 08:14:23 UTC
*** Bug 1573224 has been marked as a duplicate of this bug. ***

Comment 22 Mike Kolesnik 2018-05-21 12:53:08 UTC
This should be available once we rebase to stable/oxygen, moving to POST

Comment 32 Janki 2018-07-17 04:33:13 UTC
I have been doing successful deployments with this rpm for quite some time now and have not encountered this error.

Comment 34 errata-xmlrpc 2018-07-19 13:53:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2215