Bug 1488907

Summary:	[HA] When OpenDaylight controller is isolated from the cluster it should be removed from the Neutron Northbound VIP
Product:	Red Hat OpenStack	Reporter:	Itzik Brown <itbrown>
Component:	puppet-tripleo	Assignee:	Tim Rozet <trozet>
Status:	CLOSED ERRATA	QA Contact:	Tomas Jamrisko <tjamrisk>
Severity:	high	Docs Contact:
Priority:	medium
Version:	12.0 (Pike)	CC:	aadam, apevec, itbrown, jjoyce, jluhrsen, joflynn, jschluet, lhh, mburns, mkolesni, nyechiel, ramishra, rhel-osp-director-maint, shague, skitt, slinaber, srevivo, tjamrisk, trozet, tvignaud, vorburger
Target Milestone:	z3	Keywords:	TestOnly, Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	HA
Fixed In Version:	puppet-tripleo-8.3.4-3.el7ost	Doc Type:	Bug Fix
Doc Text:	When a single OpenDaylight instance was removed from a cluster, this moved the instance into an isolated state, meaning it no longer acted on incoming requests. HA Proxy still load-balanced requests to the isolated OpenDaylight instance, which potentially resulted in OpenStack network commands failing or not working correctly. HA Proxy now detects the isolated OpenDaylight instance as in an unhealthy state. HA Proxy does not forward requests to the isolated OpenDaylight.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-13 22:26:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1566080
Bug Blocks:

Description Itzik Brown 2017-09-06 12:37:46 UTC

Description of problem:
When an ODL controller is isolated from other controllers it's moving to a read -only mode. It should be removed from the Northbound VIP (Using HAProxy) until it's fully functional again.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170821194254.el7ost.noarch

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Tim Rozet 2017-09-12 13:18:10 UTC

Can you elaborate more on what you mean by "isolated from other controllers"? 
 What exactly are you doing here?  Can you please add steps to reproduce.

Comment 2 Itzik Brown 2017-09-12 13:55:32 UTC

It means that other controllers can't reach it. For example port 2550 on the controller is blocked.

When the controller can't reach other controllers it moves to a 'read-only' state for 10 minutes - then it kills itself.

During this 10 minutes HA proxy still forwards requests from Neutron to the controller and these requests will fail in such a scenario.

We need a way to check that the controller is in read-only state and not forward requests.

Comment 3 Tim Rozet 2017-09-12 14:07:11 UTC

So currently with HAproxy we detect of the TCP port is open for the controller to determine if it is up.  We could add a health check REST call instead.  However, I'm not sure how we determine if an ODL instance is in 'read-only' mode.  If we have a 3 node cluster and 1 ODL connection on 2550 is severed so that now we have 1 ODL and 2 ODLs, how does the single ODL decide he should be read only?  What if that single ODL was Leader when the connections were severed?

Jamo, Sam can you provide these answers?

Comment 4 Stephen Kitt 2017-09-25 11:40:33 UTC

(In reply to Tim Rozet from comment #3)
> So currently with HAproxy we detect of the TCP port is open for the
> controller to determine if it is up.  We could add a health check REST call
> instead.  However, I'm not sure how we determine if an ODL instance is in
> 'read-only' mode.  If we have a 3 node cluster and 1 ODL connection on 2550
> is severed so that now we have 1 ODL and 2 ODLs, how does the single ODL
> decide he should be read only?  What if that single ODL was Leader when the
> connections were severed?
> 
> Jamo, Sam can you provide these answers?

The isolated ODL will notice that it can no longer talk to the other two cluster members, which means it’s in a minority. If it was leader in any shard, it will switch to isolated leader and wait until either it can rejoin the cluster or the timeout expires, at which point it will kill itself. If it was follower, it will call elections and end up being isolated leader too (IIRC).

It’s not so much that the isolated nodes decide to become read-only; the reason the datastore becomes read-only on minority nodes is that transactions need a majority vote before they’re committed.

I think the datastore status REST endpoint indicates that it’s not operational, I’ll check what that looks like...

Comment 7 Stephen Kitt 2018-02-20 09:43:14 UTC

(In reply to Stephen Kitt from comment #4)
> I think the datastore status REST endpoint indicates that it’s not
> operational, I’ll check what that looks like...

It’s a two-step process:

* determine the controller and shard names using

        http://10.10.10.115:8181/jolokia/read/org.opendaylight.controller:Category=ShardManager,name=shard-manager-operational,type=DistributedOperationalDatastore

  (replacing 10.10.10.115 and 8181 as appropriate)

* retrieve the shard status using

        http://10.10.10.115:8181/jolokia/read/org.opendaylight.controller:Category=Shards,name=member-1-shard-default-operational,type=DistributedOperationalDatastore

  (replacing 10.10.10.115, 8181, member-1, and default as appropriate)

In the second JSON document, the items to look at are

* RaftStatus (should be Leader or Follower, or sometimes Candidate)
* FailedTransactionsCount (should be 0)

Comment 8 Tim Rozet 2018-02-21 15:40:22 UTC

*** Bug 1535481 has been marked as a duplicate of this bug. ***

Comment 9 Nir Yechiel 2018-03-27 09:24:15 UTC

@Tim, do we have a plan to fix this for 13?

Comment 11 Tim Rozet 2018-04-04 21:40:27 UTC

We could do something like:
option httpchk GET /jolokia/read/org.opendaylight.controller:Category=Shards,name=member-1-shard-default-operational,type=DistributedOperationalDatastore
http-check expect rstring RaftStatus=(Leader|Follower|Candidate)

However, the query URL needs to be unique to the queried node (querying the status of this node).  This not since it is filtering the member name, and the status would be the same no matter which node is queried.  I'll look to see if there is something else we can query to get the right result.  Otherwise we will need to use external-check and write a bash/python script to do the member to node mapping logic.

Will work on it.

Comment 12 Mike Kolesnik 2018-04-16 06:48:59 UTC

Tim, any progress on this?

Comment 18 Michael Vorburger 2018-05-15 16:03:24 UTC

Tim wrote: "If we can add this behavior so that we get an HTTP error code when it is isolated leader, then that would help solve this issue." so making this depends on Bug 1566080 and the (still TBD) GENIUS-138.

Comment 28 Joanne O'Flynn 2018-08-15 08:06:20 UTC

This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible.

If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-".


To add draft documentation text:

* Select the documentation type from the "Doc Type" drop down field.

* A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.

Comment 33 Michael Vorburger 2018-08-21 19:15:14 UTC

I'm slightly confused about what exactly we're suggesting to fix how, but sense:

based on https://jira.opendaylight.org/browse/INFRAUTILS-33 which specified 418:

1. If you don't think the 418 is all that funny, I'm happy to change it to be 503 instead.  If this is what you want, would you mind raising a new Improvement, not bug, on jira.opendaylight.org dedicated to (only) this with a summary such as "/diagstatus/ should return HTTP status code 503 instead of 418" and I'll implement that ASAP.

2. If you are saying that the HEAD vs GET return inconsistent HTTP status codes, that's surprising me... but I originally probably didn't really test HEAD - only GET.  However, normally the web server used in ODL (Jetty, from Karaf) should automatically do this. Looks like you are saying and showing with the example above that it's not?  But beware that is possible for the /diagstatus/ to change over time.. if the HEAD vs GET both on http://172.17.1.14:8081/diagstatus shown above were taken at some interval from each other, then it could be perfectly normal that they return different codes.  But if you did this virtually at the same instant, on the same URL, then I can dig into this.  If so, would you mind raising a new Bug on jira.opendaylight.org dedicated to (only) this with a summary such as "/diagstatus/ returns inconsistent HTTP status codes from GET vs HEAD" so that I can investigate this more?  (Perhaps there is some subtlety in the Servlet API re. response.setStatus() for HEAD vs GET..)

3. In what you pasted above, something very curious jumps out.. did you notice how your "curl -i -X GET http://172.17.1.14:8081/diagstatus" is a 200 with "systemReadyState": "FAILURE" ? That's... impossible, technically.  Unfortunately the rest (statusSummary) was cut off above, so really unclear.  If this isn't some copy/paste mistake but really reported like that, I would love to see the full "statusSummary" and figure out how that can be.  Maybe it's a concurrency issue where it's JUST changing status and needs better thread safety of where it determines what's in the response body vs returned status code.. but I'm a bit surprised this could hit us (here).  Another jira.opendaylight.org dedicated to (only) this with a summary such as "/diagstatus/ sometimes returns 200 despite systemReadyState FAILURE", with additional details, would be useful for me to look into that.

Comment 34 Tim Rozet 2018-08-21 21:31:06 UTC

(In reply to Michael Vorburger from comment #33)
> I'm slightly confused about what exactly we're suggesting to fix how, but
> sense:
> 
> based on https://jira.opendaylight.org/browse/INFRAUTILS-33 which specified
> 418:
> 
> 1. If you don't think the 418 is all that funny, I'm happy to change it to
> be 503 instead.  If this is what you want, would you mind raising a new
> Improvement, not bug, on jira.opendaylight.org dedicated to (only) this with
> a summary such as "/diagstatus/ should return HTTP status code 503 instead
> of 418" and I'll implement that ASAP.
> 

I'll open some bugs and going to track this now in #1566080 as that is the ODL component.

> 2. If you are saying that the HEAD vs GET return inconsistent HTTP status
> codes, that's surprising me... but I originally probably didn't really test
> HEAD - only GET.  However, normally the web server used in ODL (Jetty, from
> Karaf) should automatically do this. Looks like you are saying and showing
> with the example above that it's not?  But beware that is possible for the
> /diagstatus/ to change over time.. if the HEAD vs GET both on
> http://172.17.1.14:8081/diagstatus shown above were taken at some interval
> from each other, then it could be perfectly normal that they return
> different codes.  But if you did this virtually at the same instant, on the
> same URL, then I can dig into this.  If so, would you mind raising a new Bug
> on jira.opendaylight.org dedicated to (only) this with a summary such as
> "/diagstatus/ returns inconsistent HTTP status codes from GET vs HEAD" so
> that I can investigate this more?  (Perhaps there is some subtlety in the
> Servlet API re. response.setStatus() for HEAD vs GET..)
> 

It's not a timing problem. If I send them in the same command, the results differ. In fact, the HEAD (418) hangs until it reaches max timeout:
[root@controller-0 ~]# curl -i -X GET http://172.17.1.14:8081/diagstatus && curl -i -X HEAD http://172.17.1.14:8081/diagstatus --max-time 5
HTTP/1.1 200 OK
Content-Type: application/json;charset=utf-8
Content-Length: 132

{
  "timeStamp": "Tue Aug 21 21:20:07 UTC 2018",
  "isOperational": false,
  "systemReadyState": "FAILURE",
  "statusSummary": []
}
HTTP/1.1 418 418
Content-Type: application/json;charset=utf-8
Content-Length: 132

curl: (28) Operation timed out after 5001 milliseconds with 0 out of 132 bytes received

> 3. In what you pasted above, something very curious jumps out.. did you
> notice how your "curl -i -X GET http://172.17.1.14:8081/diagstatus" is a 200
> with "systemReadyState": "FAILURE" ? That's... impossible, technically. 
> Unfortunately the rest (statusSummary) was cut off above, so really unclear.
> If this isn't some copy/paste mistake but really reported like that, I would
> love to see the full "statusSummary" and figure out how that can be.  Maybe
> it's a concurrency issue where it's JUST changing status and needs better
> thread safety of where it determines what's in the response body vs returned
> status code.. but I'm a bit surprised this could hit us (here).  Another
> jira.opendaylight.org dedicated to (only) this with a summary such as
> "/diagstatus/ sometimes returns 200 despite systemReadyState FAILURE", with
> additional details, would be useful for me to look into that.

As can be seen from the output I provided in #2 the output isn't cutoff there.

Comment 36 Michael Vorburger 2018-08-22 23:00:36 UTC

Fixes have just been proposed upstream, and should trickle downstream eventually.

Comment 37 Lon Hohberger 2018-09-19 10:36:00 UTC

According to our records, this should be resolved by puppet-tripleo-8.3.4-5.el7ost.  This build is available now.

Comment 44 errata-xmlrpc 2018-11-13 22:26:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3587