Bug 1488907
Summary: | [HA] When OpenDaylight controller is isolated from the cluster it should be removed from the Neutron Northbound VIP | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Itzik Brown <itbrown> |
Component: | puppet-tripleo | Assignee: | Tim Rozet <trozet> |
Status: | CLOSED ERRATA | QA Contact: | Tomas Jamrisko <tjamrisk> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 12.0 (Pike) | CC: | aadam, apevec, itbrown, jjoyce, jluhrsen, joflynn, jschluet, lhh, mburns, mkolesni, nyechiel, ramishra, rhel-osp-director-maint, shague, skitt, slinaber, srevivo, tjamrisk, trozet, tvignaud, vorburger |
Target Milestone: | z3 | Keywords: | TestOnly, Triaged, ZStream |
Target Release: | 13.0 (Queens) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | HA | ||
Fixed In Version: | puppet-tripleo-8.3.4-3.el7ost | Doc Type: | Bug Fix |
Doc Text: |
When a single OpenDaylight instance was removed from a cluster, this moved the instance into an isolated state, meaning it no longer acted on incoming requests. HA Proxy still load-balanced requests to the isolated OpenDaylight instance, which potentially resulted in OpenStack network commands failing or not working correctly.
HA Proxy now detects the isolated OpenDaylight instance as in an unhealthy state.
HA Proxy does not forward requests to the isolated OpenDaylight.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-11-13 22:26:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1566080 | ||
Bug Blocks: |
Description
Itzik Brown
2017-09-06 12:37:46 UTC
Can you elaborate more on what you mean by "isolated from other controllers"? What exactly are you doing here? Can you please add steps to reproduce. It means that other controllers can't reach it. For example port 2550 on the controller is blocked. When the controller can't reach other controllers it moves to a 'read-only' state for 10 minutes - then it kills itself. During this 10 minutes HA proxy still forwards requests from Neutron to the controller and these requests will fail in such a scenario. We need a way to check that the controller is in read-only state and not forward requests. So currently with HAproxy we detect of the TCP port is open for the controller to determine if it is up. We could add a health check REST call instead. However, I'm not sure how we determine if an ODL instance is in 'read-only' mode. If we have a 3 node cluster and 1 ODL connection on 2550 is severed so that now we have 1 ODL and 2 ODLs, how does the single ODL decide he should be read only? What if that single ODL was Leader when the connections were severed? Jamo, Sam can you provide these answers? (In reply to Tim Rozet from comment #3) > So currently with HAproxy we detect of the TCP port is open for the > controller to determine if it is up. We could add a health check REST call > instead. However, I'm not sure how we determine if an ODL instance is in > 'read-only' mode. If we have a 3 node cluster and 1 ODL connection on 2550 > is severed so that now we have 1 ODL and 2 ODLs, how does the single ODL > decide he should be read only? What if that single ODL was Leader when the > connections were severed? > > Jamo, Sam can you provide these answers? The isolated ODL will notice that it can no longer talk to the other two cluster members, which means it’s in a minority. If it was leader in any shard, it will switch to isolated leader and wait until either it can rejoin the cluster or the timeout expires, at which point it will kill itself. If it was follower, it will call elections and end up being isolated leader too (IIRC). It’s not so much that the isolated nodes decide to become read-only; the reason the datastore becomes read-only on minority nodes is that transactions need a majority vote before they’re committed. I think the datastore status REST endpoint indicates that it’s not operational, I’ll check what that looks like... (In reply to Stephen Kitt from comment #4) > I think the datastore status REST endpoint indicates that it’s not > operational, I’ll check what that looks like... It’s a two-step process: * determine the controller and shard names using http://10.10.10.115:8181/jolokia/read/org.opendaylight.controller:Category=ShardManager,name=shard-manager-operational,type=DistributedOperationalDatastore (replacing 10.10.10.115 and 8181 as appropriate) * retrieve the shard status using http://10.10.10.115:8181/jolokia/read/org.opendaylight.controller:Category=Shards,name=member-1-shard-default-operational,type=DistributedOperationalDatastore (replacing 10.10.10.115, 8181, member-1, and default as appropriate) In the second JSON document, the items to look at are * RaftStatus (should be Leader or Follower, or sometimes Candidate) * FailedTransactionsCount (should be 0) *** Bug 1535481 has been marked as a duplicate of this bug. *** @Tim, do we have a plan to fix this for 13? We could do something like: option httpchk GET /jolokia/read/org.opendaylight.controller:Category=Shards,name=member-1-shard-default-operational,type=DistributedOperationalDatastore http-check expect rstring RaftStatus=(Leader|Follower|Candidate) However, the query URL needs to be unique to the queried node (querying the status of this node). This not since it is filtering the member name, and the status would be the same no matter which node is queried. I'll look to see if there is something else we can query to get the right result. Otherwise we will need to use external-check and write a bash/python script to do the member to node mapping logic. Will work on it. Tim, any progress on this? Tim wrote: "If we can add this behavior so that we get an HTTP error code when it is isolated leader, then that would help solve this issue." so making this depends on Bug 1566080 and the (still TBD) GENIUS-138. This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible. If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-". To add draft documentation text: * Select the documentation type from the "Doc Type" drop down field. * A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field. I'm slightly confused about what exactly we're suggesting to fix how, but sense: based on https://jira.opendaylight.org/browse/INFRAUTILS-33 which specified 418: 1. If you don't think the 418 is all that funny, I'm happy to change it to be 503 instead. If this is what you want, would you mind raising a new Improvement, not bug, on jira.opendaylight.org dedicated to (only) this with a summary such as "/diagstatus/ should return HTTP status code 503 instead of 418" and I'll implement that ASAP. 2. If you are saying that the HEAD vs GET return inconsistent HTTP status codes, that's surprising me... but I originally probably didn't really test HEAD - only GET. However, normally the web server used in ODL (Jetty, from Karaf) should automatically do this. Looks like you are saying and showing with the example above that it's not? But beware that is possible for the /diagstatus/ to change over time.. if the HEAD vs GET both on http://172.17.1.14:8081/diagstatus shown above were taken at some interval from each other, then it could be perfectly normal that they return different codes. But if you did this virtually at the same instant, on the same URL, then I can dig into this. If so, would you mind raising a new Bug on jira.opendaylight.org dedicated to (only) this with a summary such as "/diagstatus/ returns inconsistent HTTP status codes from GET vs HEAD" so that I can investigate this more? (Perhaps there is some subtlety in the Servlet API re. response.setStatus() for HEAD vs GET..) 3. In what you pasted above, something very curious jumps out.. did you notice how your "curl -i -X GET http://172.17.1.14:8081/diagstatus" is a 200 with "systemReadyState": "FAILURE" ? That's... impossible, technically. Unfortunately the rest (statusSummary) was cut off above, so really unclear. If this isn't some copy/paste mistake but really reported like that, I would love to see the full "statusSummary" and figure out how that can be. Maybe it's a concurrency issue where it's JUST changing status and needs better thread safety of where it determines what's in the response body vs returned status code.. but I'm a bit surprised this could hit us (here). Another jira.opendaylight.org dedicated to (only) this with a summary such as "/diagstatus/ sometimes returns 200 despite systemReadyState FAILURE", with additional details, would be useful for me to look into that. (In reply to Michael Vorburger from comment #33) > I'm slightly confused about what exactly we're suggesting to fix how, but > sense: > > based on https://jira.opendaylight.org/browse/INFRAUTILS-33 which specified > 418: > > 1. If you don't think the 418 is all that funny, I'm happy to change it to > be 503 instead. If this is what you want, would you mind raising a new > Improvement, not bug, on jira.opendaylight.org dedicated to (only) this with > a summary such as "/diagstatus/ should return HTTP status code 503 instead > of 418" and I'll implement that ASAP. > I'll open some bugs and going to track this now in #1566080 as that is the ODL component. > 2. If you are saying that the HEAD vs GET return inconsistent HTTP status > codes, that's surprising me... but I originally probably didn't really test > HEAD - only GET. However, normally the web server used in ODL (Jetty, from > Karaf) should automatically do this. Looks like you are saying and showing > with the example above that it's not? But beware that is possible for the > /diagstatus/ to change over time.. if the HEAD vs GET both on > http://172.17.1.14:8081/diagstatus shown above were taken at some interval > from each other, then it could be perfectly normal that they return > different codes. But if you did this virtually at the same instant, on the > same URL, then I can dig into this. If so, would you mind raising a new Bug > on jira.opendaylight.org dedicated to (only) this with a summary such as > "/diagstatus/ returns inconsistent HTTP status codes from GET vs HEAD" so > that I can investigate this more? (Perhaps there is some subtlety in the > Servlet API re. response.setStatus() for HEAD vs GET..) > It's not a timing problem. If I send them in the same command, the results differ. In fact, the HEAD (418) hangs until it reaches max timeout: [root@controller-0 ~]# curl -i -X GET http://172.17.1.14:8081/diagstatus && curl -i -X HEAD http://172.17.1.14:8081/diagstatus --max-time 5 HTTP/1.1 200 OK Content-Type: application/json;charset=utf-8 Content-Length: 132 { "timeStamp": "Tue Aug 21 21:20:07 UTC 2018", "isOperational": false, "systemReadyState": "FAILURE", "statusSummary": [] } HTTP/1.1 418 418 Content-Type: application/json;charset=utf-8 Content-Length: 132 curl: (28) Operation timed out after 5001 milliseconds with 0 out of 132 bytes received > 3. In what you pasted above, something very curious jumps out.. did you > notice how your "curl -i -X GET http://172.17.1.14:8081/diagstatus" is a 200 > with "systemReadyState": "FAILURE" ? That's... impossible, technically. > Unfortunately the rest (statusSummary) was cut off above, so really unclear. > If this isn't some copy/paste mistake but really reported like that, I would > love to see the full "statusSummary" and figure out how that can be. Maybe > it's a concurrency issue where it's JUST changing status and needs better > thread safety of where it determines what's in the response body vs returned > status code.. but I'm a bit surprised this could hit us (here). Another > jira.opendaylight.org dedicated to (only) this with a summary such as > "/diagstatus/ sometimes returns 200 despite systemReadyState FAILURE", with > additional details, would be useful for me to look into that. As can be seen from the output I provided in #2 the output isn't cutoff there. Fixes have just been proposed upstream, and should trickle downstream eventually. According to our records, this should be resolved by puppet-tripleo-8.3.4-5.el7ost. This build is available now. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3587 |