Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 1574708 - [HA] ODL cluster member fails to fully sync back to the cluster after isolation test
[HA] ODL cluster member fails to fully sync back to the cluster after isolati...
Status: ASSIGNED
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight (Show other bugs)
13.0 (Queens)
Unspecified Unspecified
high Severity high
: z4
: 13.0 (Queens)
Assigned To: Michael Vorburger
Toni Freger
HA
: Triaged, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-05-03 17:18 EDT by jamo luhrsen
Modified: 2018-09-06 15:17 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
When an OpenDaylight instance is removed from a cluster and reconnected, the instance may not successfully join the cluster. The node will eventually re-join the cluster. The following actions should be taken in such a situation: * Restart the faulty node. * Monitor the REST endpoint to verify the cluster member is healthy: http://$ODL_IP:8081/jolokia/read/org.opendaylight.controller:Category=ShardManager,name=shard-manager-config,type=DistributedConfigDatastore * The response should contain a field “SyncStatus”, and a value of “true” will indicate a healthy cluster member.
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
three opendaylight logs (862.29 KB, application/x-xz)
2018-05-03 17:30 EDT, jamo luhrsen
no flags Details
controller logs (1.76 MB, application/x-gzip)
2018-06-25 16:56 EDT, Victor Pickard
no flags Details
Test script to reproduce issue (1.07 KB, text/plain)
2018-06-25 16:57 EDT, Victor Pickard
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenDaylight Bug CONTROLLER-1849 None None None 2018-08-02 06:47 EDT
OpenDaylight gerrit 72658 None None None 2018-06-04 12:02 EDT

  None (edit)
Description jamo luhrsen 2018-05-03 17:18:54 EDT
Description of problem:

An ODL node is isolated from a 3 node cluster (via iptables rules) and
then communication is restored. However the cluster sync status never
becomes "True" after polling for 60s.


Version-Release number of selected component (if applicable):
opendaylight-8.0.0-9.el7ost.noarch.rpm

How reproducible:
sporadically

Steps to Reproduce:
1. run the u/s CSIT suites in a job like this:
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit

2. cross fingers that we hit it, and not some other bug first.

3.

Actual results:

polling quits after 60s and the test is marked as FAIL

Expected results:

I am really not sure, but I want to say the internal timeout mechanisms
are in the 5s range, so I would expect that cluster sync could be
complete in a few iterations of that, at worst. It's possible that
a fully sync with lots of data could take longer, but these are not
heavy tests and the data needed to sync is probably not very big


Additional info:

This is/was seen in u/s CSIT jobs as well, and I think other (non-netvirt)
projects might even have 5m timeouts before failing the tests. If that's
still the case, and true, then probably something is just totally stuck
and broken and this is not the case where we can just wait a little
longer.
Comment 1 jamo luhrsen 2018-05-03 17:28:46 EDT
exact test case that fails:

  https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit

output of the cluster status REST query:

{
  "request": {
    "mbean": "org.opendaylight.controller:Category=ShardManager,name=shard-manager-operational,type=DistributedOperationalDatastore",
    "type": "read"
  },
  "value": {
    "LocalShards": [
      "member-0-shard-default-operational"
    ],
    "SyncStatus": false,
    "MemberName": "member-0"
  },
  "timestamp": 1525213624,
  "status": 200
}

SyncStatus is the key we are polling on to be "true"
Comment 2 jamo luhrsen 2018-05-03 17:30 EDT
Created attachment 1430918 [details]
three opendaylight logs

you can get all of the log logs collected from the artifacts of the
job:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/artifact/

but, attached (for easier consumption) are just the three ODL logs.
Comment 3 jamo luhrsen 2018-05-03 17:53:56 EDT
(In reply to jamo luhrsen from comment #0)
> Description of problem:
> 
> An ODL node is isolated from a 3 node cluster (via iptables rules) and
> then communication is restored. However the cluster sync status never
> becomes "True" after polling for 60s.
> 
> 
> Version-Release number of selected component (if applicable):
> opendaylight-8.0.0-9.el7ost.noarch.rpm
> 
> How reproducible:
> sporadically
> 
> Steps to Reproduce:
> 1. run the u/s CSIT suites in a job like this:
> https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit
> 
> 2. cross fingers that we hit it, and not some other bug first.
> 
> 3.
> 
> Actual results:
> 
> polling quits after 60s and the test is marked as FAIL
> 
> Expected results:
> 
> I am really not sure, but I want to say the internal timeout mechanisms
> are in the 5s range, so I would expect that cluster sync could be
> complete in a few iterations of that, at worst. It's possible that
> a fully sync with lots of data could take longer, but these are not
> heavy tests and the data needed to sync is probably not very big
> 
> 
> Additional info:
> 
> This is/was seen in u/s CSIT jobs as well, and I think other (non-netvirt)
> projects might even have 5m timeouts before failing the tests. If that's
> still the case, and true, then probably something is just totally stuck
> and broken and this is not the case where we can just wait a little
> longer.

actually, I found an example of this 5m timeout in the d/s job too:

  https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/26/robot/report/log.html#s1-s3-t33-k2-k2-k8
Comment 4 jamo luhrsen 2018-05-03 18:02:07 EDT
same basic issue here:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/28/robot/report/log.html

it is easy to see in the HA L2 suite, but in the HA L3 suite, there is a failure
where ODL3 is giving 404 on a /restconf/modules GET and I'm guessing that's
because it's still dead (aka, SyncStatus: false) from the problem in the HA L2
suite.
Comment 5 Mike Kolesnik 2018-05-21 08:59:24 EDT
Stephen, any update on this issue?
Comment 6 Stephen Kitt 2018-06-04 11:46:16 EDT
I’m seeing https://jira.opendaylight.org/browse/OPNFLWPLUG-1013 in suspicious places in the logs here too, so https://code.engineering.redhat.com/gerrit/140588 might help. Additional MD-SAL traces (https://git.opendaylight.org/gerrit/72658) should help us determine exactly what’s going wrong.
Comment 7 Victor Pickard 2018-06-25 16:56 EDT
Created attachment 1454492 [details]
controller logs
Comment 8 Victor Pickard 2018-06-25 16:57 EDT
Created attachment 1454493 [details]
Test script to reproduce issue
Comment 9 Victor Pickard 2018-06-25 17:04:16 EDT
I've been looking at this a little, wrote a small script to reproduce the issue, attached.

Here is a snippet of the output below.

This was from a fresh deploy of the overcloud. No automated tests were executed. Basically, no config (networks, subnets, vms, ports, etc).

controller-0 172.17.1.24
controller-1 172.17.1.20
controller-2 172.17.1.23

I ran this script on controller-0, that is the one that did not go back to sync.

I was monitoring karaf logs on controller-0 when this failure occurred, looks like ODL did not start up correctly due to some blueprint issues.



ODL starting
...............................Controller in sync, took 31 seconds!
Stopping ODL, iteration 9
opendaylight_api
sleeping 10 secs
Starting ODL
opendaylight_api
ODL starting
...............................Controller in sync, took 31 seconds!
Stopping ODL, iteration 10
opendaylight_api
sleeping 10 secs
Starting ODL
opendaylight_api
ODL starting
........................................................................................................................Controller is NOT in sync!
Comment 11 jamo luhrsen 2018-08-01 09:47:35 EDT
I would assume this is also tracked here:
https://jira.opendaylight.org/browse/CONTROLLER-1849

which is making progress (slow, but still progressing)

Note You need to log in before you can comment on or make changes to this bug.