Red Hat Bugzilla – Bug 1574708
[HA] ODL cluster member fails to fully sync back to the cluster after isolation test
Last modified: 2018-09-06 15:17:22 EDT
Description of problem: An ODL node is isolated from a 3 node cluster (via iptables rules) and then communication is restored. However the cluster sync status never becomes "True" after polling for 60s. Version-Release number of selected component (if applicable): opendaylight-8.0.0-9.el7ost.noarch.rpm How reproducible: sporadically Steps to Reproduce: 1. run the u/s CSIT suites in a job like this: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit 2. cross fingers that we hit it, and not some other bug first. 3. Actual results: polling quits after 60s and the test is marked as FAIL Expected results: I am really not sure, but I want to say the internal timeout mechanisms are in the 5s range, so I would expect that cluster sync could be complete in a few iterations of that, at worst. It's possible that a fully sync with lots of data could take longer, but these are not heavy tests and the data needed to sync is probably not very big Additional info: This is/was seen in u/s CSIT jobs as well, and I think other (non-netvirt) projects might even have 5m timeouts before failing the tests. If that's still the case, and true, then probably something is just totally stuck and broken and this is not the case where we can just wait a little longer.
exact test case that fails: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit output of the cluster status REST query: { "request": { "mbean": "org.opendaylight.controller:Category=ShardManager,name=shard-manager-operational,type=DistributedOperationalDatastore", "type": "read" }, "value": { "LocalShards": [ "member-0-shard-default-operational" ], "SyncStatus": false, "MemberName": "member-0" }, "timestamp": 1525213624, "status": 200 } SyncStatus is the key we are polling on to be "true"
Created attachment 1430918 [details] three opendaylight logs you can get all of the log logs collected from the artifacts of the job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/artifact/ but, attached (for easier consumption) are just the three ODL logs.
(In reply to jamo luhrsen from comment #0) > Description of problem: > > An ODL node is isolated from a 3 node cluster (via iptables rules) and > then communication is restored. However the cluster sync status never > becomes "True" after polling for 60s. > > > Version-Release number of selected component (if applicable): > opendaylight-8.0.0-9.el7ost.noarch.rpm > > How reproducible: > sporadically > > Steps to Reproduce: > 1. run the u/s CSIT suites in a job like this: > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit > > 2. cross fingers that we hit it, and not some other bug first. > > 3. > > Actual results: > > polling quits after 60s and the test is marked as FAIL > > Expected results: > > I am really not sure, but I want to say the internal timeout mechanisms > are in the 5s range, so I would expect that cluster sync could be > complete in a few iterations of that, at worst. It's possible that > a fully sync with lots of data could take longer, but these are not > heavy tests and the data needed to sync is probably not very big > > > Additional info: > > This is/was seen in u/s CSIT jobs as well, and I think other (non-netvirt) > projects might even have 5m timeouts before failing the tests. If that's > still the case, and true, then probably something is just totally stuck > and broken and this is not the case where we can just wait a little > longer. actually, I found an example of this 5m timeout in the d/s job too: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/26/robot/report/log.html#s1-s3-t33-k2-k2-k8
same basic issue here: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/28/robot/report/log.html it is easy to see in the HA L2 suite, but in the HA L3 suite, there is a failure where ODL3 is giving 404 on a /restconf/modules GET and I'm guessing that's because it's still dead (aka, SyncStatus: false) from the problem in the HA L2 suite.
Stephen, any update on this issue?
I’m seeing https://jira.opendaylight.org/browse/OPNFLWPLUG-1013 in suspicious places in the logs here too, so https://code.engineering.redhat.com/gerrit/140588 might help. Additional MD-SAL traces (https://git.opendaylight.org/gerrit/72658) should help us determine exactly what’s going wrong.
Created attachment 1454492 [details] controller logs
Created attachment 1454493 [details] Test script to reproduce issue
I've been looking at this a little, wrote a small script to reproduce the issue, attached. Here is a snippet of the output below. This was from a fresh deploy of the overcloud. No automated tests were executed. Basically, no config (networks, subnets, vms, ports, etc). controller-0 172.17.1.24 controller-1 172.17.1.20 controller-2 172.17.1.23 I ran this script on controller-0, that is the one that did not go back to sync. I was monitoring karaf logs on controller-0 when this failure occurred, looks like ODL did not start up correctly due to some blueprint issues. ODL starting ...............................Controller in sync, took 31 seconds! Stopping ODL, iteration 9 opendaylight_api sleeping 10 secs Starting ODL opendaylight_api ODL starting ...............................Controller in sync, took 31 seconds! Stopping ODL, iteration 10 opendaylight_api sleeping 10 secs Starting ODL opendaylight_api ODL starting ........................................................................................................................Controller is NOT in sync!
I would assume this is also tracked here: https://jira.opendaylight.org/browse/CONTROLLER-1849 which is making progress (slow, but still progressing)