Bug 1574739
Summary: | [Netvirt][HA] instances do not get their dhcp lease | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | jamo luhrsen <jluhrsen> | ||||||||||||
Component: | opendaylight | Assignee: | Vishal Thapar <vthapar> | ||||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Itzik Brown <itbrown> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 13.0 (Queens) | CC: | jluhrsen, mkolesni, nyechiel, wznoinsk | ||||||||||||
Target Milestone: | z1 | Keywords: | Triaged, ZStream | ||||||||||||
Target Release: | 13.0 (Queens) | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | odl_netvirt, odl_ha | ||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: |
N/A
|
|||||||||||||
Last Closed: | 2018-05-21 21:36:56 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
jamo luhrsen
2018-05-03 22:49:34 UTC
Please attach logs from neutron (server & dhcp) & ODL. (In reply to jamo luhrsen from comment #0) > Description of problem: > > In an HA setup, sometimes instances do not receive their dhcp lease. > > > Version-Release number of selected component (if applicable): > opendaylight-8.0.0-9.el7ost.noarch.rpm > > How reproducible: > sporadically > > Steps to Reproduce: > 1. run this job: > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/ > > 2. > 3. > > Actual results: > > some instances (not all) do not receive their dhcp lease > > Expected results: > > instances should get their address. > > Additional info: > > same sporadic issues are also seen in these same tests run in u/s ODL CSIT. > > here's a direct link to the robot test logs where you can see an > instance's console log and the 20 attempts to get a dhcp lease before > finally giving up: > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/ I forgot to give a specific job as an example of this problem, but it does happen enough that we don't need to worry about one job. either way, here is one: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27 Created attachment 1435528 [details]
nova, neutron, odl logs
.xz file should contain:
$ tree -s
.
├── [ 4096] compute-0
│ └── [ 4718163] nova.log
├── [ 4096] compute-1
│ └── [ 7257573] nova.log
├── [ 4096] controller-0
│ ├── [ 1508876] neutron-dhcp.log
│ ├── [ 6726385] neutron.log
│ └── [ 16900312] odl.log
├── [ 4096] controller-1
│ ├── [ 1522104] neutron-dhcp.log
│ ├── [ 7276265] neutron.log
│ └── [ 21092528] odl.log
├── [ 4096] controller-2
│ ├── [ 1530288] neutron-dhcp.log
│ ├── [ 6518581] neutron.log
│ └── [ 15316198] odl.log
└── [ 2125828] logs.tar.xz
we have same problem (VM doesn't get fixed ip from DHCP) in non-ha jobs, reported in https://bugzilla.redhat.com/show_bug.cgi?id=1573224 referencing it here cause it may be easier to troubleshoot the non-ha logs There are following exceptions in neutron logs: 2018-05-01 23:05:32.436 27 INFO neutron.wsgi [-] 172.17.1.21 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0015409 2018-05-01 23:05:33.251 34 ERROR networking_odl.common.client [req-8cdb564b-7dce-4026-996c-4fa6efab8442 - - - - -] REST request ( post ) to url ( restconf/operations/sal-remote:create-data-change-event-subscription ) is failed. Request body : [{'input': {'path': '/neutron:neutron/neutron:ports', 'sal-remote-augment:notification-output-type': 'JSON', 'sal-remote-augment:datastore': 'OPERATIONAL', 'sal-remote-augment:scope': 'SUBTREE'}}] service: ReadTimeout: HTTPConnectionPool(host='172.17.1.13', port=8081): Read timed out. (read timeout=10) and 018-05-01 23:28:18.922 27 ERROR networking_odl.common.client [req-b7b2a77f-e889-47c5-8f40-ba43e67fc06f - - - - -] REST request ( post ) to url ( routers ) is failed. Request body : [{u'router': {'status': 'ACTIVE', 'external_gateway_info': None, 'availability_zone_hints': [], 'ha': False, 'ha_vr_id': 0, 'description': '', 'gw_port_id': None, 'tags': [], 'tenant_id': u'37ed46f5d05049dca077948ef7aa2fa5', 'created_at': '2018-05-01T22:28:08Z', 'admin_state_up': True, 'distributed': False, 'updated_at': '2018-05-01T22:28:08Z', 'revision_number': 0, 'routes': [], 'project_id': u'37ed46f5d05049dca077948ef7aa2fa5', 'id': '1c1a2c43-fbaa-43e0-a6be-c4e2562c5e2a', 'name': u'cl3_bp_router_3'}}] service: ReadTimeout: HTTPConnectionPool(host='172.17.1.13', port=8081): Read timed out. (read timeout=10) 2018-05-01 23:28:18.923 27 ERROR networking_odl.journal.journal [req-b7b2a77f-e889-47c5-8f40-ba43e67fc06f - - - - -] Error while processing (Entry ID: 1505) - create router 1c1a2c43-fbaa-43e0-a6be-c4e2562c5e2a (Time stamp: 63660814098.9): ReadTimeout: HTTPConnectionPool(host='172.17.1.13', port=8081): Read timed out. (read timeout=10) In karaf logs there are exceptions: 2018-05-01T21:36:40,323 | INFO | pipe-log:log "ROBOT MESSAGE: Starting test L2 & L3 & Ha L2 & Ha L3 & Ha L3 Block Port & Neutron Security Group.Ha L2.Connectivity Tests From Vm Instance2 In net_1" | core | 123 - org.apache.karaf.log.core - 4.1.3 | ROBOT MESSAGE: Starting test L2 & L3 & Ha L2 & Ha L3 & Ha L3 Block Port & Neutron Security Group.Ha L2.Bring Up ODL3 2018-05-01T21:36:43,168 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-48 | NettyTransport | 46 - com.typesafe.akka.slf4j - 2.5.4 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /172.17.1.14:2550 This implies unable to talk to peer clusters. A couple of such entries are expected in logs during bring up as node tries to find peer nodes, but not once things are up. Then there are exceptions for ModifiedNodeDoesntExist on opendaylight-inventory which suggest either the port or device went missing [disconnect or again cluster issues]. Before we go looking at flows, need to check if ODL has all the data or not. Someone [Michael???] can take a look at logs and confirm if cluster is actually healthy during run or not. Need ODL Datastore dumps for fail scenarios to check if data is present in datastore or not. If it is, then cluster logs are just a red herring. If not, then it is issue with core mdsal. Vishal hey, any chance you can cross-compare the non-ha version of this problem in https://bugzilla.redhat.com/show_bug.cgi?id=1573224 ? Created attachment 1439743 [details]
file 1 of split robot's log.html
run this to get the full log.html back:
cat xaa xab xac xad > log.html
Created attachment 1439744 [details]
file 2 of split robot's log.html
run this to get the full log.html back:
cat xaa xab xac xad > log.html
Created attachment 1439745 [details]
file 3 of split robot's log.html
run this to get the full log.html back:
cat xaa xab xac xad > log.html
Created attachment 1439746 [details]
file 4 of split robot's log.html
run this to get the full log.html back:
cat xaa xab xac xad > log.html
In line below, but first I want to point out that the job I used to file this bug has failures straight away before any ODL nodes are stopped, blocked, etc. So, I think we should skip worrying about the points in the tests that are dealing with all of that mess which will just confuse things. I say we dig in on that first L2 suite's failure where.... oh shoot. I think this specific job is actually failing because that missing table-48 bug. I can see that flow is missing on the 2nd compute node. this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1568989 I left my comments inline below as I wrote them before realizing this was failing due to missing table-48. I'll go dig through other jobs to find any cases that wouldn't be explained by this. If there are none, we can DUP this I think. (In reply to Vishal Thapar from comment #9) > There are following exceptions in neutron logs: > > 2018-05-01 23:05:32.436 27 INFO neutron.wsgi [-] 172.17.1.21 "OPTIONS / > HTTP/1.0" status: 200 len: 248 time: 0.0015409 > 2018-05-01 23:05:33.251 34 ERROR networking_odl.common.client > [req-8cdb564b-7dce-4026-996c-4fa6efab8442 - - - - -] REST request ( post ) > to url ( > restconf/operations/sal-remote:create-data-change-event-subscription ) is > failed. Request body : [{'input': {'path': '/neutron:neutron/neutron:ports', > 'sal-remote-augment:notification-output-type': 'JSON', > 'sal-remote-augment:datastore': 'OPERATIONAL', 'sal-remote-augment:scope': > 'SUBTREE'}}] service: ReadTimeout: HTTPConnectionPool(host='172.17.1.13', > port=8081): Read timed out. (read timeout=10) > > and > > 018-05-01 23:28:18.922 27 ERROR networking_odl.common.client > [req-b7b2a77f-e889-47c5-8f40-ba43e67fc06f - - - - -] REST request ( post ) > to url ( routers ) is failed. Request body : [{u'router': {'status': > 'ACTIVE', 'external_gateway_info': None, 'availability_zone_hints': [], > 'ha': False, 'ha_vr_id': 0, 'description': '', 'gw_port_id': None, 'tags': > [], 'tenant_id': u'37ed46f5d05049dca077948ef7aa2fa5', 'created_at': > '2018-05-01T22:28:08Z', 'admin_state_up': True, 'distributed': False, > 'updated_at': '2018-05-01T22:28:08Z', 'revision_number': 0, 'routes': [], > 'project_id': u'37ed46f5d05049dca077948ef7aa2fa5', 'id': > '1c1a2c43-fbaa-43e0-a6be-c4e2562c5e2a', 'name': u'cl3_bp_router_3'}}] > service: ReadTimeout: HTTPConnectionPool(host='172.17.1.13', port=8081): > Read timed out. (read timeout=10) > 2018-05-01 23:28:18.923 27 ERROR networking_odl.journal.journal > [req-b7b2a77f-e889-47c5-8f40-ba43e67fc06f - - - - -] Error while processing > (Entry ID: 1505) - create router 1c1a2c43-fbaa-43e0-a6be-c4e2562c5e2a (Time > stamp: 63660814098.9): ReadTimeout: HTTPConnectionPool(host='172.17.1.13', > port=8081): Read timed out. (read timeout=10) I was trying to match up these exceptions to what's going on in the robot logs, but the timestamps aren't working. The robot suite seems to have been done at 22:47 and both of those entries are after that (23:05 and 23:28). That is more than 15m after the robot csit is done, and the jenkins job finishes log collection well before that (afaict) The timestamps in the karaf log seems to sync up, so I'm not sure how to explain it. > In karaf logs there are exceptions: > > 2018-05-01T21:36:40,323 | INFO | pipe-log:log "ROBOT MESSAGE: Starting test > L2 & L3 & Ha L2 & Ha L3 & Ha L3 Block Port & Neutron Security Group.Ha > L2.Connectivity Tests From Vm Instance2 In net_1" | core > | 123 - org.apache.karaf.log.core - 4.1.3 | ROBOT MESSAGE: Starting test L2 > & L3 & Ha L2 & Ha L3 & Ha L3 Block Port & Neutron Security Group.Ha L2.Bring > Up ODL3 > 2018-05-01T21:36:43,168 | WARN | > opendaylight-cluster-data-akka.actor.default-dispatcher-48 | NettyTransport > | 46 - com.typesafe.akka.slf4j - 2.5.4 | Remote connection to [null] failed > with java.net.ConnectException: Connection refused: /172.17.1.14:2550 > > This implies unable to talk to peer clusters. A couple of such entries are > expected in logs during bring up as node tries to find peer nodes, but not > once things are up. right, looking at the robot logs, I would think this ConnectException is ok at that time stamp (21:36:40) as that's when ODL3 is still coming up. It's up and data sync set to True at 21:37:22, so anything suspicious after that is what we can look for. But, the very next test case is dropping ODL1 and ODL2 so again we'd expect to see more of these kinds of issues. > Then there are exceptions for ModifiedNodeDoesntExist on > opendaylight-inventory which suggest either the port or device went missing > [disconnect or again cluster issues]. > > Before we go looking at flows, need to check if ODL has all the data or not. > Someone [Michael???] can take a look at logs and confirm if cluster is > actually healthy during run or not. Need ODL Datastore dumps for fail > scenarios to check if data is present in datastore or not. If it is, then > cluster logs are just a red herring. If not, then it is issue with core > mdsal. I can help dig some of this up, as we collect the info in the robot test teardowns. I dug through as many CSIT reports as were available to find this high level symptom of "instances not getting their dhcp lease" and I am fairly confident they all are coming from two other bugs we can use to track with: missing table 48 on a compute: https://bugzilla.redhat.com/show_bug.cgi?id=1568989 agents think ODL is down, so instances go to ERROR state: https://bugzilla.redhat.com/show_bug.cgi?id=1575150 I'm going to DUP this bug to 1575150 because that was the more frequent reason for the problem. Fixing the above two bugs will go a long way in making our 3node CSIT jobs more stable and remove a lot of failure noise. *** This bug has been marked as a duplicate of bug 1575150 *** |