Bug 1574739

Summary: [Netvirt][HA] instances do not get their dhcp lease
Product: Red Hat OpenStack Reporter: jamo luhrsen <jluhrsen>
Component: opendaylightAssignee: Vishal Thapar <vthapar>
Status: CLOSED DUPLICATE QA Contact: Itzik Brown <itbrown>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: jluhrsen, mkolesni, nyechiel, wznoinsk
Target Milestone: z1Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: odl_netvirt, odl_ha
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
N/A
Last Closed: 2018-05-21 21:36:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
nova, neutron, odl logs
none
file 1 of split robot's log.html
none
file 2 of split robot's log.html
none
file 3 of split robot's log.html
none
file 4 of split robot's log.html none

Description jamo luhrsen 2018-05-03 22:49:34 UTC
Description of problem:

In an HA setup, sometimes instances do not receive their dhcp lease.


Version-Release number of selected component (if applicable):
opendaylight-8.0.0-9.el7ost.noarch.rpm

How reproducible:
sporadically

Steps to Reproduce:
1. run this job:
   https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/

2.
3.

Actual results:

some instances (not all) do not receive their dhcp lease

Expected results:

instances should get their address.

Additional info:

same sporadic issues are also seen in these same tests run in u/s ODL CSIT.

here's a direct link to the robot test logs where you can see an
instance's console log and the 20 attempts to get a dhcp lease before
finally giving up:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/

Comment 1 Mike Kolesnik 2018-05-07 06:54:48 UTC
Please attach logs from neutron (server & dhcp) & ODL.

Comment 3 jamo luhrsen 2018-05-12 21:41:43 UTC
(In reply to jamo luhrsen from comment #0)
> Description of problem:
> 
> In an HA setup, sometimes instances do not receive their dhcp lease.
> 
> 
> Version-Release number of selected component (if applicable):
> opendaylight-8.0.0-9.el7ost.noarch.rpm
> 
> How reproducible:
> sporadically
> 
> Steps to Reproduce:
> 1. run this job:
>   
> https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/
> 
> 2.
> 3.
> 
> Actual results:
> 
> some instances (not all) do not receive their dhcp lease
> 
> Expected results:
> 
> instances should get their address.
> 
> Additional info:
> 
> same sporadic issues are also seen in these same tests run in u/s ODL CSIT.
> 
> here's a direct link to the robot test logs where you can see an
> instance's console log and the 20 attempts to get a dhcp lease before
> finally giving up:
> 
> https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/

I forgot to give a specific job as an example of this problem, but it does
happen enough that we don't need to worry about one job. either way, here
is one:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27

Comment 5 jamo luhrsen 2018-05-12 21:44:48 UTC
Created attachment 1435528 [details]
nova, neutron, odl logs

.xz file should contain:

$ tree -s
.
├── [       4096]  compute-0
│   └── [    4718163]  nova.log
├── [       4096]  compute-1
│   └── [    7257573]  nova.log
├── [       4096]  controller-0
│   ├── [    1508876]  neutron-dhcp.log
│   ├── [    6726385]  neutron.log
│   └── [   16900312]  odl.log
├── [       4096]  controller-1
│   ├── [    1522104]  neutron-dhcp.log
│   ├── [    7276265]  neutron.log
│   └── [   21092528]  odl.log
├── [       4096]  controller-2
│   ├── [    1530288]  neutron-dhcp.log
│   ├── [    6518581]  neutron.log
│   └── [   15316198]  odl.log
└── [    2125828]  logs.tar.xz

Comment 8 Waldemar Znoinski 2018-05-16 10:13:06 UTC
we have same problem (VM doesn't get fixed ip from DHCP) in non-ha jobs, reported in https://bugzilla.redhat.com/show_bug.cgi?id=1573224
referencing it here cause it may be easier to troubleshoot the non-ha logs

Comment 9 Vishal Thapar 2018-05-17 06:18:59 UTC
There are following exceptions in neutron logs:

2018-05-01 23:05:32.436 27 INFO neutron.wsgi [-] 172.17.1.21 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0015409
2018-05-01 23:05:33.251 34 ERROR networking_odl.common.client [req-8cdb564b-7dce-4026-996c-4fa6efab8442 - - - - -] REST request ( post ) to url ( restconf/operations/sal-remote:create-data-change-event-subscription ) is failed. Request body : [{'input': {'path': '/neutron:neutron/neutron:ports', 'sal-remote-augment:notification-output-type': 'JSON', 'sal-remote-augment:datastore': 'OPERATIONAL', 'sal-remote-augment:scope': 'SUBTREE'}}] service: ReadTimeout: HTTPConnectionPool(host='172.17.1.13', port=8081): Read timed out. (read timeout=10)

and 

018-05-01 23:28:18.922 27 ERROR networking_odl.common.client [req-b7b2a77f-e889-47c5-8f40-ba43e67fc06f - - - - -] REST request ( post ) to url ( routers ) is failed. Request body : [{u'router': {'status': 'ACTIVE', 'external_gateway_info': None, 'availability_zone_hints': [], 'ha': False, 'ha_vr_id': 0, 'description': '', 'gw_port_id': None, 'tags': [], 'tenant_id': u'37ed46f5d05049dca077948ef7aa2fa5', 'created_at': '2018-05-01T22:28:08Z', 'admin_state_up': True, 'distributed': False, 'updated_at': '2018-05-01T22:28:08Z', 'revision_number': 0, 'routes': [], 'project_id': u'37ed46f5d05049dca077948ef7aa2fa5', 'id': '1c1a2c43-fbaa-43e0-a6be-c4e2562c5e2a', 'name': u'cl3_bp_router_3'}}] service: ReadTimeout: HTTPConnectionPool(host='172.17.1.13', port=8081): Read timed out. (read timeout=10)
2018-05-01 23:28:18.923 27 ERROR networking_odl.journal.journal [req-b7b2a77f-e889-47c5-8f40-ba43e67fc06f - - - - -] Error while processing (Entry ID: 1505) - create router 1c1a2c43-fbaa-43e0-a6be-c4e2562c5e2a (Time stamp: 63660814098.9): ReadTimeout: HTTPConnectionPool(host='172.17.1.13', port=8081): Read timed out. (read timeout=10)

In karaf logs there are exceptions:

2018-05-01T21:36:40,323 | INFO  | pipe-log:log "ROBOT MESSAGE: Starting test L2 & L3 & Ha L2 & Ha L3 & Ha L3 Block Port & Neutron Security Group.Ha L2.Connectivity Tests From Vm Instance2 In net_1" | core                             | 123 - org.apache.karaf.log.core - 4.1.3 | ROBOT MESSAGE: Starting test L2 & L3 & Ha L2 & Ha L3 & Ha L3 Block Port & Neutron Security Group.Ha L2.Bring Up ODL3
2018-05-01T21:36:43,168 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-48 | NettyTransport                   | 46 - com.typesafe.akka.slf4j - 2.5.4 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /172.17.1.14:2550

This implies unable to talk to peer clusters. A couple of such entries are expected in logs during bring up as node tries to find peer nodes, but not once things are up.

Then there are exceptions for ModifiedNodeDoesntExist on opendaylight-inventory which suggest either the port or device went missing [disconnect or again cluster issues].

Before we go looking at flows, need to check if ODL has all the data or not. Someone [Michael???] can take a look at logs and confirm if cluster is actually healthy during run or not. Need ODL Datastore dumps for fail scenarios to check if data is present in datastore or not. If it is, then cluster logs are just a red herring. If not, then it is issue with core mdsal.

Comment 10 Waldemar Znoinski 2018-05-17 09:58:03 UTC
Vishal hey, 
any chance you can cross-compare the non-ha version of this problem in https://bugzilla.redhat.com/show_bug.cgi?id=1573224 ?

Comment 11 jamo luhrsen 2018-05-21 19:06:04 UTC
Created attachment 1439743 [details]
file 1 of split robot's log.html

run this to get the full log.html back:

cat xaa xab xac xad > log.html

Comment 12 jamo luhrsen 2018-05-21 19:07:30 UTC
Created attachment 1439744 [details]
file 2 of split robot's log.html

run this to get the full log.html back:

cat xaa xab xac xad > log.html

Comment 13 jamo luhrsen 2018-05-21 19:08:57 UTC
Created attachment 1439745 [details]
file 3 of split robot's log.html

run this to get the full log.html back:

cat xaa xab xac xad > log.html

Comment 14 jamo luhrsen 2018-05-21 19:09:29 UTC
Created attachment 1439746 [details]
file 4 of split robot's log.html

run this to get the full log.html back:

cat xaa xab xac xad > log.html

Comment 15 jamo luhrsen 2018-05-21 21:23:03 UTC
In line below, but first I want to point out that the job I used to
file this bug has failures straight away before any ODL nodes
are stopped, blocked, etc. So, I think we should skip worrying
about the points in the tests that are dealing with all of that
mess which will just confuse things. I say we dig in on that first
L2 suite's failure where....  oh shoot. I think this specific
job is actually failing because that missing table-48 bug. I can
see that flow is missing on the 2nd compute node.

this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1568989

I left my comments inline below as I wrote them before realizing
this was failing due to missing table-48. I'll go dig through other
jobs to find any cases that wouldn't be explained by this. If
there are none, we can DUP this I think.


(In reply to Vishal Thapar from comment #9)
> There are following exceptions in neutron logs:
> 
> 2018-05-01 23:05:32.436 27 INFO neutron.wsgi [-] 172.17.1.21 "OPTIONS /
> HTTP/1.0" status: 200  len: 248 time: 0.0015409
> 2018-05-01 23:05:33.251 34 ERROR networking_odl.common.client
> [req-8cdb564b-7dce-4026-996c-4fa6efab8442 - - - - -] REST request ( post )
> to url (
> restconf/operations/sal-remote:create-data-change-event-subscription ) is
> failed. Request body : [{'input': {'path': '/neutron:neutron/neutron:ports',
> 'sal-remote-augment:notification-output-type': 'JSON',
> 'sal-remote-augment:datastore': 'OPERATIONAL', 'sal-remote-augment:scope':
> 'SUBTREE'}}] service: ReadTimeout: HTTPConnectionPool(host='172.17.1.13',
> port=8081): Read timed out. (read timeout=10)
> 
> and 
> 
> 018-05-01 23:28:18.922 27 ERROR networking_odl.common.client
> [req-b7b2a77f-e889-47c5-8f40-ba43e67fc06f - - - - -] REST request ( post )
> to url ( routers ) is failed. Request body : [{u'router': {'status':
> 'ACTIVE', 'external_gateway_info': None, 'availability_zone_hints': [],
> 'ha': False, 'ha_vr_id': 0, 'description': '', 'gw_port_id': None, 'tags':
> [], 'tenant_id': u'37ed46f5d05049dca077948ef7aa2fa5', 'created_at':
> '2018-05-01T22:28:08Z', 'admin_state_up': True, 'distributed': False,
> 'updated_at': '2018-05-01T22:28:08Z', 'revision_number': 0, 'routes': [],
> 'project_id': u'37ed46f5d05049dca077948ef7aa2fa5', 'id':
> '1c1a2c43-fbaa-43e0-a6be-c4e2562c5e2a', 'name': u'cl3_bp_router_3'}}]
> service: ReadTimeout: HTTPConnectionPool(host='172.17.1.13', port=8081):
> Read timed out. (read timeout=10)
> 2018-05-01 23:28:18.923 27 ERROR networking_odl.journal.journal
> [req-b7b2a77f-e889-47c5-8f40-ba43e67fc06f - - - - -] Error while processing
> (Entry ID: 1505) - create router 1c1a2c43-fbaa-43e0-a6be-c4e2562c5e2a (Time
> stamp: 63660814098.9): ReadTimeout: HTTPConnectionPool(host='172.17.1.13',
> port=8081): Read timed out. (read timeout=10)


I was trying to match up these exceptions to what's going on in the robot
logs, but the timestamps aren't working. The robot suite seems to have been
done at 22:47 and both of those entries are after that (23:05 and 23:28).
That is more than 15m after the robot csit is done, and the jenkins job
finishes log collection well before that (afaict)

The timestamps in the karaf log seems to sync up, so I'm not sure how to
explain it.


> In karaf logs there are exceptions:
> 
> 2018-05-01T21:36:40,323 | INFO  | pipe-log:log "ROBOT MESSAGE: Starting test
> L2 & L3 & Ha L2 & Ha L3 & Ha L3 Block Port & Neutron Security Group.Ha
> L2.Connectivity Tests From Vm Instance2 In net_1" | core                    
> | 123 - org.apache.karaf.log.core - 4.1.3 | ROBOT MESSAGE: Starting test L2
> & L3 & Ha L2 & Ha L3 & Ha L3 Block Port & Neutron Security Group.Ha L2.Bring
> Up ODL3
> 2018-05-01T21:36:43,168 | WARN  |
> opendaylight-cluster-data-akka.actor.default-dispatcher-48 | NettyTransport 
> | 46 - com.typesafe.akka.slf4j - 2.5.4 | Remote connection to [null] failed
> with java.net.ConnectException: Connection refused: /172.17.1.14:2550
> 
> This implies unable to talk to peer clusters. A couple of such entries are
> expected in logs during bring up as node tries to find peer nodes, but not
> once things are up.


right, looking at the robot logs, I would think this ConnectException is ok
at that time stamp (21:36:40) as that's when ODL3 is still coming up. It's
up and data sync set to True at 21:37:22, so anything suspicious after that
is what we can look for. But, the very next test case is dropping ODL1 and
ODL2 so again we'd expect to see more of these kinds of issues.

> Then there are exceptions for ModifiedNodeDoesntExist on
> opendaylight-inventory which suggest either the port or device went missing
> [disconnect or again cluster issues].
> 
> Before we go looking at flows, need to check if ODL has all the data or not.
> Someone [Michael???] can take a look at logs and confirm if cluster is
> actually healthy during run or not. Need ODL Datastore dumps for fail
> scenarios to check if data is present in datastore or not. If it is, then
> cluster logs are just a red herring. If not, then it is issue with core
> mdsal.

I can help dig some of this up, as we collect the info in the robot
test teardowns.

Comment 16 jamo luhrsen 2018-05-21 21:36:56 UTC
I dug through as many CSIT reports as were available to find this
high level symptom of "instances not getting their dhcp lease" and
I am fairly confident they all are coming from two other bugs we 
can use to track with:

missing table 48 on a compute:
  https://bugzilla.redhat.com/show_bug.cgi?id=1568989

agents think ODL is down, so instances go to ERROR state:
  https://bugzilla.redhat.com/show_bug.cgi?id=1575150


I'm going to DUP this bug to 1575150 because that was the more
frequent reason for the problem.

Fixing the above two bugs will go a long way in making our 3node CSIT
jobs more stable and remove a lot of failure noise.

*** This bug has been marked as a duplicate of bug 1575150 ***