1474025 – No connectivity to An instance Floating IP after Restarting the compute node

Bug 1474025 - No connectivity to An instance Floating IP after Restarting the compute node

Summary: No connectivity to An instance Floating IP after Restarting the compute node

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	opendaylight
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	12.0 (Pike)
Assignee:	Sridhar Gaddam
QA Contact:	Itzik Brown
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-23 08:29 UTC by Itzik Brown
Modified:	2018-10-24 12:37 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	N/A
Last Closed:	2017-12-14 09:45:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenDaylight Bug	8877	0	None	None	None	2017-07-23 08:29:48 UTC

Description Itzik Brown 2017-07-23 08:29:49 UTC

Description of problem:
Deploying OpenStack Pike with Carbon SR1.
After restarting a compute node and launching an instance the Instance doesn't get an IP from DHCP.
After restarting Open vSwitch and running the DHCP client again the Instance gets an IP.

It seems that the following flow is missing 
table=0, n_packets=0, n_bytes=0, priority=4,in_port=2,vlan_tci=0x0000/0x1fff actions=write_metadata:0x180000000000/0xffffff0000000001,goto_table:17

Version-Release number of selected component (if applicable):
OpenStack Pike
opendaylight-6.1.0-1.el7.noarch
Open vSwitch 2.7

How reproducible:


Steps to Reproduce:
1. Restart the compute node 
2. Launch an instance on the above compute node
3. Verify the instance is not getting an IP
4. Restart the openvswitch on the compute node
5. Restart the DHCP client and verify the instance gets an IP.

Actual results:


Expected results:


Additional info:

Comment 1 Sridhar Gaddam 2017-08-01 15:37:06 UTC

On debugging this issue closely, it appears like some race condition in ODL Controller.

Steps to Reproduce:
1. Restart the compute node 
2. Launch an instance on the compute node
3. You can observe that the instance initially stays in "spawning" state and then transitions to "error" state.
4. Restart the openvswitch on the compute node
5. Launch a new instance and it would boot successfully.

Basically, when we issue the reboot on the compute node, ODL identifies that the node is idle and triggers the disconnection chain. 
But, while this is going on, when the Compute node comes up, we could see that there is a race condition between the cleanup events and the events related to the node reconciliation.

In this process, we could see that finally the Compute node is deleted from the operational store [#] eventhough its connected to the controller. 
Since the node info is deleted from the datastore, the sideeffect is that port-binding fails and we will be unable to spawn new VMs until we restart the OVS Switch on the Compute node.
Following[@] is a SNAP of the karaf logs which show this sequence.

Some additional notes:
Incase, the compute node comes up with some delay (i.e., after the cleanup is properly done in ODL) this issue (i.e., step3 above) is not seen.

[#] 2017-08-01 07:48:16,660 | INFO  | lt-dispatcher-49 | OvsdbConnectionManager           | 289 - org.opendaylight.ovsdb.southbound-impl - 1.4.1.Carbon-redhat-1 | Entity{type='ovsdb', id=/(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/e9806896-8dc2-4f17-83ea-c1c957608915}]} has no owner, cleaning up the operational data store
[@] https://gist.github.com/sridhargaddam/3761ef080e11f2dd2429c8d7016ae6d0

Comment 4 Itzik Brown 2017-09-26 10:03:33 UTC

Checked with opendaylight-6.2.0-0.1.20170921snap729.el7.noarch.

After the compute node restart, launching again the instance - I get an IP.
The problem now is that I don't have connectivity to the instance's FIP.

Comment 6 Itzik Brown 2017-12-14 09:45:33 UTC

Opening a new bug for the FIP issue.

Note You need to log in before you can comment on or make changes to this bug.