Bug 1578312

Summary:	OVN metadata server is not reachable after resetting of Controllers with ovn-servers
Product:	Red Hat OpenStack	Reporter:	Eran Kuris <ekuris>
Component:	python-networking-ovn	Assignee:	Daniel Alvarez Sanchez <dalvarez>
Status:	CLOSED ERRATA	QA Contact:	Eran Kuris <ekuris>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	13.0 (Queens)	CC:	apevec, dalvarez, lhh, majopela, mkrcmari, nusiddiq, nyechiel
Target Milestone:	z1	Keywords:	Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	python-networking-ovn-4.0.1-0.20180420150812.c7c16d4.el7ost	Doc Type:	Release Note
Doc Text:	When the OVSDB server fails over to a different controller node, a reconnection from neutron-server/metadata-agent does not take place because they are not detecting this condition. As a result, booting VMs may not work as metadata-agent will not provision new metadata namespaces and the clustering is not behaving as expected. A possible workaround is to restart the ovn_metadata_agent container in all the compute nodes after a new controller has been promoted as master for OVN databases. Also increase the ovsdb_probe_interval on the plugin.ini to a value of 600000 milliseconds.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-07-19 13:53:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eran Kuris 2018-05-15 09:11:10 UTC

Description of problem:

Deployed OSP13+OVN then I ran rally scenario which creates an instance with FIP, ping the instance and If successful then tries to ssh into the instance. This test was successful on a fresh deployment. However once I ungracefully reset overcloud node hosting ovn-server master, Things start to go sideways even though pacemaker reports that different slave node was successfully promoted to Master. If I rerun the rally scenario after such ovn-server master node reset, The test fails on FIP pinging, or even sometimes I can see some neutron exceptions coming from neutron OpenStack client used by Rally.

digging more and It's not that instance is not pingable but rally scenario failed on SSH Timeout, The instance was pingable. It's possible to SSH into the instance later as well, It seems that instance hangs during boot, from the console log:
Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending select for 10.2.0.11...
Lease of 10.2.0.11 obtained, lease time 43200
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.2.0.1"
cirros-ds 'net' up at 3.90
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 3.92. request failed
failed 2/20: up 15.99. request failed
failed 3/20: up 28.02. request failed
failed 4/20: up 40.05. request failed
failed 5/20: up 52.07. request failed
failed 6/20: up 64.10. request failed
failed 7/20: up 76.12. request failed
failed 8/20: up 88.15. request failed
failed 9/20: up 100.17. request failed
failed 10/20: up 112.20. request failed
failed 11/20: up 124.22. request failed
failed 12/20: up 136.25. request failed
failed 13/20: up 148.28. request failed
failed 14/20: up 160.30. request failed
failed 15/20: up 172.33. request failed
failed 16/20: up 184.36. request failed
failed 17/20: up 196.38. request failed
failed 18/20: up 208.41. request failed
failed 19/20: up 220.43. request failed
failed 20/20: up 232.46. request failed
failed to read iid from metadata. tried 20
no results found for mode=net. up 244.49. searched: nocloud configdrive ec2
failed to get instance-id of datasource
Starting dropbear sshd: generating rsa key... generating dsa key... OK

It seems that metadata server is not reachable. It does not happen on non-ovn setups and It did not happen before resetting of Controllers with ovn-servers on fresh OVN OSP Setup. It started to appear after I reset controllers.

Dev debugging:

ovn metadata docker service (openstack-neutron-metadata-agent-ovn) is not creating the namespace and starting the haproxy for metadata. I created another VM and observed the same behaviour. I restarted the ovn metadata agent on compute-0 and then it started working fine. Any VM scheduled on compute-0 is able to access the metadata service. I haven't restarted the metadata service in compute-1 yet.

In case you need the setup and want to carry on with your work, please restart ovn metadata docker service in compute-1.

I see from the logs, that the metadata was working earlier and it has stopped working somehow.

Version-Release number of selected component (if applicable):
OSP 13 -p 2018-05-10.3
[root@controller-0 ~]# rpm -qa |grep -i ovn
openvswitch-ovn-central-2.9.0-20.el7fdp.x86_64
openvswitch-ovn-common-2.9.0-20.el7fdp.x86_64
openstack-nova-novncproxy-17.0.3-0.20180420001138.el7ost.noarch
openvswitch-ovn-host-2.9.0-20.el7fdp.x86_64
python-networking-ovn-4.0.1-0.20180420150809.c7c16d4.el7ost.noarch
puppet-ovn-12.4.0-0.20180329043503.36ff219.el7ost.noarch
python-networking-ovn-metadata-agent-4.0.1-0.20180420150809.c7c16d4.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Run sts+run job (link will be added)
2.
3.

Actual results:

Expected results:

Additional info:

Comment 4 Numan Siddique 2018-05-17 07:32:18 UTC

The issue is because when ovn south db server goes down or gets restarted, ovn metadata agents don't detect this. So it never reconnects to new connections. The reason for this is we don't add the below option in networking-ovn-metadata-agent.ini under the [ovn] section

ovsdb_connection_timeout=180

The fix is required in puppet-neutron here - https://github.com/openstack/puppet-neutron/blob/master/manifests/agents/ovn_metadata.pp#L146

Comment 5 Daniel Alvarez Sanchez 2018-05-21 14:50:00 UTC

I have checked and the value we're getting for ovsdb_connection_timeout is 180. Checked it by adding traces to ovn metadata agent code and restarting container. This is because we have a default value in the code:

https://github.com/openstack/networking-ovn/blob/stable/queens/networking_ovn/common/config.py#L73
(getting registered them in L152 below)

@Numan I've verified this in both devstack and TripleO setups so I don't think this is the root cause though.

Comment 6 Daniel Alvarez Sanchez 2018-05-22 13:37:58 UTC

I reported the bug here: https://bugs.launchpad.net/networking-ovn/+bug/1772656

The issue is not specific to metadata-agent but also to neutron-server. The thing is that neutron-server is not reacting upon the failover but when a new API request comes to a worker, it'll timeout and reconnect after ovsdb_connection_timeout seconds.

Comment 7 Numan Siddique 2018-05-22 18:44:57 UTC

@Daniel - You want to mention the workaround in the doc text ? - i.e Restarting the containers in each compute node would fix the issue ?

Comment 8 Daniel Alvarez Sanchez 2018-05-22 20:21:20 UTC

Done, thanks!

Comment 9 Miguel Angel Ajo 2018-05-30 13:21:26 UTC

Wouldn't another more permanent workaround be increasing the ovsdb_probe_interval in the plugin.ini config file to 60000?

Comment 10 Miguel Angel Ajo 2018-05-30 13:27:26 UTC

Sorry, this is on dev (upstream patch, no downstream patch yet)

Comment 14 Eran Kuris 2018-07-12 09:09:39 UTC

Fix verified: python-networking-ovn-4.0.1-0.20180420150812.c7c16d4.el7ost.noarch

2018-07-06.1


https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-13_director-rhel-virthost-3cont_2comp-ipv4-geneve-sts/28/testReport/.home.stack.openstack-sts.tests.smoke/03_HARD_RESET_CONTROLLER_MAIN_VIP/

verify manually too

 [root@vm-net-64-1 ~]#  curl http://169.254.169.254/latest/meta-data/
ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
hostname
instance-action
instance-id
instance-type
local-hostname
local-ipv4

Comment 16 errata-xmlrpc 2018-07-19 13:53:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2215