Bug 1578312
Summary: | OVN metadata server is not reachable after resetting of Controllers with ovn-servers | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eran Kuris <ekuris> |
Component: | python-networking-ovn | Assignee: | Daniel Alvarez Sanchez <dalvarez> |
Status: | CLOSED ERRATA | QA Contact: | Eran Kuris <ekuris> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 13.0 (Queens) | CC: | apevec, dalvarez, lhh, majopela, mkrcmari, nusiddiq, nyechiel |
Target Milestone: | z1 | Keywords: | Triaged, ZStream |
Target Release: | 13.0 (Queens) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | python-networking-ovn-4.0.1-0.20180420150812.c7c16d4.el7ost | Doc Type: | Release Note |
Doc Text: |
When the OVSDB server fails over to a different controller node, a reconnection from neutron-server/metadata-agent does not take place because they are not detecting this condition.
As a result, booting VMs may not work as metadata-agent will not provision new metadata namespaces and the clustering is not behaving as expected.
A possible workaround is to restart the ovn_metadata_agent container in all the compute nodes after a new controller has been promoted as master for OVN databases. Also increase the ovsdb_probe_interval on the plugin.ini to a value of 600000 milliseconds.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-07-19 13:53:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Eran Kuris
2018-05-15 09:11:10 UTC
The issue is because when ovn south db server goes down or gets restarted, ovn metadata agents don't detect this. So it never reconnects to new connections. The reason for this is we don't add the below option in networking-ovn-metadata-agent.ini under the [ovn] section ovsdb_connection_timeout=180 The fix is required in puppet-neutron here - https://github.com/openstack/puppet-neutron/blob/master/manifests/agents/ovn_metadata.pp#L146 I have checked and the value we're getting for ovsdb_connection_timeout is 180. Checked it by adding traces to ovn metadata agent code and restarting container. This is because we have a default value in the code: https://github.com/openstack/networking-ovn/blob/stable/queens/networking_ovn/common/config.py#L73 (getting registered them in L152 below) @Numan I've verified this in both devstack and TripleO setups so I don't think this is the root cause though. I reported the bug here: https://bugs.launchpad.net/networking-ovn/+bug/1772656 The issue is not specific to metadata-agent but also to neutron-server. The thing is that neutron-server is not reacting upon the failover but when a new API request comes to a worker, it'll timeout and reconnect after ovsdb_connection_timeout seconds. @Daniel - You want to mention the workaround in the doc text ? - i.e Restarting the containers in each compute node would fix the issue ? Done, thanks! Wouldn't another more permanent workaround be increasing the ovsdb_probe_interval in the plugin.ini config file to 60000? Sorry, this is on dev (upstream patch, no downstream patch yet) Fix verified: python-networking-ovn-4.0.1-0.20180420150812.c7c16d4.el7ost.noarch 2018-07-06.1 https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-13_director-rhel-virthost-3cont_2comp-ipv4-geneve-sts/28/testReport/.home.stack.openstack-sts.tests.smoke/03_HARD_RESET_CONTROLLER_MAIN_VIP/ verify manually too [root@vm-net-64-1 ~]# curl http://169.254.169.254/latest/meta-data/ ami-id ami-launch-index ami-manifest-path block-device-mapping/ hostname instance-action instance-id instance-type local-hostname local-ipv4 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2215 |