Description of problem: Deployed OSP13+OVN then I ran rally scenario which creates an instance with FIP, ping the instance and If successful then tries to ssh into the instance. This test was successful on a fresh deployment. However once I ungracefully reset overcloud node hosting ovn-server master, Things start to go sideways even though pacemaker reports that different slave node was successfully promoted to Master. If I rerun the rally scenario after such ovn-server master node reset, The test fails on FIP pinging, or even sometimes I can see some neutron exceptions coming from neutron OpenStack client used by Rally. digging more and It's not that instance is not pingable but rally scenario failed on SSH Timeout, The instance was pingable. It's possible to SSH into the instance later as well, It seems that instance hangs during boot, from the console log: Starting network... udhcpc (v1.20.1) started Sending discover... Sending select for 10.2.0.11... Lease of 10.2.0.11 obtained, lease time 43200 route: SIOCADDRT: File exists WARN: failed: route add -net "0.0.0.0/0" gw "10.2.0.1" cirros-ds 'net' up at 3.90 checking http://169.254.169.254/2009-04-04/instance-id failed 1/20: up 3.92. request failed failed 2/20: up 15.99. request failed failed 3/20: up 28.02. request failed failed 4/20: up 40.05. request failed failed 5/20: up 52.07. request failed failed 6/20: up 64.10. request failed failed 7/20: up 76.12. request failed failed 8/20: up 88.15. request failed failed 9/20: up 100.17. request failed failed 10/20: up 112.20. request failed failed 11/20: up 124.22. request failed failed 12/20: up 136.25. request failed failed 13/20: up 148.28. request failed failed 14/20: up 160.30. request failed failed 15/20: up 172.33. request failed failed 16/20: up 184.36. request failed failed 17/20: up 196.38. request failed failed 18/20: up 208.41. request failed failed 19/20: up 220.43. request failed failed 20/20: up 232.46. request failed failed to read iid from metadata. tried 20 no results found for mode=net. up 244.49. searched: nocloud configdrive ec2 failed to get instance-id of datasource Starting dropbear sshd: generating rsa key... generating dsa key... OK It seems that metadata server is not reachable. It does not happen on non-ovn setups and It did not happen before resetting of Controllers with ovn-servers on fresh OVN OSP Setup. It started to appear after I reset controllers. Dev debugging: ovn metadata docker service (openstack-neutron-metadata-agent-ovn) is not creating the namespace and starting the haproxy for metadata. I created another VM and observed the same behaviour. I restarted the ovn metadata agent on compute-0 and then it started working fine. Any VM scheduled on compute-0 is able to access the metadata service. I haven't restarted the metadata service in compute-1 yet. In case you need the setup and want to carry on with your work, please restart ovn metadata docker service in compute-1. I see from the logs, that the metadata was working earlier and it has stopped working somehow. Version-Release number of selected component (if applicable): OSP 13 -p 2018-05-10.3 [root@controller-0 ~]# rpm -qa |grep -i ovn openvswitch-ovn-central-2.9.0-20.el7fdp.x86_64 openvswitch-ovn-common-2.9.0-20.el7fdp.x86_64 openstack-nova-novncproxy-17.0.3-0.20180420001138.el7ost.noarch openvswitch-ovn-host-2.9.0-20.el7fdp.x86_64 python-networking-ovn-4.0.1-0.20180420150809.c7c16d4.el7ost.noarch puppet-ovn-12.4.0-0.20180329043503.36ff219.el7ost.noarch python-networking-ovn-metadata-agent-4.0.1-0.20180420150809.c7c16d4.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Run sts+run job (link will be added) 2. 3. Actual results: Expected results: Additional info:
The issue is because when ovn south db server goes down or gets restarted, ovn metadata agents don't detect this. So it never reconnects to new connections. The reason for this is we don't add the below option in networking-ovn-metadata-agent.ini under the [ovn] section ovsdb_connection_timeout=180 The fix is required in puppet-neutron here - https://github.com/openstack/puppet-neutron/blob/master/manifests/agents/ovn_metadata.pp#L146
I have checked and the value we're getting for ovsdb_connection_timeout is 180. Checked it by adding traces to ovn metadata agent code and restarting container. This is because we have a default value in the code: https://github.com/openstack/networking-ovn/blob/stable/queens/networking_ovn/common/config.py#L73 (getting registered them in L152 below) @Numan I've verified this in both devstack and TripleO setups so I don't think this is the root cause though.
I reported the bug here: https://bugs.launchpad.net/networking-ovn/+bug/1772656 The issue is not specific to metadata-agent but also to neutron-server. The thing is that neutron-server is not reacting upon the failover but when a new API request comes to a worker, it'll timeout and reconnect after ovsdb_connection_timeout seconds.
@Daniel - You want to mention the workaround in the doc text ? - i.e Restarting the containers in each compute node would fix the issue ?
Done, thanks!
Wouldn't another more permanent workaround be increasing the ovsdb_probe_interval in the plugin.ini config file to 60000?
Sorry, this is on dev (upstream patch, no downstream patch yet)
Fix verified: python-networking-ovn-4.0.1-0.20180420150812.c7c16d4.el7ost.noarch 2018-07-06.1 https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-13_director-rhel-virthost-3cont_2comp-ipv4-geneve-sts/28/testReport/.home.stack.openstack-sts.tests.smoke/03_HARD_RESET_CONTROLLER_MAIN_VIP/ verify manually too [root@vm-net-64-1 ~]# curl http://169.254.169.254/latest/meta-data/ ami-id ami-launch-index ami-manifest-path block-device-mapping/ hostname instance-action instance-id instance-type local-hostname local-ipv4
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2215