Bug 1729007

Summary: Metadata proxy not ready when vm spawns when using DVR
Product: Red Hat OpenStack Reporter: Slawek Kaplonski <skaplons>
Component: openstack-neutronAssignee: Slawek Kaplonski <skaplons>
Status: CLOSED CURRENTRELEASE QA Contact: Eran Kuris <ekuris>
Severity: medium Docs Contact:
Priority: medium    
Version: 15.0 (Stein)CC: amuller, bcafarel, chrisw, njohnston, ralonsoh, scohen
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-26 07:57:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Luigi Toscano 2019-07-11 08:44:58 UTC
I've observed locally as well. It looks like one of the involved container does not start properly on the compute node (neutron-haproxy-ovnmeta-<UUID>).
Not sure how much this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1720947 - even after restarting tripleo_ovn_controller.service and tripleo_ovn_metadata_agent.service, that specific container is still down. But it may be a slightly different issue.

Comment 2 Slawek Kaplonski 2019-07-11 09:04:23 UTC
Hi Luigi,

It is definitely different issue as this BZ is related to ML2/OVS setup. And it also happens in U/S CI where we are using Devstack and there are no containers at all.

Comment 3 Slawek Kaplonski 2019-07-11 09:28:29 UTC
Once again description of bug:

It happens sometimes in our CI that when DVR is used and vm is spawned on host, vm is unpaused and booting before L3 agent prepares router namespace and metadata proxy for this router. That cause problem with connectivity to the metadata service from vm thus e.g. public key is not configured on instance, ssh to it is not possible and test fails.

Sending select for 10.100.0.13...
Lease of 10.100.0.13 obtained, lease time 86400
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.100.0.1"
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 3.43. request failed
failed 2/20: up 15.44. request failed
failed 3/20: up 27.46. request failed
failed 4/20: up 39.47. request failed
failed 5/20: up 51.48. request failed
failed 6/20: up 63.50. request failed
failed 7/20: up 75.52. request failed
failed 8/20: up 87.53. request failed
failed 9/20: up 99.54. request failed
failed 10/20: up 111.55. request failed
failed 11/20: up 123.57. request failed
failed 12/20: up 135.58. request failed
failed 13/20: up 147.59. request failed
failed 14/20: up 159.61. request failed
failed 15/20: up 171.62. request failed
failed 16/20: up 183.63. request failed
failed 17/20: up 195.64. request failed[  205.660296] random: nonblocking pool is initialized

failed 18/20: up 207.67. request failed
failed 19/20: up 219.68. request failed
failed 20/20: up 231.69. request failed
failed to read iid from metadata. tried 20
failed to get instance-id of datasource
Top of dropbear init script

It happens also in U/S CI quite often.

Comment 6 Nate Johnston 2019-07-18 15:07:56 UTC
Going to track mlavalle's efforts upstream to debug this issue.

Comment 7 Slawek Kaplonski 2019-07-26 13:37:43 UTC
I think I found what is the reason.
It is race condition when 2 routers are created in short time and configured on same snat node. Then when both routers are configuring external gateway it may happend that one of routers will add external net to subscribers list in https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L129 so second router will got info that it's not "first" and will go to update gateway port instead of creating it.
But if in fact gateway wasn't created yet it will cause exception in: https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L332
And if this will happend, one of routers will not have properly configured iptables rules to allow requests to 169.254.169.254 so metadata will not work for this instance.