Bug 1729007 - Metadata proxy not ready when vm spawns when using DVR
Summary: Metadata proxy not ready when vm spawns when using DVR
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Slawek Kaplonski
QA Contact: Eran Kuris
Depends On:
TreeView+ depends on / blocked
Reported: 2019-07-11 07:57 UTC by Slawek Kaplonski
Modified: 2020-02-26 07:57 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2020-02-26 07:57:48 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Launchpad 1830763 0 None None None 2019-07-26 13:39:07 UTC

Comment 1 Luigi Toscano 2019-07-11 08:44:58 UTC
I've observed locally as well. It looks like one of the involved container does not start properly on the compute node (neutron-haproxy-ovnmeta-<UUID>).
Not sure how much this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1720947 - even after restarting tripleo_ovn_controller.service and tripleo_ovn_metadata_agent.service, that specific container is still down. But it may be a slightly different issue.

Comment 2 Slawek Kaplonski 2019-07-11 09:04:23 UTC
Hi Luigi,

It is definitely different issue as this BZ is related to ML2/OVS setup. And it also happens in U/S CI where we are using Devstack and there are no containers at all.

Comment 3 Slawek Kaplonski 2019-07-11 09:28:29 UTC
Once again description of bug:

It happens sometimes in our CI that when DVR is used and vm is spawned on host, vm is unpaused and booting before L3 agent prepares router namespace and metadata proxy for this router. That cause problem with connectivity to the metadata service from vm thus e.g. public key is not configured on instance, ssh to it is not possible and test fails.

Sending select for
Lease of obtained, lease time 86400
route: SIOCADDRT: File exists
WARN: failed: route add -net "" gw ""
failed 1/20: up 3.43. request failed
failed 2/20: up 15.44. request failed
failed 3/20: up 27.46. request failed
failed 4/20: up 39.47. request failed
failed 5/20: up 51.48. request failed
failed 6/20: up 63.50. request failed
failed 7/20: up 75.52. request failed
failed 8/20: up 87.53. request failed
failed 9/20: up 99.54. request failed
failed 10/20: up 111.55. request failed
failed 11/20: up 123.57. request failed
failed 12/20: up 135.58. request failed
failed 13/20: up 147.59. request failed
failed 14/20: up 159.61. request failed
failed 15/20: up 171.62. request failed
failed 16/20: up 183.63. request failed
failed 17/20: up 195.64. request failed[  205.660296] random: nonblocking pool is initialized

failed 18/20: up 207.67. request failed
failed 19/20: up 219.68. request failed
failed 20/20: up 231.69. request failed
failed to read iid from metadata. tried 20
failed to get instance-id of datasource
Top of dropbear init script

It happens also in U/S CI quite often.

Comment 6 Nate Johnston 2019-07-18 15:07:56 UTC
Going to track mlavalle's efforts upstream to debug this issue.

Comment 7 Slawek Kaplonski 2019-07-26 13:37:43 UTC
I think I found what is the reason.
It is race condition when 2 routers are created in short time and configured on same snat node. Then when both routers are configuring external gateway it may happend that one of routers will add external net to subscribers list in https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L129 so second router will got info that it's not "first" and will go to update gateway port instead of creating it.
But if in fact gateway wasn't created yet it will cause exception in: https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L332
And if this will happend, one of routers will not have properly configured iptables rules to allow requests to so metadata will not work for this instance.

Note You need to log in before you can comment on or make changes to this bug.