I've observed locally as well. It looks like one of the involved container does not start properly on the compute node (neutron-haproxy-ovnmeta-<UUID>).
Not sure how much this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1720947 - even after restarting tripleo_ovn_controller.service and tripleo_ovn_metadata_agent.service, that specific container is still down. But it may be a slightly different issue.
It is definitely different issue as this BZ is related to ML2/OVS setup. And it also happens in U/S CI where we are using Devstack and there are no containers at all.
Once again description of bug:
It happens sometimes in our CI that when DVR is used and vm is spawned on host, vm is unpaused and booting before L3 agent prepares router namespace and metadata proxy for this router. That cause problem with connectivity to the metadata service from vm thus e.g. public key is not configured on instance, ssh to it is not possible and test fails.
Sending select for 10.100.0.13...
Lease of 10.100.0.13 obtained, lease time 86400
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.100.0.1"
failed 1/20: up 3.43. request failed
failed 2/20: up 15.44. request failed
failed 3/20: up 27.46. request failed
failed 4/20: up 39.47. request failed
failed 5/20: up 51.48. request failed
failed 6/20: up 63.50. request failed
failed 7/20: up 75.52. request failed
failed 8/20: up 87.53. request failed
failed 9/20: up 99.54. request failed
failed 10/20: up 111.55. request failed
failed 11/20: up 123.57. request failed
failed 12/20: up 135.58. request failed
failed 13/20: up 147.59. request failed
failed 14/20: up 159.61. request failed
failed 15/20: up 171.62. request failed
failed 16/20: up 183.63. request failed
failed 17/20: up 195.64. request failed[ 205.660296] random: nonblocking pool is initialized
failed 18/20: up 207.67. request failed
failed 19/20: up 219.68. request failed
failed 20/20: up 231.69. request failed
failed to read iid from metadata. tried 20
failed to get instance-id of datasource
Top of dropbear init script
It happens also in U/S CI quite often.
Going to track mlavalle's efforts upstream to debug this issue.
I think I found what is the reason.
It is race condition when 2 routers are created in short time and configured on same snat node. Then when both routers are configuring external gateway it may happend that one of routers will add external net to subscribers list in https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L129 so second router will got info that it's not "first" and will go to update gateway port instead of creating it.
But if in fact gateway wasn't created yet it will cause exception in: https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L332
And if this will happend, one of routers will not have properly configured iptables rules to allow requests to 169.254.169.254 so metadata will not work for this instance.