I've observed locally as well. It looks like one of the involved container does not start properly on the compute node (neutron-haproxy-ovnmeta-<UUID>). Not sure how much this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1720947 - even after restarting tripleo_ovn_controller.service and tripleo_ovn_metadata_agent.service, that specific container is still down. But it may be a slightly different issue.
Hi Luigi, It is definitely different issue as this BZ is related to ML2/OVS setup. And it also happens in U/S CI where we are using Devstack and there are no containers at all.
Once again description of bug: It happens sometimes in our CI that when DVR is used and vm is spawned on host, vm is unpaused and booting before L3 agent prepares router namespace and metadata proxy for this router. That cause problem with connectivity to the metadata service from vm thus e.g. public key is not configured on instance, ssh to it is not possible and test fails. Sending select for 10.100.0.13... Lease of 10.100.0.13 obtained, lease time 86400 route: SIOCADDRT: File exists WARN: failed: route add -net "0.0.0.0/0" gw "10.100.0.1" checking http://169.254.169.254/2009-04-04/instance-id failed 1/20: up 3.43. request failed failed 2/20: up 15.44. request failed failed 3/20: up 27.46. request failed failed 4/20: up 39.47. request failed failed 5/20: up 51.48. request failed failed 6/20: up 63.50. request failed failed 7/20: up 75.52. request failed failed 8/20: up 87.53. request failed failed 9/20: up 99.54. request failed failed 10/20: up 111.55. request failed failed 11/20: up 123.57. request failed failed 12/20: up 135.58. request failed failed 13/20: up 147.59. request failed failed 14/20: up 159.61. request failed failed 15/20: up 171.62. request failed failed 16/20: up 183.63. request failed failed 17/20: up 195.64. request failed[ 205.660296] random: nonblocking pool is initialized failed 18/20: up 207.67. request failed failed 19/20: up 219.68. request failed failed 20/20: up 231.69. request failed failed to read iid from metadata. tried 20 failed to get instance-id of datasource Top of dropbear init script It happens also in U/S CI quite often.
Going to track mlavalle's efforts upstream to debug this issue.
I think I found what is the reason. It is race condition when 2 routers are created in short time and configured on same snat node. Then when both routers are configuring external gateway it may happend that one of routers will add external net to subscribers list in https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L129 so second router will got info that it's not "first" and will go to update gateway port instead of creating it. But if in fact gateway wasn't created yet it will cause exception in: https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L332 And if this will happend, one of routers will not have properly configured iptables rules to allow requests to 169.254.169.254 so metadata will not work for this instance.