Bug 2077944 - [Openstack 16.2.1] After heat creation, randomly instances fail to connect to metadata server.
Summary: [Openstack 16.2.1] After heat creation, randomly instances fail to connect to...
Keywords:
Status: RELEASE_PENDING
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 16.2 (Train)
Hardware: All
OS: All
high
high
Target Milestone: z5
: 16.2 (Train on RHEL 8.4)
Assignee: Miro Tomaska
QA Contact: Fiorella Yanac
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-22 16:19 UTC by Luca Davidde
Modified: 2023-05-10 19:36 UTC (History)
10 users (show)

Fixed In Version: openstack-neutron-18.6.1-1.20230131164109.8487877.el8ost
Doc Type: Bug Fix
Doc Text:
Before this update, provisioning a network namespace with thousands of subnets took a very long time. This delay prevented the metadata haproxy service from being ready for the first VM started on the hypervisor. As a result, the VM was not properly initialized by the cloud-init process. With this update, improved metadata agent logic for provisioning network namespaces creates faster provisioning performance. This resolves the issue.
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1981113 0 None None None 2022-07-09 03:21:26 UTC
OpenStack gerrit 870786 0 None MERGED Improve agent provision performance for large networks 2023-04-03 15:52:58 UTC
Red Hat Issue Tracker OSP-14838 0 None None None 2022-04-22 16:27:47 UTC
Red Hat Knowledge Base (Solution) 6963335 0 None None None 2022-06-15 14:55:24 UTC

Description Luca Davidde 2022-04-22 16:19:53 UTC
Description of problem:
Hello,
our customer is experiencing problems in reaching metadata server from 1/2 instances after deploy a stack of 10/12.
Looking in the cloud-init logs from inside the instances that have problems after the deploy you can see:

2022-04-18 12:10:07,957 - url_helper.py[DEBUG]: Calling 'http://169.254.169.254/openstack' failed [3/-1s]: request error [('Connection aborted.', error(111, 'Connection refused'))] 

but once you log in into them via console you can test successfully the http://169.254.169.254/openstack url.

the workaround to fix the instance differs:

- if it's on an empty hypervisor then a reboot from Os or ctrl+alt+del from console is needed
- if it's on an hypervisor with other vms, also a soft reboot triggers the fix (reboot from Os or ctrl+alt+del from console too)

All the instances use the same image/flavor/network (provider)/security groups.
I checked also if output ports in openflow rules for metadata are updated after instance reboot: --->  they're not, so it seems they are already ok (otherwise you won't reach metadata server)
Attached to the case there are sosreports from compute, and cloud init complete log.



Version-Release number of selected component (if applicable):
ovn-2021-21.09.1-23.el8fdp.x86_64

How reproducible:
everytime

Steps to Reproduce:
1.Launch a heat stack with several vms
2.
3.

Actual results:
1-2 instance fails connecting to metadata

Expected results:
All instances connect to metadata server.

Additional info:

Comment 1 Miro Tomaska 2022-04-26 19:45:22 UTC
Hi Luca,

Thanks for the information and the logs. I reviewed the sos logs attached to the customer case. Its seems like the cloud-init outputs and sosreports were taken at different times and I am not able to see "the full picture".
For example, the vm_boot_failed.log shows cloud-init failure which occurred on April 20 09:26:39. However the provided sosreport logs from compute nodes were collected on April 19 and April 21. The critical logs I wanted to see from April 20 at 9 o clock are not there. Would it be possible to collect logs again exactly when a cloud-init initialization failure occurs?

I am mostly interested in these logs:
From the compute node running a VM instance failing to cloud-init initialize:
- /var/log/containers/neutron ,nova  # I am assuming they are running services in containers

From the "HA master" controller
- /var/log/containers/neutron,nova,haproxy 

If possible to get access to the VM instance and run "cloud-init collect-logs".

It would be super useful to raise logging level to debug=true. Inspect the "podman inspect <name of container>" to see where the configuration files are being mount bind from. But I am assuming its probably
/var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/networking-ovn/networking-ovn-metadata-agent.ini  --> for the neutron metadata agent
/var/lib/config-data/puppet-generated/nova_metadata/etc/nova/nova.conf  --> for nova services (including nova metada service)

Dont forget to set it back to debug=false after done with log collection. Debug will continue to dump lot of logs otherwise.

Lastly, collecting sos reports from nodes right after a failures occurred would be helpful as well in case I need to look into ovs services.

Comment 2 Luca Davidde 2022-04-27 07:38:09 UTC
(In reply to Miro Tomaska from comment #1)
> Hi Luca,
> 
> Thanks for the information and the logs. I reviewed the sos logs attached to
> the customer case. Its seems like the cloud-init outputs and sosreports were
> taken at different times and I am not able to see "the full picture".
> For example, the vm_boot_failed.log shows cloud-init failure which occurred
> on April 20 09:26:39. However the provided sosreport logs from compute nodes
> were collected on April 19 and April 21. The critical logs I wanted to see
> from April 20 at 9 o clock are not there. Would it be possible to collect
> logs again exactly when a cloud-init initialization failure occurs?
> 
> I am mostly interested in these logs:
> From the compute node running a VM instance failing to cloud-init initialize:
> - /var/log/containers/neutron ,nova  # I am assuming they are running
> services in containers
> 
> From the "HA master" controller
> - /var/log/containers/neutron,nova,haproxy 
> 
> If possible to get access to the VM instance and run "cloud-init
> collect-logs".
> 
> It would be super useful to raise logging level to debug=true. Inspect the
> "podman inspect <name of container>" to see where the configuration files
> are being mount bind from. But I am assuming its probably
> /var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/networking-
> ovn/networking-ovn-metadata-agent.ini  --> for the neutron metadata agent
> /var/lib/config-data/puppet-generated/nova_metadata/etc/nova/nova.conf  -->
> for nova services (including nova metada service)
> 
> Dont forget to set it back to debug=false after done with log collection.
> Debug will continue to dump lot of logs otherwise.
> 
> Lastly, collecting sos reports from nodes right after a failures occurred
> would be helpful as well in case I need to look into ovs services.

Hello Miro, 
there's already sosreport from compute and cloud-init log when the issue occured:

0110-sosreport-ie2-prd-compute-shared-g9dl360v4-25-03199974-2022-04-21-gzqkirv.tar.xz
0120-cloud-init-ie2-ppwpt05a-prd.tar.gz

cloud-init.log
2022-04-21 13:59:16,335 - url_helper.py[DEBUG]: Calling 'http://169.254.169.254/openstack' failed [7/-1s]: request error [('Connection aborted.', error(111, 'Connection refused'))]
nova-compute.log
2022-04-21 13:58:55.102 7 WARNING nova.compute.manager [req-49572cec-33b7-46cf-9aa1-fdbc85e90b52 4df274a9261842a69303d81d78c6a4f6 9192e8f2654143e99caada0581befd90 - default default] [instance: 09257c39-24c6-476a-9e49-f1dc7b0019a3] Received unexpected event network-vif-plugged-21910b3a-1e7b-4f8e-8b61-723b5fd14e3b for instance with vm_state active and task_state None.
2022-04-21 13:58:55.103 7 WARNING nova.compute.manager [req-49572cec-33b7-46cf-9aa1-fdbc85e90b52 4df274a9261842a69303d81d78c6a4f6 9192e8f2654143e99caada0581befd90 - default default] [instance: 09257c39-24c6-476a-9e49-f1dc7b0019a3] Received unexpected event network-vif-plugged-21910b3a-1e7b-4f8e-8b61-723b5fd14e3b for instance with vm_state active and task_state None.
2022-04-21 13:58:55.104 7 WARNING nova.compute.manager [req-49572cec-33b7-46cf-9aa1-fdbc85e90b52 4df274a9261842a69303d81d78c6a4f6 9192e8f2654143e99caada0581befd90 - default default] [instance: 09257c39-24c6-476a-9e49-f1dc7b0019a3] Received unexpected event network-vif-plugged-21910b3a-1e7b-4f8e-8b61-723b5fd14e3b for instance with vm_state active and task_state None.

ovn-metadata.log

2022-04-21 13:58:50.978 643723 INFO networking_ovn.agent.metadata.agent [-] Port 21910b3a-1e7b-4f8e-8b61-723b5fd14e3b in datapath ff200762-8a1d-4b5b-87bb-b945459875c9 unbound from our chassis
2022-04-21 13:58:50.991 643723 INFO networking_ovn.agent.metadata.agent [-] Cleaning up ovnmeta-ff200762-8a1d-4b5b-87bb-b945459875c9 namespace which is not needed anymore
2022-04-21 13:58:52.473 643723 INFO networking_ovn.agent.metadata.agent [-] Port 21910b3a-1e7b-4f8e-8b61-723b5fd14e3b in datapath ff200762-8a1d-4b5b-87bb-b945459875c9 bound to our chassis

In the meantime, I'll ask to the customer if they can reproduce once again the issue.

Comment 10 Luca Davidde 2022-05-03 09:44:29 UTC
they got cloud-init-20.3-10.el8_4.5.noarch btw

Comment 14 Luca Davidde 2022-05-06 14:39:59 UTC
Hello Miro, 
so as per last comments of the customer, applying the combination of 

- ovs-vsctl set open . external_ids:ovn-monitor-all=true (as suggested by Dave Hill)
- Power Regulator setting on the baremetal to OS Control

seems to mitigate the issue, or at least with this conf he can't reproduce the issue.
Unfortunately he didn't provide any logs from the tests, but he says:

"I am able to reproduce this issue using a shared provider network with many subnets (>350 currently)
I am not able to reproduce this issue using a project network with a router/FIPs/etc.
So there is likely something different between these two types of networks that makes the problem only appear on one. Unfortunately we only currently have one type of Provider network so we cannot test this against a unused provider network (to see if the scale is related or not).
Another test I ran was removing the cloud_config entirely from the VMs on the shared provider network, leaving only a SSH keypair to be installed by cloud init.
I could reproduce the issue as before without a cloud_config, so the contents of it are not relevant for this problem.
Hopefully these extra tests go some way towards helping identify where the issue. Our hope is with those other changes the problem no longer occurs for us, but there is likely still some kind of race condition that causes this to happen when using slower CPUs"

Also he said that they want at this point wait for the update to 16.2.2 with the tweaks that appear to solve/mitigate the problem.

Comment 16 Luca Davidde 2022-05-23 09:35:06 UTC
Hello Miro, thank you and sorry for being here in late. Once the customer update us, I'll let you know.


Note You need to log in before you can comment on or make changes to this bug.