Bug 2077944
| Summary: | [Openstack 16.2.1] After heat creation, randomly instances fail to connect to metadata server. | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Luca Davidde <ldavidde> |
| Component: | openstack-neutron | Assignee: | Miro Tomaska <mtomaska> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Fiorella Yanac <fyanac> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.2 (Train) | CC: | apevec, astupnik, averdagu, chrisw, dhill, erpeters, froyo, jamsmith, lhh, majopela, scohen |
| Target Milestone: | z5 | Keywords: | TestOnly, Triaged |
| Target Release: | 16.2 (Train on RHEL 8.4) | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-neutron-18.6.1-1.20230131164109.8487877.el8ost | Doc Type: | Bug Fix |
| Doc Text: |
Before this update, provisioning a network namespace with thousands of subnets took a very long time. This delay prevented the metadata haproxy service from being ready for the first VM started on the hypervisor. As a result, the VM was not properly initialized by the cloud-init process. With this update, improved metadata agent logic for provisioning network namespaces creates faster provisioning performance. This resolves the issue.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-02-01 14:45:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Luca Davidde
2022-04-22 16:19:53 UTC
Hi Luca, Thanks for the information and the logs. I reviewed the sos logs attached to the customer case. Its seems like the cloud-init outputs and sosreports were taken at different times and I am not able to see "the full picture". For example, the vm_boot_failed.log shows cloud-init failure which occurred on April 20 09:26:39. However the provided sosreport logs from compute nodes were collected on April 19 and April 21. The critical logs I wanted to see from April 20 at 9 o clock are not there. Would it be possible to collect logs again exactly when a cloud-init initialization failure occurs? I am mostly interested in these logs: From the compute node running a VM instance failing to cloud-init initialize: - /var/log/containers/neutron ,nova # I am assuming they are running services in containers From the "HA master" controller - /var/log/containers/neutron,nova,haproxy If possible to get access to the VM instance and run "cloud-init collect-logs". It would be super useful to raise logging level to debug=true. Inspect the "podman inspect <name of container>" to see where the configuration files are being mount bind from. But I am assuming its probably /var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/networking-ovn/networking-ovn-metadata-agent.ini --> for the neutron metadata agent /var/lib/config-data/puppet-generated/nova_metadata/etc/nova/nova.conf --> for nova services (including nova metada service) Dont forget to set it back to debug=false after done with log collection. Debug will continue to dump lot of logs otherwise. Lastly, collecting sos reports from nodes right after a failures occurred would be helpful as well in case I need to look into ovs services. (In reply to Miro Tomaska from comment #1) > Hi Luca, > > Thanks for the information and the logs. I reviewed the sos logs attached to > the customer case. Its seems like the cloud-init outputs and sosreports were > taken at different times and I am not able to see "the full picture". > For example, the vm_boot_failed.log shows cloud-init failure which occurred > on April 20 09:26:39. However the provided sosreport logs from compute nodes > were collected on April 19 and April 21. The critical logs I wanted to see > from April 20 at 9 o clock are not there. Would it be possible to collect > logs again exactly when a cloud-init initialization failure occurs? > > I am mostly interested in these logs: > From the compute node running a VM instance failing to cloud-init initialize: > - /var/log/containers/neutron ,nova # I am assuming they are running > services in containers > > From the "HA master" controller > - /var/log/containers/neutron,nova,haproxy > > If possible to get access to the VM instance and run "cloud-init > collect-logs". > > It would be super useful to raise logging level to debug=true. Inspect the > "podman inspect <name of container>" to see where the configuration files > are being mount bind from. But I am assuming its probably > /var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/networking- > ovn/networking-ovn-metadata-agent.ini --> for the neutron metadata agent > /var/lib/config-data/puppet-generated/nova_metadata/etc/nova/nova.conf --> > for nova services (including nova metada service) > > Dont forget to set it back to debug=false after done with log collection. > Debug will continue to dump lot of logs otherwise. > > Lastly, collecting sos reports from nodes right after a failures occurred > would be helpful as well in case I need to look into ovs services. Hello Miro, there's already sosreport from compute and cloud-init log when the issue occured: 0110-sosreport-ie2-prd-compute-shared-g9dl360v4-25-03199974-2022-04-21-gzqkirv.tar.xz 0120-cloud-init-ie2-ppwpt05a-prd.tar.gz cloud-init.log 2022-04-21 13:59:16,335 - url_helper.py[DEBUG]: Calling 'http://169.254.169.254/openstack' failed [7/-1s]: request error [('Connection aborted.', error(111, 'Connection refused'))] nova-compute.log 2022-04-21 13:58:55.102 7 WARNING nova.compute.manager [req-49572cec-33b7-46cf-9aa1-fdbc85e90b52 4df274a9261842a69303d81d78c6a4f6 9192e8f2654143e99caada0581befd90 - default default] [instance: 09257c39-24c6-476a-9e49-f1dc7b0019a3] Received unexpected event network-vif-plugged-21910b3a-1e7b-4f8e-8b61-723b5fd14e3b for instance with vm_state active and task_state None. 2022-04-21 13:58:55.103 7 WARNING nova.compute.manager [req-49572cec-33b7-46cf-9aa1-fdbc85e90b52 4df274a9261842a69303d81d78c6a4f6 9192e8f2654143e99caada0581befd90 - default default] [instance: 09257c39-24c6-476a-9e49-f1dc7b0019a3] Received unexpected event network-vif-plugged-21910b3a-1e7b-4f8e-8b61-723b5fd14e3b for instance with vm_state active and task_state None. 2022-04-21 13:58:55.104 7 WARNING nova.compute.manager [req-49572cec-33b7-46cf-9aa1-fdbc85e90b52 4df274a9261842a69303d81d78c6a4f6 9192e8f2654143e99caada0581befd90 - default default] [instance: 09257c39-24c6-476a-9e49-f1dc7b0019a3] Received unexpected event network-vif-plugged-21910b3a-1e7b-4f8e-8b61-723b5fd14e3b for instance with vm_state active and task_state None. ovn-metadata.log 2022-04-21 13:58:50.978 643723 INFO networking_ovn.agent.metadata.agent [-] Port 21910b3a-1e7b-4f8e-8b61-723b5fd14e3b in datapath ff200762-8a1d-4b5b-87bb-b945459875c9 unbound from our chassis 2022-04-21 13:58:50.991 643723 INFO networking_ovn.agent.metadata.agent [-] Cleaning up ovnmeta-ff200762-8a1d-4b5b-87bb-b945459875c9 namespace which is not needed anymore 2022-04-21 13:58:52.473 643723 INFO networking_ovn.agent.metadata.agent [-] Port 21910b3a-1e7b-4f8e-8b61-723b5fd14e3b in datapath ff200762-8a1d-4b5b-87bb-b945459875c9 bound to our chassis In the meantime, I'll ask to the customer if they can reproduce once again the issue. they got cloud-init-20.3-10.el8_4.5.noarch btw Hello Miro, so as per last comments of the customer, applying the combination of - ovs-vsctl set open . external_ids:ovn-monitor-all=true (as suggested by Dave Hill) - Power Regulator setting on the baremetal to OS Control seems to mitigate the issue, or at least with this conf he can't reproduce the issue. Unfortunately he didn't provide any logs from the tests, but he says: "I am able to reproduce this issue using a shared provider network with many subnets (>350 currently) I am not able to reproduce this issue using a project network with a router/FIPs/etc. So there is likely something different between these two types of networks that makes the problem only appear on one. Unfortunately we only currently have one type of Provider network so we cannot test this against a unused provider network (to see if the scale is related or not). Another test I ran was removing the cloud_config entirely from the VMs on the shared provider network, leaving only a SSH keypair to be installed by cloud init. I could reproduce the issue as before without a cloud_config, so the contents of it are not relevant for this problem. Hopefully these extra tests go some way towards helping identify where the issue. Our hope is with those other changes the problem no longer occurs for us, but there is likely still some kind of race condition that causes this to happen when using slower CPUs" Also he said that they want at this point wait for the update to 16.2.2 with the tweaks that appear to solve/mitigate the problem. Hello Miro, thank you and sorry for being here in late. Once the customer update us, I'll let you know. |