Stepping in a deployed env, I deleted the dcn3 overcloud and run overcloud_deploy_dcn3.sh again Similar to the log in job, I see step 5 containers start loop running (and will probably fail): WAITING | Wait for containers to start for step 5 using paunch | dcn3-compute3-0 | 1099 retries left But I do not think anything is wrong network-wise, controllers can access compute nodes on this new leaf (and vice-versa), same from the undercloud. I am not sure what the condition is for this step 5 to complete but after login on a compute node I saw on running containers: 5d3584c9e4a3 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:16.2_20210616.1 kolla_start 10 minutes ago Up 10 minutes ago nova_virtlogd 090d993d1f6f site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:16.2_20210616.1 kolla_start 10 minutes ago Up 10 minutes ago nova_libvirt ea6dbf89e343 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-iscsid:16.2_20210616.1 kolla_start 10 minutes ago Up 10 minutes ago iscsid bd6d59baeeff site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:16.2_20210616.1 kolla_start 9 minutes ago Up 9 minutes ago logrotate_crond 578925166e2a site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:16.2_20210616.1 kolla_start 9 minutes ago Up 9 minutes ago nova_migration_target e2bf10938251 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-dhcp-agent:16.2_20210616.1 kolla_start 9 minutes ago Up 9 minutes ago neutron_dhcp ff9ecd4ab79f site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent:16.2_20210616.1 kolla_start 9 minutes ago Up 9 minutes ago neutron_metadata_agent a82da4358910 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-openvswitch-agent:16.2_20210616.1 kolla_start 9 minutes ago Up 9 minutes ago neutron_ovs_agent da29f30fd3e9 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:16.2_20210616.1 kolla_start 9 minutes ago Up 9 minutes ago nova_compute a07e5cd29460 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:16.2_20210616.1 /container-config... 9 minutes ago Up 9 minutes ago nova_wait_for_compute_service Is the wait actually caused by this nova_wait_for_compute_service container?
Did you set "NetworkDeploymentActions: ['CREATE','UPDATE'] when updating the deployment? See: https://access.redhat.com/solutions/2213711 Due to a bug, the network config on the nodes has been updating on every update since OSP16. We recently backported the fixes to close that gap, see: https://bugzilla.redhat.com/show_bug.cgi?id=1958293. Fixing this bug is likely the reason you are now seeing your job failing.
Bernard: Yes, the wait is caused by nova_wait_for_compute_service. It blocks until the compute service starts, which is failing because it cannot connect to rabbitmq: 2021-06-22 10:00:06.558 8 ERROR oslo.messaging._drivers.impl_rabbit [req-8a38c79c-ff6d-4540-a206-a81c2fbe603d - - - - -] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out The dcn3 compute can't ping the controllers on internalapi: [root@dcn3-compute3-0 heat-admin]# ping central-controller0-0.internalapi.redhat.local PING central-controller0-0.internalapi.redhat.local (172.25.1.164) 56(84) bytes of data. ^C --- central-controller0-0.internalapi.redhat.local ping statistics --- 13 packets transmitted, 0 received, 100% packet loss, time 12285ms Controller can ping the compute but that is through the default gateway not the spine routers: [heat-admin@central-controller0-0 ~]$ ping 172.25.4.169 PING 172.25.4.169 (172.25.4.169) 56(84) bytes of data. 64 bytes from 172.25.4.169: icmp_seq=1 ttl=63 time=51.1 ms 64 bytes from 172.25.4.169: icmp_seq=2 ttl=63 time=50.4 ms ^C --- 172.25.4.169 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 50.426/50.741/51.056/0.315 ms [heat-admin@central-controller0-0 ~]$ traceroute 172.25.4.169 traceroute to 172.25.4.169 (172.25.4.169), 30 hops max, 60 byte packets 1 _gateway (10.0.10.1) 0.320 ms 0.287 ms 0.277 ms Route works for a dcn2 compute: [heat-admin@central-controller0-0 ~]$ traceroute 172.25.3.169 traceroute to 172.25.3.169 (172.25.3.169), 30 hops max, 60 byte packets 1 172.25.1.254 (172.25.1.254) 0.294 ms 0.258 ms 0.245 ms 2 172.25.1.254 (172.25.1.254) 3005.468 ms !H 3005.456 ms !H 3005.444 ms !H Harald: ack thanks, I'll check that.
(In reply to Harald Jensås from comment #2) > Did you set "NetworkDeploymentActions: ['CREATE','UPDATE'] when updating the > deployment? > > See: https://access.redhat.com/solutions/2213711 > > > Due to a bug, the network config on the nodes has been updating on every > update since OSP16. We recently backported the fixes to close that gap, see: > https://bugzilla.redhat.com/show_bug.cgi?id=1958293. Fixing this bug is > likely the reason you are now seeing your job failing. Looks like the cause alright, os-net-config isn't being run on the controllers on a stack update. However adding UPDATE to NetworkDeploymentActions didn't work for me. Actions are set correctly AFAICT: [root@site-undercloud-0 central]# pwd /var/lib/mistral/central [root@site-undercloud-0 central]# grep -r tripleo_network_config_network_deployment_actions * deploy_steps_playbook.yaml: tripleo_network_config_network_deployment_actions: "{{ network_deployment_actions }}" [root@site-undercloud-0 central]# grep -A2 network_deployment_actions group_vars/Controller0 network_deployment_actions: - - CREATE - UPDATE But the tasks are still skipped: [root@site-undercloud-0 central]# grep 'Run NetworkConfig script' ansible.log 2021-06-22 11:09:08,623 p=56749 u=mistral n=ansible | 2021-06-22 11:09:08.622924 | 5254007e-ba68-f825-908b-000000000053 | TASK | Run NetworkConfig script 2021-06-22 11:09:08,670 p=56749 u=mistral n=ansible | 2021-06-22 11:09:08.670745 | 5254007e-ba68-f825-908b-000000000053 | SKIPPED | Run NetworkConfig script | central-controller0-0 2021-06-22 11:09:08,865 p=56749 u=mistral n=ansible | 2021-06-22 11:09:08.864752 | 5254007e-ba68-f825-908b-000000000053 | TASK | Run NetworkConfig script 2021-06-22 11:09:08,936 p=56749 u=mistral n=ansible | 2021-06-22 11:09:08.935907 | 5254007e-ba68-f825-908b-000000000053 | TASK | Run NetworkConfig script 2021-06-22 11:09:08,945 p=56749 u=mistral n=ansible | 2021-06-22 11:09:08.945319 | 5254007e-ba68-f825-908b-000000000053 | SKIPPED | Run NetworkConfig script | central-controller0-1 2021-06-22 11:09:09,057 p=56749 u=mistral n=ansible | 2021-06-22 11:09:09.056948 | 5254007e-ba68-f825-908b-000000000053 | SKIPPED | Run NetworkConfig script | central-controller0-2 2021-06-22 11:09:09,488 p=56749 u=mistral n=ansible | 2021-06-22 11:09:09.488562 | 5254007e-ba68-f825-908b-000000000053 | TASK | Run NetworkConfig script 2021-06-22 11:09:09,575 p=56749 u=mistral n=ansible | 2021-06-22 11:09:09.575345 | 5254007e-ba68-f825-908b-000000000053 | SKIPPED | Run NetworkConfig script | central-compute0-0 2021-06-22 11:09:09,725 p=56749 u=mistral n=ansible | 2021-06-22 11:09:09.725638 | 5254007e-ba68-f825-908b-000000000053 | TASK | Run NetworkConfig script 2021-06-22 11:09:09,801 p=56749 u=mistral n=ansible | 2021-06-22 11:09:09.800612 | 5254007e-ba68-f825-908b-000000000053 | SKIPPED | Run NetworkConfig script | central-compute0-1
Adding some debug tasks: - name: Debug fail: msg: "{{ 'UPDATE' in tripleo_network_config_network_deployment_actions }}" 2021-06-22 12:02:02.278372 | 5254007e-ba68-7bde-1e69-000000000050 | FATAL | Debug | central-controller0-0 | error={"changed": false, "msg": false} - name: Debug fail: msg: "{{ tripleo_network_config_network_deployment_actions }}" 2021-06-22 12:06:37.220712 | 5254007e-ba68-b3ef-d670-000000000050 | FATAL | Debug | central-controller0-2 | error={"changed": false, "msg": [["CREATE", "UPDATE"]]} This is the problem: network_deployment_actions: - - CREATE - UPDATE It's a list of list, should be a list.
I see the problem, testing a fix now
network_deployment_actions: - - CREATE - UPDATE ^^ The list is nested, it should just be a list of strings. What you want is: network_deployment_actions: - CREATE - UPDATE Not sure how you ended up with a nested list.
Confirmed it works with the t-h-t fix.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483