Bug 1973952 - Network config is not updated on stack update
Summary: Network config is not updated on stack update
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Linux
medium
urgent
Target Milestone: ---
: ---
Assignee: Harald Jensås
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks: 1958293 1975084 1975346
TreeView+ depends on / blocked
 
Reported: 2021-06-19 11:33 UTC by Sergey Bekkerman
Modified: 2021-09-15 07:17 UTC (History)
12 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.5.1-2.20210603174816.el8ost.4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1975084 (view as bug list)
Environment:
Last Closed: 2021-09-15 07:16:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1933228 0 None None None 2021-06-22 12:24:35 UTC
OpenStack gerrit 797441 0 None NEW Fix NetworkDeploymentActionValue format 2021-06-22 12:24:03 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:17:00 UTC

Comment 1 Bernard Cafarelli 2021-06-22 10:09:34 UTC
Stepping in a deployed env, I deleted the dcn3 overcloud and run overcloud_deploy_dcn3.sh again
Similar to the log in job, I see step 5 containers start loop running (and will probably fail):
WAITING | Wait for containers to start for step 5 using paunch | dcn3-compute3-0 | 1099 retries left

But I do not think anything is wrong network-wise, controllers can access compute nodes on this new leaf (and vice-versa), same from the undercloud.

I am not sure what the condition is for this step 5 to complete but after login on a compute node I saw on running containers:
5d3584c9e4a3  site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:16.2_20210616.1               kolla_start           10 minutes ago  Up 10 minutes ago          nova_virtlogd
090d993d1f6f  site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:16.2_20210616.1               kolla_start           10 minutes ago  Up 10 minutes ago          nova_libvirt
ea6dbf89e343  site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-iscsid:16.2_20210616.1                     kolla_start           10 minutes ago  Up 10 minutes ago          iscsid
bd6d59baeeff  site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:16.2_20210616.1                       kolla_start           9 minutes ago   Up 9 minutes ago           logrotate_crond
578925166e2a  site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:16.2_20210616.1               kolla_start           9 minutes ago   Up 9 minutes ago           nova_migration_target
e2bf10938251  site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-dhcp-agent:16.2_20210616.1         kolla_start           9 minutes ago   Up 9 minutes ago           neutron_dhcp
ff9ecd4ab79f  site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent:16.2_20210616.1     kolla_start           9 minutes ago   Up 9 minutes ago           neutron_metadata_agent
a82da4358910  site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-openvswitch-agent:16.2_20210616.1  kolla_start           9 minutes ago   Up 9 minutes ago           neutron_ovs_agent
da29f30fd3e9  site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:16.2_20210616.1               kolla_start           9 minutes ago   Up 9 minutes ago           nova_compute
a07e5cd29460  site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:16.2_20210616.1               /container-config...  9 minutes ago   Up 9 minutes ago           nova_wait_for_compute_service


Is the wait actually caused by this nova_wait_for_compute_service container?

Comment 2 Harald Jensås 2021-06-22 10:38:46 UTC
Did you set "NetworkDeploymentActions: ['CREATE','UPDATE'] when updating the deployment?

See: https://access.redhat.com/solutions/2213711


Due to a bug, the network config on the nodes has been updating on every update since OSP16. We recently backported the fixes to close that gap, see: https://bugzilla.redhat.com/show_bug.cgi?id=1958293. Fixing this bug is likely the reason you are now seeing your job failing.

Comment 3 Ollie Walsh 2021-06-22 10:48:55 UTC
Bernard: Yes, the wait is caused by nova_wait_for_compute_service.  It blocks until the compute service starts, which is failing because it cannot connect to rabbitmq:
    2021-06-22 10:00:06.558 8 ERROR oslo.messaging._drivers.impl_rabbit [req-8a38c79c-ff6d-4540-a206-a81c2fbe603d - - - - -] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out

The dcn3 compute can't ping the controllers on internalapi:
    [root@dcn3-compute3-0 heat-admin]# ping central-controller0-0.internalapi.redhat.local
    PING central-controller0-0.internalapi.redhat.local (172.25.1.164) 56(84) bytes of data.
    ^C
    --- central-controller0-0.internalapi.redhat.local ping statistics ---
    13 packets transmitted, 0 received, 100% packet loss, time 12285ms

Controller can ping the compute but that is through the default gateway not the spine routers:
    [heat-admin@central-controller0-0 ~]$ ping 172.25.4.169
    PING 172.25.4.169 (172.25.4.169) 56(84) bytes of data.
    64 bytes from 172.25.4.169: icmp_seq=1 ttl=63 time=51.1 ms
    64 bytes from 172.25.4.169: icmp_seq=2 ttl=63 time=50.4 ms
    ^C
    --- 172.25.4.169 ping statistics ---
    2 packets transmitted, 2 received, 0% packet loss, time 1001ms
    rtt min/avg/max/mdev = 50.426/50.741/51.056/0.315 ms
    [heat-admin@central-controller0-0 ~]$ traceroute 172.25.4.169
    traceroute to 172.25.4.169 (172.25.4.169), 30 hops max, 60 byte packets
     1  _gateway (10.0.10.1)  0.320 ms  0.287 ms  0.277 ms

Route works for a dcn2 compute:
    [heat-admin@central-controller0-0 ~]$ traceroute 172.25.3.169
    traceroute to 172.25.3.169 (172.25.3.169), 30 hops max, 60 byte packets
     1  172.25.1.254 (172.25.1.254)  0.294 ms  0.258 ms  0.245 ms
     2  172.25.1.254 (172.25.1.254)  3005.468 ms !H  3005.456 ms !H  3005.444 ms !H

Harald: ack thanks, I'll check that.

Comment 4 Ollie Walsh 2021-06-22 11:47:10 UTC
(In reply to Harald Jensås from comment #2)
> Did you set "NetworkDeploymentActions: ['CREATE','UPDATE'] when updating the
> deployment?
> 
> See: https://access.redhat.com/solutions/2213711
> 
> 
> Due to a bug, the network config on the nodes has been updating on every
> update since OSP16. We recently backported the fixes to close that gap, see:
> https://bugzilla.redhat.com/show_bug.cgi?id=1958293. Fixing this bug is
> likely the reason you are now seeing your job failing.

Looks like the cause alright, os-net-config isn't being run on the controllers on a stack update. However adding UPDATE to NetworkDeploymentActions didn't work for me.


Actions are set correctly AFAICT:

 [root@site-undercloud-0 central]# pwd
/var/lib/mistral/central
 [root@site-undercloud-0 central]# grep -r tripleo_network_config_network_deployment_actions *
deploy_steps_playbook.yaml:        tripleo_network_config_network_deployment_actions: "{{ network_deployment_actions }}"
 [root@site-undercloud-0 central]# grep -A2 network_deployment_actions group_vars/Controller0 
network_deployment_actions:
- - CREATE
  - UPDATE

But the tasks are still skipped:

 [root@site-undercloud-0 central]# grep 'Run NetworkConfig script' ansible.log
2021-06-22 11:09:08,623 p=56749 u=mistral n=ansible | 2021-06-22 11:09:08.622924 | 5254007e-ba68-f825-908b-000000000053 |       TASK | Run NetworkConfig script
2021-06-22 11:09:08,670 p=56749 u=mistral n=ansible | 2021-06-22 11:09:08.670745 | 5254007e-ba68-f825-908b-000000000053 |    SKIPPED | Run NetworkConfig script | central-controller0-0
2021-06-22 11:09:08,865 p=56749 u=mistral n=ansible | 2021-06-22 11:09:08.864752 | 5254007e-ba68-f825-908b-000000000053 |       TASK | Run NetworkConfig script
2021-06-22 11:09:08,936 p=56749 u=mistral n=ansible | 2021-06-22 11:09:08.935907 | 5254007e-ba68-f825-908b-000000000053 |       TASK | Run NetworkConfig script
2021-06-22 11:09:08,945 p=56749 u=mistral n=ansible | 2021-06-22 11:09:08.945319 | 5254007e-ba68-f825-908b-000000000053 |    SKIPPED | Run NetworkConfig script | central-controller0-1
2021-06-22 11:09:09,057 p=56749 u=mistral n=ansible | 2021-06-22 11:09:09.056948 | 5254007e-ba68-f825-908b-000000000053 |    SKIPPED | Run NetworkConfig script | central-controller0-2
2021-06-22 11:09:09,488 p=56749 u=mistral n=ansible | 2021-06-22 11:09:09.488562 | 5254007e-ba68-f825-908b-000000000053 |       TASK | Run NetworkConfig script
2021-06-22 11:09:09,575 p=56749 u=mistral n=ansible | 2021-06-22 11:09:09.575345 | 5254007e-ba68-f825-908b-000000000053 |    SKIPPED | Run NetworkConfig script | central-compute0-0
2021-06-22 11:09:09,725 p=56749 u=mistral n=ansible | 2021-06-22 11:09:09.725638 | 5254007e-ba68-f825-908b-000000000053 |       TASK | Run NetworkConfig script
2021-06-22 11:09:09,801 p=56749 u=mistral n=ansible | 2021-06-22 11:09:09.800612 | 5254007e-ba68-f825-908b-000000000053 |    SKIPPED | Run NetworkConfig script | central-compute0-1

Comment 5 Ollie Walsh 2021-06-22 12:09:25 UTC
Adding some debug tasks:

- name: Debug
  fail:
    msg: "{{ 'UPDATE' in tripleo_network_config_network_deployment_actions }}"

2021-06-22 12:02:02.278372 | 5254007e-ba68-7bde-1e69-000000000050 |      FATAL | Debug | central-controller0-0 | error={"changed": false, "msg": false}

- name: Debug
  fail:
    msg: "{{ tripleo_network_config_network_deployment_actions }}"

2021-06-22 12:06:37.220712 | 5254007e-ba68-b3ef-d670-000000000050 |      FATAL | Debug | central-controller0-2 | error={"changed": false, "msg": [["CREATE", "UPDATE"]]}

This is the problem:
network_deployment_actions:
- - CREATE
  - UPDATE

It's a list of list, should be a list.

Comment 6 Ollie Walsh 2021-06-22 12:13:56 UTC
I see the problem, testing a fix now

Comment 7 Harald Jensås 2021-06-22 12:15:19 UTC
network_deployment_actions:
- - CREATE
  - UPDATE

 ^^ The list is nested, it should just be a list of strings.

What you want is:

  network_deployment_actions:
  - CREATE
  - UPDATE

Not sure how you ended up with a nested list.

Comment 8 Ollie Walsh 2021-06-22 12:33:25 UTC
Confirmed it works with the t-h-t fix.

Comment 17 errata-xmlrpc 2021-09-15 07:16:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.