Description of problem: Since the introduction of this commit [a], live migration is broken on cellcomputes. [1] Apparently, cell1-config is wrong, it should be internal api just like overcloud-config [2] [a] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/786576/ [1] ~~~ /nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server [req-6351f9be-c200-4995-abb3-ece487bc54cc 9b51178989ab4c57817c7a79b37354b9 dcc474ee5bd042e3be158b87290a6a0b - default default] Exception during message handling: nova.exception.ResizeError: Resize error: not able to execute ssh command: Unexpected error while running command. ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last): ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 10119, in migrate_disk_and_power_off ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server self._remotefs.create_dir(dest, inst_base) ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/volume/remotefs.py", line 95, in create_dir ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server on_completion=on_completion) ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/volume/remotefs.py", line 185, in create_dir ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server on_execute=on_execute, on_completion=on_completion) ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/utils.py", line 117, in ssh_execute ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server return processutils.execute(*ssh_cmd, **kwargs) ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py", line 431, in execute ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server cmd=sanitized_cmd) ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server Command: ssh -o BatchMode=yes 172.17.1.18 mkdir -p /var/lib/nova/instances/3058ffe2-e4a2-461e-86a8-d2d5a0f42b48 ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server Exit code: 255 ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server Stdout: '' ./nova-compute.log:2021-05-18 15:26:03.068 7 ERROR oslo_messaging.rpc.server Stderr: 'ssh: connect to host 172.17.1.18 port 2022: Connection timed out\r\n' ~~~ [2] ~~~ (undercloud) [stack@undercloud-0 plans]$ grep -A10 -R nova_migration_target::firewall_rules cell1-config/Compute/config_settings.yaml:tripleo::nova_migration_target::firewall_rules: cell1-config/Compute/config_settings.yaml- 113 nova_migration_target accept api subnet 192.168.24.0/24: cell1-config/Compute/config_settings.yaml- dport: 2022 cell1-config/Compute/config_settings.yaml- proto: tcp cell1-config/Compute/config_settings.yaml- source: 192.168.24.0/24 cell1-config/Compute/config_settings.yaml- 113 nova_migration_target accept libvirt subnet 192.168.24.0/24: cell1-config/Compute/config_settings.yaml- dport: 2022 cell1-config/Compute/config_settings.yaml- proto: tcp cell1-config/Compute/config_settings.yaml- source: 192.168.24.0/24 cell1-config/Compute/config_settings.yaml-tripleo::ovn_controller::firewall_rules: cell1-config/Compute/config_settings.yaml- 118 neutron vxlan networks: -- cell1-config/group_vars/Compute: tripleo::nova_migration_target::firewall_rules: cell1-config/group_vars/Compute- 113 nova_migration_target accept api subnet 192.168.24.0/24: cell1-config/group_vars/Compute- dport: 2022 cell1-config/group_vars/Compute- proto: tcp cell1-config/group_vars/Compute- source: 192.168.24.0/24 cell1-config/group_vars/Compute- 113 nova_migration_target accept libvirt subnet 192.168.24.0/24: cell1-config/group_vars/Compute- dport: 2022 cell1-config/group_vars/Compute- proto: tcp cell1-config/group_vars/Compute- source: 192.168.24.0/24 cell1-config/group_vars/Compute- tripleo::ovn_controller::firewall_rules: cell1-config/group_vars/Compute- 118 neutron vxlan networks: overcloud-config/Compute/config_settings.yaml:tripleo::nova_migration_target::firewall_rules: overcloud-config/Compute/config_settings.yaml- 113 nova_migration_target accept api subnet 172.17.1.0/24: overcloud-config/Compute/config_settings.yaml- dport: 2022 overcloud-config/Compute/config_settings.yaml- proto: tcp overcloud-config/Compute/config_settings.yaml- source: 172.17.1.0/24 overcloud-config/Compute/config_settings.yaml- 113 nova_migration_target accept libvirt subnet 172.17.1.0/24: overcloud-config/Compute/config_settings.yaml- dport: 2022 overcloud-config/Compute/config_settings.yaml- proto: tcp overcloud-config/Compute/config_settings.yaml- source: 172.17.1.0/24 overcloud-config/Compute/config_settings.yaml-tripleo::ovn_controller::firewall_rules: overcloud-config/Compute/config_settings.yaml- 118 neutron vxlan networks: -- overcloud-config/group_vars/Compute: tripleo::nova_migration_target::firewall_rules: overcloud-config/group_vars/Compute- 113 nova_migration_target accept api subnet 172.17.1.0/24: overcloud-config/group_vars/Compute- dport: 2022 overcloud-config/group_vars/Compute- proto: tcp overcloud-config/group_vars/Compute- source: 172.17.1.0/24 overcloud-config/group_vars/Compute- 113 nova_migration_target accept libvirt subnet 172.17.1.0/24: overcloud-config/group_vars/Compute- dport: 2022 overcloud-config/group_vars/Compute- proto: tcp overcloud-config/group_vars/Compute- source: 172.17.1.0/24 overcloud-config/group_vars/Compute- tripleo::ovn_controller::firewall_rules: overcloud-config/group_vars/Compute- 118 neutron vxlan networks: ~~~
After removing these lines [1] from the cell1.yaml file, it looks like the migration target firewall rules are populated differently this time, but still incorrect [2]. This subnet is the tenant subnet, this should be the internal_api network (172.17.1.0/24). So this really looks like this issue [b] that was "caused" unexpectedly by this commit[c]. This commit basically removes all references to NovaVncProxyNetwork and replaces it for NovaLibvirtNetwork. It also removes the mapping in service_net_map. Once applied, haproxy configuration for nova_vnc_proxy farm was erroneous and using the ctlplane subnet to bind to instead of the internal_api (aliased by NovaLibvirtNetwork). Re-adding the nova_vnc_proxy_network reference in service_net_map "unexplainedly" solved the issue even though it's not referenced anywhere. So I redeployed the cell with normal settings, including the none mapping [1] for the networks and with ManageNetworks: False and it expectedly set the migration subnet to ctlplane. I updated the service_net_map to include nova_migration_target_network but it didn't work, we still have the same ctlplane subnet. I believe this issue is more in the realm of the deployment framework than compute related. [a] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/configuring_the_compute_service_for_instance_creation/scaling-deployments-with-compute-cells#deploying-multicell-overcloud-osp [b] https://bugzilla.redhat.com/show_bug.cgi?id=1961205 [c] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/781614 [1] ~~~ OS::TripleO::Network::External: OS::Heat::None OS::TripleO::Network::InternalApi: OS::Heat::None OS::TripleO::Network::Storage: OS::Heat::None OS::TripleO::Network::StorageMgmt: OS::Heat::None OS::TripleO::Network::Tenant: OS::Heat::None OS::TripleO::Network::Management: OS::Heat::None ~~~ [2] ~~~ "tripleo::nova_migration_target::firewall_rules": { "113 nova_migration_target accept api subnet 172.16.2.0/24": { "dport": 2022, "proto": "tcp", "source": "172.16.2.0/24" }, "113 nova_migration_target accept libvirt subnet 172.16.2.0/24": { "dport": 2022, "proto": "tcp", "source": "172.16.2.0/24" } }, ~~~
I dig a bit further and the data is corrupted during heat stack creation or update [1]. The net_vip_map is fine but not the net_cidr_map and it's the same for all services [2], only on the cell stack. Now I'm wondering if the net_cidr_map was ever populated correctly for multi cells since it's only rarely used in templates AFAIK. More info on tracing back this issue: - ServiceData.net_cidr_map is populated here [a] with NetCidrMapValue - this brings us here [b] - This apparently comes from here [c], see how if we don't have a network_cidrs we default to ctlplane - But if we look at the network unit template, if manage_networks is undefined, we don't have a cidr associated. [d] Martin told me earlier that ManageNetwork is set to false in dcn. I believe that a solution would be to enable it, but I'm not sure what are the side effect. I just tried to enable it with a deploy update but it failed complaining about the management network which isn't even configured. I tried to undefine it but it keeps complaining. I'll delete the cell stack and redeploy it from scratch with managenetwork. [a] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/overcloud.j2.yaml#L565 [b] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/overcloud.j2.yaml#L478 [c] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/network/networks.j2.yaml [d] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/network/network.j2#L228-L252 [1] ~~~ mysql -D heat -e "select a.*,b.* from resource a left join stack b on b.id = a.stack_id where a.id = 1701\G" *************************** 1. row *************************** id: 1701 uuid: d78f6b91-5f6f-47b0-8eca-5ce942605eb0 nova_instance: 5f1dd1a4-ece4-4a19-88bf-aa67e7eee102 name: 22 created_at: 2021-05-19 01:28:37 updated_at: 2021-05-19 01:46:20 action: UPDATE status: COMPLETE status_reason: state changed stack_id: 2ac42e04-baf3-4bac-b0cb-fb3c9305224c rsrc_metadata: {} properties_data: null engine_id: NULL atomic_key: 6 needed_by: [] requires: [1702] replaces: NULL replaced_by: NULL current_template_id: 2035 properties_data_encrypted: NULL root_stack_id: 08f0f37e-953b-4081-b6d6-4683a7d5177f rsrc_prop_data_id: 4187 attr_data_id: 4537 id: 2ac42e04-baf3-4bac-b0cb-fb3c9305224c created_at: 2021-05-19 01:28:36 updated_at: 2021-05-19 01:45:43 deleted_at: NULL name: cell1-CellControllerServiceChain-t7ggnuk2yyps-ServiceChain-bqzwl2sb7n4e raw_template_id: 2035 prev_raw_template_id: NULL user_creds_id: 3 username: NULL owner_id: 5fb272f8-9237-4402-ba57-e8649a90ba79 action: UPDATE status: COMPLETE status_reason: Stack UPDATE completed successfully timeout: 100 tenant: 26de13d422154d9790143d326e78a669 disable_rollback: 1 stack_user_project_id: 983bef81ae04459cbd41e04d240373bf backup: 0 nested_depth: 2 convergence: 1 current_traversal: 9bfb0a2e-ab31-4993-a6b4-fb2f71de8e1b current_deps: {"edges": [[[1704, true], [1705, true]], [[1708, true], [1709, true]], [[1697, true], [1708, true]], [[1696, true], [1697, true]], [[1710, true], [1711, true]], [[1709, true], [1710, true]], [[1700, true], [1703, true]], [[1699, true], [1700, true]], [[1698, true], [1699, true]], [[1703, true], [1704, true]], [[1706, true], [1707, true]], [[1702, true], [1706, true]], [[1701, true], [1702, true]], [[1712, true], [1713, true]], [[1707, true], [1712, true]], [[1717, true], [1718, true]], [[1716, true], [1717, true]], [[1715, true], [1716, true]], [[1714, true], [1715, true]], [[1696, false], [1696, true]], [[1697, false], [1696, false]], [[1697, false], [1697, true]], [[1708, false], [1697, false]], [[1708, false], [1708, true]], [[1698, false], [1698, true]], [[1699, false], [1698, false]], [[1699, false], [1699, true]], [[1700, false], [1699, false]], [[1700, false], [1700, true]], [[1703, false], [1703, true]], [[1703, false], [1700, false]], [[1701, false], [1701, true]], [[1702, false], [1702, true]], [[1702, false], [1701, false]], [[1706, false], [1702, false]], [[1706, false], [1706, true]], [[1704, false], [1704, true]], [[1704, false], [1703, false]], [[1705, false], [1704, false]], [[1705, false], [1705, true]], [[1707, false], [1706, false]], [[1707, false], [1707, true]], [[1712, false], [1707, false]], [[1712, false], [1712, true]], [[1709, false], [1709, true]], [[1709, false], [1708, false]], [[1710, false], [1710, true]], [[1710, false], [1709, false]], [[1711, false], [1710, false]], [[1711, false], [1711, true]], [[1713, false], [1712, false]], [[1713, false], [1713, true]], [[1714, false], [1714, true]], [[1715, false], [1714, false]], [[1715, false], [1715, true]], [[1716, false], [1715, false]], [[1716, false], [1716, true]], [[1717, false], [1717, true]], [[1717, false], [1716, false]], [[1718, false], [1718, true]], [[1718, false], [1717, false]]]} parent_resource_name: ServiceChain mysql -N -s -D heat -e "select data from resource_properties_data where id = 4187;" | jq -C '.' "ServiceData": { "net_cidr_map": { "storage": [ "192.168.24.0/24" ], "storage_mgmt": [ "192.168.24.0/24" ], "internal_api": [ "192.168.24.0/24" ], "tenant": [ "192.168.24.0/24" ], "external": [ "192.168.24.0/24" ], "management": [ "10.0.1.0/24" ], "ctlplane": [ "192.168.24.0/24" ] }, ~~~ [2] ~~~ [ "OS::TripleO::Services::Podman", { "net_cidr_map": { "storage": [ "192.168.24.0/24" ], "storage_mgmt": [ "192.168.24.0/24" ], "internal_api": [ "192.168.24.0/24" ], "tenant": [ "192.168.24.0/24" ], "external": [ "192.168.24.0/24" ], "management": [ "10.0.1.0/24" ], "ctlplane": [ "192.168.24.0/24" ] }, "net_vip_map": { "ctlplane": "192.168.24.47", "ctlplane_subnet": "192.168.24.47/24", "ctlplane_uri": "192.168.24.47", "storage": "172.17.3.43", "storage_subnet": "", "storage_uri": "172.17.3.43", "storage_mgmt": "172.17.4.45", "storage_mgmt_subnet": "", "storage_mgmt_uri": "172.17.4.45", "internal_api": "172.17.1.126", "internal_api_subnet": "", "internal_api_uri": "172.17.1.126", "tenant": "", "tenant_subnet": "", "tenant_uri": "", "external": "10.0.0.6", "external_subnet": "", "external_uri": "10.0.0.6", "management": "", "management_subnet": "", "management_uri": "" } } ] ~~~
Ollie recommended that we add uuids to network_data.yaml [1], make sure that managenetwork is set to false and remove the network undefinition [2] at the top of the cell1.yaml template. This seems to work so far [3][4]. [1] ~~~ - name: Storage vip: true vlan: 30 name_lower: storage ip_subnet: '172.17.3.0/24' allocation_pools: [{'start': '172.17.3.10', 'end': '172.17.3.149'}] mtu: 1500 external_resource_network_id: fe06f73d-cb5e-4395-9c9f-c2ae264ea808 external_resource_subnet_id: e8640dbc-a1dd-44e4-83ec-e1d6967825ad external_resource_vip_id: d1a1d710-fdbd-42da-929e-4976fc1060ba - name: StorageMgmt name_lower: storage_mgmt vip: true vlan: 40 ip_subnet: '172.17.4.0/24' allocation_pools: [{'start': '172.17.4.10', 'end': '172.17.4.149'}] mtu: 1500 external_resource_network_id: b9750198-eacf-413a-ae91-a3fa99adbca3 external_resource_subnet_id: 7d70f013-57bf-4ef2-ab69-35a130b3212e external_resource_vip_id: 476e98f9-6b7d-4176-a11a-63001d2ef510 - name: InternalApi name_lower: internal_api vip: true vlan: 20 ip_subnet: '172.17.1.0/24' allocation_pools: [{'start': '172.17.1.10', 'end': '172.17.1.149'}] mtu: 1500 external_resource_network_id: 0c5a3a61-e99b-44e4-bd50-a7268e228195 external_resource_subnet_id: e0de5ff1-90c6-4821-9df2-1e4fbbc951e3 external_resource_vip_id: 6bb648e7-eb8e-4262-91c8-247e30f54ca7 - name: Tenant vip: false # Tenant network does not use VIPs name_lower: tenant vlan: 50 ip_subnet: '172.17.2.0/24' allocation_pools: [{'start': '172.17.2.10', 'end': '172.17.2.149'}] mtu: 1500 external_resource_network_id: 53bc65ae-62d3-4cde-b3b4-2ec8190633bb external_resource_subnet_id: 1037a300-04f1-4696-87fa-213c1c2501e4 - name: External vip: true name_lower: external vlan: 10 ip_subnet: '10.0.0.0/24' allocation_pools: [{'start': '10.0.0.101', 'end': '10.0.0.149'}] gateway_ip: '10.0.0.1' mtu: 1500 external_resource_network_id: 6e0d993a-12c9-431b-9ba5-05b81ed30a2a external_resource_subnet_id: 379d2c41-aea5-42d7-bf19-456cf0cfd822 external_resource_vip_id: 7c81f93e-2e5b-46ff-8e12-6ebb7eb79e06 - name: Management # Management network is enabled by default for backwards-compatibility, but # is not included in any roles by default. Add to role definitions to use. enabled: true vip: false # Management network does not use VIPs name_lower: management vlan: 60 ip_subnet: '10.0.1.0/24' allocation_pools: [{'start': '10.0.1.4', 'end': '10.0.1.250'}] gateway_ip: '10.0.1.1' mtu: 1500 external_resource_network_id: a1b672dd-124a-4fe4-ad41-6075fd0ff89c external_resource_subnet_id: 856d1a93-ebd3-4d54-a5f5-9c7b15094e0d ~~~ [2] ~~~ resource_registry: OS::TripleO::Network::Ports::OVNDBsVipPort: /usr/share/openstack-tripleo-heat-templates/network/ports/noop.yaml OS::TripleO::Network::Ports::RedisVipPort: /usr/share/openstack-tripleo-heat-templates/network/ports/noop.yaml OS::TripleO::OVNMacAddressNetwork: OS::Heat::None parameter_defaults: # new CELL Parameter to reflect that this is an additional CELL # enable local meta data api per cell NovaAdditionalCell: True NovaLocalMetadataPerCell: True ManageNetworks: False # The DNS names for the VIPs for the cell CloudName: cell1.redhat.local CloudNameInternal: cell1.internalapi.redhat.local CloudNameStorage: cell1.storage.redhat.local CloudNameStorageManagement: cell1.storagemgmt.redhat.local CloudNameCtlplane: cell1.ctlplane.redhat.local # Flavors used for the cell controller and computes OvercloudCellControllerFlavor: compute OvercloudComputeFlavor: compute # number of controllers/computes in the cell CellControllerCount: 1 ComputeCount: 2 # set the compute hostname to cellname-compute-X ComputeHostnameFormat: 'cell1-compute-%index%' # since we set the PublicVirtualFixedIPs in the overcloud env file, # we have to set a pub vip for the cell controller as well PublicVirtualFixedIPs: - ip_address: 10.0.0.6 ~~~ [3] ~~~ (undercloud) [stack@undercloud-0 cell1]$ openstack stack resource show cell1 NetCidrMapValue -f value -c attributes {'value': {'storage': ['172.17.3.0/24'], 'storage_mgmt': ['172.17.4.0/24'], 'internal_api': ['172.17.1.0/24'], 'tenant': ['172.17.2.0/24'], 'external': ['10.0.0.0/24'], 'management': ['10.0.1.0/24'], 'ctlplane': ['192.168.24.0/24']}} ~~~ [4] ~~~ [root@cell1-compute-0 puppet]# hiera tripleo::nova_migration_target::firewall_rules {"113 nova_migration_target accept api subnet 172.17.1.0/24"=> {"dport"=>2022, "proto"=>"tcp", "source"=>"172.17.1.0/24"}, "113 nova_migration_target accept libvirt subnet 172.17.1.0/24"=> {"dport"=>2022, "proto"=>"tcp", "source"=>"172.17.1.0/24"}} ~~~
*** Bug 1960439 has been marked as a duplicate of this bug. ***
Created attachment 1790012 [details] Volume tempest results for verification
Created attachment 1790013 [details] Compute tempest results for verification
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483