Bug 1964699 - [16.2] Cold migration occasionally fails in a TLS-E multi-cell environment
Summary: [16.2] Cold migration occasionally fails in a TLS-E multi-cell environment
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 16.2 (Train on RHEL 8.4)
Assignee: David Vallee Delisle
QA Contact: James Parker
URL:
Whiteboard:
Depends On: 1959627
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-25 21:15 UTC by James Parker
Modified: 2021-09-15 07:15 UTC (History)
15 users (show)

Fixed In Version: puppet-tripleo-11.6.2-2.20210528134152.e8f473f.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1959627
Environment:
Last Closed: 2021-09-15 07:15:18 UTC
Target Upstream Version: Train
Embargoed:


Attachments (Terms of Use)
Compute tempest tests with cold migration tests for verification (532.34 KB, application/xhtml+xml)
2021-06-10 20:01 UTC, James Parker
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 793180 0 None NEW Adding nova::network::neutron to nova-conductor 2021-05-27 07:19:50 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:15:40 UTC

Description James Parker 2021-05-25 21:15:59 UTC
+++ This bug was initially created as a clone of Bug #1959627 +++

Description of problem:
Intermittently cold migration will fail in a TLS-E multi-cell environment.  The specific error [1] comes from the cell conductor which does not have the auth_type set in its [neutron] configuration in nova.conf.  The primary controller and the cell computes do have the the auth_type correctly set.  Not sure if this is an issue with TripleO not propagating the configuration to the cell-controller, or if nova should be referencing the config params on the primary controller instead.  To highlight what was mentioned earlier this is not 100% reproducible. So far this is happening about 50% of the time running the same suite of tests on the same environment.  When it does fail it only fails doing the cold migration revert test [2].  All other migration tests consistently pass.  It's important to note that while the cold migration revert test is the test that always fails, the failure happens before the revert.


Relevant Config parameters:
[heat-admin@cell1-compute-0 ~]$ sudo crudini --get /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf neutron auth_type
v3password

[heat-admin@cell1-cellcontrol-0 ~]$ sudo crudini --get /var/lib/config-data/puppet-generated/nova/etc/nova/nova.conf neutron
[heat-admin@cell1-cellcontrol-0 ~]$ 

[heat-admin@controller-0 ~]$ sudo crudini --get /var/lib/config-data/puppet-generated/nova/etc/nova/nova.conf neutron auth_type
v3password



Example output below when migration fails:

2021-05-11 16:45:27.998 [nova-cell-conductor.log] 23 ERROR nova.network.neutronv2.api [req-c002f919-623f-4917-b3a4-e54fc60c063d 637886e2f55840e6a3381c535ba7ec4f 20dd17d014fd442fbac7fdd6b9c006b6 - default default] The [neutron] section of your nova configuration file must be configured for authentication with the networking service endpoint. See the networking service install guide for details: https://docs.openstack.org/neutron/latest/install/                                                                                                      
2021-05-11 16:45:27.999 [nova-cell-conductor.log] 23 WARNING nova.scheduler.utils [req-c002f919-623f-4917-b3a4-e54fc60c063d 637886e2f55840e6a3381c535ba7ec4f 20dd17d014fd442fbac7fdd6b9c006b6 - default default] Failed to compute_task_migrate_server: Unknown auth type: None: neutronclient.common.exceptions.Unauthorized: Unknown auth type: None
2021-05-11 16:45:28.004 [nova-cell-conductor.log] 23 WARNING nova.scheduler.utils [req-c002f919-623f-4917-b3a4-e54fc60c063d 637886e2f55840e6a3381c535ba7ec4f 20dd17d014fd442fbac7fdd6b9c006b6 - default default] [instance: 709ed075-6419-4e0b-928f-27050b8910ba] Setting instance to ACTIVE state.: neutronclient.common.exceptions.Unauthorized: Unknown auth type: None
2021-05-11 16:45:28.062 [nova-cell-conductor.log] 23 DEBUG nova.objects.instance [req-c002f919-623f-4917-b3a4-e54fc60c063d 637886e2f55840e6a3381c535ba7ec4f 20dd17d014fd442fbac7fdd6b9c006b6 - default default] Lazy-loading 'flavor' on Instance uuid 709ed075-6419-4e0b-928f-27050b8910ba obj_load_attr /usr/lib/python3.6/site-packages/nova/objects/instance.py:1091
2021-05-11 16:45:28.098 [nova-cell-conductor.log] 23 DEBUG nova.objects.instance [req-c002f919-623f-4917-b3a4-e54fc60c063d 637886e2f55840e6a3381c535ba7ec4f 20dd17d014fd442fbac7fdd6b9c006b6 - default default] Lazy-loading 'metadata' on Instance uuid 709ed075-6419-4e0b-928f-27050b8910ba obj_load_attr /usr/lib/python3.6/site-packages/nova/objects/instance.py:1091
2021-05-11 16:45:28.136 [nova-cell-conductor.log] 23 DEBUG nova.objects.instance [req-c002f919-623f-4917-b3a4-e54fc60c063d 637886e2f55840e6a3381c535ba7ec4f 20dd17d014fd442fbac7fdd6b9c006b6 - default default] Lazy-loading 'info_cache' on Instance uuid 709ed075-6419-4e0b-928f-27050b8910ba obj_load_attr /usr/lib/python3.6/site-packages/nova/objects/instance.py:1091
2021-05-11 16:45:28.223 [nova-cell-conductor.log] 23 ERROR oslo_messaging.rpc.server [req-c002f919-623f-4917-b3a4-e54fc60c063d 637886e2f55840e6a3381c535ba7ec4f 20dd17d014fd442fbac7fdd6b9c006b6 - default default] Exception during message handling: neutronclient.common.exceptions.Unauthorized: Unknown auth type: None
2021-05-11 16:45:28.223 [nova-cell-conductor.log] 23 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2021-05-11 16:45:28.223 [nova-cell-conductor.log] 23 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming


Version-Release number of selected component (if applicable):
[stack@undercloud-0 ~]$ cat core_puddle_version 
RHOS-16.1-RHEL-8-20210506.n.1
[stack@undercloud-0 ~]$ cat /etc/rhosp-release 
Red Hat OpenStack Platform release 16.1.6 GA (Train)


How reproducible:
~50% running the same suite of tests on the same environment

Steps to Reproduce:
1. Setup a multi-cell environment with TLS-E
2. Create an instance in the cell1 cluster of the deployment
3. Using admin client, cold migrate a server in and wait until status VERIFY_RESIZE is reached
4. (This might be triggered due to parallel migration tests happening)

Actual results:
Instance fails to cold migrate to new host

Expected results:
Instance should migrate to a new host

Additional info:
Test logs can be found here [3]

Relevant nova logs are merged and attached.  The instance id involved in the fail scenario is 709ed075-6419-4e0b-928f-27050b8910ba and the request-uuid assoicate with the migration is req-c002f919-623f-4917-b3a4-e54fc60c063d


[1] https://github.com/openstack/nova/blob/stable/train/nova/network/neutronv2/api.py#L82
[2] https://github.com/openstack/tempest/blob/master/tempest/api/compute/admin/test_migrations.py#L171
[3] https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/job/DFG-compute-nova-16.1_director-rhel-virthost-1cont_1comp_1cellcont_2cellcomp_1ipa-ipv4-geneve-multi-cell-tls-everywhere-phase3/5/testReport/tempest.api.compute.admin.test_migrations/MigrationsAdminTest/test_revert_cold_migration_id_caa1aa8b_f4ef_4374_be0d_95f001c2ac2d_/

--- Additional comment from  on 2021-05-14 13:42:39 UTC ---

looking at the merged logs i can see the conductor does not have the neutron admin credentials

2021-05-11 16:02:05.847 [nova-conductor.log] 7 DEBUG oslo_service.service [-] neutron.auth_section           = None log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589                        │
│2021-05-11 16:02:05.847 [nova-conductor.log] 7 DEBUG oslo_service.service [-] neutron.auth_type              = v3password log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589 

so i suspect that this is why its failing 


this is the relevent log http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/staging/DFG-compute-nova-16.1_director-rhel-virthost-1cont_1comp_1cellcont_2cellcomp_1ipa-ipv4-geneve-multi-cell-tls-everywhere-phase3/5/controller-0/var/lib/config-data/nova/etc/nova/nova.conf.gz

 Authentication type to load (string value)
# Deprecated group;name - [neutron]/auth_plugin
#auth_type=<None>
auth_type=v3password

# Config Section from which to load plugin specific options (string value)
#auth_section=<None>

# Authentication URL (string value)
#auth_url=<None>
auth_url=https://overcloud.internalapi.redhat.local:5000/v3

# Scope for system operations (string value)
#system_scope=<None>

# Domain ID to scope to (string value)
#domain_id=<None>

# Domain name to scope to (string value)
#domain_name=<None>

# Project ID to scope to (string value)
#project_id=<None>

# Project name to scope to (string value)
#project_name=<None>
project_name=service

# Domain ID containing project (string value)
#project_domain_id=<None>

# Domain name containing project (string value)
#project_domain_name=<None>
project_domain_name=Default

# Trust ID (string value)
#trust_id=<None>

# Optional domain ID to use with v3 and v2 parameters. It will be used for both
# the user and project domain in v3 and ignored in v2 authentication (string
# value)
#default_domain_id=<None>

# Optional domain name to use with v3 API and v2 parameters. It will be used for
# both the user and project domain in v3 and ignored in v2 authentication
# (string value)
#default_domain_name=<None>

# User ID (string value)
#user_id=<None>

# Username (string value)
# Deprecated group;name - [neutron]/user_name
#username=<None>
username=neutron

# User's domain id (string value)
#user_domain_id=<None>

# User's domain name (string value)
#user_domain_name=<None>
user_domain_name=Default

# User's password (string value)
#password=<None>
password=qPe6rrMXgW2sFlD314F0K91v1

so it should be able to conenct the shduler share the same nova.conf i belive btu it seams to think the auth type is none

2021-05-11 16:45:27.998 [nova-cell-conductor.log] 23 ERROR nova.network.neutronv2.api [req-c002f919-623f-4917-b3a4-e54fc60c063d 637886e2f55840e6a3381c535ba7ec4f 20dd17d014fd442fbac7fdd6b9c006b6 - default default] The [neutron] section of your nova configuration file must be configured for authentication with the networking service endpoint. See the networking service install guide for details: https://docs.openstack.org/neutron/latest/install/                                                                                                      
2021-05-11 16:45:27.999 [nova-cell-conductor.log] 23 WARNING nova.scheduler.utils [req-c002f919-623f-4917-b3a4-e54fc60c063d 637886e2f55840e6a3381c535ba7ec4f 20dd17d014fd442fbac7fdd6b9c006b6 - default default] Failed to compute_task_migrate_server: Unknown auth type: None: neutronclient.common.exceptions.Unauthorized: Unknown auth type: None


yet at startup it was detected corrctly as v3password

│2021-05-11 16:02:08.801 [nova-scheduler.log] 7 DEBUG oslo_service.service [req-a383177d-fd3b-49e4-bfef-e844cf46dd83 - - - - -] neutron.auth_type              = v3password 


so there si obviously some wrong here that is cause the the constuciotn of the neutron client to fail in  nova.network.neutronv2.api

Comment 7 James Parker 2021-06-10 20:01:12 UTC
Created attachment 1790014 [details]
Compute tempest tests with cold migration tests for verification

Comment 10 errata-xmlrpc 2021-09-15 07:15:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.