1878492 – openstack overcloud deploy always fails after OSP13->OSP16.1 upgrade

Bug 1878492 - openstack overcloud deploy always fails after OSP13->OSP16.1 upgrade

Summary: openstack overcloud deploy always fails after OSP13->OSP16.1 upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	z3
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Alan Bishop
QA Contact:	Tzach Shefi
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1853281 1856906 1882110 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-13 12:09 UTC by Takashi Kajinami
Modified:	2024-03-25 16:40 UTC (History)
CC List:	14 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-1.20200914170164.el8ost
Doc Type:	Bug Fix
Doc Text:	Before this update, director maintained Identity service (keystone) catalog entries for Block Storage service's (cinder) deprecated v1 API volume service, and the legacy Identity service endpoints were not compatible with recent enhancements to director's endpoint validations. As a result, stack updates failed if a legacy volume service was present in the Identity service catalog. With this update, director automatically removes the legacy volume service and its associated endpoints. Stack updates no longer fail Identity service endpoint validation.
Clone Of:
Environment:
Last Closed:	2020-12-15 18:36:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1897761	None	None	None	2020-09-29 15:51:05 UTC
OpenStack gerrit	757414	None	MERGED	[FFU] Remove cinder's v1 keystone service	2021-05-26 13:26:19 UTC
Red Hat Issue Tracker	OSP-1668	None	None	None	2024-03-25 16:40:07 UTC
Red Hat Product Errata	RHEA-2020:5413	None	None	None	2020-12-15 18:37:03 UTC

Description Takashi Kajinami 2020-09-13 12:09:27 UTC

Description of problem:

After OSP13->OSP16.1 upgrade, openstack overcloud deploy always fails because of the error in "Check Keystone public endpoint status" task.
By removing no_logs: true in /usr/share/ansible/roles/tripleo-keystone-resources/tasks/endpoints.yml, I captured the following error in tripleo-keystone-resources.

~~~
...

TASK [tripleo-keystone-resources : Check Keystone public endpoint status] ******
Sunday 13 September 2020  11:14:39 +0000 (0:00:05.000)       0:25:13.826 ****** 
...
failed: [undercloud] (item={'started': 1, 'finished': 0, 'ansible_job_id': '668760418547.976504', 'results_file': '/root/.ansible_async/668760418547.976504', 'changed': True, 'failed': False, 'tripleo_keystone_resources_data': {'key': 'cinderv3', 'value': {'endpoints': {'admin': 'http://172.17.1.25:8776/v3/%(tenant_id)s', 'internal': 'http://172.17.1.25:8776/v3/%(tenant_id)s', 'public': 'http://10.0.0.145:8776/v3/%(tenant_id)s'}, 'region': 'regionOne', 'service': 'volumev3', 'users': {'cinderv3': {'password': 'NAR9UexZ7u28rCGRuEEArtt7v', 'roles': ['admin', 'service']}}}}, 'ansible_loop_var': 'tripleo_keystone_resources_data'}) => {"ansible_job_id": "668760418547.976504", "ansible_loop_var": "tripleo_keystone_resources_endpoint_async_result_item", "attempts": 1, "changed": false, "finished": 1, "msg": "Multiple matches found for cinderv3", "tripleo_keystone_resources_endpoint_async_result_item": {"ansible_job_id": "668760418547.976504", "ansible_loop_var": "tripleo_keystone_resources_data", "changed": true, "failed": false, "finished": 0, "results_file": "/root/.ansible_async/668760418547.976504", "started": 1, "tripleo_keystone_resources_data": {"key": "cinderv3", "value": {"endpoints": {"admin": "http://172.17.1.25:8776/v3/%(tenant_id)s", "internal": "http://172.17.1.25:8776/v3/%(tenant_id)s", "public": "http://10.0.0.145:8776/v3/%(tenant_id)s"}, "region": "regionOne", "service": "volumev3", "users": {"cinderv3": {"password": "NAR9UexZ7u28rCGRuEEArtt7v", "roles": ["admin", "service"]}}}}}}
...

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
compute-0                  : ok=256  changed=103  unreachable=0    failed=0    skipped=118  rescued=0    ignored=0   
compute-1                  : ok=252  changed=103  unreachable=0    failed=0    skipped=118  rescued=0    ignored=0   
controller-0               : ok=317  changed=138  unreachable=0    failed=0    skipped=127  rescued=0    ignored=0   
controller-1               : ok=306  changed=138  unreachable=0    failed=0    skipped=128  rescued=0    ignored=0   
controller-2               : ok=306  changed=138  unreachable=0    failed=0    skipped=128  rescued=0    ignored=0   
undercloud                 : ok=51   changed=16   unreachable=0    failed=1    skipped=44   rescued=0    ignored=0   

Sunday 13 September 2020  11:14:42 +0000 (0:00:02.533)       0:25:16.360 ****** 
=============================================================================== 

Ansible failed, check log at /var/lib/mistral/overcloud/ansible.log.
Overcloud configuration failed.
~~~

Cinder v1 endpoint created during deployment of OSP13, while it is no longer created in OSP16.1 deployment.
The problem is that it is not purged during upgrade and left in OSP16.1 deployment after upgrade,
then results in the deployment failure.

Fresh RHOSP13 deployment
~~~
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list --service cinderv3
Multiple service matches found for 'cinderv3', use an ID to be more specific.
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list | grep cinderv3
| 0232351c41fc41eb8dc281a01f27e20b | regionOne | cinderv3     | volumev3       | True    | internal  | http://172.17.1.103:8776/v3/%(tenant_id)s     |
| 1cfa53ca39184f94a55a91ed6d3ae8af | regionOne | cinderv3     | volume         | True    | admin     | http://172.17.1.103:8776/v3/%(tenant_id)s     |
| 5a6b1149cc4142d08e411e10743b8356 | regionOne | cinderv3     | volume         | True    | public    | http://10.0.0.113:8776/v3/%(tenant_id)s       |
| 7aea0836ca984a6fb4dd2c55318dd8af | regionOne | cinderv3     | volumev3       | True    | public    | http://10.0.0.113:8776/v3/%(tenant_id)s       |
| c04ae1bae20b45e0b127749e75911907 | regionOne | cinderv3     | volume         | True    | internal  | http://172.17.1.103:8776/v3/%(tenant_id)s     |
| e097455080374c5fad4ab96725e8ec8c | regionOne | cinderv3     | volumev3       | True    | admin     | http://172.17.1.103:8776/v3/%(tenant_id)s     |
~~~

Fresh RHOSP16.1 deployment
~~~
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list --service cinderv3
+----------------------------------+-----------+--------------+--------------+---------+-----------+-------------------------------------------+
| ID                               | Region    | Service Name | Service Type | Enabled | Interface | URL                                       |
+----------------------------------+-----------+--------------+--------------+---------+-----------+-------------------------------------------+
| 691ce1bdc9b343ebb8fee891bd2b9733 | regionOne | cinderv3     | volumev3     | True    | admin     | http://172.17.1.49:8776/v3/%(tenant_id)s  |
| 905b33cf92454c7881c8c222b5c30059 | regionOne | cinderv3     | volumev3     | True    | public    | https://10.0.0.101:13776/v3/%(tenant_id)s |
| e133d65395874e5fb2dd0ae53abd268e | regionOne | cinderv3     | volumev3     | True    | internal  | http://172.17.1.49:8776/v3/%(tenant_id)s  |
+----------------------------------+-----------+--------------+--------------+---------+-----------+-------------------------------------------+
~~~

RHOSP16.1 upgraded from RHOSP13
~~~
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list --service cinderv3
Multiple service matches found for 'cinderv3', use an ID to be more specific.
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list | grep cinderv3
| 596146ea98914c69ab77b023ff3ade3e | regionOne | cinderv3     | volume         | True    | public    | http://10.0.0.145:8776/v3/%(tenant_id)s       |
| 5e2429696b2143c8a9dbe7553fde9e1e | regionOne | cinderv3     | volume         | True    | admin     | http://172.17.1.25:8776/v3/%(tenant_id)s      |
| 8eeb1f1780ba46769cd1112ac1899a46 | regionOne | cinderv3     | volume         | True    | internal  | http://172.17.1.25:8776/v3/%(tenant_id)s      |
| 95dbaf739c3348cf8016ee50b53eea9a | regionOne | cinderv3     | volumev3       | True    | public    | http://10.0.0.145:8776/v3/%(tenant_id)s       |
| a5ca4c12a3f04ac5b11bd1c7334fc094 | regionOne | cinderv3     | volumev3       | True    | internal  | http://172.17.1.25:8776/v3/%(tenant_id)s      |
| d98b138eba27410097ce333d6fb5a7cb | regionOne | cinderv3     | volumev3       | True    | admin     | http://172.17.1.25:8776/v3/%(tenant_id)s      |
~~~


Version-Release number of selected component (if applicable):

(overcloud) [stack@undercloud-0 ~]$ rpm -qa | grep tripleo | sort
ansible-role-tripleo-modify-image-1.2.1-0.20200527233426.bc21900.el8ost.noarch
ansible-tripleo-ipa-0.2.1-0.20200611104546.c22fc8d.el8ost.noarch
ansible-tripleo-ipsec-9.2.1-0.20200311073016.0c8693c.el8ost.noarch
openstack-tripleo-common-11.3.3-0.20200611110657.f7715be.el8ost.noarch
openstack-tripleo-common-containers-11.3.3-0.20200611110657.f7715be.el8ost.noarch
openstack-tripleo-heat-templates-11.3.2-0.20200616081539.396affd.el8ost.noarch
openstack-tripleo-image-elements-10.6.2-0.20200528043425.7dc0fa1.el8ost.noarch
openstack-tripleo-puppet-elements-11.2.2-0.20200527003426.226ce95.el8ost.noarch
openstack-tripleo-validations-11.3.2-0.20200611115253.08f469d.el8ost.noarch
puppet-tripleo-11.5.0-0.20200616033428.8ff1c6a.el8ost.noarch
python3-tripleoclient-12.3.2-0.20200615103427.6f877f6.el8ost.noarch
python3-tripleoclient-heat-installer-12.3.2-0.20200615103427.6f877f6.el8ost.noarch
python3-tripleo-common-11.3.3-0.20200611110657.f7715be.el8ost.noarch
tripleo-ansible-0.5.1-0.20200611113659.34b8fcc.el8ost.noarch
(overcloud) [stack@undercloud-0 ~]$ cat /etc/rhosp-release 
Red Hat OpenStack Platform release 16.1.1 GA (Train)

How reproducible:
Always

Steps to Reproduce:
1. Upgrade RHOSP13 deployment to RHOSP16.1
2. run "openstack overcloud deploy" to update overcloud stack

Actual results:
deploy fails at "Check Keystone public endpoint status" task

Expected results:
deploy completes without failures

Additional info:

Comment 1 Takashi Kajinami 2020-09-13 13:41:06 UTC

I deleted cinder v1 api, but deploy still fails.
~~~
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint delete 596146ea98914c69ab77b023ff3ade3e
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint delete 5e2429696b2143c8a9dbe7553fde9e1e
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint delete 8eeb1f1780ba46769cd1112ac1899a46
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list | grep cinderv3
| 95dbaf739c3348cf8016ee50b53eea9a | regionOne | cinderv3     | volumev3       | True    | public    | http://10.0.0.145:8776/v3/%(tenant_id)s       |
| a5ca4c12a3f04ac5b11bd1c7334fc094 | regionOne | cinderv3     | volumev3       | True    | internal  | http://172.17.1.25:8776/v3/%(tenant_id)s      |
| d98b138eba27410097ce333d6fb5a7cb | regionOne | cinderv3     | volumev3       | True    | admin     | http://172.17.1.25:8776/v3/%(tenant_id)s      |
~~~
~~~
TASK [tripleo-keystone-resources : Check Keystone public endpoint status] ******
Sunday 13 September 2020  12:45:24 +0000 (0:00:04.946)       0:25:56.754 ****** 
...
failed: [undercloud] (item={'started': 1, 'finished': 0, 'ansible_job_id': '737602304891.91428', 'results_file': '/root/.ansible_async/737602304891.91428', 'changed': True, 'failed': False, 'tripleo_keystone_resources_data': {'key': 'cinderv3', 'value': {'endpoints': {'admin': 'http://172.17.1.25:8776/v3/%(tenant_id)s', 'internal': 'http://172.17.1.25:8776/v3/%(tenant_id)s', 'public': 'http://10.0.0.145:8776/v3/%(tenant_id)s'}, 'region': 'regionOne', 'service': 'volumev3', 'users': {'cinderv3': {'password': 'NAR9UexZ7u28rCGRuEEArtt7v', 'roles': ['admin', 'service']}}}}, 'ansible_loop_var': 'tripleo_keystone_resources_data'}) => {"ansible_job_id": "737602304891.91428", "ansible_loop_var": "tripleo_keystone_resources_endpoint_async_result_item", "attempts": 1, "changed": false, "finished": 1, "msg": "Multiple matches found for cinderv3", "tripleo_keystone_resources_endpoint_async_result_item": {"ansible_job_id": "737602304891.91428", "ansible_loop_var": "tripleo_keystone_resources_data", "changed": true, "failed": false, "finished": 0, "results_file": "/root/.ansible_async/737602304891.91428", "started": 1, "tripleo_keystone_resources_data": {"key": "cinderv3", "value": {"endpoints": {"admin": "http://172.17.1.25:8776/v3/%(tenant_id)s", "internal": "http://172.17.1.25:8776/v3/%(tenant_id)s", "public": "http://10.0.0.145:8776/v3/%(tenant_id)s"}, "region": "regionOne", "service": "volumev3", "users": {"cinderv3": {"password": "NAR9UexZ7u28rCGRuEEArtt7v", "roles": ["admin", "service"]}}}}}}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
compute-0                  : ok=254  changed=103  unreachable=0    failed=0    skipped=119  rescued=0    ignored=0   
compute-1                  : ok=250  changed=103  unreachable=0    failed=0    skipped=119  rescued=0    ignored=0   
controller-0               : ok=315  changed=138  unreachable=0    failed=0    skipped=128  rescued=0    ignored=0   
controller-1               : ok=304  changed=138  unreachable=0    failed=0    skipped=129  rescued=0    ignored=0   
controller-2               : ok=304  changed=138  unreachable=0    failed=0    skipped=129  rescued=0    ignored=0   
undercloud                 : ok=50   changed=16   unreachable=0    failed=1    skipped=44   rescued=0    ignored=0   

Sunday 13 September 2020  12:45:27 +0000 (0:00:02.534)       0:25:59.288 ****** 
=============================================================================== 

Ansible failed, check log at /var/lib/mistral/overcloud/ansible.log.
Overcloud configuration failed.
~~~

I rerun deploy after deleting cinderv3 service with volume type, and this time deployment completes without failures.
~~~
(overcloud) [stack@undercloud-0 ~]$ openstack service list
+----------------------------------+------------+----------------+
| ID                               | Name       | Type           |
+----------------------------------+------------+----------------+
| 0712f3e20dfc40fb8d9b60aba776dee8 | heat-cfn   | cloudformation |
| 381864758ab54695a8ed42c7c25d91a2 | swift      | object-store   |
| 52413dad9d374cc9a071de8291664f7b | keystone   | identity       |
| 577b55c6e5394960b0b30497f97c012e | ceilometer | metering       |
| 634bb6d13e2440369f5c86082e2db3b0 | cinderv2   | volumev2       |
| 8dff2d6971eb4e4ba8b346cae4aba181 | heat       | orchestration  |
| 9238a83ed4aa4cf9a599c0f762e84946 | aodh       | alarming       |
| a3d3f1f6992f4c1ca6f3271bfab1e71f | cinderv3   | volume         |
| aa0853b342b34f7585f33307ab8f4c66 | panko      | event          |
| acd1aa6c6dd9475495dfafefdb69f106 | nova       | compute        |
| b18244980c9541aea9ceb0714af43b35 | placement  | placement      |
| c9183c0d5af148a89ddb79b8a7f5f54c | gnocchi    | metric         |
| cecc0e92b7984e58b4b8cfd984bb9431 | neutron    | network        |
| ec1b6b2da9f34e3181a9bc63689a5bb4 | cinderv3   | volumev3       |
| f88125bfa1ae4bbc8db65007f506416b | glance     | image          |
+----------------------------------+------------+----------------+
(overcloud) [stack@undercloud-0 ~]$ openstack service delete a3d3f1f6992f4c1ca6f3271bfab1e71f
(overcloud) [stack@undercloud-0 ~]$ openstack service list
+----------------------------------+------------+----------------+
| ID                               | Name       | Type           |
+----------------------------------+------------+----------------+
| 0712f3e20dfc40fb8d9b60aba776dee8 | heat-cfn   | cloudformation |
| 381864758ab54695a8ed42c7c25d91a2 | swift      | object-store   |
| 52413dad9d374cc9a071de8291664f7b | keystone   | identity       |
| 577b55c6e5394960b0b30497f97c012e | ceilometer | metering       |
| 634bb6d13e2440369f5c86082e2db3b0 | cinderv2   | volumev2       |
| 8dff2d6971eb4e4ba8b346cae4aba181 | heat       | orchestration  |
| 9238a83ed4aa4cf9a599c0f762e84946 | aodh       | alarming       |
| aa0853b342b34f7585f33307ab8f4c66 | panko      | event          |
| acd1aa6c6dd9475495dfafefdb69f106 | nova       | compute        |
| b18244980c9541aea9ceb0714af43b35 | placement  | placement      |
| c9183c0d5af148a89ddb79b8a7f5f54c | gnocchi    | metric         |
| cecc0e92b7984e58b4b8cfd984bb9431 | neutron    | network        |
| ec1b6b2da9f34e3181a9bc63689a5bb4 | cinderv3   | volumev3       |
| f88125bfa1ae4bbc8db65007f506416b | glance     | image          |
+----------------------------------+------------+----------------+
~~~

Comment 3 Ade Lee 2020-09-22 19:11:36 UTC

Takashi,

When the system is broken, can you confirm whether or not cinder is actually running?  Is it running on the public interface behind TLS maybe?
Do you have any logs - or a system we can look at?

Comment 4 Takashi Kajinami 2020-09-23 05:41:30 UTC

Hi Ade,

> When the system is broken, can you confirm whether or not cinder is actually running?
Unfortunately I didn't perform any volume operations at that time,
but confirmed that cinder containers are all running without any failures.

> Is it running on the public interface behind TLS maybe?
No. As you can find in the output of endpoint list, I didn't enable TLS for public endpoints.
I don't use TLS for internal endpoints, either.

> Do you have any logs - or a system we can look at?
I still have a deployment where I tested upgrade, but unfortunately most of logs are rotated
and the issue was already fixed by removing endpoints and services with "volume" type.

To reproduce the situation itself, I think we can manually create endpoints and services with volume type
in fresh RHOSP16.1 deployment, as created in RHOSP13.

Comment 5 Alan Bishop 2020-09-25 17:45:56 UTC

All of this is a side effect of something we had to introduce in OSP-13/queens. In order to work around a bug in tempest (the version pinned to test queens) we had to create a "volume" service using the cinderv3 endpoints (see bug #1472859, and patches [1] and [2]).

[1] https://review.opendev.org/644550
[2] https://review.opendev.org/649084

Now that OSP-16/train is using a version of tempest that doesn't assume the presence of cinder's V1 API (the "volume" service), we end up with a stale "volume" service with corresponding endpoints. As Takashi-san noted, the fix involves deleting the stale catalog entries.

I just became aware of this issue, and will start working on a fix so that this is handled automatically by the FFU process.

Comment 6 Alan Bishop 2020-09-25 18:09:36 UTC

*** Bug 1856906 has been marked as a duplicate of this bug. ***

Comment 7 Alan Bishop 2020-09-29 14:57:03 UTC

*** Bug 1882110 has been marked as a duplicate of this bug. ***

Comment 8 Luigi Toscano 2020-09-29 16:09:14 UTC

*** Bug 1853281 has been marked as a duplicate of this bug. ***

Comment 9 Jakub Libosvar 2020-10-08 08:01:59 UTC

(In reply to Takashi Kajinami from comment #1)
> I deleted cinder v1 api, but deploy still fails.
> ~~~
> (overcloud) [stack@undercloud-0 ~]$ openstack endpoint delete
> 596146ea98914c69ab77b023ff3ade3e
> (overcloud) [stack@undercloud-0 ~]$ openstack endpoint delete
> 5e2429696b2143c8a9dbe7553fde9e1e
> (overcloud) [stack@undercloud-0 ~]$ openstack endpoint delete
> 8eeb1f1780ba46769cd1112ac1899a46
> (overcloud) [stack@undercloud-0 ~]$ openstack endpoint list | grep cinderv3
> | 95dbaf739c3348cf8016ee50b53eea9a | regionOne | cinderv3     | volumev3    
> | True    | public    | http://10.0.0.145:8776/v3/%(tenant_id)s       |
> | a5ca4c12a3f04ac5b11bd1c7334fc094 | regionOne | cinderv3     | volumev3    
> | True    | internal  | http://172.17.1.25:8776/v3/%(tenant_id)s      |
> | d98b138eba27410097ce333d6fb5a7cb | regionOne | cinderv3     | volumev3    
> | True    | admin     | http://172.17.1.25:8776/v3/%(tenant_id)s      |
> ~~~


Does it mean we have no workaround? We have a migration procedure that requires manual intervention and in case there is a workaround, we'd like to recommend it in case there is someone requesting FFU + migration.

Comment 10 Takashi Kajinami 2020-10-08 09:08:03 UTC

(In reply to Jakub Libosvar from comment #9)
> (In reply to Takashi Kajinami from comment #1)
> > I deleted cinder v1 api, but deploy still fails.
> > ~~~
> > (overcloud) [stack@undercloud-0 ~]$ openstack endpoint delete
> > 596146ea98914c69ab77b023ff3ade3e
> > (overcloud) [stack@undercloud-0 ~]$ openstack endpoint delete
> > 5e2429696b2143c8a9dbe7553fde9e1e
> > (overcloud) [stack@undercloud-0 ~]$ openstack endpoint delete
> > 8eeb1f1780ba46769cd1112ac1899a46
> > (overcloud) [stack@undercloud-0 ~]$ openstack endpoint list | grep cinderv3
> > | 95dbaf739c3348cf8016ee50b53eea9a | regionOne | cinderv3     | volumev3    
> > | True    | public    | http://10.0.0.145:8776/v3/%(tenant_id)s       |
> > | a5ca4c12a3f04ac5b11bd1c7334fc094 | regionOne | cinderv3     | volumev3    
> > | True    | internal  | http://172.17.1.25:8776/v3/%(tenant_id)s      |
> > | d98b138eba27410097ce333d6fb5a7cb | regionOne | cinderv3     | volumev3    
> > | True    | admin     | http://172.17.1.25:8776/v3/%(tenant_id)s      |
> > ~~~
> 
> 
> Does it mean we have no workaround? We have a migration procedure that
> requires manual intervention and in case there is a workaround, we'd like to
> recommend it in case there is someone requesting FFU + migration.

There is a available workaround which is to manually remove the following items from overcloud keystone after upgrade.
 1. all endpoints for "volume" service type
 2. the service with "volume" type

According to the patch Alan proposed, I guess we need to remove only (2) and (1) will be removed automatically,
but didn't confirm that removing only the volume service fixes the issue, as described in my comment 1.

Comment 11 Alan Bishop 2020-10-08 12:52:11 UTC

That's correct. In fact, while developing the patch I discovered it's sufficient to just delete the "volume" service (step 2). Its endpoints are automatically deleted. Executing just this one command should be sufficient:

(overcloud) [stack@undercloud-0 ~]$ openstack service delete volume

Then run this and you should see the associated endpoints are gone:

[stack@rhos-undercloud ~]$ openstack endpoint list | grep cinder
| 435ef6e5b7064ac09d52f79012ace4c0 | regionOne | cinderv2     | volumev2       | True    | public    | http://192.168.24.14:8776/v2/%(tenant_id)s      |
| 45dd4494cd5d4f4bb8558ef3aee15710 | regionOne | cinderv3     | volumev3       | True    | public    | http://192.168.24.14:8776/v3/%(tenant_id)s      |
| 609d5234f6b74d719c916b497cbd3d0f | regionOne | cinderv2     | volumev2       | True    | internal  | http://192.168.24.14:8776/v2/%(tenant_id)s      |
| 654a9466ba5e41fd8a6b84dc033bb2d3 | regionOne | cinderv2     | volumev2       | True    | admin     | http://192.168.24.14:8776/v2/%(tenant_id)s      |
| a0385feccf254f84a798093b9ed6393a | regionOne | cinderv3     | volumev3       | True    | internal  | http://192.168.24.14:8776/v3/%(tenant_id)s      |
| dbf3c261a7e748d097dc9119f75b475c | regionOne | cinderv3     | volumev3       | True    | admin     | http://192.168.24.14:8776/v3/%(tenant_id)s      |

Comment 24 Tzach Shefi 2020-11-22 14:45:09 UTC

Alan, 

Just to confirm, steps I should take to verify are:

1. Deploy OSP13, check that I start off with 3 versions of Cinder endpoints, similar to say this:
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list | grep  cinder
| 4a8d29e364094703b7da86215bcb138a | regionOne | cinderv2     | volumev2       | True    | internal  | http://172.17.1.147:8776/v2/%(tenant_id)s      |
| 4ee73c2d15a14561a1bd4e57f0925dc4 | regionOne | cinderv3     | volume         | True    | public    | http://10.0.0.108:8776/v3/%(tenant_id)s        |
| 63133b700dca466eae944271f7a3f825 | regionOne | cinderv3     | volume         | True    | internal  | http://172.17.1.147:8776/v3/%(tenant_id)s      |
| 689255e4cbb0424f9fed31bda12cbe6d | regionOne | cinderv3     | volume         | True    | admin     | http://172.17.1.147:8776/v3/%(tenant_id)s      |
| 68dc5070a9dd45ad9a37ae4c74523dd5 | regionOne | cinderv3     | volumev3       | True    | internal  | http://172.17.1.147:8776/v3/%(tenant_id)s      |
| 943e91972d9943e190f4c5401f0eecbe | regionOne | cinderv2     | volumev2       | True    | admin     | http://172.17.1.147:8776/v2/%(tenant_id)s      |
| d8cdcccff4e548e7bb8d9cae3e48c4fa | regionOne | cinderv2     | volumev2       | True    | public    | http://10.0.0.108:8776/v2/%(tenant_id)s        |
| d9121c4b166a43159634c7e37073d309 | regionOne | cinderv3     | volumev3       | True    | admin     | http://172.17.1.147:8776/v3/%(tenant_id)s      |
| ef746699f18f4f3eba6e4e62cf0e37eb | regionOne | cinderv3     | volumev3       | True    | public    | http://10.0.0.108:8776/v3/%(tenant_id)s        |


2. Upgrade to OSP16.1 
Here is where I'm a bit unsure, why I'm asking. 
After the upgrade I'm assuming I should still get "volume" endpoints correct? 
Thus I should manually issue the "openstack service delete volume" right? 

If I now re-list endpoints I shouldn't get any "volume" endpoints as they were removed. 

3. And my final verification step, the crucial one than is,
pending manual removal of "volume" endpoints after upgrading to 16.1 
I should be able to complete, without errors, some sort of overcloud update on 16.1?
Say change debug's setting from false to true or vise versa.


If this is right I'll get working on it, 
if I misunderstood something or a step please correct me. 
Thanks

Comment 25 Alan Bishop 2020-11-22 16:29:45 UTC

(In reply to Tzach Shefi from comment #24)
> Just to confirm, steps I should take to verify are:
> 
> 1. Deploy OSP13, check that I start off with 3 versions of Cinder endpoints,
> similar to say this:

Correct. You should see 3 cinder services and 9 endpoints (3 for each).

> 2. Upgrade to OSP16.1 
> Here is where I'm a bit unsure, why I'm asking. 
> After the upgrade I'm assuming I should still get "volume" endpoints
> correct? 
> Thus I should manually issue the "openstack service delete volume" right? 
> 
> If I now re-list endpoints I shouldn't get any "volume" endpoints as they
> were removed. 

No. What you described is the original (bad) behavior and the workaround. With the fix, after upgrading to 16.1 the "volume" and its three associated endpoints should no longer be present. The BZ fix automatically removes them.

> 3. And my final verification step, the crucial one than is,
> pending manual removal of "volume" endpoints after upgrading to 16.1 
> I should be able to complete, without errors, some sort of overcloud update
> on 16.1?
> Say change debug's setting from false to true or vise versa.

Sort of. After confirming the "volume" service and its endpoints were removed, you should be able to perform any stack update (even one that doesn't change any parameters).

Here's an alternative sequence:

1. Install 13
2. FFU to 16.1.2  <=== z2! Confirm the "volume" service and its endpoints are still present
3. Perform a stack update and confirm it fails (this reproduces the bug). 
4. Upgrade undercloud to 16.1.3
5. Repeat 3, and this time it should succeed. Also confirm the "volume" service and its endponts were removed.

Comment 26 Tzach Shefi 2020-11-23 14:45:12 UTC

Thanks Alan, 

During an attempted FFU verification run into a leapp bug
https://bugzilla.redhat.com/show_bug.cgi?id=1900667

A waiting updates on leapp bz, also emailed rhos-upgrade/qe dep.

Comment 27 Tzach Shefi 2020-11-30 09:07:16 UTC

Ignore bz1900667, was my fault. 

Now stuck on a new upgrade bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1902628

Again emailed upgrades dfg, awaiting assistance.

Comment 28 Tzach Shefi 2020-12-03 07:30:25 UTC

Verified on:
openstack-tripleo-heat-templates-11.3.2-1.20200914170176.el8ost.noarch

Started with an OSP13 deployment
13  -p 2020-11-13.1

(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list | grep cinder
| 1ba6d3dcd6904c669ce46a4e839fbd02 | regionOne | cinderv2     | volumev2       | True    | admin     | http://172.17.1.98:8776/v2/%(tenant_id)s       |
| 1da848b4449d489fb36b1421e1e08f4a | regionOne | cinderv2     | volumev2       | True    | public    | http://10.0.0.107:8776/v2/%(tenant_id)s        |
| 213e3e6f702444a393400a7bab2409e3 | regionOne | cinderv3     | volume         | True    | internal  | http://172.17.1.98:8776/v3/%(tenant_id)s       |
| 2f60909af1a64a78a22074663c0fc588 | regionOne | cinderv3     | volumev3       | True    | public    | http://10.0.0.107:8776/v3/%(tenant_id)s        |
| 3815901813fe40fa874cff57702c1db0 | regionOne | cinderv3     | volume         | True    | public    | http://10.0.0.107:8776/v3/%(tenant_id)s        |
| 521895b698a6434a8cfbb2435917bba3 | regionOne | cinderv3     | volume         | True    | admin     | http://172.17.1.98:8776/v3/%(tenant_id)s       |
| 74a8a9208cba4c9badf1158e20fe3847 | regionOne | cinderv3     | volumev3       | True    | internal  | http://172.17.1.98:8776/v3/%(tenant_id)s       |
| d4648559b5ea4014805620c4c66481c5 | regionOne | cinderv2     | volumev2       | True    | internal  | http://172.17.1.98:8776/v2/%(tenant_id)s       |
| ed3689f6909a430d86c9d749ee0d634e | regionOne | cinderv3     | volumev3       | True    | admin     | http://172.17.1.98:8776/v3/%(tenant_id)s       |

FFU to 16.1z3 passed this time without an issue, turns out having Babrican deployed caused bz1902628. 
Anyway after the FFU we still have same list of endpoints for Cinder v3

 (overcloud) [stack@undercloud-0 ~]$ openstack endpoint list | grep cinder
| 1ba6d3dcd6904c669ce46a4e839fbd02 | regionOne | cinderv2     | volumev2       | True    | admin     | http://172.17.1.98:8776/v2/%(tenant_id)s       |
| 1da848b4449d489fb36b1421e1e08f4a | regionOne | cinderv2     | volumev2       | True    | public    | http://10.0.0.107:8776/v2/%(tenant_id)s        |
| 213e3e6f702444a393400a7bab2409e3 | regionOne | cinderv3     | volume         | True    | internal  | http://172.17.1.98:8776/v3/%(tenant_id)s       |
| 2f60909af1a64a78a22074663c0fc588 | regionOne | cinderv3     | volumev3       | True    | public    | http://10.0.0.107:8776/v3/%(tenant_id)s        |
| 3815901813fe40fa874cff57702c1db0 | regionOne | cinderv3     | volume         | True    | public    | http://10.0.0.107:8776/v3/%(tenant_id)s        |
| 521895b698a6434a8cfbb2435917bba3 | regionOne | cinderv3     | volume         | True    | admin     | http://172.17.1.98:8776/v3/%(tenant_id)s       |
| 74a8a9208cba4c9badf1158e20fe3847 | regionOne | cinderv3     | volumev3       | True    | internal  | http://172.17.1.98:8776/v3/%(tenant_id)s       |
| d4648559b5ea4014805620c4c66481c5 | regionOne | cinderv2     | volumev2       | True    | internal  | http://172.17.1.98:8776/v2/%(tenant_id)s       |
| ed3689f6909a430d86c9d749ee0d634e | regionOne | cinderv3     | volumev3       | True    | admin     | http://172.17.1.98:8776/v3/%(tenant_id)s       |

But as Alan pointed out in a chat session, his fix resides in a section of THT called "external_deploy_tasks" 
Which isn't executed during the FFU process, but rather during an overcloud update. 

I had cloned overcloud_upgrade_prepare.sh to overcloud_deploy16.1.sh
And swapped the overcloud deploy "upgrade" with "deploy". 
ran it without any other changes and the expected result was achieved:


+----------------------------------+-----------+--------------+----------------+---------+-----------+------------------------------------------------+                                                                                     
| ID                               | Region    | Service Name | Service Type   | Enabled | Interface | URL                                            |                                                                                  
+----------------------------------+-----------+--------------+----------------+---------+-----------+------------------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list | grep cinder
| 1ba6d3dcd6904c669ce46a4e839fbd02 | regionOne | cinderv2     | volumev2       | True    | admin     | http://172.17.1.98:8776/v2/%(tenant_id)s       |
| 1da848b4449d489fb36b1421e1e08f4a | regionOne | cinderv2     | volumev2       | True    | public    | http://10.0.0.107:8776/v2/%(tenant_id)s        |
| 2f60909af1a64a78a22074663c0fc588 | regionOne | cinderv3     | volumev3       | True    | public    | http://10.0.0.107:8776/v3/%(tenant_id)s        |
| 74a8a9208cba4c9badf1158e20fe3847 | regionOne | cinderv3     | volumev3       | True    | internal  | http://172.17.1.98:8776/v3/%(tenant_id)s       |
| d4648559b5ea4014805620c4c66481c5 | regionOne | cinderv2     | volumev2       | True    | internal  | http://172.17.1.98:8776/v2/%(tenant_id)s       |
| ed3689f6909a430d86c9d749ee0d634e | regionOne | cinderv3     | volumev3       | True    | admin     | http://172.17.1.98:8776/v3/%(tenant_id)s       |

This time as expected, no service points for Cinder v1 remain following an update of an upgraded OSP16.1z3.

Comment 35 errata-xmlrpc 2020-12-15 18:36:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.3 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5413

Note You need to log in before you can comment on or make changes to this bug.