Bug 1477770

Summary: OSP11 -> OSP12 upgrade: post upgrade 'nova service-list' reports duplicate services
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Ollie Walsh <owalsh>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: high    
Version: 12.0 (Pike)CC: aschultz, ccamacho, dbecker, jschluet, mandreou, mbooth, mbultel, mburns, morazi, owalsh, rhel-osp-director-maint, sgordon
Target Milestone: rcKeywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-7.0.3-6.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 21:48:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1477962    
Bug Blocks: 1399762    

Description Marius Cornea 2017-08-02 21:35:13 UTC
Description of problem:
OSP11 -> OSP12 upgrade: post upgrade 'nova service-list' reports duplicate services:

(overcloud) [stack@undercloud-0 ~]$ nova service-list
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+
| Id  | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason                                                              |
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+
| 29  | nova-conductor   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 35  | nova-conductor   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:49.000000 | -                                                                            |
| 44  | nova-compute     | compute-1.localdomain    | nova     | disabled | up    | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' |
| 77  | nova-scheduler   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 80  | nova-compute     | compute-0.localdomain    | nova     | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 83  | nova-scheduler   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:50.000000 | -                                                                            |
| 86  | nova-consoleauth | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:54.000000 | -                                                                            |
| 89  | nova-consoleauth | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:58.000000 | -                                                                            |
| 92  | nova-conductor   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 98  | nova-scheduler   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 101 | nova-consoleauth | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 29  | nova-conductor   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 35  | nova-conductor   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:49.000000 | -                                                                            |
| 44  | nova-compute     | compute-1.localdomain    | nova     | disabled | up    | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' |
| 77  | nova-scheduler   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 80  | nova-compute     | compute-0.localdomain    | nova     | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 83  | nova-scheduler   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:50.000000 | -                                                                            |
| 86  | nova-consoleauth | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:54.000000 | -                                                                            |
| 89  | nova-consoleauth | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:58.000000 | -                                                                            |
| 92  | nova-conductor   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 98  | nova-scheduler   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 101 | nova-consoleauth | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170721174554.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP11
2. Upgrade to OSP12
3. After the upgrade process is completed(major-upgrade-converge-docker.yaml) check nova service-list

Actual results:
We can see duplicate services reported by nova service-list:

(overcloud) [stack@undercloud-0 ~]$ nova service-list
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+
| Id  | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason                                                              |
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+
| 29  | nova-conductor   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 35  | nova-conductor   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:49.000000 | -                                                                            |
| 44  | nova-compute     | compute-1.localdomain    | nova     | disabled | up    | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' |
| 77  | nova-scheduler   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 80  | nova-compute     | compute-0.localdomain    | nova     | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 83  | nova-scheduler   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:50.000000 | -                                                                            |
| 86  | nova-consoleauth | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:54.000000 | -                                                                            |
| 89  | nova-consoleauth | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:58.000000 | -                                                                            |
| 92  | nova-conductor   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 98  | nova-scheduler   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 101 | nova-consoleauth | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 29  | nova-conductor   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 35  | nova-conductor   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:49.000000 | -                                                                            |
| 44  | nova-compute     | compute-1.localdomain    | nova     | disabled | up    | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' |
| 77  | nova-scheduler   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 80  | nova-compute     | compute-0.localdomain    | nova     | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 83  | nova-scheduler   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:50.000000 | -                                                                            |
| 86  | nova-consoleauth | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:54.000000 | -                                                                            |
| 89  | nova-consoleauth | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:58.000000 | -                                                                            |
| 92  | nova-conductor   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 98  | nova-scheduler   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 101 | nova-consoleauth | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+


Expected results:
We don't get any duplicate services.

Additional info:

Comment 1 Marius Cornea 2017-08-02 21:40:43 UTC
The same goes for hypervisor-list, hypervisor-stats:

(overcloud) [stack@undercloud-0 ~]$ nova hypervisor-list
+----+-----------------------+-------+---------+
| ID | Hypervisor hostname   | State | Status  |
+----+-----------------------+-------+---------+
| 2  | compute-1.localdomain | up    | enabled |
| 5  | compute-0.localdomain | up    | enabled |
| 2  | compute-1.localdomain | up    | enabled |
| 5  | compute-0.localdomain | up    | enabled |
+----+-----------------------+-------+---------+


(overcloud) [stack@undercloud-0 ~]$ nova hypervisor-stats
+----------------------+-------+
| Property             | Value |
+----------------------+-------+
| count                | 4     |
| current_workload     | 0     |
| disk_available_least | 118   |
| free_disk_gb         | 156   |
| free_ram_mb          | 16380 |
| local_gb             | 156   |
| local_gb_used        | 0     |
| memory_mb            | 32764 |
| memory_mb_used       | 16384 |
| running_vms          | 0     |
| vcpus                | 16    |
| vcpus_used           | 0     |

Comment 2 Carlos Camacho 2017-08-07 12:48:04 UTC
Hey!!

Just lurking into the code,

In a non-controller upgrade, I think we are missing in some how the upgrade_tasks step which actually stops the services under systemd i.e. nova-conductor

Here we are stopping nova-conductor:
https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/nova-conductor.yaml#L110

But we are not getting in:
https://github.com/openstack/tripleo-heat-templates/blob/master/docker/docker-steps.j2#L167

Comment 3 Carlos Camacho 2017-08-14 13:12:55 UTC
Marios at the beginnig I believed this bug was related to something like https://review.openstack.org/#/c/484711/

But I think there is something else there.

Comment 4 Marios Andreou 2017-08-17 11:58:27 UTC
this is not a valid bug, yet. It is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1477962 for the non controller upgrade. That is, the workflow for the non controllers (in this BZ computes), is still being finished by BZ 1477962 . Once that is fixed, we will get the execution of the upgrade_tasks from the nova-compute service, which includes a stop on the existing (non dockerized) service in https://github.com/openstack/tripleo-heat-templates/blob/0cb45d65c607cf4eb9a4096c7cc3f1c8a5ca58b4/docker/services/nova-compute.yaml#L145 .

I think BZ 1477768 is related/duplicate (?) of this , if indeed the root cause is that we are not stopping the nova-compute (and nova-*) running on the compute node, before these are brought up in containers. I know you landed the fix into the tripleo_upgrade_node.sh @ https://review.openstack.org/#/c/490226/ for BZ 1477768, but as in the paragraph above, the workflow is changed now so we will no longer rely on that file (it *is* still wired in but we may remove it alltogether). 

So, do you agree that this is now blocked/needs re-testing once we get BZ 1477962

Comment 5 Marius Cornea 2017-08-17 12:02:39 UTC
(In reply to marios from comment #4)
> this is not a valid bug, yet. It is blocked by
> https://bugzilla.redhat.com/show_bug.cgi?id=1477962 for the non controller
> upgrade. That is, the workflow for the non controllers (in this BZ
> computes), is still being finished by BZ 1477962 . Once that is fixed, we
> will get the execution of the upgrade_tasks from the nova-compute service,
> which includes a stop on the existing (non dockerized) service in
> https://github.com/openstack/tripleo-heat-templates/blob/
> 0cb45d65c607cf4eb9a4096c7cc3f1c8a5ca58b4/docker/services/nova-compute.
> yaml#L145 .
> 
> I think BZ 1477768 is related/duplicate (?) of this , if indeed the root
> cause is that we are not stopping the nova-compute (and nova-*) running on
> the compute node, before these are brought up in containers. I know you
> landed the fix into the tripleo_upgrade_node.sh @
> https://review.openstack.org/#/c/490226/ for BZ 1477768, but as in the
> paragraph above, the workflow is changed now so we will no longer rely on
> that file (it *is* still wired in but we may remove it alltogether). 
> 
> So, do you agree that this is now blocked/needs re-testing once we get BZ
> 1477962

Agree, we need to test the fix for BZ#1477962 and see if the issue reported in this ticket is still valid.

Comment 6 Marios Andreou 2017-09-18 09:26:43 UTC
> 
> Agree, we need to test the fix for BZ#1477962 and see if the issue reported
> in this ticket is still valid.


o/ can we add this to the list again please - trying to clear BZ - looks like BZ#1477962 is done based on latest comment #16 ... i'll catch up with you about it later on the phone too

Comment 7 Marius Cornea 2017-09-18 14:05:20 UTC
(In reply to marios from comment #6)
> > 
> > Agree, we need to test the fix for BZ#1477962 and see if the issue reported
> > in this ticket is still valid.
> 
> 
> o/ can we add this to the list again please - trying to clear BZ - looks
> like BZ#1477962 is done based on latest comment #16 ... i'll catch up with
> you about it later on the phone too

This is still an issue on an environment which includes fixes for bug 1477962:

(overcloud) [stack@undercloud-0 ~]$ nova service-list
+--------------------------------------+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+-------------+
| Id                                   | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason | Forced down |
+--------------------------------------+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+-------------+
| 453e2c46-f476-4fbc-905c-0e54c68aadaf | nova-conductor   | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| a514f4a9-8e40-4a42-b92b-37d57d299570 | nova-conductor   | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| fc58e9ef-8b21-49f8-93f4-0663ca051b8a | nova-compute     | compute-0.localdomain    | nova     | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| a35f9f91-9116-4f69-822b-fc576ad9f6f5 | nova-scheduler   | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| e1490acc-765d-46fe-9114-a7bb4eb7a2d2 | nova-scheduler   | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:47.000000 | -               | False       |
| f184bcaf-3dc8-4d8d-b59d-cc666a6cc0bd | nova-consoleauth | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:52.000000 | -               | False       |
| ca403e7e-1e33-40bd-b95e-7d60fb560a5a | nova-consoleauth | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:54.000000 | -               | False       |
| 0a1fdca5-84d7-4c4d-894a-9f1ea2d434c0 | nova-compute     | compute-1.localdomain    | nova     | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| 5d91e538-2a9d-4186-a5f9-0055a78cafb9 | nova-conductor   | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| e52b5b29-18b7-478a-891b-a677e7d24d19 | nova-scheduler   | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:51.000000 | -               | False       |
| 2d32736a-5fe4-4797-a3f2-afb304c0a0f3 | nova-consoleauth | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:50.000000 | -               | False       |
| 453e2c46-f476-4fbc-905c-0e54c68aadaf | nova-conductor   | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| a514f4a9-8e40-4a42-b92b-37d57d299570 | nova-conductor   | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| fc58e9ef-8b21-49f8-93f4-0663ca051b8a | nova-compute     | compute-0.localdomain    | nova     | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| a35f9f91-9116-4f69-822b-fc576ad9f6f5 | nova-scheduler   | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| e1490acc-765d-46fe-9114-a7bb4eb7a2d2 | nova-scheduler   | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:47.000000 | -               | False       |
| f184bcaf-3dc8-4d8d-b59d-cc666a6cc0bd | nova-consoleauth | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:52.000000 | -               | False       |
| ca403e7e-1e33-40bd-b95e-7d60fb560a5a | nova-consoleauth | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:54.000000 | -               | False       |
| 0a1fdca5-84d7-4c4d-894a-9f1ea2d434c0 | nova-compute     | compute-1.localdomain    | nova     | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| 5d91e538-2a9d-4186-a5f9-0055a78cafb9 | nova-conductor   | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| e52b5b29-18b7-478a-891b-a677e7d24d19 | nova-scheduler   | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:51.000000 | -               | False       |
| 2d32736a-5fe4-4797-a3f2-afb304c0a0f3 | nova-consoleauth | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:50.000000 | -               | False       |
+--------------------------------------+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+-------------+

Comment 8 Ollie Walsh 2017-09-21 16:52:37 UTC
Duplicate cell_v2 mapping is the culprit:

[root@controller-0 heat-admin]# nova-manage cell_v2 list_cells
Option "rabbit_use_ssl" from group "oslo_messaging_rabbit" is deprecated. Use option "ssl" from group "oslo_messaging_rabbit".
+---------+--------------------------------------+----------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+
|   Name  |                 UUID                 |                            Transport URL                             |                                                   Database Connection                                                   |
+---------+--------------------------------------+----------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+
|  cell0  | 00000000-0000-0000-0000-000000000000 |                                none:/                                | mysql+pymysql://nova:****@172.17.1.11/nova_cell0?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo |
| default | 1f4fa8fd-966c-4e46-b90b-164aa8b7e49b | rabbit://guest:****@controller-2.internalapi.localdomain:5672/?ssl=0 |    mysql+pymysql://nova:****@172.17.1.11/nova?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo    |
| default | 87002684-89e6-4227-8a8d-8c501dcf3a92 | rabbit://guest:****@controller-2.internalapi.localdomain:5672/?ssl=0 |    mysql+pymysql://nova:****@172.17.1.11/nova?read_default_group=tripleo&read_default_file=/etc/my.cnf.d/tripleo.cnf    |
+---------+--------------------------------------+----------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+

Comment 9 Marius Cornea 2017-09-21 17:27:11 UTC
Comparing the database_connection on fresh deployments: 

OSP11:

mysql+pymysql://nova:HnK9XA7e8wwJh9A6NFNpfAzgZ.1.14/nova?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo

OSP12:

mysql+pymysql://nova:j6E3FpBMQF69mUeQFkkYqT2Mq@[fd00:fd00:fd00:2000::1a]/nova?read_default_group=tripleo&read_default_file=/etc/my.cnf.d/tripleo.cnf

Looks like the position of read_default_file in OSP11 changed with read_default_group in OSP12.

Comment 10 Ollie Walsh 2017-09-21 18:28:25 UTC
Yea, and nova-manage cell_v2 create cell is only idempotent if the transport_url and database_connection are identical.

However we now what a cell_v2 update command so we can find the cell uuid and ensure the name/mq/db are correct.

Comment 12 Alex Schultz 2017-09-22 15:17:26 UTC
So the url changes are seem to be due to our swapping out to use the make_url function from heat. https://review.openstack.org/#/c/446704/

Comment 13 Marius Cornea 2017-10-06 14:41:48 UTC
*** Bug 1491611 has been marked as a duplicate of this bug. ***

Comment 15 Carlos Camacho 2017-10-31 15:39:21 UTC
Still waiting for this to be merged: https://review.openstack.org/#/q/topic:bug/1718912+(status:open+OR+status:merged)

Comment 16 Ollie Walsh 2017-11-11 12:55:03 UTC
https://review.openstack.org/513383 has merged

Comment 23 errata-xmlrpc 2017-12-13 21:48:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462