Bug 1477770 - OSP11 -> OSP12 upgrade: post upgrade 'nova service-list' reports duplicate services
OSP11 -> OSP12 upgrade: post upgrade 'nova service-list' reports duplicate se...
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates (Show other bugs)
12.0 (Pike)
Unspecified Unspecified
high Severity urgent
: rc
: 12.0 (Pike)
Assigned To: Ollie Walsh
Marius Cornea
: Triaged
: 1491611 (view as bug list)
Depends On: 1477962
Blocks: 1399762
  Show dependency treegraph
 
Reported: 2017-08-02 17:35 EDT by Marius Cornea
Modified: 2018-02-05 14:10 EST (History)
13 users (show)

See Also:
Fixed In Version: openstack-tripleo-heat-templates-7.0.3-6.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-12-13 16:48:30 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1718912 None None None 2017-09-22 06:51 EDT
Launchpad 1718914 None None None 2017-09-22 06:54 EDT
OpenStack gerrit 513383 None stable/pike: MERGED tripleo-heat-templates: Update default cell_v2 cell when it already exists 2017-11-27 23:00 EST
Red Hat Product Errata RHEA-2017:3462 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-15 20:43:25 EST

  None (edit)
Description Marius Cornea 2017-08-02 17:35:13 EDT
Description of problem:
OSP11 -> OSP12 upgrade: post upgrade 'nova service-list' reports duplicate services:

(overcloud) [stack@undercloud-0 ~]$ nova service-list
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+
| Id  | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason                                                              |
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+
| 29  | nova-conductor   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 35  | nova-conductor   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:49.000000 | -                                                                            |
| 44  | nova-compute     | compute-1.localdomain    | nova     | disabled | up    | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' |
| 77  | nova-scheduler   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 80  | nova-compute     | compute-0.localdomain    | nova     | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 83  | nova-scheduler   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:50.000000 | -                                                                            |
| 86  | nova-consoleauth | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:54.000000 | -                                                                            |
| 89  | nova-consoleauth | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:58.000000 | -                                                                            |
| 92  | nova-conductor   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 98  | nova-scheduler   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 101 | nova-consoleauth | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 29  | nova-conductor   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 35  | nova-conductor   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:49.000000 | -                                                                            |
| 44  | nova-compute     | compute-1.localdomain    | nova     | disabled | up    | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' |
| 77  | nova-scheduler   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 80  | nova-compute     | compute-0.localdomain    | nova     | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 83  | nova-scheduler   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:50.000000 | -                                                                            |
| 86  | nova-consoleauth | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:54.000000 | -                                                                            |
| 89  | nova-consoleauth | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:58.000000 | -                                                                            |
| 92  | nova-conductor   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 98  | nova-scheduler   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 101 | nova-consoleauth | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170721174554.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP11
2. Upgrade to OSP12
3. After the upgrade process is completed(major-upgrade-converge-docker.yaml) check nova service-list

Actual results:
We can see duplicate services reported by nova service-list:

(overcloud) [stack@undercloud-0 ~]$ nova service-list
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+
| Id  | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason                                                              |
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+
| 29  | nova-conductor   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 35  | nova-conductor   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:49.000000 | -                                                                            |
| 44  | nova-compute     | compute-1.localdomain    | nova     | disabled | up    | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' |
| 77  | nova-scheduler   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 80  | nova-compute     | compute-0.localdomain    | nova     | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 83  | nova-scheduler   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:50.000000 | -                                                                            |
| 86  | nova-consoleauth | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:54.000000 | -                                                                            |
| 89  | nova-consoleauth | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:58.000000 | -                                                                            |
| 92  | nova-conductor   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 98  | nova-scheduler   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 101 | nova-consoleauth | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 29  | nova-conductor   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 35  | nova-conductor   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:49.000000 | -                                                                            |
| 44  | nova-compute     | compute-1.localdomain    | nova     | disabled | up    | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' |
| 77  | nova-scheduler   | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
| 80  | nova-compute     | compute-0.localdomain    | nova     | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 83  | nova-scheduler   | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:50.000000 | -                                                                            |
| 86  | nova-consoleauth | controller-1.localdomain | internal | enabled  | up    | 2017-08-02T21:29:54.000000 | -                                                                            |
| 89  | nova-consoleauth | controller-2.localdomain | internal | enabled  | up    | 2017-08-02T21:29:58.000000 | -                                                                            |
| 92  | nova-conductor   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 98  | nova-scheduler   | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:56.000000 | -                                                                            |
| 101 | nova-consoleauth | controller-0.localdomain | internal | enabled  | up    | 2017-08-02T21:29:55.000000 | -                                                                            |
+-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+


Expected results:
We don't get any duplicate services.

Additional info:
Comment 1 Marius Cornea 2017-08-02 17:40:43 EDT
The same goes for hypervisor-list, hypervisor-stats:

(overcloud) [stack@undercloud-0 ~]$ nova hypervisor-list
+----+-----------------------+-------+---------+
| ID | Hypervisor hostname   | State | Status  |
+----+-----------------------+-------+---------+
| 2  | compute-1.localdomain | up    | enabled |
| 5  | compute-0.localdomain | up    | enabled |
| 2  | compute-1.localdomain | up    | enabled |
| 5  | compute-0.localdomain | up    | enabled |
+----+-----------------------+-------+---------+


(overcloud) [stack@undercloud-0 ~]$ nova hypervisor-stats
+----------------------+-------+
| Property             | Value |
+----------------------+-------+
| count                | 4     |
| current_workload     | 0     |
| disk_available_least | 118   |
| free_disk_gb         | 156   |
| free_ram_mb          | 16380 |
| local_gb             | 156   |
| local_gb_used        | 0     |
| memory_mb            | 32764 |
| memory_mb_used       | 16384 |
| running_vms          | 0     |
| vcpus                | 16    |
| vcpus_used           | 0     |
Comment 2 Carlos Camacho 2017-08-07 08:48:04 EDT
Hey!!

Just lurking into the code,

In a non-controller upgrade, I think we are missing in some how the upgrade_tasks step which actually stops the services under systemd i.e. nova-conductor

Here we are stopping nova-conductor:
https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/nova-conductor.yaml#L110

But we are not getting in:
https://github.com/openstack/tripleo-heat-templates/blob/master/docker/docker-steps.j2#L167
Comment 3 Carlos Camacho 2017-08-14 09:12:55 EDT
Marios at the beginnig I believed this bug was related to something like https://review.openstack.org/#/c/484711/

But I think there is something else there.
Comment 4 Marios Andreou 2017-08-17 07:58:27 EDT
this is not a valid bug, yet. It is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1477962 for the non controller upgrade. That is, the workflow for the non controllers (in this BZ computes), is still being finished by BZ 1477962 . Once that is fixed, we will get the execution of the upgrade_tasks from the nova-compute service, which includes a stop on the existing (non dockerized) service in https://github.com/openstack/tripleo-heat-templates/blob/0cb45d65c607cf4eb9a4096c7cc3f1c8a5ca58b4/docker/services/nova-compute.yaml#L145 .

I think BZ 1477768 is related/duplicate (?) of this , if indeed the root cause is that we are not stopping the nova-compute (and nova-*) running on the compute node, before these are brought up in containers. I know you landed the fix into the tripleo_upgrade_node.sh @ https://review.openstack.org/#/c/490226/ for BZ 1477768, but as in the paragraph above, the workflow is changed now so we will no longer rely on that file (it *is* still wired in but we may remove it alltogether). 

So, do you agree that this is now blocked/needs re-testing once we get BZ 1477962
Comment 5 Marius Cornea 2017-08-17 08:02:39 EDT
(In reply to marios from comment #4)
> this is not a valid bug, yet. It is blocked by
> https://bugzilla.redhat.com/show_bug.cgi?id=1477962 for the non controller
> upgrade. That is, the workflow for the non controllers (in this BZ
> computes), is still being finished by BZ 1477962 . Once that is fixed, we
> will get the execution of the upgrade_tasks from the nova-compute service,
> which includes a stop on the existing (non dockerized) service in
> https://github.com/openstack/tripleo-heat-templates/blob/
> 0cb45d65c607cf4eb9a4096c7cc3f1c8a5ca58b4/docker/services/nova-compute.
> yaml#L145 .
> 
> I think BZ 1477768 is related/duplicate (?) of this , if indeed the root
> cause is that we are not stopping the nova-compute (and nova-*) running on
> the compute node, before these are brought up in containers. I know you
> landed the fix into the tripleo_upgrade_node.sh @
> https://review.openstack.org/#/c/490226/ for BZ 1477768, but as in the
> paragraph above, the workflow is changed now so we will no longer rely on
> that file (it *is* still wired in but we may remove it alltogether). 
> 
> So, do you agree that this is now blocked/needs re-testing once we get BZ
> 1477962

Agree, we need to test the fix for BZ#1477962 and see if the issue reported in this ticket is still valid.
Comment 6 Marios Andreou 2017-09-18 05:26:43 EDT
> 
> Agree, we need to test the fix for BZ#1477962 and see if the issue reported
> in this ticket is still valid.


o/ can we add this to the list again please - trying to clear BZ - looks like BZ#1477962 is done based on latest comment #16 ... i'll catch up with you about it later on the phone too
Comment 7 Marius Cornea 2017-09-18 10:05:20 EDT
(In reply to marios from comment #6)
> > 
> > Agree, we need to test the fix for BZ#1477962 and see if the issue reported
> > in this ticket is still valid.
> 
> 
> o/ can we add this to the list again please - trying to clear BZ - looks
> like BZ#1477962 is done based on latest comment #16 ... i'll catch up with
> you about it later on the phone too

This is still an issue on an environment which includes fixes for bug 1477962:

(overcloud) [stack@undercloud-0 ~]$ nova service-list
+--------------------------------------+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+-------------+
| Id                                   | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason | Forced down |
+--------------------------------------+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+-------------+
| 453e2c46-f476-4fbc-905c-0e54c68aadaf | nova-conductor   | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| a514f4a9-8e40-4a42-b92b-37d57d299570 | nova-conductor   | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| fc58e9ef-8b21-49f8-93f4-0663ca051b8a | nova-compute     | compute-0.localdomain    | nova     | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| a35f9f91-9116-4f69-822b-fc576ad9f6f5 | nova-scheduler   | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| e1490acc-765d-46fe-9114-a7bb4eb7a2d2 | nova-scheduler   | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:47.000000 | -               | False       |
| f184bcaf-3dc8-4d8d-b59d-cc666a6cc0bd | nova-consoleauth | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:52.000000 | -               | False       |
| ca403e7e-1e33-40bd-b95e-7d60fb560a5a | nova-consoleauth | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:54.000000 | -               | False       |
| 0a1fdca5-84d7-4c4d-894a-9f1ea2d434c0 | nova-compute     | compute-1.localdomain    | nova     | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| 5d91e538-2a9d-4186-a5f9-0055a78cafb9 | nova-conductor   | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| e52b5b29-18b7-478a-891b-a677e7d24d19 | nova-scheduler   | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:51.000000 | -               | False       |
| 2d32736a-5fe4-4797-a3f2-afb304c0a0f3 | nova-consoleauth | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:50.000000 | -               | False       |
| 453e2c46-f476-4fbc-905c-0e54c68aadaf | nova-conductor   | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| a514f4a9-8e40-4a42-b92b-37d57d299570 | nova-conductor   | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| fc58e9ef-8b21-49f8-93f4-0663ca051b8a | nova-compute     | compute-0.localdomain    | nova     | enabled | up    | 2017-09-18T13:57:53.000000 | -               | False       |
| a35f9f91-9116-4f69-822b-fc576ad9f6f5 | nova-scheduler   | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| e1490acc-765d-46fe-9114-a7bb4eb7a2d2 | nova-scheduler   | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:47.000000 | -               | False       |
| f184bcaf-3dc8-4d8d-b59d-cc666a6cc0bd | nova-consoleauth | controller-1.localdomain | internal | enabled | up    | 2017-09-18T13:57:52.000000 | -               | False       |
| ca403e7e-1e33-40bd-b95e-7d60fb560a5a | nova-consoleauth | controller-2.localdomain | internal | enabled | up    | 2017-09-18T13:57:54.000000 | -               | False       |
| 0a1fdca5-84d7-4c4d-894a-9f1ea2d434c0 | nova-compute     | compute-1.localdomain    | nova     | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| 5d91e538-2a9d-4186-a5f9-0055a78cafb9 | nova-conductor   | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:48.000000 | -               | False       |
| e52b5b29-18b7-478a-891b-a677e7d24d19 | nova-scheduler   | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:51.000000 | -               | False       |
| 2d32736a-5fe4-4797-a3f2-afb304c0a0f3 | nova-consoleauth | controller-0.localdomain | internal | enabled | up    | 2017-09-18T13:57:50.000000 | -               | False       |
+--------------------------------------+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+-------------+
Comment 8 Ollie Walsh 2017-09-21 12:52:37 EDT
Duplicate cell_v2 mapping is the culprit:

[root@controller-0 heat-admin]# nova-manage cell_v2 list_cells
Option "rabbit_use_ssl" from group "oslo_messaging_rabbit" is deprecated. Use option "ssl" from group "oslo_messaging_rabbit".
+---------+--------------------------------------+----------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+
|   Name  |                 UUID                 |                            Transport URL                             |                                                   Database Connection                                                   |
+---------+--------------------------------------+----------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+
|  cell0  | 00000000-0000-0000-0000-000000000000 |                                none:/                                | mysql+pymysql://nova:****@172.17.1.11/nova_cell0?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo |
| default | 1f4fa8fd-966c-4e46-b90b-164aa8b7e49b | rabbit://guest:****@controller-2.internalapi.localdomain:5672/?ssl=0 |    mysql+pymysql://nova:****@172.17.1.11/nova?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo    |
| default | 87002684-89e6-4227-8a8d-8c501dcf3a92 | rabbit://guest:****@controller-2.internalapi.localdomain:5672/?ssl=0 |    mysql+pymysql://nova:****@172.17.1.11/nova?read_default_group=tripleo&read_default_file=/etc/my.cnf.d/tripleo.cnf    |
+---------+--------------------------------------+----------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+
Comment 9 Marius Cornea 2017-09-21 13:27:11 EDT
Comparing the database_connection on fresh deployments: 

OSP11:

mysql+pymysql://nova:HnK9XA7e8wwJh9A6NFNpfAzgZ@172.17.1.14/nova?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo

OSP12:

mysql+pymysql://nova:j6E3FpBMQF69mUeQFkkYqT2Mq@[fd00:fd00:fd00:2000::1a]/nova?read_default_group=tripleo&read_default_file=/etc/my.cnf.d/tripleo.cnf

Looks like the position of read_default_file in OSP11 changed with read_default_group in OSP12.
Comment 10 Ollie Walsh 2017-09-21 14:28:25 EDT
Yea, and nova-manage cell_v2 create cell is only idempotent if the transport_url and database_connection are identical.

However we now what a cell_v2 update command so we can find the cell uuid and ensure the name/mq/db are correct.
Comment 12 Alex Schultz 2017-09-22 11:17:26 EDT
So the url changes are seem to be due to our swapping out to use the make_url function from heat. https://review.openstack.org/#/c/446704/
Comment 13 Marius Cornea 2017-10-06 10:41:48 EDT
*** Bug 1491611 has been marked as a duplicate of this bug. ***
Comment 15 Carlos Camacho 2017-10-31 11:39:21 EDT
Still waiting for this to be merged: https://review.openstack.org/#/q/topic:bug/1718912+(status:open+OR+status:merged)
Comment 16 Ollie Walsh 2017-11-11 07:55:03 EST
https://review.openstack.org/513383 has merged
Comment 23 errata-xmlrpc 2017-12-13 16:48:30 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462

Note You need to log in before you can comment on or make changes to this bug.