Hide Forgot
Description of problem: OSP11 -> OSP12 upgrade: post upgrade 'nova service-list' reports duplicate services: (overcloud) [stack@undercloud-0 ~]$ nova service-list +-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+ | 29 | nova-conductor | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | | 35 | nova-conductor | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:49.000000 | - | | 44 | nova-compute | compute-1.localdomain | nova | disabled | up | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' | | 77 | nova-scheduler | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | | 80 | nova-compute | compute-0.localdomain | nova | enabled | up | 2017-08-02T21:29:56.000000 | - | | 83 | nova-scheduler | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:50.000000 | - | | 86 | nova-consoleauth | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:54.000000 | - | | 89 | nova-consoleauth | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:58.000000 | - | | 92 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:56.000000 | - | | 98 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:56.000000 | - | | 101 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | | 29 | nova-conductor | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | | 35 | nova-conductor | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:49.000000 | - | | 44 | nova-compute | compute-1.localdomain | nova | disabled | up | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' | | 77 | nova-scheduler | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | | 80 | nova-compute | compute-0.localdomain | nova | enabled | up | 2017-08-02T21:29:56.000000 | - | | 83 | nova-scheduler | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:50.000000 | - | | 86 | nova-consoleauth | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:54.000000 | - | | 89 | nova-consoleauth | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:58.000000 | - | | 92 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:56.000000 | - | | 98 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:56.000000 | - | | 101 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | +-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+ Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-7.0.0-0.20170721174554.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP11 2. Upgrade to OSP12 3. After the upgrade process is completed(major-upgrade-converge-docker.yaml) check nova service-list Actual results: We can see duplicate services reported by nova service-list: (overcloud) [stack@undercloud-0 ~]$ nova service-list +-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+ | 29 | nova-conductor | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | | 35 | nova-conductor | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:49.000000 | - | | 44 | nova-compute | compute-1.localdomain | nova | disabled | up | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' | | 77 | nova-scheduler | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | | 80 | nova-compute | compute-0.localdomain | nova | enabled | up | 2017-08-02T21:29:56.000000 | - | | 83 | nova-scheduler | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:50.000000 | - | | 86 | nova-consoleauth | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:54.000000 | - | | 89 | nova-consoleauth | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:58.000000 | - | | 92 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:56.000000 | - | | 98 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:56.000000 | - | | 101 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | | 29 | nova-conductor | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | | 35 | nova-conductor | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:49.000000 | - | | 44 | nova-compute | compute-1.localdomain | nova | disabled | up | 2017-08-02T21:29:49.000000 | AUTO: Failed to connect to libvirt: Failed to find user record for uid '162' | | 77 | nova-scheduler | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | | 80 | nova-compute | compute-0.localdomain | nova | enabled | up | 2017-08-02T21:29:56.000000 | - | | 83 | nova-scheduler | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:50.000000 | - | | 86 | nova-consoleauth | controller-1.localdomain | internal | enabled | up | 2017-08-02T21:29:54.000000 | - | | 89 | nova-consoleauth | controller-2.localdomain | internal | enabled | up | 2017-08-02T21:29:58.000000 | - | | 92 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:56.000000 | - | | 98 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:56.000000 | - | | 101 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2017-08-02T21:29:55.000000 | - | +-----+------------------+--------------------------+----------+----------+-------+----------------------------+------------------------------------------------------------------------------+ Expected results: We don't get any duplicate services. Additional info:
The same goes for hypervisor-list, hypervisor-stats: (overcloud) [stack@undercloud-0 ~]$ nova hypervisor-list +----+-----------------------+-------+---------+ | ID | Hypervisor hostname | State | Status | +----+-----------------------+-------+---------+ | 2 | compute-1.localdomain | up | enabled | | 5 | compute-0.localdomain | up | enabled | | 2 | compute-1.localdomain | up | enabled | | 5 | compute-0.localdomain | up | enabled | +----+-----------------------+-------+---------+ (overcloud) [stack@undercloud-0 ~]$ nova hypervisor-stats +----------------------+-------+ | Property | Value | +----------------------+-------+ | count | 4 | | current_workload | 0 | | disk_available_least | 118 | | free_disk_gb | 156 | | free_ram_mb | 16380 | | local_gb | 156 | | local_gb_used | 0 | | memory_mb | 32764 | | memory_mb_used | 16384 | | running_vms | 0 | | vcpus | 16 | | vcpus_used | 0 |
Hey!! Just lurking into the code, In a non-controller upgrade, I think we are missing in some how the upgrade_tasks step which actually stops the services under systemd i.e. nova-conductor Here we are stopping nova-conductor: https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/nova-conductor.yaml#L110 But we are not getting in: https://github.com/openstack/tripleo-heat-templates/blob/master/docker/docker-steps.j2#L167
Marios at the beginnig I believed this bug was related to something like https://review.openstack.org/#/c/484711/ But I think there is something else there.
this is not a valid bug, yet. It is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1477962 for the non controller upgrade. That is, the workflow for the non controllers (in this BZ computes), is still being finished by BZ 1477962 . Once that is fixed, we will get the execution of the upgrade_tasks from the nova-compute service, which includes a stop on the existing (non dockerized) service in https://github.com/openstack/tripleo-heat-templates/blob/0cb45d65c607cf4eb9a4096c7cc3f1c8a5ca58b4/docker/services/nova-compute.yaml#L145 . I think BZ 1477768 is related/duplicate (?) of this , if indeed the root cause is that we are not stopping the nova-compute (and nova-*) running on the compute node, before these are brought up in containers. I know you landed the fix into the tripleo_upgrade_node.sh @ https://review.openstack.org/#/c/490226/ for BZ 1477768, but as in the paragraph above, the workflow is changed now so we will no longer rely on that file (it *is* still wired in but we may remove it alltogether). So, do you agree that this is now blocked/needs re-testing once we get BZ 1477962
(In reply to marios from comment #4) > this is not a valid bug, yet. It is blocked by > https://bugzilla.redhat.com/show_bug.cgi?id=1477962 for the non controller > upgrade. That is, the workflow for the non controllers (in this BZ > computes), is still being finished by BZ 1477962 . Once that is fixed, we > will get the execution of the upgrade_tasks from the nova-compute service, > which includes a stop on the existing (non dockerized) service in > https://github.com/openstack/tripleo-heat-templates/blob/ > 0cb45d65c607cf4eb9a4096c7cc3f1c8a5ca58b4/docker/services/nova-compute. > yaml#L145 . > > I think BZ 1477768 is related/duplicate (?) of this , if indeed the root > cause is that we are not stopping the nova-compute (and nova-*) running on > the compute node, before these are brought up in containers. I know you > landed the fix into the tripleo_upgrade_node.sh @ > https://review.openstack.org/#/c/490226/ for BZ 1477768, but as in the > paragraph above, the workflow is changed now so we will no longer rely on > that file (it *is* still wired in but we may remove it alltogether). > > So, do you agree that this is now blocked/needs re-testing once we get BZ > 1477962 Agree, we need to test the fix for BZ#1477962 and see if the issue reported in this ticket is still valid.
> > Agree, we need to test the fix for BZ#1477962 and see if the issue reported > in this ticket is still valid. o/ can we add this to the list again please - trying to clear BZ - looks like BZ#1477962 is done based on latest comment #16 ... i'll catch up with you about it later on the phone too
(In reply to marios from comment #6) > > > > Agree, we need to test the fix for BZ#1477962 and see if the issue reported > > in this ticket is still valid. > > > o/ can we add this to the list again please - trying to clear BZ - looks > like BZ#1477962 is done based on latest comment #16 ... i'll catch up with > you about it later on the phone too This is still an issue on an environment which includes fixes for bug 1477962: (overcloud) [stack@undercloud-0 ~]$ nova service-list +--------------------------------------+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+-------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | Forced down | +--------------------------------------+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+-------------+ | 453e2c46-f476-4fbc-905c-0e54c68aadaf | nova-conductor | controller-1.localdomain | internal | enabled | up | 2017-09-18T13:57:53.000000 | - | False | | a514f4a9-8e40-4a42-b92b-37d57d299570 | nova-conductor | controller-2.localdomain | internal | enabled | up | 2017-09-18T13:57:53.000000 | - | False | | fc58e9ef-8b21-49f8-93f4-0663ca051b8a | nova-compute | compute-0.localdomain | nova | enabled | up | 2017-09-18T13:57:53.000000 | - | False | | a35f9f91-9116-4f69-822b-fc576ad9f6f5 | nova-scheduler | controller-1.localdomain | internal | enabled | up | 2017-09-18T13:57:48.000000 | - | False | | e1490acc-765d-46fe-9114-a7bb4eb7a2d2 | nova-scheduler | controller-2.localdomain | internal | enabled | up | 2017-09-18T13:57:47.000000 | - | False | | f184bcaf-3dc8-4d8d-b59d-cc666a6cc0bd | nova-consoleauth | controller-1.localdomain | internal | enabled | up | 2017-09-18T13:57:52.000000 | - | False | | ca403e7e-1e33-40bd-b95e-7d60fb560a5a | nova-consoleauth | controller-2.localdomain | internal | enabled | up | 2017-09-18T13:57:54.000000 | - | False | | 0a1fdca5-84d7-4c4d-894a-9f1ea2d434c0 | nova-compute | compute-1.localdomain | nova | enabled | up | 2017-09-18T13:57:48.000000 | - | False | | 5d91e538-2a9d-4186-a5f9-0055a78cafb9 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2017-09-18T13:57:48.000000 | - | False | | e52b5b29-18b7-478a-891b-a677e7d24d19 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2017-09-18T13:57:51.000000 | - | False | | 2d32736a-5fe4-4797-a3f2-afb304c0a0f3 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2017-09-18T13:57:50.000000 | - | False | | 453e2c46-f476-4fbc-905c-0e54c68aadaf | nova-conductor | controller-1.localdomain | internal | enabled | up | 2017-09-18T13:57:53.000000 | - | False | | a514f4a9-8e40-4a42-b92b-37d57d299570 | nova-conductor | controller-2.localdomain | internal | enabled | up | 2017-09-18T13:57:53.000000 | - | False | | fc58e9ef-8b21-49f8-93f4-0663ca051b8a | nova-compute | compute-0.localdomain | nova | enabled | up | 2017-09-18T13:57:53.000000 | - | False | | a35f9f91-9116-4f69-822b-fc576ad9f6f5 | nova-scheduler | controller-1.localdomain | internal | enabled | up | 2017-09-18T13:57:48.000000 | - | False | | e1490acc-765d-46fe-9114-a7bb4eb7a2d2 | nova-scheduler | controller-2.localdomain | internal | enabled | up | 2017-09-18T13:57:47.000000 | - | False | | f184bcaf-3dc8-4d8d-b59d-cc666a6cc0bd | nova-consoleauth | controller-1.localdomain | internal | enabled | up | 2017-09-18T13:57:52.000000 | - | False | | ca403e7e-1e33-40bd-b95e-7d60fb560a5a | nova-consoleauth | controller-2.localdomain | internal | enabled | up | 2017-09-18T13:57:54.000000 | - | False | | 0a1fdca5-84d7-4c4d-894a-9f1ea2d434c0 | nova-compute | compute-1.localdomain | nova | enabled | up | 2017-09-18T13:57:48.000000 | - | False | | 5d91e538-2a9d-4186-a5f9-0055a78cafb9 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2017-09-18T13:57:48.000000 | - | False | | e52b5b29-18b7-478a-891b-a677e7d24d19 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2017-09-18T13:57:51.000000 | - | False | | 2d32736a-5fe4-4797-a3f2-afb304c0a0f3 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2017-09-18T13:57:50.000000 | - | False | +--------------------------------------+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+-------------+
Duplicate cell_v2 mapping is the culprit: [root@controller-0 heat-admin]# nova-manage cell_v2 list_cells Option "rabbit_use_ssl" from group "oslo_messaging_rabbit" is deprecated. Use option "ssl" from group "oslo_messaging_rabbit". +---------+--------------------------------------+----------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ | Name | UUID | Transport URL | Database Connection | +---------+--------------------------------------+----------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ | cell0 | 00000000-0000-0000-0000-000000000000 | none:/ | mysql+pymysql://nova:****@172.17.1.11/nova_cell0?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo | | default | 1f4fa8fd-966c-4e46-b90b-164aa8b7e49b | rabbit://guest:****@controller-2.internalapi.localdomain:5672/?ssl=0 | mysql+pymysql://nova:****@172.17.1.11/nova?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo | | default | 87002684-89e6-4227-8a8d-8c501dcf3a92 | rabbit://guest:****@controller-2.internalapi.localdomain:5672/?ssl=0 | mysql+pymysql://nova:****@172.17.1.11/nova?read_default_group=tripleo&read_default_file=/etc/my.cnf.d/tripleo.cnf | +---------+--------------------------------------+----------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+
Comparing the database_connection on fresh deployments: OSP11: mysql+pymysql://nova:HnK9XA7e8wwJh9A6NFNpfAzgZ.1.14/nova?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo OSP12: mysql+pymysql://nova:j6E3FpBMQF69mUeQFkkYqT2Mq@[fd00:fd00:fd00:2000::1a]/nova?read_default_group=tripleo&read_default_file=/etc/my.cnf.d/tripleo.cnf Looks like the position of read_default_file in OSP11 changed with read_default_group in OSP12.
Yea, and nova-manage cell_v2 create cell is only idempotent if the transport_url and database_connection are identical. However we now what a cell_v2 update command so we can find the cell uuid and ensure the name/mq/db are correct.
So the url changes are seem to be due to our swapping out to use the make_url function from heat. https://review.openstack.org/#/c/446704/
*** Bug 1491611 has been marked as a duplicate of this bug. ***
Still waiting for this to be merged: https://review.openstack.org/#/q/topic:bug/1718912+(status:open+OR+status:merged)
https://review.openstack.org/513383 has merged
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462