1356107 – nova-compute service is down on compute nodes added post 8->9 upgrade (tripleO Heat Template)

Bug 1356107 - nova-compute service is down on compute nodes added post 8->9 upgrade (tripleO Heat Template)

Summary: nova-compute service is down on compute nodes added post 8->9 upgrade (triple...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	9.0 (Mitaka)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ga
Target Release:	9.0 (Mitaka)
Assignee:	Giulio Fidente
QA Contact:	Marius Cornea
Docs Contact:
URL:
Whiteboard:
Depends On:	1353971 1363645
Blocks:
TreeView+	depends on / blocked

Reported:	2016-07-13 12:21 UTC by Marius Cornea
Modified:	2020-07-16 08:48 UTC (History)
CC List:	16 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-2.0.0-29.el7ost
Doc Type:	Bug Fix
Doc Text:	OpenStack Platform 9 deployments require an additional CephX key for the "client.openstack" user. However, the director's command line client does not generate this key for existing deployments and updates the "ceph.openstack" keyring with an empty secret. Before upgrading, generate a new CephX key and pass it to the deployment using the CephClientKey parameter in an environment file. For example: parameter_defaults: CephClientKey: 'my_cephx_key' Generate the new key with the following command: $ ceph-authtool --gen-print-key
Clone Of:
Clones:	1363645 1363650 (view as bug list)
Environment:
Last Closed:	2016-08-11 11:36:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	348900	0	'None'	MERGED	We don't need to set a default for the CephX keys and cluster FSID	2021-01-06 16:30:51 UTC
Red Hat Product Errata	RHEA-2016:1599	0	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 9 director Release Candidate Advisory	2016-08-11 15:25:37 UTC

Description Marius Cornea 2016-07-13 12:21:23 UTC

Description of problem:
On a deployment that was upgraded from 8->9, after scaling out with an additional compute node the nova-compute service shows as down on the newly added node.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-2.0.0-14.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:

1. Do initial deployment
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

2. Upgrade undercloud 
sudo yum update -y
openstack undercloud upgrade

3. Update images 
openstack overcloud image upload --update-existing
openstack baremetal configure boot

4. Add osp8 repos on overcloud nodes

5. major-upgrade-aodh.yaml

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

6. major-upgrade-keystone-liberty-mitaka.yaml 

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-keystone-liberty-mitaka.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

7. Add OSP9 repos on overcloud nodes

8. 
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

9. Update os-collect-config and resource-agents on overcloud nodes

10. 
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

11. Start rabbitmq on controller-1 and controller-2
systemctl start rabbitmq-server.service
pcs resource cleanup

12. 
upgrade-non-controller.sh --upgrade overcloud-novacompute-0

13. 
upgrade-non-controller.sh --upgrade overcloud-cephstorage-0
upgrade-non-controller.sh --upgrade overcloud-cephstorage-1
upgrade-non-controller.sh --upgrade overcloud-cephstorage-2

14. converge:

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

15. Add an additional compute node:
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu 

Actual results:
[stack@undercloud ~]$ . overcloudrc 
[stack@undercloud ~]$ nova service-list
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                               | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| 3  | nova-scheduler   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-07-13T12:19:10.000000 | -               |
| 6  | nova-scheduler   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-07-13T12:19:09.000000 | -               |
| 9  | nova-scheduler   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-07-13T12:19:09.000000 | -               |
| 12 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-07-13T12:19:08.000000 | -               |
| 15 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-07-13T12:19:08.000000 | -               |
| 18 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-07-13T12:19:07.000000 | -               |
| 21 | nova-conductor   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-07-13T12:19:09.000000 | -               |
| 27 | nova-compute     | overcloud-compute-0.localdomain    | nova     | enabled | up    | 2016-07-13T12:19:06.000000 | -               |
| 30 | nova-conductor   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-07-13T12:19:06.000000 | -               |
| 33 | nova-conductor   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-07-13T12:19:09.000000 | -               |
| 60 | nova-compute     | overcloud-compute-1.localdomain    | nova     | enabled | down  | -                          | -               |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+


Expected results:
 nova-compute     | overcloud-compute-1.localdomain    | nova     | enabled | up

Additional info:
On the node compute node:

[root@overcloud-compute-1 heat-admin]# systemctl status openstack-nova-compute
● openstack-nova-compute.service - OpenStack Nova Compute Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-nova-compute.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2016-07-13 12:20:07 UTC; 2s ago
 Main PID: 14276 (nova-compute)
   CGroup: /system.slice/openstack-nova-compute.service
           └─14276 /usr/bin/python2 /usr/bin/nova-compute

Jul 13 12:20:07 overcloud-compute-1.localdomain systemd[1]: Starting OpenStack Nova Compute Server...


/var/log/nova/nova-compute.log keeps showing this error:

ERROR nova.compute.manager [req-a57e050b-0eeb-47ad-b405-019eda8c772a - - - - -] No compute node record for host overcloud-compute-1.localdomain
WARNING nova.compute.monitors [req-a57e050b-0eeb-47ad-b405-019eda8c772a - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).
INFO nova.compute.resource_tracker [req-a57e050b-0eeb-47ad-b405-019eda8c772a - - - - -] Auditing locally available compute resources for node overcloud-compute-1.localdomain

Comment 4 Sven Anderson 2016-07-14 13:50:18 UTC

So, how was the health state of the underlying CEPH storage when the problem is reproduced? Comment 3 suggest that this could be the underlying issue. Is this more than a gut feeling?

Comment 5 Marius Cornea 2016-07-14 14:11:54 UTC

During my tests, before starting the upgrade process I've successfully launched an instance on the OSP8 overcloud which is still accessible post upgrade so I believe the ceph storage isn't an issue here.

Unfortunately the environment is currently blocked for another BZ investigation but once it's available I can use it to reproduce the issue if the sosreports don't help.

Comment 6 Stephen Gordon 2016-07-21 14:46:23 UTC

Any update here?

Comment 7 Marius Cornea 2016-07-21 14:48:43 UTC

Not from my side.

Comment 8 Sven Anderson 2016-07-25 12:34:54 UTC

Are you still going to reproduce this again in your environment?

Comment 9 Marius Cornea 2016-07-25 12:36:21 UTC

Yes, I'm preparing an environment today. I'll get back with the credentials once it's ready.

Comment 11 Alexander Chuzhoy 2016-07-25 20:18:00 UTC

Checked that the issue doesn't reproduce on clean deployment of 9 + scale out.

Comment 12 Eoghan Glynn 2016-07-26 11:53:44 UTC

(In reply to Alexander Chuzhoy from comment #11)
> Checked that the issue doesn't reproduce on clean deployment of 9 + scale
> out.

@sasha: can you clarify what you mean by by "clean deployment of 9"?

Specifically, did you mean a non-upgraded deployment of OSP9?

Comment 14 Alexander Chuzhoy 2016-07-26 13:04:00 UTC

(In reply to Eoghan Glynn from comment #12)
> (In reply to Alexander Chuzhoy from comment #11)
> > Checked that the issue doesn't reproduce on clean deployment of 9 + scale
> > out.
> 
> @sasha: can you clarify what you mean by by "clean deployment of 9"?
> 
> Specifically, did you mean a non-upgraded deployment of OSP9?

Exactly.

Comment 16 Marius Cornea 2016-07-27 11:18:36 UTC

It looks like nova-compute on the new node keeps trying to get started but it fails (segfault in librados):

/var/log/messages shows the following:

Jul 27 07:16:54 localhost systemd: Started OpenStack Nova Compute Server.
Jul 27 07:16:54 localhost kernel: nova-compute[18765]: segfault at 0 ip 00007f1510d8217a sp 00007f150ab093b0 error 4 in librados.so.2.0.0[7f1510a54000+504000]
Jul 27 07:16:54 localhost journal: End of file while reading data: Input/output error
Jul 27 07:16:54 localhost systemd: openstack-nova-compute.service: main process exited, code=killed, status=11/SEGV
Jul 27 07:16:54 localhost systemd: Unit openstack-nova-compute.service entered failed state.
Jul 27 07:16:54 localhost systemd: openstack-nova-compute.service failed.
Jul 27 07:16:54 localhost systemd: openstack-nova-compute.service holdoff time over, scheduling restart.
Jul 27 07:16:54 localhost systemd: Starting OpenStack Nova Compute Server...

Comment 18 seb 2016-07-27 14:47:10 UTC

The /etc/ceph/ceph.client.openstack.keyring on the upgraded compute node had a wrong format.
It was like so:

[client.openstack]
        key = AAAAAAAAAAAAAAAA

Where a valid key is:

[client.openstack]
        key = AQBHOJdXAAAAABAAod6tL8beRSB1IasVq0FywQ==

Using the proper key on the compute node and the process starts normally.
So I'm closing this since this is not a bug.

Comment 19 Marius Cornea 2016-07-27 14:54:28 UTC

I'm reopening it as we need to figure out how the ceph.client.openstack.keyring got generated for the new node, probably it changed between releases.

Comment 27 Marius Cornea 2016-08-05 15:47:58 UTC

After passing an environment file containing:

parameter_defaults:
  CephClientKey: 'AQBFZqRXGHWdIRAAqHuCLQgXHQYHVl3jQSFqWg=='

to the overcloud deploy commands I was able to scale out with an additional compute node on which nova service got started ok:

[stack@undercloud ~]$ nova service-list
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| Id  | Binary           | Host                               | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| 2   | nova-scheduler   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-08-05T15:47:28.000000 | -               |
| 5   | nova-scheduler   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-08-05T15:47:27.000000 | -               |
| 8   | nova-scheduler   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-08-05T15:47:29.000000 | -               |
| 11  | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-08-05T15:47:32.000000 | -               |
| 14  | nova-conductor   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-08-05T15:47:31.000000 | -               |
| 23  | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-08-05T15:47:33.000000 | -               |
| 26  | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-08-05T15:47:33.000000 | -               |
| 29  | nova-conductor   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-08-05T15:47:31.000000 | -               |
| 41  | nova-conductor   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-08-05T15:47:30.000000 | -               |
| 44  | nova-compute     | overcloud-compute-0.localdomain    | nova     | enabled | up    | 2016-08-05T15:47:29.000000 | -               |
| 103 | nova-compute     | overcloud-compute-1.localdomain    | nova     | enabled | up    | 2016-08-05T15:47:29.000000 | -               |
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+

Comment 30 errata-xmlrpc 2016-08-11 11:36:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1599.html

Note You need to log in before you can comment on or make changes to this bug.