Bug 1356107

Summary: nova-compute service is down on compute nodes added post 8->9 upgrade (tripleO Heat Template)
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Giulio Fidente <gfidente>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 9.0 (Mitaka)CC: dbecker, dmacpher, eglynn, gfidente, jefbrown, jjoyce, mburns, mcornea, morazi, rhel-osp-director-maint, sasha, scohen, seb, sgordon, svanders, tvignaud
Target Milestone: gaKeywords: Reopened
Target Release: 9.0 (Mitaka)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-2.0.0-29.el7ost Doc Type: Bug Fix
Doc Text:
OpenStack Platform 9 deployments require an additional CephX key for the "client.openstack" user. However, the director's command line client does not generate this key for existing deployments and updates the "ceph.openstack" keyring with an empty secret. Before upgrading, generate a new CephX key and pass it to the deployment using the CephClientKey parameter in an environment file. For example: parameter_defaults: CephClientKey: 'my_cephx_key' Generate the new key with the following command: $ ceph-authtool --gen-print-key
Story Points: ---
Clone Of:
: 1363645 1363650 (view as bug list) Environment:
Last Closed: 2016-08-11 11:36:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1353971, 1363645    
Bug Blocks:    

Description Marius Cornea 2016-07-13 12:21:23 UTC
Description of problem:
On a deployment that was upgraded from 8->9, after scaling out with an additional compute node the nova-compute service shows as down on the newly added node.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-2.0.0-14.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:

1. Do initial deployment
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

2. Upgrade undercloud 
sudo yum update -y
openstack undercloud upgrade

3. Update images 
openstack overcloud image upload --update-existing
openstack baremetal configure boot

4. Add osp8 repos on overcloud nodes

5. major-upgrade-aodh.yaml

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

6. major-upgrade-keystone-liberty-mitaka.yaml 

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-keystone-liberty-mitaka.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

7. Add OSP9 repos on overcloud nodes

8. 
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

9. Update os-collect-config and resource-agents on overcloud nodes

10. 
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

11. Start rabbitmq on controller-1 and controller-2
systemctl start rabbitmq-server.service
pcs resource cleanup

12. 
upgrade-non-controller.sh --upgrade overcloud-novacompute-0

13. 
upgrade-non-controller.sh --upgrade overcloud-cephstorage-0
upgrade-non-controller.sh --upgrade overcloud-cephstorage-1
upgrade-non-controller.sh --upgrade overcloud-cephstorage-2

14. converge:

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

15. Add an additional compute node:
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu 

Actual results:
[stack@undercloud ~]$ . overcloudrc 
[stack@undercloud ~]$ nova service-list
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                               | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| 3  | nova-scheduler   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-07-13T12:19:10.000000 | -               |
| 6  | nova-scheduler   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-07-13T12:19:09.000000 | -               |
| 9  | nova-scheduler   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-07-13T12:19:09.000000 | -               |
| 12 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-07-13T12:19:08.000000 | -               |
| 15 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-07-13T12:19:08.000000 | -               |
| 18 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-07-13T12:19:07.000000 | -               |
| 21 | nova-conductor   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-07-13T12:19:09.000000 | -               |
| 27 | nova-compute     | overcloud-compute-0.localdomain    | nova     | enabled | up    | 2016-07-13T12:19:06.000000 | -               |
| 30 | nova-conductor   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-07-13T12:19:06.000000 | -               |
| 33 | nova-conductor   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-07-13T12:19:09.000000 | -               |
| 60 | nova-compute     | overcloud-compute-1.localdomain    | nova     | enabled | down  | -                          | -               |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+


Expected results:
 nova-compute     | overcloud-compute-1.localdomain    | nova     | enabled | up

Additional info:
On the node compute node:

[root@overcloud-compute-1 heat-admin]# systemctl status openstack-nova-compute
● openstack-nova-compute.service - OpenStack Nova Compute Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-nova-compute.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2016-07-13 12:20:07 UTC; 2s ago
 Main PID: 14276 (nova-compute)
   CGroup: /system.slice/openstack-nova-compute.service
           └─14276 /usr/bin/python2 /usr/bin/nova-compute

Jul 13 12:20:07 overcloud-compute-1.localdomain systemd[1]: Starting OpenStack Nova Compute Server...


/var/log/nova/nova-compute.log keeps showing this error:

ERROR nova.compute.manager [req-a57e050b-0eeb-47ad-b405-019eda8c772a - - - - -] No compute node record for host overcloud-compute-1.localdomain
WARNING nova.compute.monitors [req-a57e050b-0eeb-47ad-b405-019eda8c772a - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).
INFO nova.compute.resource_tracker [req-a57e050b-0eeb-47ad-b405-019eda8c772a - - - - -] Auditing locally available compute resources for node overcloud-compute-1.localdomain

Comment 4 Sven Anderson 2016-07-14 13:50:18 UTC
So, how was the health state of the underlying CEPH storage when the problem is reproduced? Comment 3 suggest that this could be the underlying issue. Is this more than a gut feeling?

Comment 5 Marius Cornea 2016-07-14 14:11:54 UTC
During my tests, before starting the upgrade process I've successfully launched an instance on the OSP8 overcloud which is still accessible post upgrade so I believe the ceph storage isn't an issue here.

Unfortunately the environment is currently blocked for another BZ investigation but once it's available I can use it to reproduce the issue if the sosreports don't help.

Comment 6 Stephen Gordon 2016-07-21 14:46:23 UTC
Any update here?

Comment 7 Marius Cornea 2016-07-21 14:48:43 UTC
Not from my side.

Comment 8 Sven Anderson 2016-07-25 12:34:54 UTC
Are you still going to reproduce this again in your environment?

Comment 9 Marius Cornea 2016-07-25 12:36:21 UTC
Yes, I'm preparing an environment today. I'll get back with the credentials once it's ready.

Comment 11 Alexander Chuzhoy 2016-07-25 20:18:00 UTC
Checked that the issue doesn't reproduce on clean deployment of 9 + scale out.

Comment 12 Eoghan Glynn 2016-07-26 11:53:44 UTC
(In reply to Alexander Chuzhoy from comment #11)
> Checked that the issue doesn't reproduce on clean deployment of 9 + scale
> out.

@sasha: can you clarify what you mean by by "clean deployment of 9"?

Specifically, did you mean a non-upgraded deployment of OSP9?

Comment 14 Alexander Chuzhoy 2016-07-26 13:04:00 UTC
(In reply to Eoghan Glynn from comment #12)
> (In reply to Alexander Chuzhoy from comment #11)
> > Checked that the issue doesn't reproduce on clean deployment of 9 + scale
> > out.
> 
> @sasha: can you clarify what you mean by by "clean deployment of 9"?
> 
> Specifically, did you mean a non-upgraded deployment of OSP9?

Exactly.

Comment 16 Marius Cornea 2016-07-27 11:18:36 UTC
It looks like nova-compute on the new node keeps trying to get started but it fails (segfault in librados):

/var/log/messages shows the following:

Jul 27 07:16:54 localhost systemd: Started OpenStack Nova Compute Server.
Jul 27 07:16:54 localhost kernel: nova-compute[18765]: segfault at 0 ip 00007f1510d8217a sp 00007f150ab093b0 error 4 in librados.so.2.0.0[7f1510a54000+504000]
Jul 27 07:16:54 localhost journal: End of file while reading data: Input/output error
Jul 27 07:16:54 localhost systemd: openstack-nova-compute.service: main process exited, code=killed, status=11/SEGV
Jul 27 07:16:54 localhost systemd: Unit openstack-nova-compute.service entered failed state.
Jul 27 07:16:54 localhost systemd: openstack-nova-compute.service failed.
Jul 27 07:16:54 localhost systemd: openstack-nova-compute.service holdoff time over, scheduling restart.
Jul 27 07:16:54 localhost systemd: Starting OpenStack Nova Compute Server...

Comment 18 seb 2016-07-27 14:47:10 UTC
The /etc/ceph/ceph.client.openstack.keyring on the upgraded compute node had a wrong format.
It was like so:

[client.openstack]
        key = AAAAAAAAAAAAAAAA

Where a valid key is:

[client.openstack]
        key = AQBHOJdXAAAAABAAod6tL8beRSB1IasVq0FywQ==

Using the proper key on the compute node and the process starts normally.
So I'm closing this since this is not a bug.

Comment 19 Marius Cornea 2016-07-27 14:54:28 UTC
I'm reopening it as we need to figure out how the ceph.client.openstack.keyring got generated for the new node, probably it changed between releases.

Comment 27 Marius Cornea 2016-08-05 15:47:58 UTC
After passing an environment file containing:

parameter_defaults:
  CephClientKey: 'AQBFZqRXGHWdIRAAqHuCLQgXHQYHVl3jQSFqWg=='

to the overcloud deploy commands I was able to scale out with an additional compute node on which nova service got started ok:

[stack@undercloud ~]$ nova service-list
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| Id  | Binary           | Host                               | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| 2   | nova-scheduler   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-08-05T15:47:28.000000 | -               |
| 5   | nova-scheduler   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-08-05T15:47:27.000000 | -               |
| 8   | nova-scheduler   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-08-05T15:47:29.000000 | -               |
| 11  | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-08-05T15:47:32.000000 | -               |
| 14  | nova-conductor   | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-08-05T15:47:31.000000 | -               |
| 23  | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-08-05T15:47:33.000000 | -               |
| 26  | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up    | 2016-08-05T15:47:33.000000 | -               |
| 29  | nova-conductor   | overcloud-controller-0.localdomain | internal | enabled | up    | 2016-08-05T15:47:31.000000 | -               |
| 41  | nova-conductor   | overcloud-controller-2.localdomain | internal | enabled | up    | 2016-08-05T15:47:30.000000 | -               |
| 44  | nova-compute     | overcloud-compute-0.localdomain    | nova     | enabled | up    | 2016-08-05T15:47:29.000000 | -               |
| 103 | nova-compute     | overcloud-compute-1.localdomain    | nova     | enabled | up    | 2016-08-05T15:47:29.000000 | -               |
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+

Comment 30 errata-xmlrpc 2016-08-11 11:36:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1599.html