Description of problem: On a deployment that was upgraded from 8->9, after scaling out with an additional compute node the nova-compute service shows as down on the newly added node. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-2.0.0-14.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Do initial deployment source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e ~/templates/network-environment.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/disk-layout.yaml \ -e ~/templates/wipe-disk-env.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ --control-scale 3 \ --control-flavor controller \ --compute-scale 1 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ --ntp-server clock.redhat.com \ --libvirt-type qemu 2. Upgrade undercloud sudo yum update -y openstack undercloud upgrade 3. Update images openstack overcloud image upload --update-existing openstack baremetal configure boot 4. Add osp8 repos on overcloud nodes 5. major-upgrade-aodh.yaml source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e ~/templates/network-environment.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/disk-layout.yaml \ -e ~/templates/wipe-disk-env.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml \ --control-scale 3 \ --control-flavor controller \ --compute-scale 1 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ --ntp-server clock.redhat.com \ --libvirt-type qemu 6. major-upgrade-keystone-liberty-mitaka.yaml source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e ~/templates/network-environment.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/disk-layout.yaml \ -e ~/templates/wipe-disk-env.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-keystone-liberty-mitaka.yaml \ --control-scale 3 \ --control-flavor controller \ --compute-scale 1 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ --ntp-server clock.redhat.com \ --libvirt-type qemu 7. Add OSP9 repos on overcloud nodes 8. source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e ~/templates/network-environment.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/disk-layout.yaml \ -e ~/templates/wipe-disk-env.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml \ --control-scale 3 \ --control-flavor controller \ --compute-scale 1 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ --ntp-server clock.redhat.com \ --libvirt-type qemu 9. Update os-collect-config and resource-agents on overcloud nodes 10. source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e ~/templates/network-environment.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/disk-layout.yaml \ -e ~/templates/wipe-disk-env.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml \ --control-scale 3 \ --control-flavor controller \ --compute-scale 1 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ --ntp-server clock.redhat.com \ --libvirt-type qemu 11. Start rabbitmq on controller-1 and controller-2 systemctl start rabbitmq-server.service pcs resource cleanup 12. upgrade-non-controller.sh --upgrade overcloud-novacompute-0 13. upgrade-non-controller.sh --upgrade overcloud-cephstorage-0 upgrade-non-controller.sh --upgrade overcloud-cephstorage-1 upgrade-non-controller.sh --upgrade overcloud-cephstorage-2 14. converge: source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates \ -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e ~/templates/network-environment.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/disk-layout.yaml \ -e ~/templates/wipe-disk-env.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml \ --control-scale 3 \ --control-flavor controller \ --compute-scale 1 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ --ntp-server clock.redhat.com \ --libvirt-type qemu 15. Add an additional compute node: source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e ~/templates/network-environment.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/disk-layout.yaml \ -e ~/templates/wipe-disk-env.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ --control-scale 3 \ --control-flavor controller \ --compute-scale 2 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ --ntp-server clock.redhat.com \ --libvirt-type qemu Actual results: [stack@undercloud ~]$ . overcloudrc [stack@undercloud ~]$ nova service-list /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | 3 | nova-scheduler | overcloud-controller-0.localdomain | internal | enabled | up | 2016-07-13T12:19:10.000000 | - | | 6 | nova-scheduler | overcloud-controller-2.localdomain | internal | enabled | up | 2016-07-13T12:19:09.000000 | - | | 9 | nova-scheduler | overcloud-controller-1.localdomain | internal | enabled | up | 2016-07-13T12:19:09.000000 | - | | 12 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2016-07-13T12:19:08.000000 | - | | 15 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up | 2016-07-13T12:19:08.000000 | - | | 18 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up | 2016-07-13T12:19:07.000000 | - | | 21 | nova-conductor | overcloud-controller-1.localdomain | internal | enabled | up | 2016-07-13T12:19:09.000000 | - | | 27 | nova-compute | overcloud-compute-0.localdomain | nova | enabled | up | 2016-07-13T12:19:06.000000 | - | | 30 | nova-conductor | overcloud-controller-2.localdomain | internal | enabled | up | 2016-07-13T12:19:06.000000 | - | | 33 | nova-conductor | overcloud-controller-0.localdomain | internal | enabled | up | 2016-07-13T12:19:09.000000 | - | | 60 | nova-compute | overcloud-compute-1.localdomain | nova | enabled | down | - | - | +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ Expected results: nova-compute | overcloud-compute-1.localdomain | nova | enabled | up Additional info: On the node compute node: [root@overcloud-compute-1 heat-admin]# systemctl status openstack-nova-compute ● openstack-nova-compute.service - OpenStack Nova Compute Server Loaded: loaded (/usr/lib/systemd/system/openstack-nova-compute.service; enabled; vendor preset: disabled) Active: activating (start) since Wed 2016-07-13 12:20:07 UTC; 2s ago Main PID: 14276 (nova-compute) CGroup: /system.slice/openstack-nova-compute.service └─14276 /usr/bin/python2 /usr/bin/nova-compute Jul 13 12:20:07 overcloud-compute-1.localdomain systemd[1]: Starting OpenStack Nova Compute Server... /var/log/nova/nova-compute.log keeps showing this error: ERROR nova.compute.manager [req-a57e050b-0eeb-47ad-b405-019eda8c772a - - - - -] No compute node record for host overcloud-compute-1.localdomain WARNING nova.compute.monitors [req-a57e050b-0eeb-47ad-b405-019eda8c772a - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors). INFO nova.compute.resource_tracker [req-a57e050b-0eeb-47ad-b405-019eda8c772a - - - - -] Auditing locally available compute resources for node overcloud-compute-1.localdomain
So, how was the health state of the underlying CEPH storage when the problem is reproduced? Comment 3 suggest that this could be the underlying issue. Is this more than a gut feeling?
During my tests, before starting the upgrade process I've successfully launched an instance on the OSP8 overcloud which is still accessible post upgrade so I believe the ceph storage isn't an issue here. Unfortunately the environment is currently blocked for another BZ investigation but once it's available I can use it to reproduce the issue if the sosreports don't help.
Any update here?
Not from my side.
Are you still going to reproduce this again in your environment?
Yes, I'm preparing an environment today. I'll get back with the credentials once it's ready.
Checked that the issue doesn't reproduce on clean deployment of 9 + scale out.
(In reply to Alexander Chuzhoy from comment #11) > Checked that the issue doesn't reproduce on clean deployment of 9 + scale > out. @sasha: can you clarify what you mean by by "clean deployment of 9"? Specifically, did you mean a non-upgraded deployment of OSP9?
(In reply to Eoghan Glynn from comment #12) > (In reply to Alexander Chuzhoy from comment #11) > > Checked that the issue doesn't reproduce on clean deployment of 9 + scale > > out. > > @sasha: can you clarify what you mean by by "clean deployment of 9"? > > Specifically, did you mean a non-upgraded deployment of OSP9? Exactly.
It looks like nova-compute on the new node keeps trying to get started but it fails (segfault in librados): /var/log/messages shows the following: Jul 27 07:16:54 localhost systemd: Started OpenStack Nova Compute Server. Jul 27 07:16:54 localhost kernel: nova-compute[18765]: segfault at 0 ip 00007f1510d8217a sp 00007f150ab093b0 error 4 in librados.so.2.0.0[7f1510a54000+504000] Jul 27 07:16:54 localhost journal: End of file while reading data: Input/output error Jul 27 07:16:54 localhost systemd: openstack-nova-compute.service: main process exited, code=killed, status=11/SEGV Jul 27 07:16:54 localhost systemd: Unit openstack-nova-compute.service entered failed state. Jul 27 07:16:54 localhost systemd: openstack-nova-compute.service failed. Jul 27 07:16:54 localhost systemd: openstack-nova-compute.service holdoff time over, scheduling restart. Jul 27 07:16:54 localhost systemd: Starting OpenStack Nova Compute Server...
The /etc/ceph/ceph.client.openstack.keyring on the upgraded compute node had a wrong format. It was like so: [client.openstack] key = AAAAAAAAAAAAAAAA Where a valid key is: [client.openstack] key = AQBHOJdXAAAAABAAod6tL8beRSB1IasVq0FywQ== Using the proper key on the compute node and the process starts normally. So I'm closing this since this is not a bug.
I'm reopening it as we need to figure out how the ceph.client.openstack.keyring got generated for the new node, probably it changed between releases.
After passing an environment file containing: parameter_defaults: CephClientKey: 'AQBFZqRXGHWdIRAAqHuCLQgXHQYHVl3jQSFqWg==' to the overcloud deploy commands I was able to scale out with an additional compute node on which nova service got started ok: [stack@undercloud ~]$ nova service-list /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 172.16.18.25 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning +-----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +-----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | 2 | nova-scheduler | overcloud-controller-0.localdomain | internal | enabled | up | 2016-08-05T15:47:28.000000 | - | | 5 | nova-scheduler | overcloud-controller-2.localdomain | internal | enabled | up | 2016-08-05T15:47:27.000000 | - | | 8 | nova-scheduler | overcloud-controller-1.localdomain | internal | enabled | up | 2016-08-05T15:47:29.000000 | - | | 11 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2016-08-05T15:47:32.000000 | - | | 14 | nova-conductor | overcloud-controller-1.localdomain | internal | enabled | up | 2016-08-05T15:47:31.000000 | - | | 23 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up | 2016-08-05T15:47:33.000000 | - | | 26 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up | 2016-08-05T15:47:33.000000 | - | | 29 | nova-conductor | overcloud-controller-0.localdomain | internal | enabled | up | 2016-08-05T15:47:31.000000 | - | | 41 | nova-conductor | overcloud-controller-2.localdomain | internal | enabled | up | 2016-08-05T15:47:30.000000 | - | | 44 | nova-compute | overcloud-compute-0.localdomain | nova | enabled | up | 2016-08-05T15:47:29.000000 | - | | 103 | nova-compute | overcloud-compute-1.localdomain | nova | enabled | up | 2016-08-05T15:47:29.000000 | - | +-----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1599.html