Bug 1536753
Summary: | live migration fails when hostnames are configured with "_" (underscore) due to inconsistent naming in /etc/hosts and /etc/ssh/ssh_known_hosts | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Andreas Karis <akaris> |
Component: | openstack-tripleo-heat-templates | Assignee: | Emilien Macchi <emacchi> |
Status: | CLOSED WONTFIX | QA Contact: | Gurenko Alex <agurenko> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 10.0 (Newton) | CC: | akaris, aschultz, mburns, mschuppe, owalsh, rhel-osp-director-maint |
Target Milestone: | Upstream M1 | Keywords: | Triaged, ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-07-23 15:20:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Andreas Karis
2018-01-20 17:32:54 UTC
Hello, I applied the following: ~~~ ComputeHostnameFormat: '%stackname%-compute-v1-%index%' ~~~ Then, I reran `openstack overcloud deploy`. This leads to: ~~~ [root@overcloud-compute-v1-0 ~]# grep '_v1' !$ grep '_v1' /etc/ssh/ssh_known_hosts [root@overcloud-compute-v1-0 ~]# grep '_v1' /etc/hosts [root@overcloud-compute-v1-0 ~]# ~~~ And to a hostname change: ~~~ 2018-01-20 18:07:34.066 59047 ERROR nova.virt.libvirt.host [req-57ec6d8c-80fc-414e-8a6f-96e57d593499 - - - - -] Hostname has changed from overcloud-compute-v1-1 to overcloud-compute-v1-1.localdomain. A restart is required to take effect. ~~~ After a restart on all computes: ~~~ [root@overcloud-compute-v1-1 ~]# systemctl restart openstack-nova-compute [root@overcloud-compute-v1-1 ~]# ~~~ This will lead to another rename of compute services (note the .localdomain): ~~~ [stack@undercloud-7 ~]$ nova service-list +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | 3 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2018-01-20T19:00:40.000000 | - | | 4 | nova-scheduler | overcloud-controller-0.localdomain | internal | enabled | up | 2018-01-20T19:00:37.000000 | - | | 5 | nova-conductor | overcloud-controller-0.localdomain | internal | enabled | up | 2018-01-20T19:00:34.000000 | - | | 6 | nova-compute | overcloud-compute-v1-0 | nova | enabled | down | 2018-01-20T18:07:31.000000 | - | | 7 | nova-compute | overcloud-compute-v1-1 | nova | enabled | down | 2018-01-20T18:07:37.000000 | - | | 8 | nova-compute | overcloud-compute-v1-0.localdomain | nova | enabled | down | 2018-01-20T18:58:45.000000 | - | | 9 | nova-compute | overcloud-compute-v1-1.localdomain | nova | enabled | up | 2018-01-20T19:00:39.000000 | - | +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ ~~~ Live migration then fails, because a) in my env, v1-0 is down. But more importantly, the rename to .localdomain messed up other things: ~~~ /var/log/nova/nova-conductor.log:2018-01-20 19:01:31.363 92122 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/conductor/tasks/live_migrate.py", line 49, in _execute /var/log/nova/nova-conductor.log:2018-01-20 19:01:31.363 92122 ERROR oslo_messaging.rpc.server self._check_host_is_up(self.source) /var/log/nova/nova-conductor.log:2018-01-20 19:01:31.363 92122 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/conductor/tasks/live_migrate.py", line 89, in _check_host_is_up /var/log/nova/nova-conductor.log:2018-01-20 19:01:31.363 92122 ERROR oslo_messaging.rpc.server raise exception.ComputeServiceUnavailable(host=host) /var/log/nova/nova-conductor.log:2018-01-20 19:01:31.363 92122 ERROR oslo_messaging.rpc.server ComputeServiceUnavailable: Compute service of overcloud-compute-v1-1 is unavailable at this time. /var/log/nova/nova-conductor.log:2018-01-20 19:01:31.363 92122 ERROR oslo_messaging.rpc.server ~~~ - Andreas Correction: it's not authorized_keys, it's /etc/ssh/ssh_known_hosts The hostname has not been set (by cloud-init) because underscore is not a valid hostname character (see https://tools.ietf.org/html/rfc952). The hostname command will not accept this as a hostname e.g: [root@undercloud stack]# hostname foo_ hostname: the specified hostname is invalid Everything else has been configure expecting this invalid hostname to have been set. SSH public/private key authentication failing is the most obvious side-effect of this, but there are likely to be other issues e.g the hostname being reported by nova-compute is also incorrect. > oslo_messaging.rpc.server ComputeServiceUnavailable: Compute service of overcloud-compute-v1-1 is unavailable at this time. What is the expectation here? That looks correct. It's now overcloud-compute-v1-1.localdomain. The overcloud-compute-v1-0 and overcloud-compute-v1-1 services should be deleted Hi, The hostname change does not work. Check the .localdomain. With the wrong settings, we don't get the .localdomain suffix: ~~~ | 3 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2018-01-20T19:00:40.000000 | - | | 4 | nova-scheduler | overcloud-controller-0.localdomain | internal | enabled | up | 2018-01-20T19:00:37.000000 | - | | 5 | nova-conductor | overcloud-controller-0.localdomain | internal | enabled | up | 2018-01-20T19:00:34.000000 | - | | 6 | nova-compute | overcloud-compute-v1-0 | nova | enabled | down | 2018-01-20T18:07:31.000000 | - | | 7 | nova-compute | overcloud-compute-v1-1 | nova | enabled | down | 2018-01-20T18:07:37.000000 | - | | 8 | nova-compute | overcloud-compute-v1-0.localdomain | nova | enabled | down | 2018-01-20T18:58:45.000000 | - | | 9 | nova-compute | overcloud-compute-v1-1.localdomain | nova | enabled | ~~~ ~~~ /var/log/nova/nova-conductor.log:2018-01-20 19:01:31.363 92122 ERROR oslo_messaging.rpc.server ComputeServiceUnavailable: Compute service of overcloud-compute-v1-1 is unavailable at this time. ~~~ The database does not contain entries for `overcloud-compute-v1-1.localdomain`, but only for `overcloud-compute-v1-1`. That means that even after a rename via Director, due to this issue, one cannot migrate the instances off if one does not go into the database and fix this manually. Actually, I'm not asking for a mitigation here. I'm asking that we do not let customers set flavor names or ComputeHostnameFormat that contain "_". Or, alternatively, that we correctly convert all of them from "_" to "-". Overall, this is a product bug: we either accept invalid input in our templates and/or do not convert "_" to "-" everywhere where we should do it. We added a validation in OSP11 to prevent the use of underscore in stacknames which is where this originally snuck in. I believe we have a validation in place for FFU as well. The RHEL documentation has some additional details around valid hostnames. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/ch-configure_host_names#sec_Understanding_Host_Names At this point I'm not sure there's much to do in 10 without possibly breaking existing deployments. If a user has an existing stack deployed, they'll probably need to update the role hostname format to not have a '_' in it and update the node if possible. |