Bug 1638398 - overcloud nodes have incorrect hostname - OC deploy fails on pacemaker/corosync
Summary: overcloud nodes have incorrect hostname - OC deploy fails on pacemaker/corosync
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: diskimage-builder
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 10.0 (Newton)
Assignee: Bob Fournier
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-11 13:18 UTC by Pavel Sedlák
Modified: 2018-11-26 18:01 UTC (History)
12 users (show)

Fixed In Version: diskimage-builder-1.26.1-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-26 18:00:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3674 0 None None None 2018-11-26 18:01:27 UTC

Description Pavel Sedlák 2018-10-11 13:18:28 UTC
tripleo deployment of OSP10 on rhel-7.6 fails in overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0
on
>    Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Exec try 360/360
>    Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
>    Debug: Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
>    Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Sleeping for 10.0 seconds between tries
>    Notice: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Error: cluster is not currently running on this node
>    Notice: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Dependency Exec[wait-for-settle] has failures: true

pacemaker failed due to corosync:
> Oct 10 16:38:24 host-192-168-24-10 corosync[155641]: Can't read file /etc/corosync/corosync.conf reason = (No such file or directory)

this config file really missing.

Seems it's missing due that hostname of this node is not as expected
controller-0, but host-192-168-24-10 instead. (btw same applies also to other OverCloud nodes, e.g. computes)

Looking in /var/log/messages, hostname recorded in all intries is this
'host-192...', there are few entries about hostname being changed to
controller-0 (from networkmanager and such).

hostnamectl show that only static hostname was changed:
>    Static hostname: controller-0
> Transient hostname: host-192-168-24-10
(and after restart of oc node correct 'controller-0' gets used)

In case of OSP10 on rhel-7.5 (not 7.6) there was no such issue,
and controller-0 seems to be used in all entries since beginning
of whole /var/log/messages (so none of these NM changing hostname/restart was needed
there). So I would assume this is expected behaviour.

Logs related to dhcp/cloud-init/... seem bit different between 7.5/7.6
setups, but it's not clear to me if they indicate any issue or not.
(if anything seems that 7.6 mentiones more times changing hostname,
 but transient seem to never really get affected)

from UC:
> openstack-tripleo-0.0.8-0.3.4de13b3git.el7ost.noarch
> python-tripleoclient-5.4.6-1.el7ost.noarch
> 
> openstack-tripleo-image-elements-5.3.3-1.el7ost.noarch
> rhosp-director-images-ipa-10.0-20181005.1.donotshipel7_6.el7ost.noarch
> rhosp-director-images-10.0-20181005.1.donotshipel7_6.el7ost.noarch
> 
> dnsmasq-2.76-7.el7.x86_64
> openstack-neutron-9.4.1-28.el7ost.noarch
> openstack-neutron-ml2-9.4.1-28.el7ost.noarch
> python-neutron-9.4.1-28.el7ost.noarch
> python-neutron-lib-0.4.0-1.el7ost.noarch
> python-neutronclient-6.0.1-1.el7ost.noarch
> openstack-neutron-openvswitch-9.4.1-28.el7ost.noarch
> openstack-neutron-common-9.4.1-28.el7ost.noarch
> puppet-neutron-9.5.0-5.el7ost.noarch
> python-ironic-lib-2.1.3-2.el7ost.noarch
> puppet-ironic-9.5.0-3.el7ost.noarch
> openstack-ironic-inspector-4.2.2-4.el7ost.noarch
> python-ironic-inspector-client-1.10.0-1.el7ost.noarch
> openstack-ironic-common-6.2.4-3.el7ost.noarch
> openstack-ironic-conductor-6.2.4-3.el7ost.noarch
> openstack-ironic-api-6.2.4-3.el7ost.noarch
> python-ironicclient-1.7.1-1.el7ost.noarch

from OC node:
> cloud-init-18.2-1.el7.x86_64
> kernel-3.10.0-956.el7.x86_64
> os-collect-config-5.2.1-2.el7ost.noarch
> os-net-config-5.2.3-2.el7ost.noarch
> NetworkManager-1.12.0-6.el7.x86_64
> systemd-219-62.el7.x86_64

Comment 4 Pavel Sedlák 2018-10-11 13:41:28 UTC
On compute-0 (which has same issue, i've already modified controller so cannot inspect it anymore), there are few AVC denials possibly related:

> type=AVC msg=audit(1539106775.354:36): avc:  denied  { read } for  pid=3599 comm="NetworkManager" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.361:37): avc:  denied  { read } for  pid=3599 comm="NetworkManager" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.365:38): avc:  denied  { read } for  pid=3599 comm="NetworkManager" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.373:41): avc:  denied  { read } for  pid=3670 comm="dhclient" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.373:42): avc:  denied  { read } for  pid=3674 comm="dhclient" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.373:43): avc:  denied  { write } for  pid=3674 comm="dhclient" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.373:44): avc:  denied  { write } for  pid=3670 comm="dhclient" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.380:45): avc:  denied  { read } for  pid=3678 comm="dhclient" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.380:46): avc:  denied  { write } for  pid=3678 comm="dhclient" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.391:47): avc:  denied  { write } for  pid=3670 comm="dhclient" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.391:48): avc:  denied  { write } for  pid=3674 comm="dhclient" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.398:49): avc:  denied  { write } for  pid=3678 comm="dhclient" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.413:50): avc:  denied  { write } for  pid=3674 comm="dhclient" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.416:51): avc:  denied  { write } for  pid=3670 comm="dhclient" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106775.433:52): avc:  denied  { write } for  pid=3678 comm="dhclient" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106792.039:112): avc:  denied  { unlink } for  pid=3599 comm="NetworkManager" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106792.171:114): avc:  denied  { unlink } for  pid=3599 comm="NetworkManager" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
> type=AVC msg=audit(1539106792.293:115): avc:  denied  { unlink } for  pid=3599 comm="NetworkManager" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0

Comment 5 Pavel Sedlák 2018-10-13 11:54:03 UTC
Selinux issue may not be related, seems similar was already seen once in past without affecting functionality as reported in bug 1390011.

Comment 24 Bob Fournier 2018-10-23 16:44:39 UTC
So the recommended fix is to backport https://review.openstack.org/#/c/416664/ to OSP-10, is that correct? 
https://review.openstack.org/#/c/490962/ does not need to be backported?

Has anyone done the backport yet?  If not I will do it

Comment 25 Martin Schuppert 2018-10-23 18:06:20 UTC
(In reply to Bob Fournier from comment #24)
> So the recommended fix is to backport
> https://review.openstack.org/#/c/416664/ to OSP-10, is that correct? 
> https://review.openstack.org/#/c/490962/ does not need to be backported?

We have only tested with both patches:

diff -urN diskimage-builder-osp10/elements/dhcp-all-interfaces/install.d/dhcp-interface@.service diskimage-builder-osp13/diskimage_builder/elements/dhcp-all-interfaces/install.d/dhcp-interface@.service
--- diskimage-builder-osp10/elements/dhcp-all-interfaces/install.d/dhcp-interface@.service      2018-10-23 14:04:13.351476911 +0200
+++ diskimage-builder-osp13/diskimage_builder/elements/dhcp-all-interfaces/install.d/dhcp-interface@.service    2018-10-23 14:04:34.473701409 +0200
@@ -1,7 +1,15 @@
 [Unit]
 Description=DHCP interface %i
-Before=network-pre.target
-Wants=network-pre.target
+# We want to run after network.target so it doesn't try to bring
+# up the interfaces a second time, but network-online should not
+# be reached until after we've brought up the interfaces.
+# We also need to break the default dependencies which prevents
+# this from operating on Ubuntu/Debian as the unit gets stuck
+# into a cyclical dependency loop.
+DefaultDependencies=no
+After=network.target
+Before=network-online.target
+Wants=network-online.target

Comment 26 Bob Fournier 2018-10-23 18:28:55 UTC
OK. As they are patches in same area of code I will backport the first (https://review.openstack.org/#/c/416664/) then the second (https://review.openstack.org/#/c/490962/).

Comment 35 Bob Fournier 2018-10-25 20:43:18 UTC
In /var/log/messages on controller-0 and compute-0 it looks like the same issue with hostnames using the IP, e.g. on controller:
Oct 25 14:57:24 host-192-168-24-18 kernel: Initializing cgroup subsys cpuset
Oct 25 14:57:24 host-192-168-24-18 kernel: Initializing cgroup subsys cpu
Oct 25 14:57:24 host-192-168-24-18 kernel: Initializing cgroup subsys cpuacct

and on compute:
Oct 25 14:57:23 host-192-168-24-11 kernel: Initializing cgroup subsys cpuset
Oct 25 14:57:23 host-192-168-24-11 kernel: Initializing cgroup subsys cpu
Oct 25 14:57:23 host-192-168-24-11 kernel: Initializing cgroup subsys cpuacct

Comment 42 errata-xmlrpc 2018-11-26 18:00:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3674


Note You need to log in before you can comment on or make changes to this bug.