tripleo deployment of OSP10 on rhel-7.6 fails in overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0 on > Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Exec try 360/360 > Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1' > Debug: Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1' > Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Sleeping for 10.0 seconds between tries > Notice: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Error: cluster is not currently running on this node > Notice: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Dependency Exec[wait-for-settle] has failures: true pacemaker failed due to corosync: > Oct 10 16:38:24 host-192-168-24-10 corosync[155641]: Can't read file /etc/corosync/corosync.conf reason = (No such file or directory) this config file really missing. Seems it's missing due that hostname of this node is not as expected controller-0, but host-192-168-24-10 instead. (btw same applies also to other OverCloud nodes, e.g. computes) Looking in /var/log/messages, hostname recorded in all intries is this 'host-192...', there are few entries about hostname being changed to controller-0 (from networkmanager and such). hostnamectl show that only static hostname was changed: > Static hostname: controller-0 > Transient hostname: host-192-168-24-10 (and after restart of oc node correct 'controller-0' gets used) In case of OSP10 on rhel-7.5 (not 7.6) there was no such issue, and controller-0 seems to be used in all entries since beginning of whole /var/log/messages (so none of these NM changing hostname/restart was needed there). So I would assume this is expected behaviour. Logs related to dhcp/cloud-init/... seem bit different between 7.5/7.6 setups, but it's not clear to me if they indicate any issue or not. (if anything seems that 7.6 mentiones more times changing hostname, but transient seem to never really get affected) from UC: > openstack-tripleo-0.0.8-0.3.4de13b3git.el7ost.noarch > python-tripleoclient-5.4.6-1.el7ost.noarch > > openstack-tripleo-image-elements-5.3.3-1.el7ost.noarch > rhosp-director-images-ipa-10.0-20181005.1.donotshipel7_6.el7ost.noarch > rhosp-director-images-10.0-20181005.1.donotshipel7_6.el7ost.noarch > > dnsmasq-2.76-7.el7.x86_64 > openstack-neutron-9.4.1-28.el7ost.noarch > openstack-neutron-ml2-9.4.1-28.el7ost.noarch > python-neutron-9.4.1-28.el7ost.noarch > python-neutron-lib-0.4.0-1.el7ost.noarch > python-neutronclient-6.0.1-1.el7ost.noarch > openstack-neutron-openvswitch-9.4.1-28.el7ost.noarch > openstack-neutron-common-9.4.1-28.el7ost.noarch > puppet-neutron-9.5.0-5.el7ost.noarch > python-ironic-lib-2.1.3-2.el7ost.noarch > puppet-ironic-9.5.0-3.el7ost.noarch > openstack-ironic-inspector-4.2.2-4.el7ost.noarch > python-ironic-inspector-client-1.10.0-1.el7ost.noarch > openstack-ironic-common-6.2.4-3.el7ost.noarch > openstack-ironic-conductor-6.2.4-3.el7ost.noarch > openstack-ironic-api-6.2.4-3.el7ost.noarch > python-ironicclient-1.7.1-1.el7ost.noarch from OC node: > cloud-init-18.2-1.el7.x86_64 > kernel-3.10.0-956.el7.x86_64 > os-collect-config-5.2.1-2.el7ost.noarch > os-net-config-5.2.3-2.el7ost.noarch > NetworkManager-1.12.0-6.el7.x86_64 > systemd-219-62.el7.x86_64
On compute-0 (which has same issue, i've already modified controller so cannot inspect it anymore), there are few AVC denials possibly related: > type=AVC msg=audit(1539106775.354:36): avc: denied { read } for pid=3599 comm="NetworkManager" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.361:37): avc: denied { read } for pid=3599 comm="NetworkManager" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.365:38): avc: denied { read } for pid=3599 comm="NetworkManager" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.373:41): avc: denied { read } for pid=3670 comm="dhclient" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.373:42): avc: denied { read } for pid=3674 comm="dhclient" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.373:43): avc: denied { write } for pid=3674 comm="dhclient" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.373:44): avc: denied { write } for pid=3670 comm="dhclient" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.380:45): avc: denied { read } for pid=3678 comm="dhclient" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.380:46): avc: denied { write } for pid=3678 comm="dhclient" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.391:47): avc: denied { write } for pid=3670 comm="dhclient" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.391:48): avc: denied { write } for pid=3674 comm="dhclient" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.398:49): avc: denied { write } for pid=3678 comm="dhclient" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.413:50): avc: denied { write } for pid=3674 comm="dhclient" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.416:51): avc: denied { write } for pid=3670 comm="dhclient" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106775.433:52): avc: denied { write } for pid=3678 comm="dhclient" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:dhcpc_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106792.039:112): avc: denied { unlink } for pid=3599 comm="NetworkManager" name="dhclient-eth2.pid" dev="tmpfs" ino=31671 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106792.171:114): avc: denied { unlink } for pid=3599 comm="NetworkManager" name="dhclient-eth1.pid" dev="tmpfs" ino=33456 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0 > type=AVC msg=audit(1539106792.293:115): avc: denied { unlink } for pid=3599 comm="NetworkManager" name="dhclient-eth0.pid" dev="tmpfs" ino=30701 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=0
Selinux issue may not be related, seems similar was already seen once in past without affecting functionality as reported in bug 1390011.
So the recommended fix is to backport https://review.openstack.org/#/c/416664/ to OSP-10, is that correct? https://review.openstack.org/#/c/490962/ does not need to be backported? Has anyone done the backport yet? If not I will do it
(In reply to Bob Fournier from comment #24) > So the recommended fix is to backport > https://review.openstack.org/#/c/416664/ to OSP-10, is that correct? > https://review.openstack.org/#/c/490962/ does not need to be backported? We have only tested with both patches: diff -urN diskimage-builder-osp10/elements/dhcp-all-interfaces/install.d/dhcp-interface@.service diskimage-builder-osp13/diskimage_builder/elements/dhcp-all-interfaces/install.d/dhcp-interface@.service --- diskimage-builder-osp10/elements/dhcp-all-interfaces/install.d/dhcp-interface@.service 2018-10-23 14:04:13.351476911 +0200 +++ diskimage-builder-osp13/diskimage_builder/elements/dhcp-all-interfaces/install.d/dhcp-interface@.service 2018-10-23 14:04:34.473701409 +0200 @@ -1,7 +1,15 @@ [Unit] Description=DHCP interface %i -Before=network-pre.target -Wants=network-pre.target +# We want to run after network.target so it doesn't try to bring +# up the interfaces a second time, but network-online should not +# be reached until after we've brought up the interfaces. +# We also need to break the default dependencies which prevents +# this from operating on Ubuntu/Debian as the unit gets stuck +# into a cyclical dependency loop. +DefaultDependencies=no +After=network.target +Before=network-online.target +Wants=network-online.target
OK. As they are patches in same area of code I will backport the first (https://review.openstack.org/#/c/416664/) then the second (https://review.openstack.org/#/c/490962/).
In /var/log/messages on controller-0 and compute-0 it looks like the same issue with hostnames using the IP, e.g. on controller: Oct 25 14:57:24 host-192-168-24-18 kernel: Initializing cgroup subsys cpuset Oct 25 14:57:24 host-192-168-24-18 kernel: Initializing cgroup subsys cpu Oct 25 14:57:24 host-192-168-24-18 kernel: Initializing cgroup subsys cpuacct and on compute: Oct 25 14:57:23 host-192-168-24-11 kernel: Initializing cgroup subsys cpuset Oct 25 14:57:23 host-192-168-24-11 kernel: Initializing cgroup subsys cpu Oct 25 14:57:23 host-192-168-24-11 kernel: Initializing cgroup subsys cpuacct
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3674