Bug 1275814

Summary: Update OSP-D from 7.0 to 7.1 Failed : systemd stop functioning on the controller node (Failed to get D-Bus connection)
Product: Red Hat OpenStack Reporter: Omri Hochman <ohochman>
Component: rhosp-directorAssignee: James Slagle <jslagle>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: calfonso, dmacpher, jprovazn, jslagle, kbasil, mburns, rhel-osp-director-maint, ukalifon, yeylon
Target Milestone: y2Keywords: TestOnly, Triaged
Target Release: 7.0 (Kilo)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Old puppet manifests were reapplied during the update process when they should not have been. This had the potential to take the cluster services down in the Overcloud. The agent on the overcloud nodes caused the reapplication of the old Puppet manifests because their state was saved in tmpfs mounted directory under /var/run/. This directory is lost on reboot. This update moves the directory from /var/run/heat-config/deployed to /var/lib/heat-config/deployed, which allows the deployed state to persist across reboots.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-21 16:57:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1272254    
Attachments:
Description Flags
messages none

Description Omri Hochman 2015-10-27 20:06:07 UTC
Update OSP-D from 7.0 to 7.1 Failed : systemd stop functioning on the controller node (Failed to get D-Bus connection) 

Environment :
--------------

Controller: 
-------------
dbus-1.6.12-11.el7.x86_64
dbus-glib-0.100-7.el7.x86_64
dbus-python-1.1.1-9.el7.x86_64
dbus-libs-1.6.12-11.el7.x86_64
python-slip-dbus-0.4.0-2.el7.noarch


Undercloud: 
------------
instack-undercloud-2.1.2-29.el7ost.noarch
instack-0.0.7-1.el7ost.noarch
openstack-heat-templates-0-0.6.20150605git.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-45.el7ost.noarch
openstack-heat-api-2015.1.0-4.el7ost.noarch
openstack-heat-api-cfn-2015.1.1-6.el7ost.noarch
heat-cfntools-1.2.8-2.el7.noarch
openstack-heat-common-2015.1.0-4.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.1-6.el7ost.noarch
openstack-heat-api-cfn-2015.1.0-4.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.0-4.el7ost.noarch
openstack-heat-common-2015.1.1-6.el7ost.noarch
openstack-heat-api-2015.1.1-6.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-71.el7ost.noarch
openstack-heat-engine-2015.1.1-6.el7ost.noarch
openstack-heat-engine-2015.1.0-4.el7ost.noarch


Description : 
-------------
It happened after applying this patch : https://review.openstack.org/#/c/239368/ to workaround :https://bugzilla.redhat.com/show_bug.cgi?id=1274859  
and then attempted to update ospd UC+OC from 7.0 to 7.1

Steps:
-------
(1) Install Undercloud and Overcloud 7.0 (with 7.0 Images)
(2) Update the undercloud to 7.1 ( using rhos-release )
(3) make sure you have 7.1 repos on the overcloud nodes
(4) attempt to run the overcloud update command :  

(More details: http://etherpad.corp.redhat.com/update-ospd-7-0-to-7-1  )

openstack overcloud update stack overcloud -i --templates  -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /home/stack/update.yaml 

Results: 
---------

(1)It looks like during the yum update that was running on the controller - one package failed to update :

   59/363 \nFailed to get D-Bus connect
 /run/systemd/private: No such file or directory\nwarning: %post(glusterfs-3.7.1-16.el7.x86_64) scriptlet failed, exit status 1\n 


(2) then during the update systemctl stopped functioning on the controller machie :

[root@overcloud-controller-0 ~]# systemctl 
Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: No such file or directory

(3) Overcloud 'Update failed' 

----------------------------------------------------------
[root@overcloud-controller-0 ~]# ps auxf|grep systemd
root         1  0.3  0.0  51260  2340 ?        Ss   Oct22  28:11 /usr/lib/systemd/systemd --system --deserialize 27
root       346  0.2  0.5  80496 20016 ?        Ss   Oct22  20:04 /usr/lib/systemd/systemd-journald
root       437  0.0  0.0      0     0 ?        Zs   Oct22   3:17 [systemd-logind] <defunct>
dbus       438  0.1  0.0 100492  2024 ?        Ssl  Oct22   7:06 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root      3444  0.0  0.0 112640   928 pts/0    S+   15:43   0:00                      \_ grep --color=auto systemd


/var/log/messages (from controller) 
------------------------------------
                                \n  tzdata.noarch 0:2015g-1.el7                                                   \n  util-linux.x86_64 0:2.23.2-22.el7_1.1   
      \n\nComplete!\nyum return code: 0\nStarting cluster node\nStarting Cluster...\nRedirecting to /bin/systemctl start  corosync.service\nFailed to get D-Bu
to socket /run/systemd/private: No such file or directory\n\nERROR overcloud-controller-0 failed to join cluster in 360 seconds\n", "deploy_stderr": "Non-fata
m package glusterfs-3.7.1-16.el7.x86_64\nNon-fatal POSTUN scriptlet failure in rpm package glusterfs-3.6.0.29-2.el7.x86_64\nError: unable to start corosync\nE
unning on this node\nError: cluster is not currently running on this node\nError: cluster is not currently running on this node\nError: cluster is not current
cluster is not currently running on this node\nError: cluster is not currently running on this node\nError: cluster is not currently running on this node\nErr
ning on this node\nError: cluster is not currently running on this node\nError: cluster is not currently running on this node\nError: cluster is not currently
uster is not currently running on this node\nError: cluster is not currently running on this node\nError: cluster is not currently running on this node\nError
ng on this node\nError: cluster is not currently running on this node\nError: cluster is not currently running on this node\nError: cluster is not currently r
ter is not currently running on this node\nError: cluster is not currently running on this node\nError: cluster is not currently running on this node\nError: 
 on this no

Comment 1 Jan Provaznik 2015-10-27 20:13:52 UTC
I think systemd got into a broken state on the controller node:
[root@overcloud-controller-0 ~]# systemctl 
Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: No such file or directory

Because systemctl doesn't work also any services can't be started.

Comment 2 Omri Hochman 2015-10-27 20:25:33 UTC
Created attachment 1087011 [details]
messages

Adding messages file from controller

Comment 3 James Slagle 2015-11-06 18:08:19 UTC
i also saw some cluster related errors during an update attempt:
https://bugzilla.redhat.com/show_bug.cgi?id=1278004

the puppet reapply is happening due to:
https://bugzilla.redhat.com/show_bug.cgi?id=1278181

though it's still unclear why reapplying the puppet causes these errors

Comment 5 Udi Kalifon 2015-12-15 11:01:17 UTC
Update from 7.0 to 7.2 is working. Verified.

Comment 7 errata-xmlrpc 2015-12-21 16:57:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:2651