1635662 – [AIO] standalone deployment does not survive a machine reboot

Bug 1635662 - [AIO] standalone deployment does not survive a machine reboot

Summary: [AIO] standalone deployment does not survive a machine reboot

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	14.0 (Rocky)
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	beta
Target Release:	14.0 (Rocky)
Assignee:	Alex Schultz
QA Contact:	Marius Cornea
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-03 12:58 UTC by Fabio Massimo Di Nitto
Modified:	2019-01-11 11:53 UTC (History)
CC List:	13 users (show)
Fixed In Version:	puppet-tripleo-9.3.1-0.20180831202651.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-11 11:53:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1795986	None	None	None	2018-10-03 22:03:47 UTC
OpenStack gerrit	607727	None	None	None	2018-10-03 22:03:24 UTC
Red Hat Product Errata	RHEA-2019:0045	None	None	None	2019-01-11 11:53:44 UTC

Description Fabio Massimo Di Nitto 2018-10-03 12:58:45 UTC

Description of problem:

deploy all-in-one standalone on baremetal, verify that the deployment is functional by running an instance.

reboot the baremetal machine

Version-Release number of selected component (if applicable):

openstack-tripleo-heat-templates-9.0.0-0.20180919080941.0rc1.0rc1.el7ost.noarch

How reproducible:

always

Steps to Reproduce:
1. see above

Actual results:

leaving aside that it took over 20 minutes from reboot to ping host to stop (reboot kicking in), that might be a side effect of the amount of services running on the machine, after the machine is back online, not all containers / services are running.

At a first glance at least all the pacemaker driven services are offline.

Expected results:

A fully functional cloud.

Additional info:

I am happy to provide access to the environment where I am testing AIO. Please let me know as I do keep testing other bits and pieces.

Comment 1 Fabio Massimo Di Nitto 2018-10-03 13:04:59 UTC

It looks like pacemaker does not start at boot for some reasons.

Starting pacemaker manually will make the cloud working again, but instances are not running (perhaps expected).

[root@hab-07 ~]# systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)

Comment 2 Fabio Massimo Di Nitto 2018-10-03 13:10:37 UTC

Moving to PIDONE, keeping DF FYI.

The environment to reproduce is hab-07.

Michele, I won't touch the env till you give me the ok. Feel free to reboot as will, it doesn't take more than a couple of hours to redeploy if needed.

Comment 3 Chris Jones 2018-10-03 15:40:43 UTC

@Fabio: FYI, pacemaker does start on boot, but it takes about 10-11 minutes after the machine starts responding to pings, for pacemaker to actually start.

I believe the culprit is that chrony is installed and configured to start, but it has no NTP server configured, so the chrony-wait service takes 10 minutes to time out:

[root@hab-07 ~]# grep server /etc/chrony.conf | grep -v '^#'
[root@hab-07 ~]# 
[root@hab-07 ~]# systemctl show chrony-wait.service | grep -E 'ExecMain.*Timestamp='
ExecMainStartTimestamp=Wed 2018-10-03 11:07:58 UTC
ExecMainExitTimestamp=Wed 2018-10-03 11:17:58 UTC  [Editor's Note: here's our 10 minute delay]
[root@hab-07 ~]#

I think ntpd is supposed to be used here, since that has an NTP server configured, but I'm not up to date on what the expected interactions are between TripleO and chrony/ntpd.

I hope that's useful info - I think it justifies moving this back to DF as it doesn't seem like Pacemaker has any problems starting when systemd is ready to start it.

Comment 4 Alex Schultz 2018-10-03 15:54:13 UTC

From the ansible.log

        "Notice: /Stage[main]/Tripleo::Profile::Base::Time::Ntp/Service[chronyd]/ensure: ensure changed 'running' to 'stopped'", 
        "Notice: /Stage[main]/Ntp::Install/Package[ntp]/ensure: created", 
        "Notice: /Stage[main]/Ntp::Config/File[/etc/ntp.conf]/content: content changed '{md5}913c85f0fde85f83c2d6c030ecf259e9' to '{md5}56184b875f6e3aeb59cbf8f52a60a70a'", 
        "Notice: /Stage[main]/Ntp::Service/Service[ntp]/ensure: ensure changed 'stopped' to 'running'",

We stopped chrony and started ntp. Sounds like NTP might not have started on reboot but i'll investigate

Comment 5 Alex Schultz 2018-10-03 21:42:25 UTC

Turns out there's another chrony service (chrony-wait) that prevents ntp from starting. We'll need to account for that as well somehow.  This service does not seem to exist on centos.  This likely has a larger impact and affects all of our versions of OSP as it can prevent ntp from running on reboot.

Comment 11 errata-xmlrpc 2019-01-11 11:53:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Note You need to log in before you can comment on or make changes to this bug.