Bug 1635662

Summary: [AIO] standalone deployment does not survive a machine reboot
Product: Red Hat OpenStack Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: puppet-tripleoAssignee: Alex Schultz <aschultz>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: apevec, aschultz, chjones, dbecker, emacchi, jjoyce, jschluet, lhh, mburns, morazi, rhos-maint, slinaber, tvignaud
Target Milestone: betaKeywords: Triaged
Target Release: 14.0 (Rocky)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: puppet-tripleo-9.3.1-0.20180831202651.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-11 11:53:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Fabio Massimo Di Nitto 2018-10-03 12:58:45 UTC
Description of problem:

deploy all-in-one standalone on baremetal, verify that the deployment is functional by running an instance.

reboot the baremetal machine

Version-Release number of selected component (if applicable):

openstack-tripleo-heat-templates-9.0.0-0.20180919080941.0rc1.0rc1.el7ost.noarch

How reproducible:

always

Steps to Reproduce:
1. see above

Actual results:

leaving aside that it took over 20 minutes from reboot to ping host to stop (reboot kicking in), that might be a side effect of the amount of services running on the machine, after the machine is back online, not all containers / services are running.

At a first glance at least all the pacemaker driven services are offline.

Expected results:

A fully functional cloud.

Additional info:

I am happy to provide access to the environment where I am testing AIO. Please let me know as I do keep testing other bits and pieces.

Comment 1 Fabio Massimo Di Nitto 2018-10-03 13:04:59 UTC
It looks like pacemaker does not start at boot for some reasons.

Starting pacemaker manually will make the cloud working again, but instances are not running (perhaps expected).

[root@hab-07 ~]# systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)

Comment 2 Fabio Massimo Di Nitto 2018-10-03 13:10:37 UTC
Moving to PIDONE, keeping DF FYI.

The environment to reproduce is hab-07.

Michele, I won't touch the env till you give me the ok. Feel free to reboot as will, it doesn't take more than a couple of hours to redeploy if needed.

Comment 3 Chris Jones 2018-10-03 15:40:43 UTC
@Fabio: FYI, pacemaker does start on boot, but it takes about 10-11 minutes after the machine starts responding to pings, for pacemaker to actually start.

I believe the culprit is that chrony is installed and configured to start, but it has no NTP server configured, so the chrony-wait service takes 10 minutes to time out:

[root@hab-07 ~]# grep server /etc/chrony.conf | grep -v '^#'
[root@hab-07 ~]# 
[root@hab-07 ~]# systemctl show chrony-wait.service | grep -E 'ExecMain.*Timestamp='
ExecMainStartTimestamp=Wed 2018-10-03 11:07:58 UTC
ExecMainExitTimestamp=Wed 2018-10-03 11:17:58 UTC  [Editor's Note: here's our 10 minute delay]
[root@hab-07 ~]#

I think ntpd is supposed to be used here, since that has an NTP server configured, but I'm not up to date on what the expected interactions are between TripleO and chrony/ntpd.

I hope that's useful info - I think it justifies moving this back to DF as it doesn't seem like Pacemaker has any problems starting when systemd is ready to start it.

Comment 4 Alex Schultz 2018-10-03 15:54:13 UTC
From the ansible.log

        "Notice: /Stage[main]/Tripleo::Profile::Base::Time::Ntp/Service[chronyd]/ensure: ensure changed 'running' to 'stopped'", 
        "Notice: /Stage[main]/Ntp::Install/Package[ntp]/ensure: created", 
        "Notice: /Stage[main]/Ntp::Config/File[/etc/ntp.conf]/content: content changed '{md5}913c85f0fde85f83c2d6c030ecf259e9' to '{md5}56184b875f6e3aeb59cbf8f52a60a70a'", 
        "Notice: /Stage[main]/Ntp::Service/Service[ntp]/ensure: ensure changed 'stopped' to 'running'",

We stopped chrony and started ntp. Sounds like NTP might not have started on reboot but i'll investigate

Comment 5 Alex Schultz 2018-10-03 21:42:25 UTC
Turns out there's another chrony service (chrony-wait) that prevents ntp from starting. We'll need to account for that as well somehow.  This service does not seem to exist on centos.  This likely has a larger impact and affects all of our versions of OSP as it can prevent ntp from running on reboot.

Comment 11 errata-xmlrpc 2019-01-11 11:53:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045