Description of problem: deploy all-in-one standalone on baremetal, verify that the deployment is functional by running an instance. reboot the baremetal machine Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-9.0.0-0.20180919080941.0rc1.0rc1.el7ost.noarch How reproducible: always Steps to Reproduce: 1. see above Actual results: leaving aside that it took over 20 minutes from reboot to ping host to stop (reboot kicking in), that might be a side effect of the amount of services running on the machine, after the machine is back online, not all containers / services are running. At a first glance at least all the pacemaker driven services are offline. Expected results: A fully functional cloud. Additional info: I am happy to provide access to the environment where I am testing AIO. Please let me know as I do keep testing other bits and pieces.
It looks like pacemaker does not start at boot for some reasons. Starting pacemaker manually will make the cloud working again, but instances are not running (perhaps expected). [root@hab-07 ~]# systemctl status pacemaker ● pacemaker.service - Pacemaker High Availability Cluster Manager Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
Moving to PIDONE, keeping DF FYI. The environment to reproduce is hab-07. Michele, I won't touch the env till you give me the ok. Feel free to reboot as will, it doesn't take more than a couple of hours to redeploy if needed.
@Fabio: FYI, pacemaker does start on boot, but it takes about 10-11 minutes after the machine starts responding to pings, for pacemaker to actually start. I believe the culprit is that chrony is installed and configured to start, but it has no NTP server configured, so the chrony-wait service takes 10 minutes to time out: [root@hab-07 ~]# grep server /etc/chrony.conf | grep -v '^#' [root@hab-07 ~]# [root@hab-07 ~]# systemctl show chrony-wait.service | grep -E 'ExecMain.*Timestamp=' ExecMainStartTimestamp=Wed 2018-10-03 11:07:58 UTC ExecMainExitTimestamp=Wed 2018-10-03 11:17:58 UTC [Editor's Note: here's our 10 minute delay] [root@hab-07 ~]# I think ntpd is supposed to be used here, since that has an NTP server configured, but I'm not up to date on what the expected interactions are between TripleO and chrony/ntpd. I hope that's useful info - I think it justifies moving this back to DF as it doesn't seem like Pacemaker has any problems starting when systemd is ready to start it.
From the ansible.log "Notice: /Stage[main]/Tripleo::Profile::Base::Time::Ntp/Service[chronyd]/ensure: ensure changed 'running' to 'stopped'", "Notice: /Stage[main]/Ntp::Install/Package[ntp]/ensure: created", "Notice: /Stage[main]/Ntp::Config/File[/etc/ntp.conf]/content: content changed '{md5}913c85f0fde85f83c2d6c030ecf259e9' to '{md5}56184b875f6e3aeb59cbf8f52a60a70a'", "Notice: /Stage[main]/Ntp::Service/Service[ntp]/ensure: ensure changed 'stopped' to 'running'", We stopped chrony and started ntp. Sounds like NTP might not have started on reboot but i'll investigate
Turns out there's another chrony service (chrony-wait) that prevents ntp from starting. We'll need to account for that as well somehow. This service does not seem to exist on centos. This likely has a larger impact and affects all of our versions of OSP as it can prevent ntp from running on reboot.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045