Bug 1635662
| Summary: | [AIO] standalone deployment does not survive a machine reboot | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Fabio Massimo Di Nitto <fdinitto> |
| Component: | puppet-tripleo | Assignee: | Alex Schultz <aschultz> |
| Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 14.0 (Rocky) | CC: | apevec, aschultz, chjones, dbecker, emacchi, jjoyce, jschluet, lhh, mburns, morazi, rhos-maint, slinaber, tvignaud |
| Target Milestone: | beta | Keywords: | Triaged |
| Target Release: | 14.0 (Rocky) | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | puppet-tripleo-9.3.1-0.20180831202651.el7ost | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-01-11 11:53:35 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Fabio Massimo Di Nitto
2018-10-03 12:58:45 UTC
It looks like pacemaker does not start at boot for some reasons. Starting pacemaker manually will make the cloud working again, but instances are not running (perhaps expected). [root@hab-07 ~]# systemctl status pacemaker ● pacemaker.service - Pacemaker High Availability Cluster Manager Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled) Moving to PIDONE, keeping DF FYI. The environment to reproduce is hab-07. Michele, I won't touch the env till you give me the ok. Feel free to reboot as will, it doesn't take more than a couple of hours to redeploy if needed. @Fabio: FYI, pacemaker does start on boot, but it takes about 10-11 minutes after the machine starts responding to pings, for pacemaker to actually start. I believe the culprit is that chrony is installed and configured to start, but it has no NTP server configured, so the chrony-wait service takes 10 minutes to time out: [root@hab-07 ~]# grep server /etc/chrony.conf | grep -v '^#' [root@hab-07 ~]# [root@hab-07 ~]# systemctl show chrony-wait.service | grep -E 'ExecMain.*Timestamp=' ExecMainStartTimestamp=Wed 2018-10-03 11:07:58 UTC ExecMainExitTimestamp=Wed 2018-10-03 11:17:58 UTC [Editor's Note: here's our 10 minute delay] [root@hab-07 ~]# I think ntpd is supposed to be used here, since that has an NTP server configured, but I'm not up to date on what the expected interactions are between TripleO and chrony/ntpd. I hope that's useful info - I think it justifies moving this back to DF as it doesn't seem like Pacemaker has any problems starting when systemd is ready to start it. From the ansible.log
"Notice: /Stage[main]/Tripleo::Profile::Base::Time::Ntp/Service[chronyd]/ensure: ensure changed 'running' to 'stopped'",
"Notice: /Stage[main]/Ntp::Install/Package[ntp]/ensure: created",
"Notice: /Stage[main]/Ntp::Config/File[/etc/ntp.conf]/content: content changed '{md5}913c85f0fde85f83c2d6c030ecf259e9' to '{md5}56184b875f6e3aeb59cbf8f52a60a70a'",
"Notice: /Stage[main]/Ntp::Service/Service[ntp]/ensure: ensure changed 'stopped' to 'running'",
We stopped chrony and started ntp. Sounds like NTP might not have started on reboot but i'll investigate
Turns out there's another chrony service (chrony-wait) that prevents ntp from starting. We'll need to account for that as well somehow. This service does not seem to exist on centos. This likely has a larger impact and affects all of our versions of OSP as it can prevent ntp from running on reboot. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045 |