Description of problem: When a puppet run gets stuck, we have no way of knowing what happened. It's caused by two things: 1. Puppet run output is printed into os-collect-config logs only when puppet has finished. If it doesn't finish, no output exists that we could use to guess why it didn't finish. 2. Even if we fix point 1, puppet itself only prints out steps that it finished performing by default, so a stuck step won't get printed at all anyway. This can be changed by running puppet in debug mode, so that it prints every command it's attempting to run on the system.
Martin Mágr has an initial patch for it, didn't pass upstream CI yet: https://review.openstack.org/#/c/188737/
Requested blocker because this is going to be a major usability problem when we're trying to debug deployment failures in the field.
The CI jobs are failing upstream because the CI is broken. Trying to fix the CI by removing py26 jobs (which have been deprecated for some time). Pending feedback from Heat / OS infra folks. https://review.openstack.org/#/c/201105/
removing infra patch from tracking. We don't need that downstream
I posted another patch which is needed to allow enabling puppet debug mode. It will only do it's job together with mmagr's patch.
@jistr -- do we get logs from puppet if it completes? Is this only an issue when puppet gets stuck or crashes or gets into a loop?
Yes we get logs from puppet when it completes, the output is in os-collect-config log. We also get logs when puppet fails on some step and skips the rest. We'd probably also get logs from a crashed puppet run, although that hasn't happened for me. This is only an issue when a puppet run gets stuck (e.g. in a retry loop inside puppet, or a shell command executed via puppet gets stuck) -- then we don't know which action it was trying to perform.
@jistr hey man I tried to verify today I applied the 2 heat-templates & tht changes - it failed in Compute/Controller post deployment. Failure might be because of up/downstream issues (e.g. cloned and cherry picked to upstream tht and used that for roles) but 2 questions when i revisit: 1. can you think of easy repro to validate (e.g. i have to induce a puppet fail right? then check what, will it just be obvious in journalctl?) 2. after applying the heat-templates change shouldn't need to rebuild overcloud-full or anything right? thanks!
Not sure if you tested the latest heat-templates patch as it was uploaded by mmagr just about 1 hour before you posted this comment. It will need to be changed anyway though because it didn't pass some of the CI jobs yet, so maybe now is not the right time to test yet. 1. You should see puppet run log files in /var/run/heat-config/deployed. When you enable debugging (via ConfigDebug variable, or the environment file included in the t-h-t patch), then the logs will be more verbose. If you want to test the "getting stuck" part, you can add an exec [1] to one of the puppet manifests in t-h-t, which would loop and fail E.g. just "false" as a command with e.g. 10 sec try_sleep and 360 tries. If you have debugging enabled, you should see in the log how puppet repeatedly tries to execute the "false" command. 2. You need to rebuild the image or replace the affected files in the image manually. (for t-h-t it's not necessary, for h-t it is, because we use image elements from there) [1] https://docs.puppetlabs.com/references/4.2.latest/type.html#exec
dropping the heat templates patch from this bug. It will be included in bug 1243884
Should be already part of OSP8 since it was merged in early October, correct?
Verified: Environment: openstack-tripleo-heat-templates-0.8.8-2.el7ost.noarch During a deployment, I see that the puppet logs are appending under /var/run/heat-config/deployed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0604.html