Bug 1157340
| Summary: | need to rerun puppet to deploy heat on HA Controller (sometimes) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Crag Wolfe <cwolfe> | ||||
| Component: | rubygem-staypuft | Assignee: | Jiri Stransky <jstransk> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 5.0 (RHEL 7) | CC: | cwolfe, ddomingo, jguiditt, jstransk, mburns, morazi, rhos-maint, sasha, schuzhoy, sclewis, yeylon | ||||
| Target Milestone: | ga | Keywords: | ZStream | ||||
| Target Release: | Installer | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | ruby193-rubygem-staypuft-0.5.4-1.el7ost | Doc Type: | Release Note | ||||
| Doc Text: |
A race condition in puppet sometimes caused the following error during a host's first puppet run:
change from notrun to 0 failed: /usr/sbin/pcs constraint order start openstack-nova-api-clone
then start openstack-nova-scheduler-clone returned 1 instead of one of [0]
When a deployment stopped or paused with this error in the puppet log, users had to manually resume the deployment.
To fix this, the deployment now runs puppet 3 times on each host. Only the last puppet run is checked for success; doing so will avoid this race condition error.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1183832 (view as bug list) | Environment: | |||||
| Last Closed: | 2015-02-09 15:17:34 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1183832 | ||||||
| Attachments: |
|
||||||
Created attachment 952612 [details]
logs
Verified: FailedQA rhel-osp-installer-0.4.7-1.el6ost.noarch ruby193-rubygem-staypuft-0.4.13-1.el6ost.noarch ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el6ost.noarch openstack-foreman-installer-2.0.32-1.el6ost.noarch openstack-puppet-modules-2014.1-24.el6ost.noarch The logs are attached. Got the puppet error: change from notrun to 0 failed: /usr/sbin/pcs constraint order start openstack-nova-api-clone then start openstack-nova-scheduler-clone returned 1 instead of one of [0] It looks like this is not a puppet ordering problem based on: a previous constraint on nova-api-clone succeeded: Oct 31 14:37:18 mac000af779931a puppet-agent[4278]: (/Stage[main]/Quickstack::Pacemaker::Nova/Quickstack::Pacemaker::Constraint::Colocation[nova-vnc-api-colo]/Pacemaker::Constraint::Colocation[nova-vnc-api-colo]/Pcmk_constraint[colo-openstack-nova-api-clone-openstack-nova-novncproxy-clone]/ensure) created and openstack-nova-scheduler-clone was created earlier: Oct 31 14:37:17 mac000af779931a puppet-agent[4278]: (/Stage[main]/Quickstack::Pacemaker::Nova/Quickstack::Pacemaker::Resource::Service[openstack-nova-scheduler]/Pacemaker::Resource::Service[openstack-nova-scheduler]/Pacemaker::Resource::Systemd[openstack-nova-scheduler]/Pcmk_resource[openstack-nova-scheduler]/ensure) created ... Oct 31 14:37:19 mac000af779931a crmd[12811]: notice: process_lrm_event: LRM operation openstack-nova-scheduler_start_0 (call=152, rc=0, cib-update=153, confirmed=true) ok So, it seems that the pacemaker cluster was not ready to handle to the added constraint. From pacemaker.log: Oct 31 14:37:19 [12804] mac000af779931a.example.com cib: info: crm_client_new: Connecting 0xad0b20 for uid=0 gid=0 pid=29140 id=18b0d84f-21df-4714-a156-cb88ee4077ce Oct 31 14:37:19 [12804] mac000af779931a.example.com cib: info: cib_process_request: Forwarding cib_replace operation for section constraints to master (origin=local/cibadmin/2) Oct 31 14:37:19 [12804] mac000af779931a.example.com cib: info: cib_process_diff: Diff 0.86.3 -> 0.87.1 from mac000af7799318.example.com not applied to 0.86.4: current "num_updates" is greater than required Oct 31 14:37:19 [12804] mac000af779931a.example.com cib: info: crm_client_destroy: Destroying 0 events Adding a retry to the pcmk_resource puppet would likely allow this to succeed. Clarifying previous comment, the puppet code[1] that defines pcmk_resource could be modified to accept a "retry" parameter. [1] https://github.com/radez/puppet-pacemaker/blob/master/lib/puppet/provider/pcmk_resource/default.rb moving to staypuft component to investigate if we can automatically re-run if puppet fails. Run puppet 3 times in Staypuft: https://github.com/theforeman/staypuft/pull/387 This doesn't fix the root cause, but gives us 3 puppet runs to let things converge. The last puppet run's report status is checked, it must be successful for the deployment to continue. Verified: Environment: ruby193-rubygem-staypuft-0.5.12-1.el7ost.noarch openstack-puppet-modules-2014.2.8-1.el7ost.noarch ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch openstack-foreman-installer-3.0.10-2.el7ost.noarch rhel-osp-installer-0.5.5-1.el7ost.noarch rhel-osp-installer-client-0.5.5-1.el7ost.noarch The reported issue doesn't reproduce. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0156.html |
Description of problem: QE encountered a case where puppet needed to be rerun on a controller to deploy heat. The first time there was a puppet error due to a missing heat pacemaker resource. Running a puppet again allowed the heat setup to complete with no errors. Version-Release number of selected component (if applicable): osp5 / rhel7 clients / A2 How reproducible: Only in QA so far, not in development. Additional info: Relevant info from /var/log/messages for the host with $cluster_control_ip: Oct 24 12:36:33 mac000af779931a puppet-agent[4081]: (/Stage[main]/Quickstack::Pacemaker::Heat/Quickstack::Pacemaker::Constraint::Base[heat-api-cfn-constr]/Pacemaker::Constraint::Base[heat-api-cfn-constr]/Exec[Creating order constraint heat-api-cfn-constr]/returns) executed successfully Oct 24 12:36:34 mac000af779931a crmd[13334]: notice: do_state_transition: State transition S_NOT_DC -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ] Oct 24 12:36:34 mac000af779931a puppet-agent[4081]: (/Stage[main]/Quickstack::Pacemaker::Heat/Quickstack::Pacemaker::Constraint::Colocation[heat-api-cfn-colo]/Pacemaker::Constraint::Colocation[heat-api-cfn-colo]/Pcmk_constraint[colo-openstack-heat-api-cfn-clone-openstack-heat-api-clone]/ensure) created Oct 24 12:36:34 mac000af779931a crmd[13334]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Oct 24 12:36:34 mac000af779931a crmd[13334]: notice: do_state_transition: State transition S_NOT_DC -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ] Oct 24 12:36:34 mac000af779931a cib[13329]: warning: cib_process_diff: Diff 0.156.2 -> 0.156.1 from mac000af7799318.osqe.lab.eng.bos.redhat.com not applied to 0.156.2: Failed application of an update diff Oct 24 12:36:34 mac000af779931a attrd[13332]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Oct 24 12:36:34 mac000af779931a attrd[13332]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Oct 24 12:36:34 mac000af779931a cib[13329]: warning: cib_process_replace: Replacement 0.156.1 from mac000af7799318.osqe.lab.eng.bos.redhat.com not applied to 0.156.2: current num_updates is greater than the replacement Oct 24 12:36:34 mac000af779931a cib[13329]: notice: cib_server_process_diff: Not applying diff 0.156.1 -> 0.157.1 (sync in progress) Oct 24 12:36:34 mac000af779931a attrd[13332]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Oct 24 12:36:34 mac000af779931a attrd[13332]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Oct 24 12:36:34 mac000af779931a crmd[13334]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Oct 24 12:36:35 mac000af779931a stonith-ng[13330]: warning: cib_process_diff: Diff 0.156.2 -> 0.157.1 from local not applied to 0.156.2: Failed application of an update diff Oct 24 12:36:35 mac000af779931a stonith-ng[13330]: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) ############## error on next line Oct 24 12:36:35 mac000af779931a start-puppet-agent: Error: unable to find group or resource: openstack-heat-api-cloudwatch Oct 24 12:36:35 mac000af779931a cib[13329]: warning: cib_notify_send_one: Notification of client crmd/513438cc-67ec-41a4-b319-502faaa028e5 failed Oct 24 12:36:35 mac000af779931a puppet-agent[4081]: (/Stage[main]/Quickstack::Pacemaker::Heat/Quickstack::Pacemaker::Resource::Service[openstack-heat-api-cloudwatch]/Pacemaker::Resource::Service[openstack-heat-api-cloudwatch]/Pacemaker::Resource::Systemd[openstack-heat-api-cloudwatch]/Pcmk_resource[openstack-heat-api-cloudwatch]/ensure) created