Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1157340

Summary: need to rerun puppet to deploy heat on HA Controller (sometimes)
Product: Red Hat OpenStack Reporter: Crag Wolfe <cwolfe>
Component: rubygem-staypuftAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 5.0 (RHEL 7)CC: cwolfe, ddomingo, jguiditt, jstransk, mburns, morazi, rhos-maint, sasha, schuzhoy, sclewis, yeylon
Target Milestone: gaKeywords: ZStream
Target Release: Installer   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ruby193-rubygem-staypuft-0.5.4-1.el7ost Doc Type: Release Note
Doc Text:
A race condition in puppet sometimes caused the following error during a host's first puppet run: change from notrun to 0 failed: /usr/sbin/pcs constraint order start openstack-nova-api-clone then start openstack-nova-scheduler-clone returned 1 instead of one of [0] When a deployment stopped or paused with this error in the puppet log, users had to manually resume the deployment. To fix this, the deployment now runs puppet 3 times on each host. Only the last puppet run is checked for success; doing so will avoid this race condition error.
Story Points: ---
Clone Of:
: 1183832 (view as bug list) Environment:
Last Closed: 2015-02-09 15:17:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1183832    
Attachments:
Description Flags
logs none

Description Crag Wolfe 2014-10-27 04:23:16 UTC
Description of problem:

QE encountered a case where puppet needed to be rerun on a controller to deploy heat.  The first time there was a puppet error due to a missing heat pacemaker resource.  Running a puppet again allowed the heat setup to complete with no errors.

Version-Release number of selected component (if applicable):
osp5 / rhel7 clients / A2

How reproducible:
Only in QA so far, not in development.

Additional info:
Relevant info from /var/log/messages for the host with $cluster_control_ip:


    Oct 24 12:36:33 mac000af779931a puppet-agent[4081]: (/Stage[main]/Quickstack::Pacemaker::Heat/Quickstack::Pacemaker::Constraint::Base[heat-api-cfn-constr]/Pacemaker::Constraint::Base[heat-api-cfn-constr]/Exec[Creating order constraint heat-api-cfn-constr]/returns) executed successfully
    Oct 24 12:36:34 mac000af779931a crmd[13334]: notice: do_state_transition: State transition S_NOT_DC -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
    Oct 24 12:36:34 mac000af779931a puppet-agent[4081]: (/Stage[main]/Quickstack::Pacemaker::Heat/Quickstack::Pacemaker::Constraint::Colocation[heat-api-cfn-colo]/Pacemaker::Constraint::Colocation[heat-api-cfn-colo]/Pcmk_constraint[colo-openstack-heat-api-cfn-clone-openstack-heat-api-clone]/ensure) created
    Oct 24 12:36:34 mac000af779931a crmd[13334]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
    Oct 24 12:36:34 mac000af779931a crmd[13334]: notice: do_state_transition: State transition S_NOT_DC -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
    Oct 24 12:36:34 mac000af779931a cib[13329]: warning: cib_process_diff: Diff 0.156.2 -> 0.156.1 from mac000af7799318.osqe.lab.eng.bos.redhat.com not applied to 0.156.2: Failed application of an update diff
    Oct 24 12:36:34 mac000af779931a attrd[13332]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
    Oct 24 12:36:34 mac000af779931a attrd[13332]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
    Oct 24 12:36:34 mac000af779931a cib[13329]: warning: cib_process_replace: Replacement 0.156.1 from mac000af7799318.osqe.lab.eng.bos.redhat.com not applied to 0.156.2: current num_updates is greater than the replacement
    Oct 24 12:36:34 mac000af779931a cib[13329]: notice: cib_server_process_diff: Not applying diff 0.156.1 -> 0.157.1 (sync in progress)
    Oct 24 12:36:34 mac000af779931a attrd[13332]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
    Oct 24 12:36:34 mac000af779931a attrd[13332]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
    Oct 24 12:36:34 mac000af779931a crmd[13334]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
    Oct 24 12:36:35 mac000af779931a stonith-ng[13330]: warning: cib_process_diff: Diff 0.156.2 -> 0.157.1 from local not applied to 0.156.2: Failed application of an update diff
    Oct 24 12:36:35 mac000af779931a stonith-ng[13330]: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206)
     
    ############## error on next line
    Oct 24 12:36:35 mac000af779931a start-puppet-agent: Error: unable to find group or resource: openstack-heat-api-cloudwatch
     
    Oct 24 12:36:35 mac000af779931a cib[13329]: warning: cib_notify_send_one: Notification of client crmd/513438cc-67ec-41a4-b319-502faaa028e5 failed
    Oct 24 12:36:35 mac000af779931a puppet-agent[4081]: (/Stage[main]/Quickstack::Pacemaker::Heat/Quickstack::Pacemaker::Resource::Service[openstack-heat-api-cloudwatch]/Pacemaker::Resource::Service[openstack-heat-api-cloudwatch]/Pacemaker::Resource::Systemd[openstack-heat-api-cloudwatch]/Pcmk_resource[openstack-heat-api-cloudwatch]/ensure) created

Comment 5 Alexander Chuzhoy 2014-10-31 20:44:05 UTC
Created attachment 952612 [details]
logs

Comment 6 Alexander Chuzhoy 2014-10-31 20:45:13 UTC
Verified: FailedQA
rhel-osp-installer-0.4.7-1.el6ost.noarch
ruby193-rubygem-staypuft-0.4.13-1.el6ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el6ost.noarch
openstack-foreman-installer-2.0.32-1.el6ost.noarch
openstack-puppet-modules-2014.1-24.el6ost.noarch

The logs are attached.


Got the puppet error:
change from notrun to 0 failed: /usr/sbin/pcs constraint order start openstack-nova-api-clone then start openstack-nova-scheduler-clone returned 1 instead of one of [0]

Comment 7 Crag Wolfe 2014-10-31 23:48:55 UTC
It looks like this is not a puppet ordering problem based on:

a previous constraint on nova-api-clone succeeded:

Oct 31 14:37:18 mac000af779931a puppet-agent[4278]: (/Stage[main]/Quickstack::Pacemaker::Nova/Quickstack::Pacemaker::Constraint::Colocation[nova-vnc-api-colo]/Pacemaker::Constraint::Colocation[nova-vnc-api-colo]/Pcmk_constraint[colo-openstack-nova-api-clone-openstack-nova-novncproxy-clone]/ensure) created

and openstack-nova-scheduler-clone was created earlier:

Oct 31 14:37:17 mac000af779931a puppet-agent[4278]: (/Stage[main]/Quickstack::Pacemaker::Nova/Quickstack::Pacemaker::Resource::Service[openstack-nova-scheduler]/Pacemaker::Resource::Service[openstack-nova-scheduler]/Pacemaker::Resource::Systemd[openstack-nova-scheduler]/Pcmk_resource[openstack-nova-scheduler]/ensure) created
...
Oct 31 14:37:19 mac000af779931a crmd[12811]: notice: process_lrm_event: LRM operation openstack-nova-scheduler_start_0 (call=152, rc=0, cib-update=153, confirmed=true) ok

So, it seems that the pacemaker cluster was not ready to handle to the added constraint.  From pacemaker.log:

Oct 31 14:37:19 [12804] mac000af779931a.example.com        cib:     info: crm_client_new:       Connecting 0xad0b20 for uid=0 gid=0 pid=29140 id=18b0d84f-21df-4714-a156-cb88ee4077ce
Oct 31 14:37:19 [12804] mac000af779931a.example.com        cib:     info: cib_process_request:  Forwarding cib_replace operation for section constraints to master (origin=local/cibadmin/2)
Oct 31 14:37:19 [12804] mac000af779931a.example.com        cib:     info: cib_process_diff:     Diff 0.86.3 -> 0.87.1 from mac000af7799318.example.com not applied to 0.86.4: current "num_updates" is greater than required
Oct 31 14:37:19 [12804] mac000af779931a.example.com        cib:     info: crm_client_destroy:   Destroying 0 events

Adding a retry to the pcmk_resource puppet would likely allow this to succeed.

Comment 8 Crag Wolfe 2014-11-01 00:02:04 UTC
Clarifying previous comment, the puppet code[1] that defines pcmk_resource could be modified to accept a "retry" parameter.
[1] https://github.com/radez/puppet-pacemaker/blob/master/lib/puppet/provider/pcmk_resource/default.rb

Comment 12 Mike Burns 2014-11-26 14:10:36 UTC
moving to staypuft component to investigate if we can automatically re-run if puppet fails.

Comment 14 Jiri Stransky 2014-12-08 17:07:15 UTC
Run puppet 3 times in Staypuft:

https://github.com/theforeman/staypuft/pull/387

This doesn't fix the root cause, but gives us 3 puppet runs to let things converge. The last puppet run's report status is checked, it must be successful for the deployment to continue.

Comment 17 Alexander Chuzhoy 2015-01-16 22:03:26 UTC
Verified:
Environment:
ruby193-rubygem-staypuft-0.5.12-1.el7ost.noarch
openstack-puppet-modules-2014.2.8-1.el7ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch
openstack-foreman-installer-3.0.10-2.el7ost.noarch
rhel-osp-installer-0.5.5-1.el7ost.noarch
rhel-osp-installer-client-0.5.5-1.el7ost.noarch


The reported issue doesn't reproduce.

Comment 20 errata-xmlrpc 2015-02-09 15:17:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0156.html