Bug 1118067

Summary:

Rubygem-Staypuft: Deployment gets paused with error upon installing second compute node.

Product:

Red Hat OpenStack

Reporter:

Alexander Chuzhoy <sasha>

Component:

openstack-foreman-installer

Assignee:

Jason Guiditta <jguiditt>

Status:

CLOSED ERRATA

QA Contact:

Alexander Chuzhoy <sasha>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.0 (RHEL 7)

CC:

aberezin, ajeain, hbrock, jguiditt, mburns, morazi, rhos-maint, sasha, sengork, yeylon

Target Milestone:

Target Release:

Installer

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

openstack-foreman-installer-2.0.15-1.el6ost

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-08-21 18:05:10 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
production.log from the staypuft host.	none
/var/log/messages file from a compute node where puppet had to be re-executed.	none
/var/log/messages file from the controller.	none
/var/log/messages file from the compute node.	none

Description Alexander Chuzhoy 2014-07-09 23:18:38 UTC

Created attachment 916961 [details]
production.log from the staypuft host.

Rubygem-Staypuft:  Deployment gets paused with error upon installing second compute node.

Environment: rhel-osp-installer-0.1.0-2.el6ost.noarch


Steps to reproduce:
1. Install rhel-osp-installer.
2. Create a deployment of 3 hosts (1 Nova controller + 2 compute)
3. Start the deployment.



Result:
The deloyment gets paused with error after successfully installing the controller + one compute node.
The error is: ERF42-0943 [Staypuft::Exception]: Latest Puppet Run Contains Failures for Host: 5

Comment 3 Alexander Chuzhoy 2014-07-10 21:02:09 UTC

Reproduced with poodle 2014-07-10.4

Running the puppet manually on the second node where the deployment paused - I see these changes only:

Notice: /Stage[main]/Quickstack::Compute_common/Firewall[001 nova compute incoming]/ensure: created
Notice: /File[/etc/sysconfig/iptables]/seluser: seluser changed 'unconfined_u' to 'system_u'
Notice: /File[/usr/bin/qemu-system-x86_64]/seltype: seltype changed 'qemu_exec_t' to 'bin_t'
Info: /Stage[main]/Quickstack::Compute::Qemu/File[/usr/bin/qemu-system-x86_64]: Scheduling refresh of Service[nova-compute]
Notice: /Stage[main]/Nova::Compute/Nova::Generic_service[compute]/Service[nova-compute]: Triggered 'refresh' from 1 events



Hope this helps.

Comment 4 Hugh Brock 2014-07-10 21:03:32 UTC

Moving back on_dev since it is still not fixed

Comment 5 Jason Guiditta 2014-07-11 15:12:28 UTC

Is there any way we could get some output from var/log/messages on the problematic node?  The first run would be ideal, as that is more likely to show us something useful.  The production log from foreman doesn't really give me any information with which I can attempt to debug.  Also, if there are any actual error reports in the foreman UI, those would be of help.  They can be found from the host list, click on host, then click on the reports button on the lefthand side.

Comment 6 Alexander Chuzhoy 2014-07-14 19:36:40 UTC

Created attachment 917977 [details]
/var/log/messages file from a compute node where puppet had to be re-executed.

Comment 7 Alexander Chuzhoy 2014-07-14 19:37:44 UTC

The issue was reproduced with poodle 2014-07-14.2
Both compute nodes got stuck. Rerunning puppet fixes the issue.

Comment 8 Jason Guiditta 2014-07-14 20:59:34 UTC

Thanks, that log shows me that ceilometer::compute attempts to set a user with a group of nova, which may or may not have been created already.  I think by adding an ordering dep between ceilometer::compute and nova::init, we can alleviate this

Comment 9 Alexander Chuzhoy 2014-07-14 22:36:02 UTC

Tried to run 2 deploymnets simultaneously after the above was applied.
The deployment doesn't fail, nor does it finish.

Running strace on the puppet process, I see repetitions of this: http://pastebin.test.redhat.com/221451

Perhaps the 2 deployments at the same time is what's causing this.

Comment 10 Alexander Chuzhoy 2014-07-15 18:01:25 UTC

Reproduced running a single deployment of Nova network (1 controller + 1 compute).
The compute got stuck on 60%.
Running puppet manually from CLI shows this:

Info: Caching catalog for <nodename>
Warning: The package type's allow_virtual parameter will be changing its default value from false to true in a future release. If you do not want to allow virtual packages, please explicitly set allow_virtual to false.
   (at /usr/share/ruby/vendor_ruby/puppet/type.rb:816:in `set_default')
Info: Applying configuration version '1405442586'
Notice: /Stage[main]/Ceilometer/User[ceilometer]/groups: groups changed 'nobody,ceilometer' to 'ceilometer,nobody,nova'
Notice: /Stage[main]/Ceilometer/File[/etc/ceilometer/]/owner: owner changed 'root' to 'ceilometer'
Notice: /Stage[main]/Ceilometer/File[/etc/ceilometer/]/group: group changed 'root' to 'ceilometer'
Notice: /Stage[main]/Ceilometer/File[/etc/ceilometer/]/mode: mode changed '0755' to '0750'
Notice: /Stage[main]/Nova::Compute::Libvirt/Service[messagebus]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Nova::Compute::Libvirt/Service[messagebus]: Unscheduling refresh on Service[messagebus]
Notice: /Stage[main]/Ceilometer/File[/etc/ceilometer/ceilometer.conf]/owner: owner changed 'root' to 'ceilometer'
Notice: Finished catalog run in 4.43 seconds


Resuming the deployment at this point - makes it completed.

Comment 11 Jason Guiditta 2014-07-15 19:16:46 UTC

OK, cwolfe has added a variation of the fix which works for me:

https://github.com/redhat-openstack/astapor/pull/316

Comment 14 Alexander Chuzhoy 2014-07-21 16:55:37 UTC

Failed-qa - with rhel-osp-installer-0.1.1-1.el6ost.noarch

Verified: FailedQA



Had to re-run puppet on the second compute node - got the following:
Warning: The package type's allow_virtual parameter will be changing its default value from false to true in a future release. If you do not want to allow virtual packages, please explicitly set allow_virtual to false.
   (at /usr/share/ruby/vendor_ruby/puppet/type.rb:816:in `set_default')
Info: Applying configuration version '1405959075'
Notice: /Stage[main]/Nova::Compute::Libvirt/Service[messagebus]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Nova::Compute::Libvirt/Service[messagebus]: Unscheduling refresh on Service[messagebus]
Notice: Finished catalog run in 6.14 seconds


Then, after resuming the deployment, the installation got completed.
Attaching the messages file.

Comment 15 Alexander Chuzhoy 2014-07-21 16:56:17 UTC

Created attachment 919698 [details]
/var/log/messages file from the controller.

Comment 16 Alexander Chuzhoy 2014-07-21 16:57:06 UTC

Created attachment 919700 [details]
/var/log/messages file from the compute node.

Comment 17 Jason Guiditta 2014-07-21 18:42:24 UTC

I'm not sure what to tell you here.  I am unable to reproduce, and the output above contains no errors, so I don't see a reason staypuft would get 'stuck' here.

Comment 18 Hugh Brock 2014-07-21 19:18:20 UTC

Sasha, can you re-run, and get us a system that is "stuck" but not re-run puppet? I think we need to see the initial failure if there is one.

Thanks...

Comment 19 Jason Guiditta 2014-07-22 21:12:05 UTC

This happens before puppet is really involved. It looked to me like one of the machines involved was on IDT and the rest were on UTC, which results in the foreman report page saying the hosts had time drift. Also, the puppet erros were of the following type:

Execution of '/usr/bin/yum -d 0 -e 0 -y install openstack-nova-compute' returned 1: One of the configured repositories failed (RHOS-5.0 el7), and yum doesn't have enough cached data to continue. At this point the only safe thing yum can do is fail. There are a few ways to work "fix" this: 1. Contact the upstream for the repository and get them to fix the problem. 2. Reconfigure the baseurl/etc. for the repository, to point to a working upstream. This is most often useful if you are using a newer distribution release than is supported by the repository (and the packages for the previous distribution release still work). 3. Disable the repository, so yum won't use it by default. Yum will then just ignore the repository until you permanently enable it again or use --enablerepo for temporary usage: yum-config-manager --disable rhelosp-5.0-el7 4. Configure the failing repository to be skipped, if it is unavailable. Note that yum will try to contact the repo. when it runs most commands, so will have to try and fail each time (and thus. yum will be be much slower). If it is a very temporary problem though, this is often a nice compromise: yum-config-manager --save --setopt=rhelosp-5.0-el7.skip_if_unavailable=true failure: repodata/52771ac1b6c333e8f628419f67eeaad75cf5a670-filelists.sqlite.bz2 from rhelosp-5.0-el7: [Errno 256] No more mirrors to try. http://<some-url>/latest/RH7-RHOS-5.0/x86_64/os/repodata/52771ac1b6c333e8f628419f67eeaad75cf5a670-filelists.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found

Note that <some-url> is an intentional change by me, the url that WAS there was correct.

Comment 21 Jason Guiditta 2014-08-04 18:56:44 UTC

Please retest this with latest release to make sure it was not a bad puddle of environment issue.  If it resurfaces, we can look at which component to assign it to, as it does not appear related to ofi

Comment 26 Alexander Chuzhoy 2014-08-07 16:55:05 UTC

Verified: rhel-osp-installer-0.1.6-5.el6ost.noarch


This particular error doesn't reproduce.

Comment 27 errata-xmlrpc 2014-08-21 18:05:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1090.html