Bug 1118067
Summary: | Rubygem-Staypuft: Deployment gets paused with error upon installing second compute node. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> |
Component: | openstack-foreman-installer | Assignee: | Jason Guiditta <jguiditt> |
Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.0 (RHEL 7) | CC: | aberezin, ajeain, hbrock, jguiditt, mburns, morazi, rhos-maint, sasha, sengork, yeylon |
Target Milestone: | ga | ||
Target Release: | Installer | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | openstack-foreman-installer-2.0.15-1.el6ost | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-08-21 18:05:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Reproduced with poodle 2014-07-10.4 Running the puppet manually on the second node where the deployment paused - I see these changes only: Notice: /Stage[main]/Quickstack::Compute_common/Firewall[001 nova compute incoming]/ensure: created Notice: /File[/etc/sysconfig/iptables]/seluser: seluser changed 'unconfined_u' to 'system_u' Notice: /File[/usr/bin/qemu-system-x86_64]/seltype: seltype changed 'qemu_exec_t' to 'bin_t' Info: /Stage[main]/Quickstack::Compute::Qemu/File[/usr/bin/qemu-system-x86_64]: Scheduling refresh of Service[nova-compute] Notice: /Stage[main]/Nova::Compute/Nova::Generic_service[compute]/Service[nova-compute]: Triggered 'refresh' from 1 events Hope this helps. Moving back on_dev since it is still not fixed Is there any way we could get some output from var/log/messages on the problematic node? The first run would be ideal, as that is more likely to show us something useful. The production log from foreman doesn't really give me any information with which I can attempt to debug. Also, if there are any actual error reports in the foreman UI, those would be of help. They can be found from the host list, click on host, then click on the reports button on the lefthand side. Created attachment 917977 [details]
/var/log/messages file from a compute node where puppet had to be re-executed.
The issue was reproduced with poodle 2014-07-14.2 Both compute nodes got stuck. Rerunning puppet fixes the issue. Thanks, that log shows me that ceilometer::compute attempts to set a user with a group of nova, which may or may not have been created already. I think by adding an ordering dep between ceilometer::compute and nova::init, we can alleviate this Tried to run 2 deploymnets simultaneously after the above was applied. The deployment doesn't fail, nor does it finish. Running strace on the puppet process, I see repetitions of this: http://pastebin.test.redhat.com/221451 Perhaps the 2 deployments at the same time is what's causing this. Reproduced running a single deployment of Nova network (1 controller + 1 compute). The compute got stuck on 60%. Running puppet manually from CLI shows this: Info: Caching catalog for <nodename> Warning: The package type's allow_virtual parameter will be changing its default value from false to true in a future release. If you do not want to allow virtual packages, please explicitly set allow_virtual to false. (at /usr/share/ruby/vendor_ruby/puppet/type.rb:816:in `set_default') Info: Applying configuration version '1405442586' Notice: /Stage[main]/Ceilometer/User[ceilometer]/groups: groups changed 'nobody,ceilometer' to 'ceilometer,nobody,nova' Notice: /Stage[main]/Ceilometer/File[/etc/ceilometer/]/owner: owner changed 'root' to 'ceilometer' Notice: /Stage[main]/Ceilometer/File[/etc/ceilometer/]/group: group changed 'root' to 'ceilometer' Notice: /Stage[main]/Ceilometer/File[/etc/ceilometer/]/mode: mode changed '0755' to '0750' Notice: /Stage[main]/Nova::Compute::Libvirt/Service[messagebus]/ensure: ensure changed 'stopped' to 'running' Info: /Stage[main]/Nova::Compute::Libvirt/Service[messagebus]: Unscheduling refresh on Service[messagebus] Notice: /Stage[main]/Ceilometer/File[/etc/ceilometer/ceilometer.conf]/owner: owner changed 'root' to 'ceilometer' Notice: Finished catalog run in 4.43 seconds Resuming the deployment at this point - makes it completed. OK, cwolfe has added a variation of the fix which works for me: https://github.com/redhat-openstack/astapor/pull/316 Failed-qa - with rhel-osp-installer-0.1.1-1.el6ost.noarch Verified: FailedQA Had to re-run puppet on the second compute node - got the following: Warning: The package type's allow_virtual parameter will be changing its default value from false to true in a future release. If you do not want to allow virtual packages, please explicitly set allow_virtual to false. (at /usr/share/ruby/vendor_ruby/puppet/type.rb:816:in `set_default') Info: Applying configuration version '1405959075' Notice: /Stage[main]/Nova::Compute::Libvirt/Service[messagebus]/ensure: ensure changed 'stopped' to 'running' Info: /Stage[main]/Nova::Compute::Libvirt/Service[messagebus]: Unscheduling refresh on Service[messagebus] Notice: Finished catalog run in 6.14 seconds Then, after resuming the deployment, the installation got completed. Attaching the messages file. Created attachment 919698 [details]
/var/log/messages file from the controller.
Created attachment 919700 [details]
/var/log/messages file from the compute node.
I'm not sure what to tell you here. I am unable to reproduce, and the output above contains no errors, so I don't see a reason staypuft would get 'stuck' here. Sasha, can you re-run, and get us a system that is "stuck" but not re-run puppet? I think we need to see the initial failure if there is one. Thanks... This happens before puppet is really involved. It looked to me like one of the machines involved was on IDT and the rest were on UTC, which results in the foreman report page saying the hosts had time drift. Also, the puppet erros were of the following type: Execution of '/usr/bin/yum -d 0 -e 0 -y install openstack-nova-compute' returned 1: One of the configured repositories failed (RHOS-5.0 el7), and yum doesn't have enough cached data to continue. At this point the only safe thing yum can do is fail. There are a few ways to work "fix" this: 1. Contact the upstream for the repository and get them to fix the problem. 2. Reconfigure the baseurl/etc. for the repository, to point to a working upstream. This is most often useful if you are using a newer distribution release than is supported by the repository (and the packages for the previous distribution release still work). 3. Disable the repository, so yum won't use it by default. Yum will then just ignore the repository until you permanently enable it again or use --enablerepo for temporary usage: yum-config-manager --disable rhelosp-5.0-el7 4. Configure the failing repository to be skipped, if it is unavailable. Note that yum will try to contact the repo. when it runs most commands, so will have to try and fail each time (and thus. yum will be be much slower). If it is a very temporary problem though, this is often a nice compromise: yum-config-manager --save --setopt=rhelosp-5.0-el7.skip_if_unavailable=true failure: repodata/52771ac1b6c333e8f628419f67eeaad75cf5a670-filelists.sqlite.bz2 from rhelosp-5.0-el7: [Errno 256] No more mirrors to try. http://<some-url>/latest/RH7-RHOS-5.0/x86_64/os/repodata/52771ac1b6c333e8f628419f67eeaad75cf5a670-filelists.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found Note that <some-url> is an intentional change by me, the url that WAS there was correct. Please retest this with latest release to make sure it was not a bad puddle of environment issue. If it resurfaces, we can look at which component to assign it to, as it does not appear related to ofi Verified: rhel-osp-installer-0.1.6-5.el6ost.noarch This particular error doesn't reproduce. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1090.html |
Created attachment 916961 [details] production.log from the staypuft host. Rubygem-Staypuft: Deployment gets paused with error upon installing second compute node. Environment: rhel-osp-installer-0.1.0-2.el6ost.noarch Steps to reproduce: 1. Install rhel-osp-installer. 2. Create a deployment of 3 hosts (1 Nova controller + 2 compute) 3. Start the deployment. Result: The deloyment gets paused with error after successfully installing the controller + one compute node. The error is: ERF42-0943 [Staypuft::Exception]: Latest Puppet Run Contains Failures for Host: 5