Bug 1356683
Summary: | rhel-osp-director: 8.0->9.0 upgrade, major-upgrade-pacemaker-converge.yaml step fails with Error: /Stage[main]/Cinder::Setup_test_volume/Exec[pvcreate /dev/loop2]: pvcreate /dev/loop2 returned 5 instead of one of [0] | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> | |
Component: | puppet-cinder | Assignee: | Jiri Stransky <jstransk> | |
Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> | |
Severity: | low | Docs Contact: | ||
Priority: | low | |||
Version: | 9.0 (Mitaka) | CC: | akaris, cpaquin, dbecker, dhill, dmacpher, ipilcher, jcoufal, jjoyce, jschluet, jslagle, jstransk, mburns, morazi, ohochman, rhel-osp-director-maint, rlopez, sasha, scohen, slinaber, srevivo, tshefi, tvignaud | |
Target Milestone: | beta | Keywords: | Reopened, Triaged | |
Target Release: | 10.0 (Newton) | Flags: | tshefi:
automate_bug-
|
|
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | puppet-cinder-9.2.0-0.20160901081001.f657f9b.el7ost | Doc Type: | Bug Fix | |
Doc Text: |
A race condition existed between loop device configuration and a check for LVM physical volumes on block storage nodes. This caused the major upgrade convergence step to fail due to Puppet being failing to detect existing LVM physical volumes and attempting to recreate the volume. This fix waits for udev events to complete after setting up the loop device. This means that Puppet waits for the loop device configuration to complete before attempting to check for an existing LVM physical volume. Block storage nodes with LVM backends now upgrade successfully.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1419187 (view as bug list) | Environment: | ||
Last Closed: | 2017-02-03 17:42:12 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1419187, 1523939 |
Description
Alexander Chuzhoy
2016-07-14 17:21:09 UTC
I reran the same command after failure (openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 3 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml) and it completed successfully. 2016-06-29 22:04:23 [NetworkDeployment]: SIGNAL_COMPLETE Unknown Stack overcloud UPDATE_COMPLETE Hmm this seems like quite a random/strange failure to me. I wasn't able to reproduce this. Sasha, did you hit it more than once, or did someone else hit it too? I only hit it once. Reproduced it, but noticed that we continued to major-upgrade-pacemaker-converge step, despite the failed major-upgrade-pacemaker.yaml step. TL;DR: This may not be worth fixing if there's no cleaner way to do it besides `sleep`ing a few seconds. It might be a race in the puppet module, which will only appear after overcloud nodes are restarted (sasha mentioned the restart happened in this deployment) and we have lost /dev/loop2 device that way, and we need puppet to recreate it. The device actually *is* restored fine, but Puppet attempts to re-run pvcreate even though it's not necessary, probably due to a race on its `unless` check for that pvcreate. Furthermore within Newton codebase it would only appear when the LVM storage backend for Cinder is actually being used. ----- The long version: The problem is here https://github.com/openstack/puppet-cinder/blob/a97128fb2b8c1b6d1fe8cf999c01e0a56403475c/manifests/setup_test_volume.pp#L40-L50 Losetup will successfully revive the /dev/loop2 loopback device, which will also revive the physical volume on it. However, the unless condition of `pvdisplay | grep ${volume_name}` on the pvcreate doesn't succeed for some reason, most probably just running it too early, which means the pvcreate will actually run, but it fails because the physical volume already exists on /dev/loop2. Setup_test_volume/Exec[pvcreate /dev/loop2]/returns: Can't initialize physical volume "/dev/loop2" of volume group "cinder-volumes" without -ff When i run the `unless` command manually, i see it succeed: [root@overcloud-controller-1 ~]# pvdisplay | grep cinder-volumes VG Name cinder-volumes I even tried to check puppet behavior with a minimal reproducing template in case the refresh would *always* run, regardless of the `unless` condition, but that doesn't seem to be a problem: exec { 'this runs': path => ['/bin','/usr/bin','/sbin','/usr/sbin'], command => "echo 'first ran'", } ~> exec { 'this does not run even though it is notified, because `unless` is true': path => ['/bin','/usr/bin','/sbin','/usr/sbin'], command => "echo 'second ran'", unless => "true", refreshonly => true, } The second exec never got executed. So this is most likely a race condition indeed. Re-opening as it reproduced during upgrade 7.3->8.0 (In reply to Alexander Chuzhoy from comment #9) > Re-opening as it reproduced during upgrade 7.3->8.0 7.3 to 8 is not the same issue. It should be filed as a new bug in osp 8. Just a bunch of follow-up info. Giulio alerted me about the existence of `udevadm settle`, which could potentially solve the waiting problem more elegantly then `sleep $a_few_seconds`, so i submitted a patch for it to upstream puppet-cinder: https://review.openstack.org/#/c/357082 I wasn't able to test it because i haven't hit the issue, so it's a best effor fix rather than something guaranteed. Still, given the nature of the issue, it is more likely to impact testing environments than cause trouble in production. *** Bug 1371628 has been marked as a duplicate of this bug. *** Reopening as folks seem to still hit this in testing envs. The fix landed upstream in time to make it into Newton / OSP 10. Given that the bug is expected to affect testing/PoC envs, i'm adjusting the severity/priority and target release, please amend if needed. The workaround should be to run the converge step for 2nd time using the same command. this has been built downstream Unable to reproduce in the scenario of osp9 (In reply to Jiri Stransky from comment #13) > Reopening as folks seem to still hit this in testing envs. The fix landed > upstream in time to make it into Newton / OSP 10. Given that the bug is > expected to affect testing/PoC envs, i'm adjusting the severity/priority and > target release, please amend if needed. > > The workaround should be to run the converge step for 2nd time using the > same command. Verified with : puppet-cinder-9.4.1-2.el7ost.noarch we cannot reproduce on the scenario of upgrade osp9 to osp10 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html I just ran into this same issue with the OSP 8 to OSP 9 upgrade. Failed on the final step of the upgrade -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml) Failed with the exact error. /Stage[main]/Cinder::Setup_test_volume/Exec[pvcreate /dev/loop2]: pvcreate /dev/loop2 returned 5 instead of one of [0] Per comment three https://bugzilla.redhat.com/show_bug.cgi?id=1356683#c3 After attempting the deploy again, deployment is successfull Overcloud Deployed clean_up DeployOvercloud: END return value: 0 That error message makes me think that this is the LVM backend, which is unsupported AFAIK (and I can't imagine that Verizon is using it). Am I missing something? (In reply to Ian Pilcher from comment #23) > That error message makes me think that this is the LVM backend, which is > unsupported AFAIK (and I can't imagine that Verizon is using it). Am I > missing something? We ran into this error after deploying a test OSP 8 (and attempting to upgrade to OSP 9) environment without any storage configuration -- accepting the default config -- as we are at this point testing the generic procedure as documented. If this is not something that we are going to run into in REAL environments, that is good. However, that being said, it's still an error that we ran into. We're hitting this issue now. I am reopening this bug. We ran into it in a customer deployment in the OSP 8 to 9 upgrade. Feel free to defer this to another BZ, but this needs to be addressed by a patch or by documentation. (In reply to David Hill from comment #26) > We're hitting this issue now. What Cinder backend are you using? This bug is against OSP 10. Seeing the issue in OSP 8/OSP 9 should not result in re-opening this bug. Please clone the bug to OSP 8 and/or 9 to track an issue in that release. This bug is being re-closed. As a separate note: in general, please don't ever reopen a bug that has been closed Errata. Due to certain internal process constraints, a bug that has been Closed Errata cannot be reused to fix an additional bug or reopen an issue that might not be fixed or incompletely fixed. The correct path is to clone the bug and use the new bug to track the issue. Thanks |