rhel-osp-director: 8.0->9.0 upgrade, major-upgrade-pacemaker-converge.yaml step fails with Error: /Stage[main]/Cinder::Setup_test_volume/Exec[pvcreate /dev/loop2]: pvcreate /dev/loop2 returned 5 instead of one of [0] Environment: openstack-tripleo-heat-templates-2.0.0-15.el7ost.noarch openstack-tripleo-heat-templates-liberty-2.0.0-15.el7ost.noarch openstack-tripleo-heat-templates-kilo-2.0.0-15.el7ost.noarch instack-undercloud-4.0.0-7.el7ost.noarch openstack-puppet-modules-8.1.2-1.el7ost.noarch Steps to reproduce: 1. Deploy 8.0 with: openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 3 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml [stack@undercloud72 ~]$ rpm -qa|grep -e undercloud -e openstack-puppet -e openstack-tripleo-heat 2. Upgrade the undercloud. 3. Do the upgrade of overcloud, reach step openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 3 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml Result: 2016-06-29 20:03:51 [ControllerNodesPostDeployment]: CREATE_IN_PROGRESS state changed 2016-06-29 20:04:05 [ObjectStorageNodesPostDeployment]: CREATE_COMPLETE state changed 2016-06-29 20:05:48 [ComputeNodesPostDeployment]: CREATE_COMPLETE state changed 2016-06-29 20:15:11 [ControllerNodesPostDeployment]: CREATE_FAILED Error: resources.ControllerNodesPostDeployment.resources.ControllerOvercloudServicesDeployment_Step4.resources[1]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6 2016-06-29 20:15:13 [overcloud]: UPDATE_FAILED Error: resources.ControllerNodesPostDeployment.resources.ControllerOvercloudServicesDeployment_Step4.resources[1]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6 Debugging hit shows: Notice: Finished catalog run in 78.15 seconds ", "deploy_stderr": "Warning: Scope(Class[Mongodb::Server]): Replset specified, but no replset_members or replset_config provided. Warning: Scope(Class[Swift]): swift_hash_suffix has been deprecated and should be replaced with swift_hash_path_suffix, this will be removed as part of the N-cycle Warning: Scope(Class[Keystone]): Execution of db_sync does not depend on $enabled anymore. Please use sync_db instead. Warning: Scope(Class[Glance::Api]): The known_stores parameter is deprecated, use stores instead Warning: Scope(Class[Glance::Api]): default_store not provided, it will be automatically set to glance.store.http.Store Warning: Scope(Class[Glance::Registry]): Execution of db_sync does not depend on $manage_service or $enabled anymore. Please use sync_db instead. Warning: Scope(Class[Nova::Api]): ec2_listen_port, ec2_workers and keystone_ec2_url are deprecated and have no effect. Deploy openstack/ec2-api instead. Warning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_host'; class ::nova::compute has not been evaluated Warning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_protocol'; class ::nova::compute has not been evaluated Warning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_port'; class ::nova::compute has not been evaluated Warning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_path'; class ::nova::compute has not been evaluated Warning: Scope(Class[Neutron]): The neutron::network_device_mtu parameter is deprecated, use neutron::global_physnet_mtu instead. Warning: Scope(Class[Neutron::Server]): identity_uri, auth_tenant, auth_user, auth_password, auth_region configuration options are deprecated in favor of auth_plugin and related options Warning: Scope(Class[Neutron::Agents::Dhcp]): The dhcp_delete_namespaces parameter was removed in Mitaka, it does not take any affect Warning: Scope(Class[Neutron::Agents::L3]): parameter external_network_bridge is deprecated Warning: Scope(Class[Neutron::Agents::L3]): parameter router_delete_namespaces was removed in Mitaka, it does not take any affect Warning: Scope(Class[Neutron::Agents::Metadata]): The auth_password parameter is deprecated and was removed in Mitaka release. Warning: Scope(Class[Neutron::Agents::Metadata]): The auth_tenant parameter is deprecated and was removed in Mitaka release. Warning: Scope(Class[Neutron::Agents::Metadata]): The auth_url parameter is deprecated and was removed in Mitaka release. Warning: Scope(Class[Ceilometer::Api]): The keystone_auth_uri parameter is deprecated. Please use auth_uri instead. Warning: Scope(Class[Ceilometer::Api]): The keystone_identity_uri parameter is deprecated. Please use identity_uri instead. Warning: Scope(Class[Heat]): \"admin_user\", \"admin_password\", \"admin_tenant_name\" configuration options are deprecated in favor of auth_plugin and related options Warning: You cannot collect exported resources without storeconfigs being set; the collection will be ignored on line 123 in file /etc/puppet/modules/gnocchi/manifests/api.pp Warning: Not collecting exported resources without storeconfigs Warning: Not collecting exported resources without storeconfigs Warning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications. Warning: Not collecting exported resources without storeconfigs Warning: Not collecting exported resources without storeconfigs Warning: Not collecting exported resources without storeconfigs Error: /Stage[main]/Cinder::Setup_test_volume/Exec[pvcreate /dev/loop2]: Failed to call refresh: pvcreate /dev/loop2 returned 5 instead of one of [0] Error: /Stage[main]/Cinder::Setup_test_volume/Exec[pvcreate /dev/loop2]: pvcreate /dev/loop2 returned 5 instead of one of [0] ", "deploy_status_code": 6 }, "creation_time": "2016-06-29T20:11:09", "updated_time": "2016-06-29T20:14:13", "input_values": { "step": 3, "update_identifier": { "deployment_identifier": 1467229708, "controller_config": { "1": "os-apply-config deployment 1a6b5a78-4d70-4297-ae0d-0fd01ee41f46 completed,Root CA cert injection not enabled.,TLS not enabled.,None,", "0": "os-apply-config deployment 07e7ca58-8d2d-41f6-982d-75d81cc7dd2c completed,Root CA cert injection not enabled.,TLS not enabled.,None,", "2": "os-apply-config deployment 260f9a0b-8700-4d0a-9220-1ff8940fffaf completed,Root CA cert injection not enabled.,TLS not enabled.,None," }, "allnodes_extra": "none" } }, "action": "CREATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 6", "id": "7ea51ee9-9e03-4e14-9da6-5ca9f9cf50ef" }
I reran the same command after failure (openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 3 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml) and it completed successfully. 2016-06-29 22:04:23 [NetworkDeployment]: SIGNAL_COMPLETE Unknown Stack overcloud UPDATE_COMPLETE
Hmm this seems like quite a random/strange failure to me. I wasn't able to reproduce this. Sasha, did you hit it more than once, or did someone else hit it too?
I only hit it once.
Reproduced it, but noticed that we continued to major-upgrade-pacemaker-converge step, despite the failed major-upgrade-pacemaker.yaml step.
TL;DR: This may not be worth fixing if there's no cleaner way to do it besides `sleep`ing a few seconds. It might be a race in the puppet module, which will only appear after overcloud nodes are restarted (sasha mentioned the restart happened in this deployment) and we have lost /dev/loop2 device that way, and we need puppet to recreate it. The device actually *is* restored fine, but Puppet attempts to re-run pvcreate even though it's not necessary, probably due to a race on its `unless` check for that pvcreate. Furthermore within Newton codebase it would only appear when the LVM storage backend for Cinder is actually being used. ----- The long version: The problem is here https://github.com/openstack/puppet-cinder/blob/a97128fb2b8c1b6d1fe8cf999c01e0a56403475c/manifests/setup_test_volume.pp#L40-L50 Losetup will successfully revive the /dev/loop2 loopback device, which will also revive the physical volume on it. However, the unless condition of `pvdisplay | grep ${volume_name}` on the pvcreate doesn't succeed for some reason, most probably just running it too early, which means the pvcreate will actually run, but it fails because the physical volume already exists on /dev/loop2. Setup_test_volume/Exec[pvcreate /dev/loop2]/returns: Can't initialize physical volume "/dev/loop2" of volume group "cinder-volumes" without -ff When i run the `unless` command manually, i see it succeed: [root@overcloud-controller-1 ~]# pvdisplay | grep cinder-volumes VG Name cinder-volumes I even tried to check puppet behavior with a minimal reproducing template in case the refresh would *always* run, regardless of the `unless` condition, but that doesn't seem to be a problem: exec { 'this runs': path => ['/bin','/usr/bin','/sbin','/usr/sbin'], command => "echo 'first ran'", } ~> exec { 'this does not run even though it is notified, because `unless` is true': path => ['/bin','/usr/bin','/sbin','/usr/sbin'], command => "echo 'second ran'", unless => "true", refreshonly => true, } The second exec never got executed. So this is most likely a race condition indeed.
Re-opening as it reproduced during upgrade 7.3->8.0
(In reply to Alexander Chuzhoy from comment #9) > Re-opening as it reproduced during upgrade 7.3->8.0 7.3 to 8 is not the same issue. It should be filed as a new bug in osp 8.
Just a bunch of follow-up info. Giulio alerted me about the existence of `udevadm settle`, which could potentially solve the waiting problem more elegantly then `sleep $a_few_seconds`, so i submitted a patch for it to upstream puppet-cinder: https://review.openstack.org/#/c/357082 I wasn't able to test it because i haven't hit the issue, so it's a best effor fix rather than something guaranteed. Still, given the nature of the issue, it is more likely to impact testing environments than cause trouble in production.
*** Bug 1371628 has been marked as a duplicate of this bug. ***
Reopening as folks seem to still hit this in testing envs. The fix landed upstream in time to make it into Newton / OSP 10. Given that the bug is expected to affect testing/PoC envs, i'm adjusting the severity/priority and target release, please amend if needed. The workaround should be to run the converge step for 2nd time using the same command.
this has been built downstream
Unable to reproduce in the scenario of osp9 (In reply to Jiri Stransky from comment #13) > Reopening as folks seem to still hit this in testing envs. The fix landed > upstream in time to make it into Newton / OSP 10. Given that the bug is > expected to affect testing/PoC envs, i'm adjusting the severity/priority and > target release, please amend if needed. > > The workaround should be to run the converge step for 2nd time using the > same command. Verified with : puppet-cinder-9.4.1-2.el7ost.noarch we cannot reproduce on the scenario of upgrade osp9 to osp10
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html
I just ran into this same issue with the OSP 8 to OSP 9 upgrade. Failed on the final step of the upgrade -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml) Failed with the exact error. /Stage[main]/Cinder::Setup_test_volume/Exec[pvcreate /dev/loop2]: pvcreate /dev/loop2 returned 5 instead of one of [0] Per comment three https://bugzilla.redhat.com/show_bug.cgi?id=1356683#c3 After attempting the deploy again, deployment is successfull Overcloud Deployed clean_up DeployOvercloud: END return value: 0
That error message makes me think that this is the LVM backend, which is unsupported AFAIK (and I can't imagine that Verizon is using it). Am I missing something?
(In reply to Ian Pilcher from comment #23) > That error message makes me think that this is the LVM backend, which is > unsupported AFAIK (and I can't imagine that Verizon is using it). Am I > missing something? We ran into this error after deploying a test OSP 8 (and attempting to upgrade to OSP 9) environment without any storage configuration -- accepting the default config -- as we are at this point testing the generic procedure as documented. If this is not something that we are going to run into in REAL environments, that is good. However, that being said, it's still an error that we ran into.
We're hitting this issue now.
I am reopening this bug. We ran into it in a customer deployment in the OSP 8 to 9 upgrade. Feel free to defer this to another BZ, but this needs to be addressed by a patch or by documentation.
(In reply to David Hill from comment #26) > We're hitting this issue now. What Cinder backend are you using?
This bug is against OSP 10. Seeing the issue in OSP 8/OSP 9 should not result in re-opening this bug. Please clone the bug to OSP 8 and/or 9 to track an issue in that release. This bug is being re-closed. As a separate note: in general, please don't ever reopen a bug that has been closed Errata. Due to certain internal process constraints, a bug that has been Closed Errata cannot be reused to fix an additional bug or reopen an issue that might not be fixed or incompletely fixed. The correct path is to clone the bug and use the new bug to track the issue. Thanks