Bug 1356683

Summary:	rhel-osp-director: 8.0->9.0 upgrade, major-upgrade-pacemaker-converge.yaml step fails with Error: /Stage[main]/Cinder::Setup_test_volume/Exec[pvcreate /dev/loop2]: pvcreate /dev/loop2 returned 5 instead of one of [0]
Product:	Red Hat OpenStack	Reporter:	Alexander Chuzhoy <sasha>
Component:	puppet-cinder	Assignee:	Jiri Stransky <jstransk>
Status:	CLOSED ERRATA	QA Contact:	Alexander Chuzhoy <sasha>
Severity:	low	Docs Contact:
Priority:	low
Version:	9.0 (Mitaka)	CC:	akaris, cpaquin, dbecker, dhill, dmacpher, ipilcher, jcoufal, jjoyce, jschluet, jslagle, jstransk, mburns, morazi, ohochman, rhel-osp-director-maint, rlopez, sasha, scohen, slinaber, srevivo, tshefi, tvignaud
Target Milestone:	beta	Keywords:	Reopened, Triaged
Target Release:	10.0 (Newton)	Flags:	tshefi: automate_bug-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	puppet-cinder-9.2.0-0.20160901081001.f657f9b.el7ost	Doc Type:	Bug Fix
Doc Text:	A race condition existed between loop device configuration and a check for LVM physical volumes on block storage nodes. This caused the major upgrade convergence step to fail due to Puppet being failing to detect existing LVM physical volumes and attempting to recreate the volume. This fix waits for udev events to complete after setting up the loop device. This means that Puppet waits for the loop device configuration to complete before attempting to check for an existing LVM physical volume. Block storage nodes with LVM backends now upgrade successfully.	Story Points:	---
Clone Of:
Clones:	1419187 (view as bug list)		Environment:
Last Closed:	2017-02-03 17:42:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1419187, 1523939

Description Alexander Chuzhoy 2016-07-14 17:21:09 UTC

rhel-osp-director:  8.0->9.0 upgrade, major-upgrade-pacemaker-converge.yaml step fails with Error: /Stage[main]/Cinder::Setup_test_volume/Exec[pvcreate /dev/loop2]: pvcreate /dev/loop2 returned 5 instead of one of [0]


Environment:
openstack-tripleo-heat-templates-2.0.0-15.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-15.el7ost.noarch
openstack-tripleo-heat-templates-kilo-2.0.0-15.el7ost.noarch
instack-undercloud-4.0.0-7.el7ost.noarch
openstack-puppet-modules-8.1.2-1.el7ost.noarch

Steps to reproduce:
1. Deploy 8.0 with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 3 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml
[stack@undercloud72 ~]$ rpm -qa|grep -e undercloud -e openstack-puppet -e openstack-tripleo-heat
2. Upgrade the undercloud.
3. Do the upgrade of overcloud, reach step 
openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 3 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml


Result:
2016-06-29 20:03:51 [ControllerNodesPostDeployment]: CREATE_IN_PROGRESS  state changed
2016-06-29 20:04:05 [ObjectStorageNodesPostDeployment]: CREATE_COMPLETE  state changed
2016-06-29 20:05:48 [ComputeNodesPostDeployment]: CREATE_COMPLETE  state changed
2016-06-29 20:15:11 [ControllerNodesPostDeployment]: CREATE_FAILED  Error: resources.ControllerNodesPostDeployment.resources.ControllerOvercloudServicesDeployment_Step4.resources[1]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6
2016-06-29 20:15:13 [overcloud]: UPDATE_FAILED  Error: resources.ControllerNodesPostDeployment.resources.ControllerOvercloudServicesDeployment_Step4.resources[1]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6


Debugging hit shows:
Notice: Finished catalog run in 78.15 seconds                                                                                                                                                                        
", "deploy_stderr": "Warning: Scope(Class[Mongodb::Server]): Replset specified, but no replset_members or replset_config provided.                                                                                   
Warning: Scope(Class[Swift]): swift_hash_suffix has been deprecated and should be replaced with swift_hash_path_suffix, this will be removed as part of the N-cycle                                                  
Warning: Scope(Class[Keystone]): Execution of db_sync does not depend on $enabled anymore. Please use sync_db instead.                                                                                               
Warning: Scope(Class[Glance::Api]): The known_stores parameter is deprecated, use stores instead                                                                                                                     
Warning: Scope(Class[Glance::Api]): default_store not provided, it will be automatically set to glance.store.http.Store                                                                                              
Warning: Scope(Class[Glance::Registry]): Execution of db_sync does not depend on $manage_service or $enabled anymore. Please use sync_db instead.                                                                    
Warning: Scope(Class[Nova::Api]): ec2_listen_port, ec2_workers and keystone_ec2_url are deprecated and have no effect. Deploy openstack/ec2-api instead.                                                             
Warning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_host'; class ::nova::compute has not been evaluated                                                   
Warning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_protocol'; class ::nova::compute has not been evaluated                                               
Warning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_port'; class ::nova::compute has not been evaluated                                                   
Warning: Scope(Class[Nova::Vncproxy::Common]): Could not look up qualified variable '::nova::compute::vncproxy_path'; class ::nova::compute has not been evaluated                                                   
Warning: Scope(Class[Neutron]): The neutron::network_device_mtu parameter is deprecated, use neutron::global_physnet_mtu instead.                                                                                    
Warning: Scope(Class[Neutron::Server]): identity_uri, auth_tenant, auth_user, auth_password, auth_region configuration options are deprecated in favor of auth_plugin and related options                            
Warning: Scope(Class[Neutron::Agents::Dhcp]): The dhcp_delete_namespaces parameter was removed in Mitaka, it does not take any affect                                                                                
Warning: Scope(Class[Neutron::Agents::L3]): parameter external_network_bridge is deprecated                                                                                                                          
Warning: Scope(Class[Neutron::Agents::L3]): parameter router_delete_namespaces was removed in Mitaka, it does not take any affect                                                                                    
Warning: Scope(Class[Neutron::Agents::Metadata]): The auth_password parameter is deprecated and was removed in Mitaka release.                                                                                       
Warning: Scope(Class[Neutron::Agents::Metadata]): The auth_tenant parameter is deprecated and was removed in Mitaka release.                                                                                         
Warning: Scope(Class[Neutron::Agents::Metadata]): The auth_url parameter is deprecated and was removed in Mitaka release.                                                                                            
Warning: Scope(Class[Ceilometer::Api]): The keystone_auth_uri parameter is deprecated. Please use auth_uri instead.                                                                                                  
Warning: Scope(Class[Ceilometer::Api]): The keystone_identity_uri parameter is deprecated. Please use identity_uri instead.                                                                                          
Warning: Scope(Class[Heat]): \"admin_user\", \"admin_password\", \"admin_tenant_name\" configuration options are deprecated in favor of auth_plugin and related options                                              
Warning: You cannot collect exported resources without storeconfigs being set; the collection will be ignored on line 123 in file /etc/puppet/modules/gnocchi/manifests/api.pp                                       
Warning: Not collecting exported resources without storeconfigs                                                                                                                                                      
Warning: Not collecting exported resources without storeconfigs                                                                                                                                                      
Warning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.                          
Warning: Not collecting exported resources without storeconfigs                                                                                                                                                      
Warning: Not collecting exported resources without storeconfigs                                                                                                                                                      
Warning: Not collecting exported resources without storeconfigs                                                                                                                                                      
Error: /Stage[main]/Cinder::Setup_test_volume/Exec[pvcreate /dev/loop2]: Failed to call refresh: pvcreate /dev/loop2 returned 5 instead of one of [0]                                                                
Error: /Stage[main]/Cinder::Setup_test_volume/Exec[pvcreate /dev/loop2]: pvcreate /dev/loop2 returned 5 instead of one of [0]                                                                                        
", "deploy_status_code": 6 }, "creation_time": "2016-06-29T20:11:09", "updated_time": "2016-06-29T20:14:13", "input_values": { "step": 3, "update_identifier": { "deployment_identifier": 1467229708, "controller_config": { "1": "os-apply-config deployment 1a6b5a78-4d70-4297-ae0d-0fd01ee41f46 completed,Root CA cert injection not enabled.,TLS not enabled.,None,", "0": "os-apply-config deployment 07e7ca58-8d2d-41f6-982d-75d81cc7dd2c completed,Root CA cert injection not enabled.,TLS not enabled.,None,", "2": "os-apply-config deployment 260f9a0b-8700-4d0a-9220-1ff8940fffaf completed,Root CA cert injection not enabled.,TLS not enabled.,None," }, "allnodes_extra": "none" } }, "action": "CREATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 6", "id": "7ea51ee9-9e03-4e14-9da6-5ca9f9cf50ef" }

Comment 3 Alexander Chuzhoy 2016-07-14 18:59:51 UTC

I reran the same command after failure (openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 3 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml)  and it completed successfully.
2016-06-29 22:04:23 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_COMPLETE

Comment 4 Jiri Stransky 2016-07-22 15:43:47 UTC

Hmm this seems like quite a random/strange failure to me. I wasn't able to reproduce this. Sasha, did you hit it more than once, or did someone else hit it too?

Comment 5 Alexander Chuzhoy 2016-07-22 16:30:48 UTC

I only hit it once.

Comment 7 Alexander Chuzhoy 2016-07-30 03:27:28 UTC

Reproduced it, but noticed that we continued to major-upgrade-pacemaker-converge step, despite the failed major-upgrade-pacemaker.yaml step.

Comment 8 Jiri Stransky 2016-08-16 09:48:55 UTC

TL;DR: This may not be worth fixing if there's no cleaner way to do it besides `sleep`ing a few seconds. It might be a race in the puppet module, which will only appear after overcloud nodes are restarted (sasha mentioned the restart happened in this deployment) and we have lost /dev/loop2 device that way, and we need puppet to recreate it. The device actually *is* restored fine, but Puppet attempts to re-run pvcreate even though it's not necessary, probably due to a race on its `unless` check for that pvcreate. Furthermore within Newton codebase it would only appear when the LVM storage backend for Cinder is actually being used.

-----

The long version:

The problem is here

https://github.com/openstack/puppet-cinder/blob/a97128fb2b8c1b6d1fe8cf999c01e0a56403475c/manifests/setup_test_volume.pp#L40-L50

Losetup will successfully revive the /dev/loop2 loopback device, which will also revive the physical volume on it. However, the unless condition of `pvdisplay | grep ${volume_name}` on the pvcreate doesn't succeed for some reason, most probably just running it too early, which means the pvcreate will actually run, but it fails because the physical volume already exists on /dev/loop2.

Setup_test_volume/Exec[pvcreate /dev/loop2]/returns:   Can't initialize physical volume "/dev/loop2" of volume group "cinder-volumes" without -ff

When i run the `unless` command manually, i see it succeed:

[root@overcloud-controller-1 ~]# pvdisplay | grep cinder-volumes
  VG Name               cinder-volumes


I even tried to check puppet behavior with a minimal reproducing template in case the refresh would *always* run, regardless of the `unless` condition, but that doesn't seem to be a problem:

exec { 'this runs':
  path    => ['/bin','/usr/bin','/sbin','/usr/sbin'],
  command => "echo 'first ran'",
}
~>
exec { 'this does not run even though it is notified, because `unless` is true':
  path        => ['/bin','/usr/bin','/sbin','/usr/sbin'],
  command     => "echo 'second ran'",
  unless      => "true",
  refreshonly => true,
}

The second exec never got executed. So this is most likely a race condition indeed.

Comment 9 Alexander Chuzhoy 2016-08-16 14:35:40 UTC

Re-opening as it reproduced during upgrade 7.3->8.0

Comment 10 Mike Burns 2016-08-17 13:09:58 UTC

(In reply to Alexander Chuzhoy from comment #9)
> Re-opening as it reproduced during upgrade 7.3->8.0

7.3 to 8 is not the same issue.  It should be filed as a new bug in osp 8.

Comment 11 Jiri Stransky 2016-08-18 10:53:53 UTC

Just a bunch of follow-up info. Giulio alerted me about the existence of `udevadm settle`, which could potentially solve the waiting problem more elegantly then `sleep $a_few_seconds`, so i submitted a patch for it to upstream puppet-cinder:

https://review.openstack.org/#/c/357082

I wasn't able to test it because i haven't hit the issue, so it's a best effor fix rather than something guaranteed.


Still, given the nature of the issue, it is more likely to impact testing environments than cause trouble in production.

Comment 12 Jiri Stransky 2016-08-31 16:21:28 UTC

*** Bug 1371628 has been marked as a duplicate of this bug. ***

Comment 13 Jiri Stransky 2016-08-31 16:28:08 UTC

Reopening as folks seem to still hit this in testing envs. The fix landed upstream in time to make it into Newton / OSP 10. Given that the bug is expected to affect testing/PoC envs, i'm adjusting the severity/priority and target release, please amend if needed. 

The workaround should be to run the converge step for 2nd time using the same command.

Comment 14 James Slagle 2016-09-13 20:03:16 UTC

this has been built downstream

Comment 17 Omri Hochman 2016-11-21 22:54:04 UTC

Unable to reproduce in the scenario of  osp9 (In reply to Jiri Stransky from comment #13)
> Reopening as folks seem to still hit this in testing envs. The fix landed
> upstream in time to make it into Newton / OSP 10. Given that the bug is
> expected to affect testing/PoC envs, i'm adjusting the severity/priority and
> target release, please amend if needed. 
> 
> The workaround should be to run the converge step for 2nd time using the
> same command.


Verified with : puppet-cinder-9.4.1-2.el7ost.noarch 
we cannot reproduce on the scenario of upgrade osp9 to osp10

Comment 19 errata-xmlrpc 2016-12-14 15:45:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Comment 22 Chris Paquin 2017-01-18 23:37:33 UTC

I just ran into this same issue with the OSP 8 to OSP 9 upgrade. Failed on the final step of the upgrade 

-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml) 

Failed with the exact error.

/Stage[main]/Cinder::Setup_test_volume/Exec[pvcreate /dev/loop2]: pvcreate /dev/loop2 returned 5 instead of one of [0] 

Per comment three https://bugzilla.redhat.com/show_bug.cgi?id=1356683#c3

After attempting the deploy again, deployment is successfull

Overcloud Deployed
clean_up DeployOvercloud: 
END return value: 0

Comment 23 Ian Pilcher 2017-01-20 16:40:20 UTC

That error message makes me think that this is the LVM backend, which is unsupported AFAIK (and I can't imagine that Verizon is using it).  Am I missing something?

Comment 24 Chris Paquin 2017-01-20 17:20:39 UTC

(In reply to Ian Pilcher from comment #23)
> That error message makes me think that this is the LVM backend, which is
> unsupported AFAIK (and I can't imagine that Verizon is using it).  Am I
> missing something?

We ran into this error after deploying a test OSP 8 (and attempting to upgrade to OSP 9) environment without any storage configuration -- accepting the default config -- as we are at this point testing the generic procedure as documented.

If this is not something that we are going to run into in REAL environments, that is good. However, that being said, it's still an error that we ran into.

Comment 26 David Hill 2017-02-02 00:26:53 UTC

We're hitting this issue now.

Comment 27 Andreas Karis 2017-02-02 00:31:07 UTC

I am reopening this bug. We ran into it in a customer deployment in the OSP 8 to 9 upgrade. Feel free to defer this to another BZ, but this needs to be addressed by a patch or by documentation.

Comment 28 Ian Pilcher 2017-02-03 14:25:56 UTC

(In reply to David Hill from comment #26)
> We're hitting this issue now.

What Cinder backend are you using?

Comment 29 Mike Burns 2017-02-03 17:42:12 UTC

This bug is against OSP 10.  Seeing the issue in OSP 8/OSP 9 should not result in re-opening this bug.  Please clone the bug to OSP 8 and/or 9 to track an issue in that release.  

This bug is being re-closed.

As a separate note:  in general, please don't ever reopen a bug that has been closed Errata.  Due to certain internal process constraints, a bug that has been Closed Errata cannot be reused to fix an additional bug or reopen an issue that might not be fixed or incompletely fixed.  The correct path is to clone the bug and use the new bug to track the issue.

Thanks