Bug 1941954

Summary:	SolidFire driver can fail to clone due timeout
Product:	Red Hat OpenStack	Reporter:	Pablo Caruana <pcaruana>
Component:	openstack-cinder	Assignee:	Pablo Caruana <pcaruana>
Status:	CLOSED ERRATA	QA Contact:	Tzach Shefi <tshefi>
Severity:	medium	Docs Contact:	RHOS Documentation Team <rhos-docs>
Priority:	medium
Version:	16.2 (Train)	CC:	gfidente, jamsmith, ltoscano, pcaruana, sfernand, tshefi
Target Milestone:	ga	Keywords:	OtherQA, Triaged, ZStream
Target Release:	16.2 (Train on RHEL 8.4)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-cinder-15.5.0-2.20210409044947.a75f863.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1888469
Clones:	1941957 (view as bug list)		Environment:
Last Closed:	2021-09-15 07:13:08 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1888469, 1939394, 1941957

Description Pablo Caruana 2021-03-23 09:45:27 UTC

+++ This bug was initially created as a clone of Bug #1888469 +++

Hi folks, we have some customers experiencing timeout issues in cloning operations and occasionally during api calls using the SolidFire driver in OSP 13 (Queens), mostly when dealing with significant large volumes or due poor network performance. Current timeout for API calls is too small for certain environments, and or cloning operation also has a hard coded timeout that doesn't work for all customers. 

I've submitted a patch upstream (not merged yet) to address this issue, by adding two parameters in cinder.conf to allow users to proper configure timeout values according to their environment.

I expect to have this fix backported to stable/queens soon, and my understanding is that the safest approach for customers to get this fix is through a osp13 update. Could you folks evaluate the possibility to include this fix in the next release cycle?

See below the Launchpad bug description:

When cloning a volume in solidfire.py there is a module "_get_model_info" in here is a hardcoded retry count of 600. Customers are facing timeout issues when volumes are too big (ie. multi-terabyte volumes), due to poor networks or upgrade issues that revolve around the ElementOS cluster. A viable solution is to make this value configurable in cinder.conf, to allow users to proper configure this according to their environment.


Upstream patch:
https://review.opendev.org/#/c/756130/


Launchpad issue:
https://bugs.launchpad.net/cinder/+bug/1898587

Comment 9 errata-xmlrpc 2021-09-15 07:13:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483