Bug 1888469

Summary: SolidFire driver can fail to clone due timeout
Product: Red Hat OpenStack Reporter: Fernando Ferraz <sfernand>
Component: openstack-cinderAssignee: Pablo Caruana <pcaruana>
Status: CLOSED ERRATA QA Contact: Tzach Shefi <tshefi>
Severity: medium Docs Contact: Andy Stillman <astillma>
Priority: medium    
Version: 13.0 (Queens)CC: abishop, gfidente, gregraka, pcaruana, slinaber, spower
Target Milestone: z16Keywords: OtherQA, Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-cinder-12.0.10-23.el7ost Doc Type: Bug Fix
Doc Text:
Before this update, users experienced timeouts in certain environments, mostly when volumes were too big. Often these multi-terabyte volumes experienced poor network performance or upgrade issues that involved the SolidFire cluster. + With this update, two timeout settings have been added to the SolidFire driver to allow users to set the appropriate timeouts for their environment.
Story Points: ---
Clone Of:
: 1941954 (view as bug list) Environment:
Last Closed: 2021-06-16 10:58:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1941954, 1941957    
Bug Blocks:    

Description Fernando Ferraz 2020-10-14 23:13:20 UTC
Hi folks, we have some customers experiencing timeout issues in cloning operations and occasionally during api calls using the SolidFire driver in OSP 13 (Queens), mostly when dealing with significant large volumes or due poor network performance. Current timeout for API calls is too small for certain environments, and or cloning operation also has a hard coded timeout that doesn't work for all customers. 

I've submitted a patch upstream (not merged yet) to address this issue, by adding two parameters in cinder.conf to allow users to proper configure timeout values according to their environment.

I expect to have this fix backported to stable/queens soon, and my understanding is that the safest approach for customers to get this fix is through a osp13 update. Could you folks evaluate the possibility to include this fix in the next release cycle?

See below the Launchpad bug description:

When cloning a volume in solidfire.py there is a module "_get_model_info" in here is a hardcoded retry count of 600. Customers are facing timeout issues when volumes are too big (ie. multi-terabyte volumes), due to poor networks or upgrade issues that revolve around the ElementOS cluster. A viable solution is to make this value configurable in cinder.conf, to allow users to proper configure this according to their environment.


Upstream patch:
https://review.opendev.org/#/c/756130/


Launchpad issue:
https://bugs.launchpad.net/cinder/+bug/1898587

Comment 1 Luigi Toscano 2020-10-15 15:49:27 UTC
Thanks, please continue working on the fix and the upstream backport, and please see my answer in https://bugzilla.redhat.com/show_bug.cgi?id=1888417#c1 .

Comment 23 errata-xmlrpc 2021-06-16 10:58:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2385