Bug 1888469 - SolidFire driver can fail to clone due timeout
Summary: SolidFire driver can fail to clone due timeout
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z16
: 13.0 (Queens)
Assignee: Pablo Caruana
QA Contact: Tzach Shefi
Andy Stillman
URL:
Whiteboard:
Depends On: 1941954 1941957
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-14 23:13 UTC by Fernando Ferraz
Modified: 2022-09-05 13:31 UTC (History)
6 users (show)

Fixed In Version: openstack-cinder-12.0.10-23.el7ost
Doc Type: Bug Fix
Doc Text:
Before this update, users experienced timeouts in certain environments, mostly when volumes were too big. Often these multi-terabyte volumes experienced poor network performance or upgrade issues that involved the SolidFire cluster. + With this update, two timeout settings have been added to the SolidFire driver to allow users to set the appropriate timeouts for their environment.
Clone Of:
: 1941954 (view as bug list)
Environment:
Last Closed: 2021-06-16 10:58:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1898587 0 None None None 2020-10-14 23:13:19 UTC
OpenStack gerrit 756130 0 None MERGED NetApp SolidFire: Fix clone and request timeout issues 2021-03-30 10:00:27 UTC
Red Hat Issue Tracker OSP-459 0 None None None 2022-09-05 13:31:50 UTC
Red Hat Product Errata RHBA-2021:2385 0 None None None 2021-06-16 10:59:26 UTC

Description Fernando Ferraz 2020-10-14 23:13:20 UTC
Hi folks, we have some customers experiencing timeout issues in cloning operations and occasionally during api calls using the SolidFire driver in OSP 13 (Queens), mostly when dealing with significant large volumes or due poor network performance. Current timeout for API calls is too small for certain environments, and or cloning operation also has a hard coded timeout that doesn't work for all customers. 

I've submitted a patch upstream (not merged yet) to address this issue, by adding two parameters in cinder.conf to allow users to proper configure timeout values according to their environment.

I expect to have this fix backported to stable/queens soon, and my understanding is that the safest approach for customers to get this fix is through a osp13 update. Could you folks evaluate the possibility to include this fix in the next release cycle?

See below the Launchpad bug description:

When cloning a volume in solidfire.py there is a module "_get_model_info" in here is a hardcoded retry count of 600. Customers are facing timeout issues when volumes are too big (ie. multi-terabyte volumes), due to poor networks or upgrade issues that revolve around the ElementOS cluster. A viable solution is to make this value configurable in cinder.conf, to allow users to proper configure this according to their environment.


Upstream patch:
https://review.opendev.org/#/c/756130/


Launchpad issue:
https://bugs.launchpad.net/cinder/+bug/1898587

Comment 1 Luigi Toscano 2020-10-15 15:49:27 UTC
Thanks, please continue working on the fix and the upstream backport, and please see my answer in https://bugzilla.redhat.com/show_bug.cgi?id=1888417#c1 .

Comment 23 errata-xmlrpc 2021-06-16 10:58:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2385


Note You need to log in before you can comment on or make changes to this bug.