968384 – libstoragemgmt does not roll back FS size when LUN resize fails due time out

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 968384 - libstoragemgmt does not roll back FS size when LUN resize fails due time out

Summary: libstoragemgmt does not roll back FS size when LUN resize fails due time out

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libstoragemgmt
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Tony Asleson
QA Contact:	Bruno Goncalves
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	754967
TreeView+	depends on / blocked

Reported:	2013-05-29 15:22 UTC by Bruno Goncalves
Modified:	2023-03-08 07:25 UTC (History)
CC List:	1 user (show)
Fixed In Version:	libstoragemgmt-0.0.22-2.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-06-13 11:51:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Bruno Goncalves 2013-05-29 15:22:48 UTC

Description of problem:

If the command to resize the LUN exit due time out the FS size still grows.
The FS size should roll back to its size before the attempt to resize the LUN.

lsmcli -u ontap+ssl://target
--resize-volume=2FiJA+BOAezA --size=2T -f
error: 303 msg: Unhandled exception in plug-in data: Traceback (most
recent call last):
File "/usr/lib/python2.7/site-packages/lsm/pluginrunner.py", line 92, in
run
result = getattr(self.plugin, method)(**msg['params'])
File "/usr/lib/python2.7/site-packages/lsm/ontap.py", line 64, in
na_wrapper
return method(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/lsm/ontap.py", line 302, in
volume_resize
self.f.volume_resize(na_vol, diff)
File "/usr/lib/python2.7/site-packages/lsm/na.py", line 288, in
volume_resize
self._invoke('volume-size', params)
File "/usr/lib/python2.7/site-packages/lsm/na.py", line 160, in _invoke
command, parameters, self.ssl)
File "/usr/lib/python2.7/site-packages/lsm/na.py", line 100, in
netapp_filer
rc = netapp_filer_parse_response(handler.read())
File "/usr/lib64/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib64/python2.7/httplib.py", line 561, in read
s = self.fp.read(amt)
File "/usr/lib64/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/lib64/python2.7/ssl.py", line 241, in recv
return self.read(buflen)
File "/usr/lib64/python2.7/ssl.py", line 160, in read
return self._sslobj.read(len)
SSLError: The read operation timed out


lsmcli -u ontap+ssl://target -l FS | grep '
lsm_lun_container_storageqe ' | cut -d'|' -f3]...
2715793723392

before size was 2652147712.


The size of the LUN did not change.


Version-Release number of selected component (if applicable):
libstoragemgmt-0.0.16-1.el7.x86_64

How reproducible:
100%

Comment 1 Tony Asleson 2013-05-29 21:11:33 UTC

The code assumed that if we got an exception during the re-size of the NetApp volume which contains the logical units that the re-size didn't actually occur.  However, it appears that in some cases such as when we time-out the NetApp volume does get re-sized and is left that way on exit.

Adding a try/except around the volume resize so that we place the volume back to the same size as it was before.

Comment 2 Bruno Goncalves 2013-07-31 13:43:33 UTC

It seems the problem continue occurring there.


libstoragemgmt-0.0.21-1.el7.x86_64


1. lsmcli  -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3
    5304291328

2. lsmcli -w 300  -u ontap://root.bos.redhat.com --resize-volume=2FiJA+BOAf0/ --size=2T -f
error: 303 msg: Unhandled exception in plug-in data: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/lsm/pluginrunner.py", line 94, in run
    **msg['params'])
  File "/usr/lib/python2.7/site-packages/lsm/ontap.py", line 67, in na_wrapper
    return method(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/lsm/ontap.py", line 315, in volume_resize
    self.f.volume_resize(na_vol, diff)
  File "/usr/lib/python2.7/site-packages/lsm/na.py", line 310, in volume_resize
    self._invoke('volume-size', params)
  File "/usr/lib/python2.7/site-packages/lsm/na.py", line 167, in _invoke
    command, parameters, self.ssl)
  File "/usr/lib/python2.7/site-packages/lsm/na.py", line 105, in netapp_filer
    rc = netapp_filer_parse_response(handler.read())
  File "/usr/lib64/python2.7/socket.py", line 351, in read
    data = self._sock.recv(rbufsize)
  File "/usr/lib64/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/usr/lib64/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
timeout: timed out


3. lsmcli  -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3
 2718445867008

Comment 3 Bruno Goncalves 2013-08-20 09:56:59 UTC

The traceback is resolved, but the volume size still increases.

libstoragemgmt-0.0.22-2.el7.x86_64
libstoragemgmt-netapp-plugin-0.0.22-2.el7.noarch

1. lsmcli  -u ontap://root.bos.redhat.com --create-volume=lsm-timeout-test-lun --size=2G --pool=6e3dc350-811e-11e0-a8a4-00a09827d1ba --provisioning=DEFAULT

2FiJA+BOAf17 | /vol/lsm_lun_container_storageqe/lsm-timeout-test-lun | 60a980003246694a412b424f41663137 | 512 | 4194304 | OK     | 2147483648 | 0151753773 | 6e3dc350-811e-11e0-a8a4-00a09827d1ba

2. lsmcli  -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3]...
    5304291328

lsmcli -w 300  -u ontap://root.bos.redhat.com --resize-volume=2FiJA+BOAf17 --size=2T -f
error: 310 msg: Connection timeout

3. lsmcli  -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3
 2718445867008

Comment 5 Tony Asleson 2013-08-28 20:36:04 UTC

The libStorageMgmt NetApp plug-in typically does one or more operations to complete a typical libStorageMgmt operation. The NetApp API does not allow the ability for transactions to make sure everything completes or no state was changed. The library does best effort to put things back into place as expected on failure, but depending on the error this may not be possible. For example if you resize something and that is successful, but then the array is unreachable on the network we cannot put it back. The other thing is NetApp API calls are not idempotent, thus we cannot blindly retry an operation that timed out because it may have completed.

The code is written in such a way that over time things that are left in the incorrect size etc. will be corrected.

In my opinion setting the timeout to very low values to induce them will yield cases where we don't put things back to where they should and we can't because the timeout are so low that we can't successfully put them back because the restore operation is prevented from completing as well.

The timeout was added in those cases where the default value wouldn't suffice and the user needed to increase them to prevent operations from timing out, because of network latency etc.

I may add code to not allow the user the ability to go smaller than the default of 30 seconds to prevent these types of artificially induced timeouts from causing errors. Otherwise the plug-in complexity will increase considerably to handle every case where the plug-in is prevented from waiting a reasonable amount of time to complete an operation.

In my opinion we should move forward and only open bugs on timeouts if the default value is insufficient.

Comment 6 Ludek Smid 2014-06-13 11:51:22 UTC

This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Note You need to log in before you can comment on or make changes to this bug.