Bug 968384

Summary: libstoragemgmt does not roll back FS size when LUN resize fails due time out
Product: Red Hat Enterprise Linux 7 Reporter: Bruno Goncalves <bgoncalv>
Component: libstoragemgmtAssignee: Tony Asleson <tasleson>
Status: CLOSED CURRENTRELEASE QA Contact: Bruno Goncalves <bgoncalv>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0CC: tasleson
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libstoragemgmt-0.0.22-2.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-13 11:51:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 754967    

Description Bruno Goncalves 2013-05-29 15:22:48 UTC
Description of problem:

If the command to resize the LUN exit due time out the FS size still grows.
The FS size should roll back to its size before the attempt to resize the LUN.

lsmcli -u ontap+ssl://target
--resize-volume=2FiJA+BOAezA --size=2T -f
error: 303 msg: Unhandled exception in plug-in data: Traceback (most
recent call last):
File "/usr/lib/python2.7/site-packages/lsm/pluginrunner.py", line 92, in
run
result = getattr(self.plugin, method)(**msg['params'])
File "/usr/lib/python2.7/site-packages/lsm/ontap.py", line 64, in
na_wrapper
return method(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/lsm/ontap.py", line 302, in
volume_resize
self.f.volume_resize(na_vol, diff)
File "/usr/lib/python2.7/site-packages/lsm/na.py", line 288, in
volume_resize
self._invoke('volume-size', params)
File "/usr/lib/python2.7/site-packages/lsm/na.py", line 160, in _invoke
command, parameters, self.ssl)
File "/usr/lib/python2.7/site-packages/lsm/na.py", line 100, in
netapp_filer
rc = netapp_filer_parse_response(handler.read())
File "/usr/lib64/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib64/python2.7/httplib.py", line 561, in read
s = self.fp.read(amt)
File "/usr/lib64/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/lib64/python2.7/ssl.py", line 241, in recv
return self.read(buflen)
File "/usr/lib64/python2.7/ssl.py", line 160, in read
return self._sslobj.read(len)
SSLError: The read operation timed out


lsmcli -u ontap+ssl://target -l FS | grep '
lsm_lun_container_storageqe ' | cut -d'|' -f3]...
2715793723392

before size was 2652147712.


The size of the LUN did not change.


Version-Release number of selected component (if applicable):
libstoragemgmt-0.0.16-1.el7.x86_64

How reproducible:
100%

Comment 1 Tony Asleson 2013-05-29 21:11:33 UTC
The code assumed that if we got an exception during the re-size of the NetApp volume which contains the logical units that the re-size didn't actually occur.  However, it appears that in some cases such as when we time-out the NetApp volume does get re-sized and is left that way on exit.

Adding a try/except around the volume resize so that we place the volume back to the same size as it was before.

Comment 2 Bruno Goncalves 2013-07-31 13:43:33 UTC
It seems the problem continue occurring there.


libstoragemgmt-0.0.21-1.el7.x86_64


1. lsmcli  -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3
    5304291328

2. lsmcli -w 300  -u ontap://root.bos.redhat.com --resize-volume=2FiJA+BOAf0/ --size=2T -f
error: 303 msg: Unhandled exception in plug-in data: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/lsm/pluginrunner.py", line 94, in run
    **msg['params'])
  File "/usr/lib/python2.7/site-packages/lsm/ontap.py", line 67, in na_wrapper
    return method(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/lsm/ontap.py", line 315, in volume_resize
    self.f.volume_resize(na_vol, diff)
  File "/usr/lib/python2.7/site-packages/lsm/na.py", line 310, in volume_resize
    self._invoke('volume-size', params)
  File "/usr/lib/python2.7/site-packages/lsm/na.py", line 167, in _invoke
    command, parameters, self.ssl)
  File "/usr/lib/python2.7/site-packages/lsm/na.py", line 105, in netapp_filer
    rc = netapp_filer_parse_response(handler.read())
  File "/usr/lib64/python2.7/socket.py", line 351, in read
    data = self._sock.recv(rbufsize)
  File "/usr/lib64/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/usr/lib64/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
timeout: timed out


3. lsmcli  -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3
 2718445867008

Comment 3 Bruno Goncalves 2013-08-20 09:56:59 UTC
The traceback is resolved, but the volume size still increases.

libstoragemgmt-0.0.22-2.el7.x86_64
libstoragemgmt-netapp-plugin-0.0.22-2.el7.noarch

1. lsmcli  -u ontap://root.bos.redhat.com --create-volume=lsm-timeout-test-lun --size=2G --pool=6e3dc350-811e-11e0-a8a4-00a09827d1ba --provisioning=DEFAULT

2FiJA+BOAf17 | /vol/lsm_lun_container_storageqe/lsm-timeout-test-lun | 60a980003246694a412b424f41663137 | 512 | 4194304 | OK     | 2147483648 | 0151753773 | 6e3dc350-811e-11e0-a8a4-00a09827d1ba

2. lsmcli  -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3]...
    5304291328

lsmcli -w 300  -u ontap://root.bos.redhat.com --resize-volume=2FiJA+BOAf17 --size=2T -f
error: 310 msg: Connection timeout

3. lsmcli  -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3
 2718445867008

Comment 5 Tony Asleson 2013-08-28 20:36:04 UTC
The libStorageMgmt NetApp plug-in typically does one or more operations to complete a typical libStorageMgmt operation.  The NetApp API does not allow the ability for transactions to make sure everything completes or no state was changed.  The library does best effort to put things back into place as expected on failure, but depending on the error this may not be possible.  For example if you resize something and that is successful, but then the array is unreachable on the network we cannot put it back.  The other thing is NetApp API calls are not idempotent, thus we cannot blindly retry an operation that timed out because it may have completed.

The code is written in such a way that over time things that are left in the incorrect size etc. will be corrected.

In my opinion setting the timeout to very low values to induce them will yield cases where we don't put things back to where they should and we can't because the timeout are so low that we can't successfully put them back because the restore operation is prevented from completing as well.

The timeout was added in those cases where the default value wouldn't suffice and the user needed to increase them to prevent operations from timing out, because of network latency etc.

I may add code to not allow the user the ability to go smaller than the default of 30 seconds to prevent these types of artificially induced timeouts from causing errors.  Otherwise the plug-in complexity will increase considerably to handle every case where the plug-in is prevented from waiting a reasonable amount of time to complete an operation.

In my opinion we should move forward and only open bugs on timeouts if the default value is insufficient.

Comment 6 Ludek Smid 2014-06-13 11:51:22 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.