Bug 968384
Summary: | libstoragemgmt does not roll back FS size when LUN resize fails due time out | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Bruno Goncalves <bgoncalv> |
Component: | libstoragemgmt | Assignee: | Tony Asleson <tasleson> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Bruno Goncalves <bgoncalv> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 7.0 | CC: | tasleson |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libstoragemgmt-0.0.22-2.el7 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-06-13 11:51:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 754967 |
Description
Bruno Goncalves
2013-05-29 15:22:48 UTC
The code assumed that if we got an exception during the re-size of the NetApp volume which contains the logical units that the re-size didn't actually occur. However, it appears that in some cases such as when we time-out the NetApp volume does get re-sized and is left that way on exit. Adding a try/except around the volume resize so that we place the volume back to the same size as it was before. It seems the problem continue occurring there. libstoragemgmt-0.0.21-1.el7.x86_64 1. lsmcli -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3 5304291328 2. lsmcli -w 300 -u ontap://root.bos.redhat.com --resize-volume=2FiJA+BOAf0/ --size=2T -f error: 303 msg: Unhandled exception in plug-in data: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/lsm/pluginrunner.py", line 94, in run **msg['params']) File "/usr/lib/python2.7/site-packages/lsm/ontap.py", line 67, in na_wrapper return method(*args, **kwargs) File "/usr/lib/python2.7/site-packages/lsm/ontap.py", line 315, in volume_resize self.f.volume_resize(na_vol, diff) File "/usr/lib/python2.7/site-packages/lsm/na.py", line 310, in volume_resize self._invoke('volume-size', params) File "/usr/lib/python2.7/site-packages/lsm/na.py", line 167, in _invoke command, parameters, self.ssl) File "/usr/lib/python2.7/site-packages/lsm/na.py", line 105, in netapp_filer rc = netapp_filer_parse_response(handler.read()) File "/usr/lib64/python2.7/socket.py", line 351, in read data = self._sock.recv(rbufsize) File "/usr/lib64/python2.7/httplib.py", line 567, in read s = self.fp.read(amt) File "/usr/lib64/python2.7/socket.py", line 380, in read data = self._sock.recv(left) timeout: timed out 3. lsmcli -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3 2718445867008 The traceback is resolved, but the volume size still increases. libstoragemgmt-0.0.22-2.el7.x86_64 libstoragemgmt-netapp-plugin-0.0.22-2.el7.noarch 1. lsmcli -u ontap://root.bos.redhat.com --create-volume=lsm-timeout-test-lun --size=2G --pool=6e3dc350-811e-11e0-a8a4-00a09827d1ba --provisioning=DEFAULT 2FiJA+BOAf17 | /vol/lsm_lun_container_storageqe/lsm-timeout-test-lun | 60a980003246694a412b424f41663137 | 512 | 4194304 | OK | 2147483648 | 0151753773 | 6e3dc350-811e-11e0-a8a4-00a09827d1ba 2. lsmcli -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3]... 5304291328 lsmcli -w 300 -u ontap://root.bos.redhat.com --resize-volume=2FiJA+BOAf17 --size=2T -f error: 310 msg: Connection timeout 3. lsmcli -u ontap://root.bos.redhat.com -l FS | grep ' lsm_lun_container_storageqe ' | cut -d'|' -f3 2718445867008 The libStorageMgmt NetApp plug-in typically does one or more operations to complete a typical libStorageMgmt operation. The NetApp API does not allow the ability for transactions to make sure everything completes or no state was changed. The library does best effort to put things back into place as expected on failure, but depending on the error this may not be possible. For example if you resize something and that is successful, but then the array is unreachable on the network we cannot put it back. The other thing is NetApp API calls are not idempotent, thus we cannot blindly retry an operation that timed out because it may have completed. The code is written in such a way that over time things that are left in the incorrect size etc. will be corrected. In my opinion setting the timeout to very low values to induce them will yield cases where we don't put things back to where they should and we can't because the timeout are so low that we can't successfully put them back because the restore operation is prevented from completing as well. The timeout was added in those cases where the default value wouldn't suffice and the user needed to increase them to prevent operations from timing out, because of network latency etc. I may add code to not allow the user the ability to go smaller than the default of 30 seconds to prevent these types of artificially induced timeouts from causing errors. Otherwise the plug-in complexity will increase considerably to handle every case where the plug-in is prevented from waiting a reasonable amount of time to complete an operation. In my opinion we should move forward and only open bugs on timeouts if the default value is insufficient. This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. |