Bug 955622 - [engine-backend] failure in extend storage domain [with force=True] causes any other asynchronous task to fail with resource timeout. [FC]
Summary: [engine-backend] failure in extend storage domain [with force=True] causes an...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: unspecified
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.2.0
Assignee: Maor
QA Contact: Elad
URL:
Whiteboard: storage
Depends On:
Blocks: 948448
TreeView+ depends on / blocked
 
Reported: 2013-04-23 12:18 UTC by Elad
Modified: 2016-02-10 20:28 UTC (History)
11 users (show)

Fixed In Version: sf16
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (20.73 MB, application/x-gzip)
2013-04-23 12:18 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 11648 0 None None None Never

Description Elad 2013-04-23 12:18:38 UTC
Created attachment 738974 [details]
logs

Description of problem:

After vdsm responses to engine about failed extendStorageDomain, engine throws:


2013-04-23 13:42:52,219 ERROR [org.ovirt.engine.core.bll.storage.ExtendSANStorageDomainCommand] (ajp-/127.0.0.1:8702-6) [2f41e0fd] Command org.ovirt.engine.core.bll.storage.ExtendSANStorageDomainCommand throw Vdc Bll exception. With error message VdcBLLException: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction



After that, in the next 3 minutes, engine will fail any other asynchronous tasks (whether if they'll success or fail by vdsm). 



Version-Release number of selected component (if applicable):

RHEVM - 3.2 - SF13.1

rhevm-backend-3.2.0-10.19.beta2.el6ev.noarch
libvirt-0.10.2-18.el6_4.4.x86_64
vdsm-4.10.2-15.0.el6ev.x86_64
sanlock-2.6-2.el6.x86_64


How reproducible:

100% (will have to get a failure in extendStorageDomain in order to reproduce)


Steps to Reproduce: 

in a 2 hosts FC env.:

1) Trigger a failure in extend storage domain (I got one by a failure in remove mappings holding device)  vdsm will update the engine about the failure  
 
2) Try to do another action that will cause an asynchronous task by vdsm.
  
Actual results:
Any other asynchronous tasks that will be performed by vdsm in the next 3 minutes, will get fail by engine due to a timeout.




Expected results:

There should not be an exception in engine after failure in ExtendStorageDomain.


Additional info:
See logs attached.

Comment 1 Vered Volansky 2013-04-25 11:53:04 UTC
(In reply to comment #0) 
> Actual results:
> Any other asynchronous tasks that will be performed by vdsm in the next 3
> minutes, will get fail by engine due to a timeout.

This implies there are exceptions in the engine log for the next three minutes.
I don't see any in engine_elad_rhevm32.log.
All I see is the ExtedStorageDomainVDS and related errors, and only for the following minute.
Where did you see these exceptions? In the UI only?
Were you using webadmin?
Am I looking at the correct log? If not, where should I be looking?
If yes - which other exceptions am I missing?

> Expected results:
> There should not be an exception in engine after failure in
> ExtendStorageDomain.
You mean for the other operations I seem to be missing, right?
You're ok with the ExtedStorageDomainVDS etc. exceptions, correct?

Comment 2 Elad 2013-04-27 18:23:41 UTC

(In reply to comment #1)
> (In reply to comment #0) 


The first ExtendStorageDomainVDSCommand sent to vdsm at 13:32:48,751.
This operation failed. After that (about 2 minuted) (13:34:26,173), I tried again to extend the domain. This operation was send by engine to vdsm only after 10 minutes (13:42:48,797). The operation failed by vdsm but even if it was successful, engine would fail it due to timeout.

I've indeed used the webadmin.

Comment 3 Maor 2013-05-08 14:18:57 UTC
Already fixed upstream

Comment 4 Elad 2013-05-15 19:38:06 UTC
engine is not failing other asynchronous tasks after failure in ExtendStorageDomain:


vdsm failed to extend the domain:

on host:

Thread-991::INFO::2013-05-15 22:28:41,094::logUtils::40::dispatcher::(wrapper) Run and protect: extendStorageDomain(sdUUID='5191a994-536b-49fc-bbbf-fa0429fade12', spUUID='b6dba6fd-e23c-4
474-b2ef-6d3fa985987d', devlist=['1elad21367847', '1elad3113678542'], force=True, options=None)
Thread-991::ERROR::2013-05-15 22:28:41,136::task::850::TaskManager.Task::(_setError) Task=`35b0e6b0-366d-4fe6-8d1d-280e359202d0`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 684, in extendStorageDomain
    pool.extendSD(sdUUID, devlist, force)
  File "/usr/share/vdsm/storage/securable.py2013-05-15 22:29:12,332 INFO  [org.ovirt.engine.core.bll.storage.ExtendSANStorageDomainCommand] (ajp-/127.0.0.1:8702-11) [1d36254b] Running command: ExtendSANStorageDomainCommand internal: false. Entities affected :  ID: 5191a994-536b-49fc-bbbf-fa0429fade12 Type: Storage
2013-05-15 22:29:12,337 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ExtendStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-11) [1d36254b] START, ExtendStorageDomainVDSCommand( storagePoolId = b6dba6fd-e23c-4474-b2ef-6d3fa985987d, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = 5191a994-536b-49fc-bbbf-fa0429fade12, deviceList = [1elad3313678542]), log id: 3b5ab952
2013-05-15 22:29:15,179 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ExtendStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-11) [1d36254b] FINISH, ExtendStorageDomainVDSCommand, log id: 3b5ab952
", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 2018, in extendSD
    sdCache.produce(sdUUID).extend(devlist, force)
  File "/usr/share/vdsm/storage/blockSD.py", line 695, in extend
    lvm.extendVG(self.sdUUID, devices, force)
  File "/usr/share/vdsm/storage/lvm.py", line 914, in extendVG
    _initpvs(pvs, int(vg.vg_mda_size) / 2 ** 20, force)
  File "/usr/share/vdsm/storage/lvm.py", line 677, in _initpvs
    _initpvs_removeHolders()
  File "/usr/share/vdsm/storage/lvm.py", line 667, in _initpvs_removeHolders
    os.path.basename(device))
  File "/usr/share/vdsm/storage/devicemapper.py", line 166, in removeMappingsHoldingDevice
    removeMapping(getDevName(holder))
  File "/usr/share/vdsm/storage/devicemapper.py", line 137, in removeMapping
    return getProxy().removeDeviceMapping(deviceName)
  File "/usr/share/vdsm/supervdsm.py", line 76, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 67, in <lambda>
    **kwargs)
  File "<string>", line 2, in removeDeviceMapping
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
Exception: Could not remove mapping `5191a994--536b--49fc--bbbf--fa0429fade12-ids`



on engine:

2013-05-15 22:28:37,871 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-5) [379d390e] Failed in ExtendStorageDomainVDS method

2013-05-15 22:28:37,901 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ExtendStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-5) [379d390e] FINISH, ExtendStorageDomainVDSCommand, log id:




right after, I tried to extend the domain again and succeed:

on host:

 
Thread-1018::INFO::2013-05-15 22:29:15,630::logUtils::40::dispatcher::(wrapper) Run and protect: extendStorageDomain(sdUUID='5191a994-536b-49fc-bbbf-fa0429fade12', spUUID='b6dba6fd-e23c-
4474-b2ef-6d3fa985987d', devlist=['1elad3313678542'], force=True, options=None)


2013-05-15 22:29:12,332 INFO  [org.ovirt.engine.core.bll.storage.ExtendSANStorageDomainCommand] (ajp-/127.0.0.1:8702-11) [1d36254b] Running command: ExtendSANStorageDomainCommand internal: false. Entities affected :  ID: 5191a994-536b-49fc-bbbf-fa0429fade12 Type: Storage
2013-05-15 22:29:12,337 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ExtendStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-11) [1d36254b] START, ExtendStorageDomainVDSCommand( storagePoolId = b6dba6fd-e23c-4474-b2ef-6d3fa985987d, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = 5191a994-536b-49fc-bbbf-fa0429fade12, deviceList = [1elad3313678542]), log id: 3b5ab952
2013-05-15 22:29:15,179 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ExtendStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-11) [1d36254b] FINISH, ExtendStorageDomainVDSCommand, log id: 3b5ab952



verified on RHEVM-3.2 - SF16:

rhevm-3.2.0-10.25.beta3.el6ev.noarch
vdsm-4.10.2-18.0.el6ev.x86_64

Comment 5 Itamar Heim 2013-06-11 08:49:11 UTC
3.2 has been released

Comment 6 Itamar Heim 2013-06-11 08:49:11 UTC
3.2 has been released

Comment 7 Itamar Heim 2013-06-11 08:49:19 UTC
3.2 has been released

Comment 8 Itamar Heim 2013-06-11 08:53:50 UTC
3.2 has been released

Comment 9 Itamar Heim 2013-06-11 09:24:28 UTC
3.2 has been released


Note You need to log in before you can comment on or make changes to this bug.