Red Hat Bugzilla – Bug 802759
3.1 - deadlock after activateStorageDomain ran
Last modified: 2012-12-04 13:55:09 EST
when trying to create VM i saw an error in the engine log "Please handle Storage Domain issues and retry the operation"
i executed getDeviceList which never return with results
looking in the ols vdsm log you can see that last thread that lock on a resource see Thread-114297 on attached vdsm.log.2.xz
other resources that request for a lock don't get it.
This is actually not a revenge of
According to sagi this bug fixed "ages ago"
Well, apparently it isn't, I put up new patches that are a bit more robust. Please test with the patches applied
Created attachment 571126 [details]
repro on March19
this log repro this issue on March19
(In reply to comment #7)
> this log repro this issue on March19
but with which vdsm? Saggi's patch has been taken upstream only yesterday.
I've managed to reproduce this when testing patches.
I happens randomly when forking\execing in python
Some days it doesn't happen at all and some days it happens all the time.
Take into account that I've been doing (because of testing) about a 10000 forks a test run, running a test run every code change and still it happened only rarely for me.
The origin is a deadlock in python which I can't quite nail the root cause of.
I know where it's stuck I don't know why it's stuck.
If you are interested Python is getting deadlock trying to get the local thread context log after a fork() in order to reinit the GIL
In any case, It should be fixed with this stack (which avoids the problem by avoiding forking all together)
Solves problems with forking\execing\process pool
so if there is no way of reproducing since we do not know exactly why it happens, and we are adding new code that avoids forking all together, how can we verify this bug?
Removing QA_ACK until we get instructions on what needs to be tested here.
Setting QE conditional NACK on Reproducer and Requirements.
According to Saggi's comment 11, it is a nasty, not completely clear, race condition in Python's subprocess.Popen. We do not have a clear reproducer for this, or a simple way to verify the bug.
The best I can see for QE is to stress-test Vdsm with multiple (as many as possible) block storage domains.
Saggi assumes that the problem noticed by Avihai shall be gone when his
we didn't encounter it following our various automation runs nor in manual storage sanity, and also on scalability (tried my-self with domain constructed with 100 pvs).
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.