Bug 802759 - 3.1 - deadlock after activateStorageDomain ran
3.1 - deadlock after activateStorageDomain ran
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: vdsm (Show other bugs)
6.3
Unspecified Unspecified
unspecified Severity urgent
: rc
: ---
Assigned To: Saggi Mizrahi
Jakub Libosvar
storage
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-13 08:58 EDT by Avihai Shoham
Modified: 2012-12-04 13:55 EST (History)
16 users (show)

See Also:
Fixed In Version: vdsm-4.9.6-18.0
Doc Type: Bug Fix
Doc Text:
Previously, a race condition in Python's subprocess Popen caused virtual machine creation to fail. A patch to VDSM prevents virtual machine failure when this race condition is present.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-12-04 13:55:09 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
repro on March19 (763.47 KB, application/x-xz)
2012-03-19 09:49 EDT, Avihai Shoham
no flags Details

  None (edit)
Description Avihai Shoham 2012-03-13 08:58:23 EDT
when trying to create VM i saw an error in the engine log "Please handle Storage Domain issues and retry the operation"
i executed getDeviceList which never return with results

looking in the ols vdsm log you can see that last thread that lock on a resource see Thread-114297 on attached vdsm.log.2.xz

other resources that request for a lock don't get it.
Comment 1 Saggi Mizrahi 2012-03-15 13:06:59 EDT
This is actually not a revenge of
Comment 2 Saggi Mizrahi 2012-03-15 13:07:40 EDT
http://gerrit.ovirt.org/#change,2844
Comment 4 Avihai Shoham 2012-03-18 12:09:31 EDT
According to sagi this bug fixed "ages ago"
Comment 6 Saggi Mizrahi 2012-03-19 09:26:50 EDT
Well, apparently it isn't, I put up new patches that are a bit more robust. Please test with the patches applied
Comment 7 Avihai Shoham 2012-03-19 09:49:20 EDT
Created attachment 571126 [details]
repro on March19

this log repro this issue on March19
Comment 8 Dan Kenigsberg 2012-04-18 15:04:10 EDT
(In reply to comment #7)
> 
> this log repro this issue on March19

but with which vdsm? Saggi's patch has been taken upstream only yesterday.
Comment 11 Saggi Mizrahi 2012-04-29 06:44:56 EDT
I've managed to reproduce this when testing patches.
I happens randomly when forking\execing in python

Some days it doesn't happen at all and some days it happens all the time.
Take into account that I've been doing (because of testing) about a 10000 forks a test run, running a test run every code change and still it happened only rarely for me.

The origin is a deadlock in python which I can't quite nail the root cause of.
I know where it's stuck I don't know why it's stuck.
If you are interested Python is getting deadlock trying to get the local thread context log after a fork() in order to reinit the GIL

In any case, It should be fixed with this stack (which avoids the problem by avoiding forking all together)
http://gerrit.ovirt.org/#q,status:open+project:vdsm+branch:master+topic:coop,n,z

Solves problems with forking\execing\process pool
Comment 12 Dafna Ron 2012-04-30 03:47:33 EDT
so if there is no way of reproducing since we do not know exactly why it happens, and we are adding new code that avoids forking all together, how can we verify this bug?
Comment 13 Yaniv Kaul 2012-05-01 02:51:07 EDT
Removing QA_ACK until we get instructions on what needs to be tested here.
Setting QE conditional NACK on Reproducer and Requirements.
Comment 15 Dan Kenigsberg 2012-05-03 04:41:36 EDT
According to Saggi's comment 11, it is a nasty, not completely clear, race condition in Python's subprocess.Popen. We do not have a clear reproducer for this, or a simple way to verify the bug.

The best I can see for QE is to stress-test Vdsm with multiple (as many as possible) block storage domains.

Saggi assumes that the problem noticed by Avihai shall be gone when his

http://gerrit.ovirt.org/3944

is in.
Comment 20 Haim 2012-07-16 08:17:28 EDT
we didn't encounter it following our various automation runs nor in manual storage sanity, and also on scalability (tried my-self with domain constructed with 100 pvs).

vdsm-4.9.6-21.0.el6_3.x86_64
Comment 25 errata-xmlrpc 2012-12-04 13:55:09 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html

Note You need to log in before you can comment on or make changes to this bug.