Bug 802759

Summary: 3.1 - deadlock after activateStorageDomain ran
Product: Red Hat Enterprise Linux 6 Reporter: Avihai Shoham <ashoham>
Component: vdsmAssignee: Saggi Mizrahi <smizrahi>
Status: CLOSED ERRATA QA Contact: Jakub Libosvar <jlibosva>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.3CC: abaron, acathrow, bazulay, cpelland, danken, dron, hateya, iheim, ilvovsky, jbiddle, jlibosva, smizrahi, syeghiay, yeylon, ykaul, zdover
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: vdsm-4.9.6-18.0 Doc Type: Bug Fix
Doc Text:
Previously, a race condition in Python's subprocess Popen caused virtual machine creation to fail. A patch to VDSM prevents virtual machine failure when this race condition is present.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 18:55:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
repro on March19 none

Description Avihai Shoham 2012-03-13 12:58:23 UTC
when trying to create VM i saw an error in the engine log "Please handle Storage Domain issues and retry the operation"
i executed getDeviceList which never return with results

looking in the ols vdsm log you can see that last thread that lock on a resource see Thread-114297 on attached vdsm.log.2.xz

other resources that request for a lock don't get it.

Comment 1 Saggi Mizrahi 2012-03-15 17:06:59 UTC
This is actually not a revenge of

Comment 2 Saggi Mizrahi 2012-03-15 17:07:40 UTC
http://gerrit.ovirt.org/#change,2844

Comment 4 Avihai Shoham 2012-03-18 16:09:31 UTC
According to sagi this bug fixed "ages ago"

Comment 6 Saggi Mizrahi 2012-03-19 13:26:50 UTC
Well, apparently it isn't, I put up new patches that are a bit more robust. Please test with the patches applied

Comment 7 Avihai Shoham 2012-03-19 13:49:20 UTC
Created attachment 571126 [details]
repro on March19

this log repro this issue on March19

Comment 8 Dan Kenigsberg 2012-04-18 19:04:10 UTC
(In reply to comment #7)
> 
> this log repro this issue on March19

but with which vdsm? Saggi's patch has been taken upstream only yesterday.

Comment 11 Saggi Mizrahi 2012-04-29 10:44:56 UTC
I've managed to reproduce this when testing patches.
I happens randomly when forking\execing in python

Some days it doesn't happen at all and some days it happens all the time.
Take into account that I've been doing (because of testing) about a 10000 forks a test run, running a test run every code change and still it happened only rarely for me.

The origin is a deadlock in python which I can't quite nail the root cause of.
I know where it's stuck I don't know why it's stuck.
If you are interested Python is getting deadlock trying to get the local thread context log after a fork() in order to reinit the GIL

In any case, It should be fixed with this stack (which avoids the problem by avoiding forking all together)
http://gerrit.ovirt.org/#q,status:open+project:vdsm+branch:master+topic:coop,n,z

Solves problems with forking\execing\process pool

Comment 12 Dafna Ron 2012-04-30 07:47:33 UTC
so if there is no way of reproducing since we do not know exactly why it happens, and we are adding new code that avoids forking all together, how can we verify this bug?

Comment 13 Yaniv Kaul 2012-05-01 06:51:07 UTC
Removing QA_ACK until we get instructions on what needs to be tested here.
Setting QE conditional NACK on Reproducer and Requirements.

Comment 15 Dan Kenigsberg 2012-05-03 08:41:36 UTC
According to Saggi's comment 11, it is a nasty, not completely clear, race condition in Python's subprocess.Popen. We do not have a clear reproducer for this, or a simple way to verify the bug.

The best I can see for QE is to stress-test Vdsm with multiple (as many as possible) block storage domains.

Saggi assumes that the problem noticed by Avihai shall be gone when his

http://gerrit.ovirt.org/3944

is in.

Comment 20 Haim 2012-07-16 12:17:28 UTC
we didn't encounter it following our various automation runs nor in manual storage sanity, and also on scalability (tried my-self with domain constructed with 100 pvs).

vdsm-4.9.6-21.0.el6_3.x86_64

Comment 25 errata-xmlrpc 2012-12-04 18:55:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html