802759 – 3.1 - deadlock after activateStorageDomain ran

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 802759 - 3.1 - deadlock after activateStorageDomain ran

Summary: 3.1 - deadlock after activateStorageDomain ran

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	6.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Saggi Mizrahi
QA Contact:	Jakub Libosvar
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-03-13 12:58 UTC by Avihai Shoham
Modified:	2022-07-09 05:34 UTC (History)
CC List:	16 users (show)
Fixed In Version:	vdsm-4.9.6-18.0
Doc Type:	Bug Fix
Doc Text:	Previously, a race condition in Python's subprocess Popen caused virtual machine creation to fail. A patch to VDSM prevents virtual machine failure when this race condition is present.
Clone Of:
Environment:
Last Closed:	2012-12-04 18:55:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
repro on March19 (763.47 KB, application/x-xz) 2012-03-19 13:49 UTC, Avihai Shoham	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2012:1508	0	normal	SHIPPED_LIVE	Important: rhev-3.1.0 vdsm security, bug fix, and enhancement update	2012-12-04 23:48:05 UTC

Description Avihai Shoham 2012-03-13 12:58:23 UTC

when trying to create VM i saw an error in the engine log "Please handle Storage Domain issues and retry the operation"
i executed getDeviceList which never return with results

looking in the ols vdsm log you can see that last thread that lock on a resource see Thread-114297 on attached vdsm.log.2.xz

other resources that request for a lock don't get it.

Comment 1 Saggi Mizrahi 2012-03-15 17:06:59 UTC

This is actually not a revenge of

Comment 2 Saggi Mizrahi 2012-03-15 17:07:40 UTC

http://gerrit.ovirt.org/#change,2844

Comment 4 Avihai Shoham 2012-03-18 16:09:31 UTC

According to sagi this bug fixed "ages ago"

Comment 6 Saggi Mizrahi 2012-03-19 13:26:50 UTC

Well, apparently it isn't, I put up new patches that are a bit more robust. Please test with the patches applied

Comment 7 Avihai Shoham 2012-03-19 13:49:20 UTC

Created attachment 571126 [details]
repro on March19

this log repro this issue on March19

Comment 8 Dan Kenigsberg 2012-04-18 19:04:10 UTC

(In reply to comment #7)
> 
> this log repro this issue on March19

but with which vdsm? Saggi's patch has been taken upstream only yesterday.

Comment 11 Saggi Mizrahi 2012-04-29 10:44:56 UTC

I've managed to reproduce this when testing patches.
I happens randomly when forking\execing in python

Some days it doesn't happen at all and some days it happens all the time.
Take into account that I've been doing (because of testing) about a 10000 forks a test run, running a test run every code change and still it happened only rarely for me.

The origin is a deadlock in python which I can't quite nail the root cause of.
I know where it's stuck I don't know why it's stuck.
If you are interested Python is getting deadlock trying to get the local thread context log after a fork() in order to reinit the GIL

In any case, It should be fixed with this stack (which avoids the problem by avoiding forking all together)
http://gerrit.ovirt.org/#q,status:open+project:vdsm+branch:master+topic:coop,n,z

Solves problems with forking\execing\process pool

Comment 12 Dafna Ron 2012-04-30 07:47:33 UTC

so if there is no way of reproducing since we do not know exactly why it happens, and we are adding new code that avoids forking all together, how can we verify this bug?

Comment 13 Yaniv Kaul 2012-05-01 06:51:07 UTC

Removing QA_ACK until we get instructions on what needs to be tested here.
Setting QE conditional NACK on Reproducer and Requirements.

Comment 15 Dan Kenigsberg 2012-05-03 08:41:36 UTC

According to Saggi's comment 11, it is a nasty, not completely clear, race condition in Python's subprocess.Popen. We do not have a clear reproducer for this, or a simple way to verify the bug.

The best I can see for QE is to stress-test Vdsm with multiple (as many as possible) block storage domains.

Saggi assumes that the problem noticed by Avihai shall be gone when his

http://gerrit.ovirt.org/3944

is in.

Comment 20 Haim 2012-07-16 12:17:28 UTC

we didn't encounter it following our various automation runs nor in manual storage sanity, and also on scalability (tried my-self with domain constructed with 100 pvs).

vdsm-4.9.6-21.0.el6_3.x86_64

Comment 25 errata-xmlrpc 2012-12-04 18:55:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html

Note You need to log in before you can comment on or make changes to this bug.