747631 – Race condition: Satellite-sync hangs up forever

Bug 747631 - Race condition: Satellite-sync hangs up forever

Summary: Race condition: Satellite-sync hangs up forever

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite 5
Classification:	Red Hat
Component:	Satellite Synchronization
Sub Component:
Version:	541
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Michael Mráka
QA Contact:	Jan Hutař
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	sat541-triage
TreeView+	depends on / blocked

Reported:	2011-10-20 14:47 UTC by Šimon Lukašík
Modified:	2012-03-08 09:06 UTC (History)
CC List:	4 users (show)
Fixed In Version:	spacewalk-backend-1.2.13-59
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-12-22 13:11:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	765952	0	unspecified	CLOSED	satellite-sync hangs	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2011:1848	0	normal	SHIPPED_LIVE	Red Hat Network Satellite server spacewalk-backend bug fix update	2011-12-22 18:10:29 UTC

Internal Links: 765952

Description Šimon Lukašík 2011-10-20 14:47:41 UTC

Description of problem:
There is a race condition in multi-threading satellite-sync. Approximately
1 in 200+ attempt of Satellite-sync hangs up forever. It appears at the very
end of 'Downloading rpm packages' phase, before 'Processing rpm packages
complete'. 

Strace of the satellite-sync process shows only:

    select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)

Probably waiting for some resources.


Version-Release number of selected component (if applicable):
RHN Satellite 5.4.1
spacewalk-backend-1.2.13-55.el6sat.noarch

How reproducible:
The issue is not deterministic and its very rare. However, I am able
to reproduce it when I run the reproducer in loop.


Steps to Reproduce:
1. satellite-sync some channel (I used dump with 9 packages.)
2.
3.
  
Actual results:
Process rarely hangs-up.

Expected results:
Process will never hang up.

Additional info:

Comment 6 Michael Mráka 2011-12-12 16:32:43 UTC

That
    select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
is called from out_queue.get_nowait() in:

1758         while list(itertools.ifilter(lambda x: x.isAlive(), all_threads)) or out_queue.qsize() > 0:
1759             try:
1760                 (rpmManip, package, is_done) = out_queue.get_nowait()
1761             except Queue.Empty:
1762                 time.sleep(0.1)
1763                 continue

and the race condition probably is - although out_queue.qsize() > 0 is true subsequent out_queue.get_nowait() raises Queue.Empty. Doc says Queue.qsize() returns the approximate size of the queue.

Comment 8 Michael Mráka 2011-12-13 13:35:49 UTC

Bug has been fixed in spacewalk master by
commit 970a100b43b9bafc34655915923a1d6b11408ab1
    747631 - exit loop when all packages are finished

Spacewalk package: spacewalk-backend-1.6.59-1

Comment 9 Michael Mráka 2011-12-14 08:08:46 UTC

Backported to SATELLITE-5.4 as
commit fe292fc954ca7809f42e9640d48fa809f61437a4
    747631 - exit loop when all packages are finished

Comment 12 errata-xmlrpc 2011-12-22 13:11:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1848.html

Note You need to log in before you can comment on or make changes to this bug.