Bug 747631 - Race condition: Satellite-sync hangs up forever
Summary: Race condition: Satellite-sync hangs up forever
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite 5
Classification: Red Hat
Component: Satellite Synchronization
Version: 541
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Michael Mráka
QA Contact: Jan Hutař
URL:
Whiteboard:
Depends On:
Blocks: sat541-triage
TreeView+ depends on / blocked
 
Reported: 2011-10-20 14:47 UTC by Šimon Lukašík
Modified: 2012-03-08 09:06 UTC (History)
4 users (show)

Fixed In Version: spacewalk-backend-1.2.13-59
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-12-22 13:11:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 765952 0 unspecified CLOSED satellite-sync hangs 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2011:1848 0 normal SHIPPED_LIVE Red Hat Network Satellite server spacewalk-backend bug fix update 2011-12-22 18:10:29 UTC

Internal Links: 765952

Description Šimon Lukašík 2011-10-20 14:47:41 UTC
Description of problem:
There is a race condition in multi-threading satellite-sync. Approximately
1 in 200+ attempt of Satellite-sync hangs up forever. It appears at the very
end of 'Downloading rpm packages' phase, before 'Processing rpm packages
complete'. 

Strace of the satellite-sync process shows only:

    select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)

Probably waiting for some resources.


Version-Release number of selected component (if applicable):
RHN Satellite 5.4.1
spacewalk-backend-1.2.13-55.el6sat.noarch

How reproducible:
The issue is not deterministic and its very rare. However, I am able
to reproduce it when I run the reproducer in loop.


Steps to Reproduce:
1. satellite-sync some channel (I used dump with 9 packages.)
2.
3.
  
Actual results:
Process rarely hangs-up.

Expected results:
Process will never hang up.

Additional info:

Comment 6 Michael Mráka 2011-12-12 16:32:43 UTC
That
    select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
is called from out_queue.get_nowait() in:

1758         while list(itertools.ifilter(lambda x: x.isAlive(), all_threads)) or out_queue.qsize() > 0:
1759             try:
1760                 (rpmManip, package, is_done) = out_queue.get_nowait()
1761             except Queue.Empty:
1762                 time.sleep(0.1)
1763                 continue

and the race condition probably is - although out_queue.qsize() > 0 is true subsequent out_queue.get_nowait() raises Queue.Empty. Doc says Queue.qsize() returns the approximate size of the queue.

Comment 8 Michael Mráka 2011-12-13 13:35:49 UTC
Bug has been fixed in spacewalk master by
commit 970a100b43b9bafc34655915923a1d6b11408ab1
    747631 - exit loop when all packages are finished

Spacewalk package: spacewalk-backend-1.6.59-1

Comment 9 Michael Mráka 2011-12-14 08:08:46 UTC
Backported to SATELLITE-5.4 as
commit fe292fc954ca7809f42e9640d48fa809f61437a4
    747631 - exit loop when all packages are finished

Comment 12 errata-xmlrpc 2011-12-22 13:11:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1848.html


Note You need to log in before you can comment on or make changes to this bug.