Bug 747631 - Race condition: Satellite-sync hangs up forever
Race condition: Satellite-sync hangs up forever
Status: CLOSED ERRATA
Product: Red Hat Satellite 5
Classification: Red Hat
Component: Satellite Synchronization (Show other bugs)
541
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Michael Mráka
Jan Hutař
: Regression
Depends On:
Blocks: sat541-triage
  Show dependency treegraph
 
Reported: 2011-10-20 10:47 EDT by Šimon Lukašík
Modified: 2012-03-08 04:06 EST (History)
4 users (show)

See Also:
Fixed In Version: spacewalk-backend-1.2.13-59
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-12-22 08:11:57 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Šimon Lukašík 2011-10-20 10:47:41 EDT
Description of problem:
There is a race condition in multi-threading satellite-sync. Approximately
1 in 200+ attempt of Satellite-sync hangs up forever. It appears at the very
end of 'Downloading rpm packages' phase, before 'Processing rpm packages
complete'. 

Strace of the satellite-sync process shows only:

    select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)

Probably waiting for some resources.


Version-Release number of selected component (if applicable):
RHN Satellite 5.4.1
spacewalk-backend-1.2.13-55.el6sat.noarch

How reproducible:
The issue is not deterministic and its very rare. However, I am able
to reproduce it when I run the reproducer in loop.


Steps to Reproduce:
1. satellite-sync some channel (I used dump with 9 packages.)
2.
3.
  
Actual results:
Process rarely hangs-up.

Expected results:
Process will never hang up.

Additional info:
Comment 6 Michael Mráka 2011-12-12 11:32:43 EST
That
    select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
is called from out_queue.get_nowait() in:

1758         while list(itertools.ifilter(lambda x: x.isAlive(), all_threads)) or out_queue.qsize() > 0:
1759             try:
1760                 (rpmManip, package, is_done) = out_queue.get_nowait()
1761             except Queue.Empty:
1762                 time.sleep(0.1)
1763                 continue

and the race condition probably is - although out_queue.qsize() > 0 is true subsequent out_queue.get_nowait() raises Queue.Empty. Doc says Queue.qsize() returns the approximate size of the queue.
Comment 8 Michael Mráka 2011-12-13 08:35:49 EST
Bug has been fixed in spacewalk master by
commit 970a100b43b9bafc34655915923a1d6b11408ab1
    747631 - exit loop when all packages are finished

Spacewalk package: spacewalk-backend-1.6.59-1
Comment 9 Michael Mráka 2011-12-14 03:08:46 EST
Backported to SATELLITE-5.4 as
commit fe292fc954ca7809f42e9640d48fa809f61437a4
    747631 - exit loop when all packages are finished
Comment 12 errata-xmlrpc 2011-12-22 08:11:57 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1848.html

Note You need to log in before you can comment on or make changes to this bug.