Bug 747631

Summary: Race condition: Satellite-sync hangs up forever
Product: Red Hat Satellite 5 Reporter: Šimon Lukašík <slukasik>
Component: Satellite SynchronizationAssignee: Michael Mráka <mmraka>
Status: CLOSED ERRATA QA Contact: Jan Hutař <jhutar>
Severity: high Docs Contact:
Priority: high    
Version: 541CC: cperry, jhutar, mmraka, msuchy
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: spacewalk-backend-1.2.13-59 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-22 13:11:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 677498    

Description Šimon Lukašík 2011-10-20 14:47:41 UTC
Description of problem:
There is a race condition in multi-threading satellite-sync. Approximately
1 in 200+ attempt of Satellite-sync hangs up forever. It appears at the very
end of 'Downloading rpm packages' phase, before 'Processing rpm packages
complete'. 

Strace of the satellite-sync process shows only:

    select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)

Probably waiting for some resources.


Version-Release number of selected component (if applicable):
RHN Satellite 5.4.1
spacewalk-backend-1.2.13-55.el6sat.noarch

How reproducible:
The issue is not deterministic and its very rare. However, I am able
to reproduce it when I run the reproducer in loop.


Steps to Reproduce:
1. satellite-sync some channel (I used dump with 9 packages.)
2.
3.
  
Actual results:
Process rarely hangs-up.

Expected results:
Process will never hang up.

Additional info:

Comment 6 Michael Mráka 2011-12-12 16:32:43 UTC
That
    select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
is called from out_queue.get_nowait() in:

1758         while list(itertools.ifilter(lambda x: x.isAlive(), all_threads)) or out_queue.qsize() > 0:
1759             try:
1760                 (rpmManip, package, is_done) = out_queue.get_nowait()
1761             except Queue.Empty:
1762                 time.sleep(0.1)
1763                 continue

and the race condition probably is - although out_queue.qsize() > 0 is true subsequent out_queue.get_nowait() raises Queue.Empty. Doc says Queue.qsize() returns the approximate size of the queue.

Comment 8 Michael Mráka 2011-12-13 13:35:49 UTC
Bug has been fixed in spacewalk master by
commit 970a100b43b9bafc34655915923a1d6b11408ab1
    747631 - exit loop when all packages are finished

Spacewalk package: spacewalk-backend-1.6.59-1

Comment 9 Michael Mráka 2011-12-14 08:08:46 UTC
Backported to SATELLITE-5.4 as
commit fe292fc954ca7809f42e9640d48fa809f61437a4
    747631 - exit loop when all packages are finished

Comment 12 errata-xmlrpc 2011-12-22 13:11:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1848.html