Description of problem: There is a race condition in multi-threading satellite-sync. Approximately 1 in 200+ attempt of Satellite-sync hangs up forever. It appears at the very end of 'Downloading rpm packages' phase, before 'Processing rpm packages complete'. Strace of the satellite-sync process shows only: select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) Probably waiting for some resources. Version-Release number of selected component (if applicable): RHN Satellite 5.4.1 spacewalk-backend-1.2.13-55.el6sat.noarch How reproducible: The issue is not deterministic and its very rare. However, I am able to reproduce it when I run the reproducer in loop. Steps to Reproduce: 1. satellite-sync some channel (I used dump with 9 packages.) 2. 3. Actual results: Process rarely hangs-up. Expected results: Process will never hang up. Additional info:
That select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) is called from out_queue.get_nowait() in: 1758 while list(itertools.ifilter(lambda x: x.isAlive(), all_threads)) or out_queue.qsize() > 0: 1759 try: 1760 (rpmManip, package, is_done) = out_queue.get_nowait() 1761 except Queue.Empty: 1762 time.sleep(0.1) 1763 continue and the race condition probably is - although out_queue.qsize() > 0 is true subsequent out_queue.get_nowait() raises Queue.Empty. Doc says Queue.qsize() returns the approximate size of the queue.
Bug has been fixed in spacewalk master by commit 970a100b43b9bafc34655915923a1d6b11408ab1 747631 - exit loop when all packages are finished Spacewalk package: spacewalk-backend-1.6.59-1
Backported to SATELLITE-5.4 as commit fe292fc954ca7809f42e9640d48fa809f61437a4 747631 - exit loop when all packages are finished
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1848.html