Bug 1298061

Summary: APPARENT DEADLOCKs, failed registrations and systems registered without package profile when registering lots of systems in parallel
Product: Red Hat Satellite 5 Reporter: Jan Hutař <jhutar>
Component: ServerAssignee: Grant Gainey <ggainey>
Status: CLOSED NOTABUG QA Contact: Red Hat Satellite QA List <satqe-list>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 570CC: dyordano, ggainey, mmello, mtaru, risantam, shughes, tlestach
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-04 14:03:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Hutař 2016-01-13 06:02:12 UTC
Description of problem:
I have seen APPARENT DEADLOCKs in the logs, failed registrations and systems registered without package profile when registering lots of systems in parallel.

Mine setup was:
 * 10 docker hosts running 50 containers each (these containers were
   registering) => 500 registrations in +-parallel (some systems were a bit
   faster so mine setup took different amount of time on each system)
 * Satellite had Oracle DB backend with Oracle hosted on a same machine
   * IBM BladeCenter HS21 XM
   * Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
   * 3830 MB memory

Although 500 containers attempted to register, I have ended with only 326 profiles and (according to number of applicable erratas which was 0 while rest of the systems had 45 applicable erratas) 181 of these had empty package profile.


Version-Release number of selected component (if applicable):
Satellite 5.7.0 with Oracle DB:
  satellite-schema-5.7.0.20-1.el6sat.noarch
  spacewalk-backend-2.3.3-33.el6sat.noarch
  spacewalk-java-2.3.8-120.el6sat.noarch


How reproducible:
always


Steps to Reproduce:
1. register 500 docker containers in parallel
2. watch logs and ensure all the systems were registered with filled
   package profile


Actual results:
APPARENT DEADLOCKs in the logs, failed registrations on 174 systems and 181 systems registered without package profile


Expected results:
No errors in the logs, all systems registered or denied but these which have registered should have full package profile

Comment 6 Jan Hutař 2016-01-13 13:21:21 UTC
Note this is not related to Docker at all. Here Docker stands as a QE tool to simulate lots of systems.

Comment 7 Tomas Lestach 2016-01-14 09:55:19 UTC
Note: This bug described behavior against Sat5 with Oracle DB. Same reproducer against Sat5 with Postgresql backend describes Bug 1298202.

Comment 8 Grant Gainey 2016-01-19 19:46:41 UTC
The "APPARENT DEADLOCK" warnings in the logs are not database deadlocks - they're the c3po library deciding that some task(s) have taken/are taking "too long". The warning *assumes* it's due to a deadlock - but all it really knows is that threads 'appear' to be hung, so it kills them and tries to recover.

See here for more detail:

http://www.mchange.com/projects/c3p0/#other_ds_configuration
http://www.mchange.com/projects/c3p0/#maxAdministrativeTaskTime

The general drift of the research I've done so far is, this mostly means "your DB is under heavy cpu- and/or memory-load, and it's taking too long for new threads to acquire DB connections". There are various settings one can adjust to try and get around this - but sooner or later, the Sat5 instance just runs out of hardware, and then things like this happen.

As an example - in the attached logs, at 10:19 2016/01/07 we see taskomatic logging APPARENT DEADLOCK! warnings from c3po:

rhn-logs/rhn/rhn_taskomatic_daemon.log:
===
INFO   | jvm 1    | 2016/01/07 10:19:44 | Jan 7, 2016 10:19:44 AM com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector run
INFO   | jvm 1    | 2016/01/07 10:19:44 | WARNING: com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@5c885c88 -- APPARENT DEADLOCK!!! Complete Status: 
===

And at the same time, Oracle is complaining about processes being killed:

database/alert_rhnsat.log
===
Thu Jan  7 10:19:44 2016                                                                     
Process q003 died, see its trace file                                                        
Thu Jan  7 10:19:44 2016                                                                     
ksvcreate: Process(q003) creation failed                                                     
===

This is looking more and more like "the Sat5 instance is just overloaded - Do Less".  Experimentation continues.

Comment 9 Grant Gainey 2016-01-29 18:47:45 UTC
*** Bug 1298202 has been marked as a duplicate of this bug. ***

Comment 10 Grant Gainey 2016-01-29 20:48:43 UTC
Stress-testing with 'stock' Sat5 configuration overloads Oracle's ability to keep up. Connections are refused to the DB, and occasionally connections are killed as a result. Results become unreliable.

The 'fix' is to tweak Sat5 configs in order to guarantee that the database can service the max-allowed connections from Apache, rhn-search, taskomatic, *and* tomcat, simultaneously. This pushes the failures out to the incoming requests, instead of letting the database stutters/failures cause random incoherence for satellite.

When the value for "max-db-connections" is less than (2*max-httpd-workers) + (3*hibernate.c3p0.max_size), then Sat5 will allow more connections to itself, than its connection to the DB can support. Behavior at this point becomes unpredictable.

The (2*max-httpd-workers) value exists because it is not uncommon for a given HTTPD connection to acquire 2 connections to the DB, to support autonomous transactions.

The (3*hibernate.c3p0.max_size) exists because taskomatic, tomcat, and rhn-search can all open connections to the DB, in addition to whatever is happening as a result of incoming HTTPD connections.

With the current default settings, we have

rhn_hibernate.conf: hibernate.c3p0.max_size=20

httpd.conf: MaxClients 300

postgresql.conf: max_connections = 600

(2*300) + (3*20) > 600 - as a result, we experience the rash of problems shown in the attached logs.

Decrementing MaxClients (in /etc/httpd.conf) to 250 results in a much better-behaved Sat5 instance, and none of the painful db-related erors noted in the attached logs.

In actual production use, the specific settings for these variables will depend on the actual hardware and the expected workflows for the installation.