1298061 – APPARENT DEADLOCKs, failed registrations and systems registered without package profile when registering lots of systems in parallel

Bug 1298061 - APPARENT DEADLOCKs, failed registrations and systems registered without package profile when registering lots of systems in parallel

Summary: APPARENT DEADLOCKs, failed registrations and systems registered without packa...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Satellite 5
Classification:	Red Hat
Component:	Server
Sub Component:
Version:	570
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Grant Gainey
QA Contact:	Red Hat Satellite QA List
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1298202 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-13 06:02 UTC by Jan Hutař
Modified:	2020-07-16 08:40 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-04 14:03:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2211431	0	None	None	None	2016-03-28 20:02:41 UTC

Description Jan Hutař 2016-01-13 06:02:12 UTC

Description of problem:
I have seen APPARENT DEADLOCKs in the logs, failed registrations and systems registered without package profile when registering lots of systems in parallel.

Mine setup was:
 * 10 docker hosts running 50 containers each (these containers were
   registering) => 500 registrations in +-parallel (some systems were a bit
   faster so mine setup took different amount of time on each system)
 * Satellite had Oracle DB backend with Oracle hosted on a same machine
   * IBM BladeCenter HS21 XM
   * Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
   * 3830 MB memory

Although 500 containers attempted to register, I have ended with only 326 profiles and (according to number of applicable erratas which was 0 while rest of the systems had 45 applicable erratas) 181 of these had empty package profile.


Version-Release number of selected component (if applicable):
Satellite 5.7.0 with Oracle DB:
  satellite-schema-5.7.0.20-1.el6sat.noarch
  spacewalk-backend-2.3.3-33.el6sat.noarch
  spacewalk-java-2.3.8-120.el6sat.noarch


How reproducible:
always


Steps to Reproduce:
1. register 500 docker containers in parallel
2. watch logs and ensure all the systems were registered with filled
   package profile


Actual results:
APPARENT DEADLOCKs in the logs, failed registrations on 174 systems and 181 systems registered without package profile


Expected results:
No errors in the logs, all systems registered or denied but these which have registered should have full package profile

Comment 6 Jan Hutař 2016-01-13 13:21:21 UTC

Note this is not related to Docker at all. Here Docker stands as a QE tool to simulate lots of systems.

Comment 7 Tomas Lestach 2016-01-14 09:55:19 UTC

Note: This bug described behavior against Sat5 with Oracle DB. Same reproducer against Sat5 with Postgresql backend describes Bug 1298202.

Comment 8 Grant Gainey 2016-01-19 19:46:41 UTC

The "APPARENT DEADLOCK" warnings in the logs are not database deadlocks - they're the c3po library deciding that some task(s) have taken/are taking "too long". The warning *assumes* it's due to a deadlock - but all it really knows is that threads 'appear' to be hung, so it kills them and tries to recover.

See here for more detail:

http://www.mchange.com/projects/c3p0/#other_ds_configuration
http://www.mchange.com/projects/c3p0/#maxAdministrativeTaskTime

The general drift of the research I've done so far is, this mostly means "your DB is under heavy cpu- and/or memory-load, and it's taking too long for new threads to acquire DB connections". There are various settings one can adjust to try and get around this - but sooner or later, the Sat5 instance just runs out of hardware, and then things like this happen.

As an example - in the attached logs, at 10:19 2016/01/07 we see taskomatic logging APPARENT DEADLOCK! warnings from c3po:

rhn-logs/rhn/rhn_taskomatic_daemon.log:
===
INFO   | jvm 1    | 2016/01/07 10:19:44 | Jan 7, 2016 10:19:44 AM com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector run
INFO   | jvm 1    | 2016/01/07 10:19:44 | WARNING: com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@5c885c88 -- APPARENT DEADLOCK!!! Complete Status: 
===

And at the same time, Oracle is complaining about processes being killed:

database/alert_rhnsat.log
===
Thu Jan  7 10:19:44 2016                                                                     
Process q003 died, see its trace file                                                        
Thu Jan  7 10:19:44 2016                                                                     
ksvcreate: Process(q003) creation failed                                                     
===

This is looking more and more like "the Sat5 instance is just overloaded - Do Less".  Experimentation continues.

Comment 9 Grant Gainey 2016-01-29 18:47:45 UTC

*** Bug 1298202 has been marked as a duplicate of this bug. ***

Comment 10 Grant Gainey 2016-01-29 20:48:43 UTC

Stress-testing with 'stock' Sat5 configuration overloads Oracle's ability to keep up. Connections are refused to the DB, and occasionally connections are killed as a result. Results become unreliable.

The 'fix' is to tweak Sat5 configs in order to guarantee that the database can service the max-allowed connections from Apache, rhn-search, taskomatic, *and* tomcat, simultaneously. This pushes the failures out to the incoming requests, instead of letting the database stutters/failures cause random incoherence for satellite.

When the value for "max-db-connections" is less than (2*max-httpd-workers) + (3*hibernate.c3p0.max_size), then Sat5 will allow more connections to itself, than its connection to the DB can support. Behavior at this point becomes unpredictable.

The (2*max-httpd-workers) value exists because it is not uncommon for a given HTTPD connection to acquire 2 connections to the DB, to support autonomous transactions.

The (3*hibernate.c3p0.max_size) exists because taskomatic, tomcat, and rhn-search can all open connections to the DB, in addition to whatever is happening as a result of incoming HTTPD connections.

With the current default settings, we have

rhn_hibernate.conf: hibernate.c3p0.max_size=20

httpd.conf: MaxClients 300

postgresql.conf: max_connections = 600

(2*300) + (3*20) > 600 - as a result, we experience the rash of problems shown in the attached logs.

Decrementing MaxClients (in /etc/httpd.conf) to 250 results in a much better-behaved Sat5 instance, and none of the painful db-related erors noted in the attached logs.

In actual production use, the specific settings for these variables will depend on the actual hardware and the expected workflows for the installation.

Note You need to log in before you can comment on or make changes to this bug.