Bug 1298061
| Summary: | APPARENT DEADLOCKs, failed registrations and systems registered without package profile when registering lots of systems in parallel | ||
|---|---|---|---|
| Product: | Red Hat Satellite 5 | Reporter: | Jan Hutař <jhutar> |
| Component: | Server | Assignee: | Grant Gainey <ggainey> |
| Status: | CLOSED NOTABUG | QA Contact: | Red Hat Satellite QA List <satqe-list> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 570 | CC: | dyordano, ggainey, mmello, mtaru, risantam, shughes, tlestach |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-05-04 14:03:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jan Hutař
2016-01-13 06:02:12 UTC
Note this is not related to Docker at all. Here Docker stands as a QE tool to simulate lots of systems. Note: This bug described behavior against Sat5 with Oracle DB. Same reproducer against Sat5 with Postgresql backend describes Bug 1298202. The "APPARENT DEADLOCK" warnings in the logs are not database deadlocks - they're the c3po library deciding that some task(s) have taken/are taking "too long". The warning *assumes* it's due to a deadlock - but all it really knows is that threads 'appear' to be hung, so it kills them and tries to recover. See here for more detail: http://www.mchange.com/projects/c3p0/#other_ds_configuration http://www.mchange.com/projects/c3p0/#maxAdministrativeTaskTime The general drift of the research I've done so far is, this mostly means "your DB is under heavy cpu- and/or memory-load, and it's taking too long for new threads to acquire DB connections". There are various settings one can adjust to try and get around this - but sooner or later, the Sat5 instance just runs out of hardware, and then things like this happen. As an example - in the attached logs, at 10:19 2016/01/07 we see taskomatic logging APPARENT DEADLOCK! warnings from c3po: rhn-logs/rhn/rhn_taskomatic_daemon.log: === INFO | jvm 1 | 2016/01/07 10:19:44 | Jan 7, 2016 10:19:44 AM com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector run INFO | jvm 1 | 2016/01/07 10:19:44 | WARNING: com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@5c885c88 -- APPARENT DEADLOCK!!! Complete Status: === And at the same time, Oracle is complaining about processes being killed: database/alert_rhnsat.log === Thu Jan 7 10:19:44 2016 Process q003 died, see its trace file Thu Jan 7 10:19:44 2016 ksvcreate: Process(q003) creation failed === This is looking more and more like "the Sat5 instance is just overloaded - Do Less". Experimentation continues. *** Bug 1298202 has been marked as a duplicate of this bug. *** Stress-testing with 'stock' Sat5 configuration overloads Oracle's ability to keep up. Connections are refused to the DB, and occasionally connections are killed as a result. Results become unreliable. The 'fix' is to tweak Sat5 configs in order to guarantee that the database can service the max-allowed connections from Apache, rhn-search, taskomatic, *and* tomcat, simultaneously. This pushes the failures out to the incoming requests, instead of letting the database stutters/failures cause random incoherence for satellite. When the value for "max-db-connections" is less than (2*max-httpd-workers) + (3*hibernate.c3p0.max_size), then Sat5 will allow more connections to itself, than its connection to the DB can support. Behavior at this point becomes unpredictable. The (2*max-httpd-workers) value exists because it is not uncommon for a given HTTPD connection to acquire 2 connections to the DB, to support autonomous transactions. The (3*hibernate.c3p0.max_size) exists because taskomatic, tomcat, and rhn-search can all open connections to the DB, in addition to whatever is happening as a result of incoming HTTPD connections. With the current default settings, we have rhn_hibernate.conf: hibernate.c3p0.max_size=20 httpd.conf: MaxClients 300 postgresql.conf: max_connections = 600 (2*300) + (3*20) > 600 - as a result, we experience the rash of problems shown in the attached logs. Decrementing MaxClients (in /etc/httpd.conf) to 250 results in a much better-behaved Sat5 instance, and none of the painful db-related erors noted in the attached logs. In actual production use, the specific settings for these variables will depend on the actual hardware and the expected workflows for the installation. |