Bug 1280346 - "stack depth limit exceeded" when submitting 600ESXi hypervisors/6500VMs via virt-who
"stack depth limit exceeded" when submitting 600ESXi hypervisors/6500VMs via ...
Status: CLOSED ERRATA
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Candlepin (Show other bugs)
6.1.1
All Linux
medium Severity medium (vote)
: 6.1.10
: --
Assigned To: satellite6-bugs
jcallaha
: Triaged
Depends On: 1327224
Blocks: 1296845 1321630 1351644
  Show dependency treegraph
 
Reported: 2015-11-11 09:24 EST by Christian Horn
Modified: 2017-02-23 14:21 EST (History)
15 users (show)

See Also:
Fixed In Version: candlepin-0.9.54-5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1321630 1327220 1327224 (view as bug list)
Environment:
Last Closed: 2016-09-27 05:01:46 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
candlepin_stack (9.43 KB, text/plain)
2016-06-02 09:17 EDT, Jonathan Gibert
no flags Details
verification screenshot (208.94 KB, image/png)
2016-09-23 16:13 EDT, jcallaha
no flags Details
output.txt (1.63 MB, text/plain)
2016-09-23 16:14 EDT, jcallaha
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2048593 None None None Never

  None (edit)
Description Christian Horn 2015-11-11 09:24:47 EST
Description of problem:
We are running Satellite 6.1.1+hotfix and configured virt-who (virt-who-0.14-1.el7sat.noarch) now.

The initial test was with one clusters of a VCenter which contained 300 ESXi and 3500 VMs. That task went OK in ~1.5min (virt-who logged a Timeout while talking to subscription-manager, but the data was well received by the Satellite).

Now we wanted to submit the whole VCenter, which contains ~600 ESXi and 6500 VMs. Doing so, we get the following error in virt-who:
~~~
Error in communication with subscription manager:
        Runtime Error ERROR: stack depth limit exceeded
  Hint: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate. at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,157
~~~

This is also seen on the Satellite in foreman/production.log and the PostgreSQL log. The PostgreSQL log also contains the query that was aborted:
~~~
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.
STATEMENT:  select this_.consumer_id as y0_ from cp_consumer_guests this_ inner join cp_consumer gconsumer1_ on this_.consumer_id=gconsumer1_.id inner join cp_guest_ids_checkin checkins2_ on gconsumer1_.id=checkins2_.consumer_id where gconsumer1_.owner_id=$1 and (lower(this_.guest_id)=$2 or ... or lower(this_.guest_id)=$13169) order by checkins2_.updated desc
~~~

These elipsis mean there are 13168 "OR" statements, resulting in a total of about 400KB for the single query which makes the PostgreSQL stack checker unhappy.


Version-Release number of selected component (if applicable):
satellite 6.1.1
virt-who-0.14-1.el7sat.noarch

How reproducible:
always

Steps to Reproduce:
1. setup 600ESXi hypervisors/6500VMs
2. run virt-who
3.

Actual results:
Runtime Error ERROR: stack depth limit exceeded

Expected results:
no error, succeeding operation

Additional info:
- after increasing postgresqls max_stack_depth to 3Mb, above works
Comment 4 Christian Horn 2016-01-11 08:27:18 EST
Any comments?  Will this be approached in 6.2 ?

Increasing max_stack_depth to 3Mb might be a simple thing to do?  Or are there other suggestions for workarounds?
Alternatively the query could be structured differently, maybe?
Comment 5 Christian Horn 2016-01-11 08:35:01 EST
max_stack_depth (integer)

    Specifies the maximum safe depth of the server's execution stack. The ideal setting for this parameter is the actual stack size limit enforced by the kernel (as set by ulimit -s or local equivalent), less a safety margin of a megabyte or so. The safety margin is needed because the stack depth is not checked in every routine in the server, but only in key potentially-recursive routines such as expression evaluation. The default setting is two megabytes (2MB), which is conservatively small and unlikely to risk crashes. However, it might be too small to allow execution of complex functions. Only superusers can change this setting.

    Setting max_stack_depth higher than the actual kernel limit will mean that a runaway recursive function can crash an individual backend process. On platforms where PostgreSQL can determine the kernel limit, the server will not allow this variable to be set to an unsafe value. However, not all platforms provide the information, so caution is recommended in selecting a value.
Comment 7 Barnaby Court 2016-03-24 08:58:44 EDT
Hi, Can you provide the Candlepin logs from the time of this error? It would be helpful in tracking down the exact code path if we had the stack trace on the Candlepin side. Thanks!
Comment 14 Jonathan Gibert 2016-06-02 09:17 EDT
Created attachment 1164087 [details]
candlepin_stack
Comment 22 jcallaha 2016-09-23 16:11:43 EDT
Verified in Satellite 6.1.10 Snap 3.

1. downloaded the file located here: http://file.rdu.redhat.com/csnyder/test_vodaphone.json
2. Registered the satellite to itself
3. Ran this command
     curl -k -X POST --cert /etc/pki/consumer/cert.pem --key /etc/pki/consumer/key.pem https://rhsm-qe-1.rhq.lab.eng.bos.redhat.com/rhsm/hypervisors -H "Content-Type: application/json" -d @"test_vodaphone.json"

The command completed successfully and all content hosts were added correctly (see attached). The output was captured and will be attached as well.
Comment 23 jcallaha 2016-09-23 16:13 EDT
Created attachment 1204262 [details]
verification screenshot
Comment 24 jcallaha 2016-09-23 16:14 EDT
Created attachment 1204263 [details]
output.txt
Comment 26 errata-xmlrpc 2016-09-27 05:01:46 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1938

Note You need to log in before you can comment on or make changes to this bug.