Bug 1280346 - "stack depth limit exceeded" when submitting 600ESXi hypervisors/6500VMs via virt-who
Summary: "stack depth limit exceeded" when submitting 600ESXi hypervisors/6500VMs via ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Candlepin
Version: 6.1.1
Hardware: All
OS: Linux
medium
medium
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: jcallaha
URL:
Whiteboard:
Depends On: 1327224
Blocks: 1296845 1321630 1351644
TreeView+ depends on / blocked
 
Reported: 2015-11-11 14:24 UTC by Christian Horn
Modified: 2021-08-30 11:42 UTC (History)
15 users (show)

Fixed In Version: candlepin-0.9.54-5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1321630 1327220 1327224 (view as bug list)
Environment:
Last Closed: 2016-09-27 09:01:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
candlepin_stack (9.43 KB, text/plain)
2016-06-02 13:17 UTC, Jonathan Gibert
no flags Details
verification screenshot (208.94 KB, image/png)
2016-09-23 20:13 UTC, jcallaha
no flags Details
output.txt (1.63 MB, text/plain)
2016-09-23 20:14 UTC, jcallaha
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2048593 0 None None None Never
Red Hat Product Errata RHBA-2016:1938 0 normal SHIPPED_LIVE Satellite 6.1.10 bug fix update 2016-09-27 12:56:10 UTC

Description Christian Horn 2015-11-11 14:24:47 UTC
Description of problem:
We are running Satellite 6.1.1+hotfix and configured virt-who (virt-who-0.14-1.el7sat.noarch) now.

The initial test was with one clusters of a VCenter which contained 300 ESXi and 3500 VMs. That task went OK in ~1.5min (virt-who logged a Timeout while talking to subscription-manager, but the data was well received by the Satellite).

Now we wanted to submit the whole VCenter, which contains ~600 ESXi and 6500 VMs. Doing so, we get the following error in virt-who:
~~~
Error in communication with subscription manager:
        Runtime Error ERROR: stack depth limit exceeded
  Hint: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate. at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,157
~~~

This is also seen on the Satellite in foreman/production.log and the PostgreSQL log. The PostgreSQL log also contains the query that was aborted:
~~~
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.
STATEMENT:  select this_.consumer_id as y0_ from cp_consumer_guests this_ inner join cp_consumer gconsumer1_ on this_.consumer_id=gconsumer1_.id inner join cp_guest_ids_checkin checkins2_ on gconsumer1_.id=checkins2_.consumer_id where gconsumer1_.owner_id=$1 and (lower(this_.guest_id)=$2 or ... or lower(this_.guest_id)=$13169) order by checkins2_.updated desc
~~~

These elipsis mean there are 13168 "OR" statements, resulting in a total of about 400KB for the single query which makes the PostgreSQL stack checker unhappy.


Version-Release number of selected component (if applicable):
satellite 6.1.1
virt-who-0.14-1.el7sat.noarch

How reproducible:
always

Steps to Reproduce:
1. setup 600ESXi hypervisors/6500VMs
2. run virt-who
3.

Actual results:
Runtime Error ERROR: stack depth limit exceeded

Expected results:
no error, succeeding operation

Additional info:
- after increasing postgresqls max_stack_depth to 3Mb, above works

Comment 4 Christian Horn 2016-01-11 13:27:18 UTC
Any comments?  Will this be approached in 6.2 ?

Increasing max_stack_depth to 3Mb might be a simple thing to do?  Or are there other suggestions for workarounds?
Alternatively the query could be structured differently, maybe?

Comment 5 Christian Horn 2016-01-11 13:35:01 UTC
max_stack_depth (integer)

    Specifies the maximum safe depth of the server's execution stack. The ideal setting for this parameter is the actual stack size limit enforced by the kernel (as set by ulimit -s or local equivalent), less a safety margin of a megabyte or so. The safety margin is needed because the stack depth is not checked in every routine in the server, but only in key potentially-recursive routines such as expression evaluation. The default setting is two megabytes (2MB), which is conservatively small and unlikely to risk crashes. However, it might be too small to allow execution of complex functions. Only superusers can change this setting.

    Setting max_stack_depth higher than the actual kernel limit will mean that a runaway recursive function can crash an individual backend process. On platforms where PostgreSQL can determine the kernel limit, the server will not allow this variable to be set to an unsafe value. However, not all platforms provide the information, so caution is recommended in selecting a value.

Comment 7 Barnaby Court 2016-03-24 12:58:44 UTC
Hi, Can you provide the Candlepin logs from the time of this error? It would be helpful in tracking down the exact code path if we had the stack trace on the Candlepin side. Thanks!

Comment 14 Jonathan Gibert 2016-06-02 13:17:42 UTC
Created attachment 1164087 [details]
candlepin_stack

Comment 22 jcallaha 2016-09-23 20:11:43 UTC
Verified in Satellite 6.1.10 Snap 3.

1. downloaded the file located here: http://file.rdu.redhat.com/csnyder/test_vodaphone.json
2. Registered the satellite to itself
3. Ran this command
     curl -k -X POST --cert /etc/pki/consumer/cert.pem --key /etc/pki/consumer/key.pem https://rhsm-qe-1.rhq.lab.eng.bos.redhat.com/rhsm/hypervisors -H "Content-Type: application/json" -d @"test_vodaphone.json"

The command completed successfully and all content hosts were added correctly (see attached). The output was captured and will be attached as well.

Comment 23 jcallaha 2016-09-23 20:13:01 UTC
Created attachment 1204262 [details]
verification screenshot

Comment 24 jcallaha 2016-09-23 20:14:00 UTC
Created attachment 1204263 [details]
output.txt

Comment 26 errata-xmlrpc 2016-09-27 09:01:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1938


Note You need to log in before you can comment on or make changes to this bug.