1280346 – "stack depth limit exceeded" when submitting 600ESXi hypervisors/6500VMs via virt-who

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1280346 - "stack depth limit exceeded" when submitting 600ESXi hypervisors/6500VMs via virt-who

Summary: "stack depth limit exceeded" when submitting 600ESXi hypervisors/6500VMs via ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Candlepin
Sub Component:
Version:	6.1.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	jcallaha
Docs Contact:
URL:
Whiteboard:
Depends On:	1327224
Blocks:	1296845 1321630 1351644
TreeView+	depends on / blocked

Reported:	2015-11-11 14:24 UTC by Christian Horn
Modified:	2021-08-30 11:42 UTC (History)
CC List:	15 users (show)
Fixed In Version:	candlepin-0.9.54-5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1321630 1327220 1327224 (view as bug list)
Environment:
Last Closed:	2016-09-27 09:01:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
candlepin_stack (9.43 KB, text/plain) 2016-06-02 13:17 UTC, Jonathan Gibert	no flags	Details
verification screenshot (208.94 KB, image/png) 2016-09-23 20:13 UTC, jcallaha	no flags	Details
output.txt (1.63 MB, text/plain) 2016-09-23 20:14 UTC, jcallaha	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2048593	0	None	None	None	Never
Red Hat Product Errata	RHBA-2016:1938	0	normal	SHIPPED_LIVE	Satellite 6.1.10 bug fix update	2016-09-27 12:56:10 UTC

Description Christian Horn 2015-11-11 14:24:47 UTC

Description of problem:
We are running Satellite 6.1.1+hotfix and configured virt-who (virt-who-0.14-1.el7sat.noarch) now.

The initial test was with one clusters of a VCenter which contained 300 ESXi and 3500 VMs. That task went OK in ~1.5min (virt-who logged a Timeout while talking to subscription-manager, but the data was well received by the Satellite).

Now we wanted to submit the whole VCenter, which contains ~600 ESXi and 6500 VMs. Doing so, we get the following error in virt-who:
~~~
Error in communication with subscription manager:
        Runtime Error ERROR: stack depth limit exceeded
  Hint: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate. at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,157
~~~

This is also seen on the Satellite in foreman/production.log and the PostgreSQL log. The PostgreSQL log also contains the query that was aborted:
~~~
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.
STATEMENT:  select this_.consumer_id as y0_ from cp_consumer_guests this_ inner join cp_consumer gconsumer1_ on this_.consumer_id=gconsumer1_.id inner join cp_guest_ids_checkin checkins2_ on gconsumer1_.id=checkins2_.consumer_id where gconsumer1_.owner_id=$1 and (lower(this_.guest_id)=$2 or ... or lower(this_.guest_id)=$13169) order by checkins2_.updated desc
~~~

These elipsis mean there are 13168 "OR" statements, resulting in a total of about 400KB for the single query which makes the PostgreSQL stack checker unhappy.


Version-Release number of selected component (if applicable):
satellite 6.1.1
virt-who-0.14-1.el7sat.noarch

How reproducible:
always

Steps to Reproduce:
1. setup 600ESXi hypervisors/6500VMs
2. run virt-who
3.

Actual results:
Runtime Error ERROR: stack depth limit exceeded

Expected results:
no error, succeeding operation

Additional info:
- after increasing postgresqls max_stack_depth to 3Mb, above works

Comment 4 Christian Horn 2016-01-11 13:27:18 UTC

Any comments?  Will this be approached in 6.2 ?

Increasing max_stack_depth to 3Mb might be a simple thing to do?  Or are there other suggestions for workarounds?
Alternatively the query could be structured differently, maybe?

Comment 5 Christian Horn 2016-01-11 13:35:01 UTC

max_stack_depth (integer)

    Specifies the maximum safe depth of the server's execution stack. The ideal setting for this parameter is the actual stack size limit enforced by the kernel (as set by ulimit -s or local equivalent), less a safety margin of a megabyte or so. The safety margin is needed because the stack depth is not checked in every routine in the server, but only in key potentially-recursive routines such as expression evaluation. The default setting is two megabytes (2MB), which is conservatively small and unlikely to risk crashes. However, it might be too small to allow execution of complex functions. Only superusers can change this setting.

    Setting max_stack_depth higher than the actual kernel limit will mean that a runaway recursive function can crash an individual backend process. On platforms where PostgreSQL can determine the kernel limit, the server will not allow this variable to be set to an unsafe value. However, not all platforms provide the information, so caution is recommended in selecting a value.

Comment 7 Barnaby Court 2016-03-24 12:58:44 UTC

Hi, Can you provide the Candlepin logs from the time of this error? It would be helpful in tracking down the exact code path if we had the stack trace on the Candlepin side. Thanks!

Comment 14 Jonathan Gibert 2016-06-02 13:17:42 UTC

Created attachment 1164087 [details]
candlepin_stack

Comment 22 jcallaha 2016-09-23 20:11:43 UTC

Verified in Satellite 6.1.10 Snap 3.

1. downloaded the file located here: http://file.rdu.redhat.com/csnyder/test_vodaphone.json
2. Registered the satellite to itself
3. Ran this command
     curl -k -X POST --cert /etc/pki/consumer/cert.pem --key /etc/pki/consumer/key.pem https://rhsm-qe-1.rhq.lab.eng.bos.redhat.com/rhsm/hypervisors -H "Content-Type: application/json" -d @"test_vodaphone.json"

The command completed successfully and all content hosts were added correctly (see attached). The output was captured and will be attached as well.

Comment 23 jcallaha 2016-09-23 20:13:01 UTC

Created attachment 1204262 [details]
verification screenshot

Comment 24 jcallaha 2016-09-23 20:14:00 UTC

Created attachment 1204263 [details]
output.txt

Comment 26 errata-xmlrpc 2016-09-27 09:01:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1938

Note You need to log in before you can comment on or make changes to this bug.