Bug 1187774 - smart proxy puppet environment list errors under load
Summary: smart proxy puppet environment list errors under load
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Configuration Management
Version: 6.0.7
Hardware: Unspecified
OS: Unspecified
high
high vote
Target Milestone: Unspecified
Assignee: Dmitri Dolguikh
QA Contact: Katello QA List
Russell Dickenson
URL:
Whiteboard:
Depends On:
Blocks: 1122832 sat61-release-notes
TreeView+ depends on / blocked
 
Reported: 2015-01-30 20:24 UTC by Peter Vreman
Modified: 2018-02-01 12:36 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Issue: The smart proxy listing of Puppet environments or Puppet classes fails when multiple content views (2 or more parallel) are published. When this occurs, the content view publishing process fails and prevents any further builds until ??? Workaround: Unknown.
Clone Of:
Environment:
Last Closed: 2016-10-14 15:27:00 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1250083 None CLOSED Smart proxy puppet classes REST timeout of 60 seconds too small 2019-03-21 09:42:08 UTC
Red Hat Bugzilla 1540974 None CLOSED foreman-poxy raises "Failed to show puppet classes: Caught recursion on hostname" on listing classes for environments 2019-03-21 09:42:08 UTC

Internal Links: 1250083 1540974

Description Peter Vreman 2015-01-30 20:24:53 UTC
Description of problem:
The smart proxy listing of puppet environments or puppet classes fails under load.
This happens e.g. when multiple content views (3-4 in parallel) are published. This then pauses the content view publishing process creating stale locks breaking further (composite) content view publishing.


Reproducing script:

Create 30 puppet environments. Run the following command that will schedule concurrent requests for each environment:

for env in $(ls -1 /etc/puppet/environments); do
  curl --cert /etc/foreman/client_cert.pem --key /etc/foreman/client_key.pem --cacert /etc/foreman/proxy_ca.pem -k -H "Accept: application/json" https://localhost:9090/puppet/environments/$env &
done
tail -f /var/log/foreman-proxy/proxy.log | grep ERROR




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create 30 puppet environments
2. See shell commands above
3.

Actual results:
# tail -f /var/log/foreman-proxy/proxy.log | grep ERROR
E, [2015-01-30T20:16:13.553930 #10426] ERROR -- : Failed to show puppet environment: Error converting value for param 'basemodulepath': Could not find value for $confdir
E, [2015-01-30T20:16:14.075168 #10426] ERROR -- : Failed to show puppet environment: Error converting value for param 'basemodulepath': Could not find value for $confdir
E, [2015-01-30T20:16:14.317551 #10426] ERROR -- : Failed to show puppet environment: Error converting value for param 'basemodulepath': Could not find value for $confdir


Expected results:
Concurrent requests handled without errors.


Additional info:

Comment 1 Peter Vreman 2015-01-30 20:32:17 UTC
Example error of a failed Content View Publishing task in Katello:

Action:

Actions::Katello::Foreman::ContentUpdate

Input:

{"environment_id"=>2,
 "content_view_id"=>49,
 "remote_user"=>"hoici-b11965af",
 "locale"=>"en"}

Output:

{}

Exception:

ProxyAPI::ProxyException: ERF12-4115 [ProxyAPI::ProxyException]: Unable to get classes from Puppet for KT_Hilti_Library_product_HOIPRODUCTS_1_1_0_ci_19 ([RestClient::NotAcceptable]: 406 Not Acceptable) for proxy https://li-lc-1017.hag.hilti.com:9090/puppet

Comment 2 RHEL Product and Program Management 2015-01-30 20:33:44 UTC
Since this issue was entered in Red Hat Bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

Comment 4 Tom Caspy 2015-02-18 12:58:09 UTC
is this affecting customers? this seems to be something which shouldn't happen in real life use cases...
If this is only a theoretical bug which doesn't affect real life use cases, I suggest setting priority to low or medium.

Comment 7 Peter Vreman 2015-02-18 13:39:19 UTC
This bug was discovered in real life use case when you publish/promote composite content views after a component content view was updated.

Another use case is with Continuous Integration systems, where the long publishing times of 15+ minutes, parallel actions on multiple component views and composite content views can easily be triggered. 

The reproduction is an easy command that demonstrates the error more quickly.

Comment 10 Peter Vreman 2015-04-07 09:14:59 UTC
What is the status of this issue. I still have this issue every day. It prevents me from enabling the full automation in our Continuous Integration environment. Because when CI is enabled it will build multiple releases and content views at the same time. When 2+ CVs are building rebuild there is a 80% chance that it will fail with this concurrency error and blocking all other future builds.

Comment 11 Tom Caspy 2015-04-07 11:45:57 UTC
The reason for this issue is that the capsule is using Webrick directly instead of going through Httpd+passenger.
We can patch it around and have a probable quick fix which will fail under higher load (simply put the capsule behind passenger, which will cost far greater in memory), or we can deliver a good solution which gracefully fails and retries later on if failures occur.

I prefer the second approach, but it will take a while to deliver. I'm consulting with our manager and will decide which approach to take right now. We can, if this is critical (and it sounds so), add the quick patch to the 6.1 version, and have the proper implementation done in 6.2

Comment 12 Peter Vreman 2015-04-07 12:37:40 UTC
The time it takes for the /classes call is between 2-4 seconds. The total time to retrieve for each environment all classes is then multiplied with the number of environments. We have 32 environments, taking about 90 seconds. During those 90 seconds no other call is allowed

See trace below where 1 process was retrieving all the classes and two other requests came in to retrieve the environments and those were failing:

1.1.1.1 - - [31/Mar/2015 12:41:43] "GET /puppet/environments HTTP/1.1" 200 1389 0.8387
1.1.1.1 - - [31/Mar/2015 12:41:47] "GET /puppet/environments/KT_Hilti_Library_hg_dev_17/classes HTTP/1.1" 200 10080 3.1246
1.1.1.1 - - [31/Mar/2015 12:41:50] "GET /puppet/environments HTTP/1.1" 200 1389 0.8812
1.1.1.1 - - [31/Mar/2015 12:41:50] "GET /puppet/environments/KT_Hilti_Library_product_HOIPRODUCTS_1_0_0_sp18_22/classes HTTP/1.1" 200 4862 2.5014
E, [2015-03-31T12:41:52.138672 #10899] ERROR -- : Failed to show puppet classes: Error converting value for param 'basemodulepath': Could not find value for $confdir
1.1.1.1 - - [31/Mar/2015 12:41:52] "GET /puppet/environments/KT_Hilti_Library_hg_dev_17/classes HTTP/1.1" 406 115 0.6211
1.1.1.1 - - [31/Mar/2015 12:41:52] "GET /puppet/environments/KT_Hilti_Library_product_HOICI_1_0_0_sp20_29/classes HTTP/1.1" 200 378 0.8968
1.1.1.1 - - [31/Mar/2015 12:41:56] "GET /puppet/environments/KT_Hilti_Library_product_HOIOS_6_5_0_sp20_30/classes HTTP/1.1" 200 5219 2.8761
1.1.1.1 - - [31/Mar/2015 12:42:01] "GET /puppet/environments/KT_Hilti_Library_hg_dev__Sat6_20/classes HTTP/1.1" 200 10353 3.9652
1.1.1.1 - - [31/Mar/2015 12:42:05] "GET /puppet/environments/KT_Hilti_Library_product_HOIOS_6_5_0_sp19_27/classes HTTP/1.1" 200 5115 2.3924
1.1.1.1 - - [31/Mar/2015 12:42:08] "GET /puppet/environments/KT_Hilti_Library_product_HOIOS_6_6_0_ci_7/classes HTTP/1.1" 200 5155 2.4159
1.1.1.1 - - [31/Mar/2015 12:42:11] "GET /puppet/environments/KT_Hilti_Library_hg_crash__IPS__Test_6_6_0_ci_13/classes HTTP/1.1" 200 10016 2.8175
1.1.1.1 - - [31/Mar/2015 12:42:14] "GET /puppet/environments/KT_Hilti_Library_product_HOIPRODUCTS_1_1_0_ci_10/classes HTTP/1.1" 200 4862 1.5934
1.1.1.1 - - [31/Mar/2015 12:42:16] "GET /puppet/environments/KT_Hilti_Library_product_HOICI_1_0_0_sp19_25/classes HTTP/1.1" 200 378 1.2671
1.1.1.1 - - [31/Mar/2015 12:42:19] "GET /puppet/environments/KT_Hilti_Library_product_HOIPRODUCTS_1_0_0_sp19_26/classes HTTP/1.1" 200 4862 2.2700
1.1.1.1 - - [31/Mar/2015 12:42:25] "GET /puppet/environments/KT_Hilti_Library_hg_crash__Sat6CI__Satellite6_CITest_6_5_0_15/classes HTTP/1.1" 200 10457 4.1300
1.1.1.1 - - [31/Mar/2015 12:42:27] "GET /puppet/environments/KT_Hilti_Library_product_HOICI_1_0_0_ci_3/classes HTTP/1.1" 200 378 1.0535
1.1.1.1 - - [31/Mar/2015 12:42:31] "GET /puppet/environments/KT_Hilti_Library_product_HOIOS_6_5_0_sp17_6/classes HTTP/1.1" 200 5116 3.1356
1.1.1.1 - - [31/Mar/2015 12:42:33] "GET /puppet/environments/KT_Hilti_Library_product_HOICI_1_0_0_sp18_24/classes HTTP/1.1" 200 378 1.4640
1.1.1.1 - - [31/Mar/2015 12:42:35] "GET /puppet/environments/KT_Hilti_Library_product_HOIPRODUCTS_1_0_0_sp17_9/classes HTTP/1.1" 200 4862 2.0812
1.1.1.1 - - [31/Mar/2015 12:42:38] "GET /puppet/environments/KT_Hilti_Library_product_HOIPRODUCTS_1_0_0_sp20_28/classes HTTP/1.1" 200 4862 2.1670
1.1.1.1 - - [31/Mar/2015 12:42:44] "GET /puppet/environments/KT_Hilti_Library_hg_crash_11/classes HTTP/1.1" 200 10080 3.5882
1.1.1.1 - - [31/Mar/2015 12:42:50] "GET /puppet/environments/KT_Hilti_Library_hg_crash__Sat6CI_14/classes HTTP/1.1" 200 10457 4.0122
1.1.1.1 - - [31/Mar/2015 12:42:56] "GET /puppet/environments/KT_Hilti_Library_hg_crash__Sat6CI__Satellite6_CITest_6_6_0_16/classes HTTP/1.1" 200 10393 4.2957
1.1.1.1 - - [31/Mar/2015 12:43:02] "GET /puppet/environments HTTP/1.1" 200 1389 0.5430
1.1.1.1 - - [31/Mar/2015 12:43:02] "GET /puppet/environments/KT_Hilti_Library_hg_qa_21/classes HTTP/1.1" 200 10080 4.8794
E, [2015-03-31T12:43:05.984226 #10899] ERROR -- : Failed to show puppet classes: Error converting value for param 'basemodulepath': Could not find value for $confdir
1.1.1.1 - - [31/Mar/2015 12:43:06] "GET /puppet/environments/KT_Hilti_Library_hg_dev_17/classes HTTP/1.1" 406 115 0.8545
1.1.1.1 - - [31/Mar/2015 12:43:06] "GET /puppet/environments/KT_Hilti_Library_product_HOICI_1_0_0_sp17_4/classes HTTP/1.1" 200 378 1.8405
1.1.1.1 - - [31/Mar/2015 12:43:08] "GET /puppet/environments/example_env/classes HTTP/1.1" 200 2 1.4511
1.1.1.1 - - [31/Mar/2015 12:43:11] "GET /puppet/environments/KT_Hilti_Library_product_HOIPRODUCTS_1_0_0_ci_8/classes HTTP/1.1" 200 4862 2.3122
1.1.1.1 - - [31/Mar/2015 12:43:17] "GET /puppet/environments/KT_Hilti_Library_hg_dev__IPS__Test_6_6_0_ci_19/classes HTTP/1.1" 200 10016 4.0574
1.1.1.1 - - [31/Mar/2015 12:43:22] "GET /puppet/environments/KT_Hilti_Library_product_HOIOS_6_5_0_sp18_23/classes HTTP/1.1" 200 5115 3.7882
1.1.1.1 - - [31/Mar/2015 12:43:27] "GET /puppet/environments/KT_Hilti_Library_hg_crash__IPS__Test_6_5_0_ci_12/classes HTTP/1.1" 200 10080 3.8703
1.1.1.1 - - [31/Mar/2015 12:43:33] "GET /puppet/environments/KT_Hilti_Library_hg_prod_31/classes HTTP/1.1" 200 10080 3.9834
1.1.1.1 - - [31/Mar/2015 12:43:38] "GET /puppet/environments/KT_Hilti_Library_hg_dev__IPS__Test_6_5_0_ci_18/classes HTTP/1.1" 200 10080 4.0939
1.1.1.1 - - [31/Mar/2015 12:43:43] "GET /puppet/environments/KT_Hilti_Library_product_HOIOS_6_5_0_ci_5/classes HTTP/1.1" 200 5219 3.1980
1.1.1.1 - - [31/Mar/2015 12:43:49] "GET /puppet/environments/KT_Hilti_Library_hg_prod__Sat6_32/classes HTTP/1.1" 200 10457 3.8847

Alternative might be to have a single REST call that retrieve classes for all environments. That reduces the amount of HTTP requests.

A second improvement is to reduce the publishing of Content Views when the content view is only be used as part of Composite Content Views and shall not be used for direct assignment, see BZ1186419. In our environment that will save 50+% of currently unused directories in /etc/puppet/environments.

Comment 13 Ohad Levy 2015-05-07 14:15:28 UTC
as a side note, foreman 1.8 introduced a caching layer to the proxy puppet class parser, which can also help speed up this. (http://theforeman.org/manuals/1.8/index.html#Headlinefeatures)

Comment 14 Peter Vreman 2015-08-04 11:52:41 UTC
Verified on Sat6.1.0 that the foreman proxy does now work without an error.
But initial response time without caching is 77 seconds, this is still too slow that the REST from Katello call times out. A second call with 3 seconds looks like the caching is working:


# zgrep hg_crash_ foreman-proxy/proxy.log-20150804.gz
10.92.14.128 - - [03/Aug/2015 12:54:19] "GET /puppet/environments/KT_Hilti_Library_hg_crash_18/classes HTTP/1.1" 200 11417 76.6783
10.92.14.128 - - [03/Aug/2015 12:56:51] "GET /puppet/environments/KT_Hilti_Library_hg_crash_18/classes HTTP/1.1" 200 11417 3.5445


The correspond log from katello, that shows it times out 16 seconds before the smart proxy finished processing the request, that is matches nicely with a REST timeout setting of 60 seconds:

2015-08-03 12:54:03 [E] ERF12-4115 [ProxyAPI::ProxyException]: Impossible d'obtenir les classes Puppet pour KT_Hilti_Library_hg_crash_18 ([RestClient::RequestTimeout]: Request Timeout) pour le proxy https://li-lc-1578.hag.hilti.com:9090/puppet (ProxyAPI::ProxyException)
/usr/share/foreman/lib/proxy_api/puppet.rb:27:in `rescue in classes'
/usr/share/foreman/lib/proxy_api/puppet.rb:21:in `classes'
/usr/share/foreman/app/services/puppet_class_importer.rb:155:in `actual_classes'
/usr/share/foreman/app/services/puppet_class_importer.rb:78:in `new_classes_for'
/usr/share/foreman/app/services/puppet_class_importer.rb:39:in `changes'
/opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.51/app/services/katello/puppet_class_importer_extensions.rb:22:in `update_environment'
/opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.51/app/lib/katello/foreman.rb:36:in `update_puppet_environment'
/opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.51/app/lib/actions/katello/foreman/content_update.rb:32:in `finalize'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.7/lib/dynflow/action.rb:503:in `block (2 levels) in execute_finalize'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.7/lib/dynflow/middleware/stack.rb:26:in `call'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.7/lib/dynflow/middleware/stack.rb:26:in `pass'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.7/lib/dynflow/middleware.rb:16:in `pass'

Comment 15 Peter Vreman 2015-08-04 13:35:52 UTC
Created BZ1250083 for the timeout issue mentioned in the previous comment.
The too small timeout value of 60 seconds is defined in Katello and not the smart proxy.

So for this BZ i confirm that it is fixed also on my system with Sat6.1.0

Comment 21 Peter Vreman 2015-10-26 08:12:27 UTC
In Sat6.1.3 there is still an issue with the smart proxy receiving multiple concurrent request:

I, [2015-10-26T08:06:50.267352 #31202]  INFO -- : Initializing from Puppet config file: /etc/puppet/puppet.conf
I, [2015-10-26T08:06:50.390772 #31202]  INFO -- : Initializing from Puppet config file: /etc/puppet/puppet.conf
I, [2015-10-26T08:06:50.563708 #31202]  INFO -- : Initializing from Puppet config file: /etc/puppet/puppet.conf
E, [2015-10-26T08:06:50.586021 #31202] ERROR -- : Failed to show puppet classes: Attempting to initialize global default settings more than once!
I, [2015-10-26T08:06:50.586520 #31202]  INFO -- : Initializing from Puppet config file: /etc/puppet/puppet.conf
10.92.14.36 - - [26/Oct/2015 08:06:50] "GET /puppet/environments/KT_Hilti_Library_product_HOIRHEL_6_6_d20151025_49/classes HTTP/1.1" 406 95 2.2388

The error above shall Never happen, when it is triggered it means the code is still not 100% stable and reliable for concurrency

Comment 27 Dmitri Dolguikh 2016-06-14 11:57:46 UTC
I would suggest 
 - switching to environments api instead of config-file based environments (for puppet version > 3.2, set use_environment_api to true). 
 - use the most modern ruby vm as possible (Don't know what rhel version is used to run smart-proxy. 2.2.x is faster than 1.9.3 which is faster than 1.8.7)
 - Not sure if this is an option, but switching to puppet version > 4.0 and using an upstream version of smart-proxy should eliminate any issues with environments or classes -- it relies on puppet api only.

Comment 28 Peter Vreman 2016-10-14 15:27:00 UTC
Not issues seen anymore on 6.2.x

Comment 29 Andrew Dahms 2017-05-23 13:38:21 UTC
Changing the 'requires_doc_text' flag to '-'.


Note You need to log in before you can comment on or make changes to this bug.