Bug 1581655 - 502 Proxy error when ordering service catalog
Summary: 502 Proxy error when ordering service catalog
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.9.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: GA
: 5.9.4
Assignee: Brandon Dunne
QA Contact: Dave Johnson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-23 10:32 UTC by Chen
Modified: 2021-09-09 14:12 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-13 13:41:21 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The error messages showing on the UI when ordering the service catalog (50.43 KB, image/png)
2018-05-23 10:33 UTC, Chen
no flags Details

Description Chen 2018-05-23 10:32:20 UTC
Description of problem:

502 Proxy error when ordering service catalog

The production.log shows the GET api took 2+ minutes to finish

[----] I, [2018-05-18T18:33:48.724983 #3510:14595b0]  INFO -- : Started GET "/api/service_dialogs/1000000000009?resource_action_id=1000000000140&target_id=1000000000009&target_type=service_template" for 127.0.0.1 at 2018-05-18 18:33:48 +0900
[----] I, [2018-05-18T18:33:48.727524 #3510:14595b0]  INFO -- : Processing by Api::ServiceDialogsController#show as JSON
[----] I, [2018-05-18T18:33:48.727601 #3510:14595b0]  INFO -- :   Parameters: {"resource_action_id"=>"1000000000140", "target_id"=>"1000000000009", "target_type"=>"service_template", "c_id"=>"1000000000009"}
[----] I, [2018-05-18T18:36:30.322966 #3510:14595b0]  INFO -- : Completed 200 OK in 161595ms (Views: 0.1ms | ActiveRecord: 38.2ms)
Version-Release number of selected component (if applicable):

In apache/ssl_error.log we have

[Fri May 18 18:35:48.812855 2018] [proxy_http:error] [pid 5542] (70007)The timeout specified has expired: [client 10.42.219.110:62010] AH01102: error reading status line from remote server 0.0.0.0:4000, referer: https://10.42.224.46/catalog/explorer
[Fri May 18 18:35:48.812891 2018] [proxy:error] [pid 5542] [client 10.42.219.110:62010] AH00898: Error reading from remote server returned by /api/service_dialogs/1000000000009, referer: https://10.42.224.46/catalog/explorer

But the puma CPU usage is not high at all

How reproducible:

100% in customer's site

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Logs are in collab-shell://cases/02102081/evm_current_hudccf21_20180518_184649.tar.xz/

Comment 2 Chen 2018-05-23 10:33:01 UTC
Rebooting the appliance didn't solve the problem. Attaching the errors in the UI as well.

Best Regards,
Chen

Comment 3 Chen 2018-05-23 10:33:56 UTC
Created attachment 1440544 [details]
The error messages showing on the UI when ordering the service catalog

Comment 6 Brandon Dunne 2018-05-23 18:46:52 UTC
That synchronously calls an automate method at: /ManageIQ/Cloud/Orchestration/Operations/Methods/Available_Tenants and it is taking a long time (over a minute) to complete in their environment.
[----] I, [2018-05-18T18:33:48.880914 #3510:14595b0]  INFO -- : <AEMethod [/ManageIQ/Cloud/Orchestration/Operations/Methods/Available_Tenants]> Starting 
[----] I, [2018-05-18T18:35:09.506019 #3510:14595b0]  INFO -- : <AEMethod [/ManageIQ/Cloud/Orchestration/Operations/Methods/Available_Tenants]> Ending

Can they override the built-in method and add some logging inside the fetch_list_data method particularly before and after the following line:
  av_tenants = service.try(:orchestration_manager).try(:cloud_tenants)
It is possible that they have a lot of cloud tenants or that it takes a long time to find them in the database.

Comment 7 Chen 2018-05-24 01:17:13 UTC
Hi Brandon,

Thank you very much for your help.

Confirmed with the customer and they created a customizing domain to overwrite the Available_Tenants method. But after they saw the problem they deleted the customizing domain. However the issue still persists. If it took a long time to get all the tenants from the DB, then I should be able to reproduce their problem after importing their DB...

Regarding overriding the built-in method, my understanding is that we can not edit ManageIQ domain's method. Do you want us to create a new domain and add some changes ?

Best Regards,
Chen

Comment 8 Chen 2018-05-24 13:32:49 UTC
Hi Brandon,

The issue resolved after resetting the ManageIQ domain by navigating Automation -> Automate -> Import/Export.

I will further confirm with the customer about how they exactly edited the Available_Tenants method. But if their steps are correct, could this issue be like that the customized code stained ManageIQ domain even though the customized domain is deleted ?

Best Regards,
Chen

Comment 9 Brandon Dunne 2018-05-24 14:09:46 UTC
Hi Chen,

The ManageIQ Domain is locked by default and can not be edited by customers.  The log that we were given says that the ManageIQ domain was being used, so I don't think any custom code was getting in the way.  Yes, to override the method, create a new domain, copy the method from the ManageIQ domain and add the logging mentioned.

In regards to being able to reproduce the problem in-house... It depends on the type of database backup.  Restoring a pgdump backup re-inserts the data into the database in an efficient manner, a pg base backup will restore the database as it was at the customer site.  Which type did you receive?

I logged in to the environment mentioned in the comment above and was able to load all of the cloud tenants through the relations that the automate method was following with the longest taking ~3 seconds (much faster than the ~1 minute 20 seconds that the log shows at the customer site).

The issue is resolved after resetting the ManageIQ domain?  Did they upgrade from an older version and not upgrade the automate domain in the process?

Regards,
Brandon

Comment 11 Brandon Dunne 2018-06-04 15:40:41 UTC
Hi Chen, Any updates on this bug?


Note You need to log in before you can comment on or make changes to this bug.