Description of problem: 502 Proxy error when ordering service catalog The production.log shows the GET api took 2+ minutes to finish [----] I, [2018-05-18T18:33:48.724983 #3510:14595b0] INFO -- : Started GET "/api/service_dialogs/1000000000009?resource_action_id=1000000000140&target_id=1000000000009&target_type=service_template" for 127.0.0.1 at 2018-05-18 18:33:48 +0900 [----] I, [2018-05-18T18:33:48.727524 #3510:14595b0] INFO -- : Processing by Api::ServiceDialogsController#show as JSON [----] I, [2018-05-18T18:33:48.727601 #3510:14595b0] INFO -- : Parameters: {"resource_action_id"=>"1000000000140", "target_id"=>"1000000000009", "target_type"=>"service_template", "c_id"=>"1000000000009"} [----] I, [2018-05-18T18:36:30.322966 #3510:14595b0] INFO -- : Completed 200 OK in 161595ms (Views: 0.1ms | ActiveRecord: 38.2ms) Version-Release number of selected component (if applicable): In apache/ssl_error.log we have [Fri May 18 18:35:48.812855 2018] [proxy_http:error] [pid 5542] (70007)The timeout specified has expired: [client 10.42.219.110:62010] AH01102: error reading status line from remote server 0.0.0.0:4000, referer: https://10.42.224.46/catalog/explorer [Fri May 18 18:35:48.812891 2018] [proxy:error] [pid 5542] [client 10.42.219.110:62010] AH00898: Error reading from remote server returned by /api/service_dialogs/1000000000009, referer: https://10.42.224.46/catalog/explorer But the puma CPU usage is not high at all How reproducible: 100% in customer's site Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Logs are in collab-shell://cases/02102081/evm_current_hudccf21_20180518_184649.tar.xz/
Rebooting the appliance didn't solve the problem. Attaching the errors in the UI as well. Best Regards, Chen
Created attachment 1440544 [details] The error messages showing on the UI when ordering the service catalog
That synchronously calls an automate method at: /ManageIQ/Cloud/Orchestration/Operations/Methods/Available_Tenants and it is taking a long time (over a minute) to complete in their environment. [----] I, [2018-05-18T18:33:48.880914 #3510:14595b0] INFO -- : <AEMethod [/ManageIQ/Cloud/Orchestration/Operations/Methods/Available_Tenants]> Starting [----] I, [2018-05-18T18:35:09.506019 #3510:14595b0] INFO -- : <AEMethod [/ManageIQ/Cloud/Orchestration/Operations/Methods/Available_Tenants]> Ending Can they override the built-in method and add some logging inside the fetch_list_data method particularly before and after the following line: av_tenants = service.try(:orchestration_manager).try(:cloud_tenants) It is possible that they have a lot of cloud tenants or that it takes a long time to find them in the database.
Hi Brandon, Thank you very much for your help. Confirmed with the customer and they created a customizing domain to overwrite the Available_Tenants method. But after they saw the problem they deleted the customizing domain. However the issue still persists. If it took a long time to get all the tenants from the DB, then I should be able to reproduce their problem after importing their DB... Regarding overriding the built-in method, my understanding is that we can not edit ManageIQ domain's method. Do you want us to create a new domain and add some changes ? Best Regards, Chen
Hi Brandon, The issue resolved after resetting the ManageIQ domain by navigating Automation -> Automate -> Import/Export. I will further confirm with the customer about how they exactly edited the Available_Tenants method. But if their steps are correct, could this issue be like that the customized code stained ManageIQ domain even though the customized domain is deleted ? Best Regards, Chen
Hi Chen, The ManageIQ Domain is locked by default and can not be edited by customers. The log that we were given says that the ManageIQ domain was being used, so I don't think any custom code was getting in the way. Yes, to override the method, create a new domain, copy the method from the ManageIQ domain and add the logging mentioned. In regards to being able to reproduce the problem in-house... It depends on the type of database backup. Restoring a pgdump backup re-inserts the data into the database in an efficient manner, a pg base backup will restore the database as it was at the customer site. Which type did you receive? I logged in to the environment mentioned in the comment above and was able to load all of the cloud tenants through the relations that the automate method was following with the longest taking ~3 seconds (much faster than the ~1 minute 20 seconds that the log shows at the customer site). The issue is resolved after resetting the ManageIQ domain? Did they upgrade from an older version and not upgrade the automate domain in the process? Regards, Brandon
Hi Chen, Any updates on this bug?