Created attachment 1415141 [details] Custom datastore for update_serviceprovision_status Description of problem: During a retry of the checkprovision method, the following code is failing: $evm.root.attributes.sort.each { |k, v| $evm.log("info", "\t#{k}: #{v}") } Throwing this error: [----] I, [2018-03-26T10:05:24.101625 #95950:e2f130] INFO -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod [/ManageIQ/Service/Provisioning/StateMachines/Methods/check_provisioned]> Starting [----] I, [2018-03-26T10:05:24.945073 #95950:ee4904] INFO -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod check_provisioned> Listing Root Object Attributes: [----] E, [2018-03-26T10:05:24.967530 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod check_provisioned> The following error occurred during method evaluation: [----] E, [2018-03-26T10:05:24.968903 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod check_provisioned> NoMethodError: undefined method `call' for #<DRb::DRbUnknown:0x00000002a97520> [----] E, [2018-03-26T10:05:24.970318 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod check_provisioned> (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/invokemethod.rb:11:in `block_yield' [----] E, [2018-03-26T10:05:24.975087 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR: (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/invokemethod.rb:11:in `block_yield': undefined method `call' for #<DRb::DRbUnknown:0x00000002a97520> (NoMethodError) [----] E, [2018-03-26T10:05:24.975786 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR: from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/invokemethod.rb:18:in `block in perform_with_block' [----] E, [2018-03-26T10:05:24.976501 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR: from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/invokemethod.rb:15:in `each' [----] E, [2018-03-26T10:05:24.977179 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR: from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/invokemethod.rb:15:in `perform_with_block' [----] E, [2018-03-26T10:05:24.977936 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR: from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1582:in `perform' [----] E, [2018-03-26T10:05:24.978672 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR: from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1657:in `block (2 levels) in main_loop' [----] E, [2018-03-26T10:05:24.979656 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR: from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1653:in `loop' [----] E, [2018-03-26T10:05:24.980516 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR: from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1653:in `block in main_loop' [----] E, [2018-03-26T10:05:24.981589 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR: from /ManageIQ/Service/Provisioning/StateMachines/Methods/check_provisioned:6:in `<main>' [----] I, [2018-03-26T10:05:25.005862 #95950:e2f130] INFO -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod [/ManageIQ/Service/Provisioning/StateMachines/Methods/check_provisioned]> Ending [----] E, [2018-03-26T10:05:25.006550 #95950:e2f130] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Aborting instantiation (unknown method return code) because [Method exited with rc=Unknown RC: [1]] [----] E, [2018-03-26T10:05:25.006950 #95950:e2f130] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> State=<checkprovisioned> running raised exception: <Method exited with rc=Unknown RC: [1]> [----] W, [2018-03-26T10:05:25.007141 #95950:e2f130] WARN -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Error in State=[checkprovisioned] Uploading logs from 2 failed provisions, 1 failed on the first retry the other fails on the 15th retry. Also there is custom code involved for the update method update_serviceprovision_status from the RedHatConsulting_Utilities datastore, but I'm not sure this would affect the $evm.root object. The provisions are successful, but the service is failing due to the DRb Unknown error. Version-Release number of selected component (if applicable): CFME 5.8.3.4 How reproducible: intermittent by customer Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Hi Yoder, Could you ask the customer to copy the update_serviceprovision_status method to a writable domain and add these 2 lines before the root object is dumped: Add these 2 lines: $evm.root.attributes.keys.each { |k| $evm.log("info", " Root key: - #{k}") } $evm.root.attributes.keys.sort.each { |k| $evm.log("info", " Sorted Root key: - #{k}") } before this line: $evm.root.attributes.sort.each { |k, v| $evm.log("info", "\t#{k}: #{v}") } And ask them to supply the logs when they encounter the DRbUnknown error? Thanks, Tina
Hi Yoder, The customer states the problem starting after applying the most recent 4.5 errata. 1. What version did they have prior to the upgrade? The customer states initially that: Currently only in non-prod, which notably is our only environment with an OpenShift provider. Your comment: They are now hitting this provisioning error 90% of the time in non-prod and prod environments. 2. Is the problem still only limited to the OpenShift provider? 3. Can you ask the customer for a set of logs including the yum.log from an appliance where the error occurs? We're setting up an environment here to try to reproduce the issue. Thanks, Tina
Hi Yoder, Some additional questions for the customer: There are 3 failed requests in supplied logs: service_template_provision_request_20000000003396 service_template_provision_request_20000000003397 service_template_provision_request_20000000003398 They call CBP_Business_Logic/Service/Provisioning/StateMachines/ Satellite_CatalogItemInitialization Does that Service provision successfully at all? If so, can they send us recent logs from a successful Service provision? Does Lifecycle VM provisioning work successfully? Does standard ManageIQ Service provisioning work properly? Thanks, Tina
Hi Yoder, A recap for the call last night.................. We had a call with the customer last night to discuss the issue. Customer has 3 environments: sandbox, non_prod, and prod. The sandbox environment is working properly, non_prod and prod are failing consistently. The only thing that is new in the environment is that the customer has been tuning appliances recently. sandbox - 20 VMs, 3 appliances - Services Provision successfully non-prod - 2500 VMs, 15 appliances, 9 Automate on - Service Provision fails prod - 2500 VMs, 15 appliances, 12 Automate on - Service Provision fails C&U was turned off and the Appliances were restarted on non_prod or prod and Service Provisioning still failed. The queue had a very large number of entries(don't remember actual size) and Service Provisioning appeared to be working properly once the queue was cleaned up. The Service Provision had not completed when we ended the call, but the customer felt confident that it would be successful since they had not see it progress that far previously. Can you ask the customer if the Service Provision completed successfully? Thanks, Tina
Created attachment 1418258 [details] debug_drb file Monkey Patch for DrbUnknown
Hi Yoder, Comment 12 contains a monkey patch for DrbUnknown which will log the output when a DrbUnknown object is created. It creates a file called debug_drb.log in the current directory(vmdb) and appends to it. The latest update in the case states that a single appliance with the Automate role appears to provision the Services successfully. Could you ask the customer to: 1. Enable one more appliance with the Automate role for a total of 2 appliances. 2. Copy the attached debug_drb.rb patch to both appliance vmdb/config/initializers folder. 3. Restart both servers 4. Provision the Services 5. Collect and supply the logs and the vmdb/debug_drb.log Let me know if you have any questions. Thanks, Tina
Hi Yoder, We discussed introducing a new 5.8.3.4 appliance(s) into one of the environments yesterday during our call with the customer. Can you ask them if any/all of the 3 appliances with the Automate role they used for testing today are new appliances? Thanks, Tina
Hi Yoder, I saw the close errors in the debug logs. Is it safe to say that the only issue now is the close issue? I've asked Adam from the provider team to look at the logs and we should open a separate ticket on that issue. We're still looking into why the behavior is different between the 2 versions, but want to verify that the customer is okay not using the code in question. Let me know if you have any questions. Thanks, Tina
https://github.com/ManageIQ/manageiq/pull/17267
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/76309b91e41725be72c5abadb19c0731b07733f7 commit 76309b91e41725be72c5abadb19c0731b07733f7 Author: Adam Grare <agrare> AuthorDate: Mon Apr 9 14:37:53 2018 -0400 Commit: Adam Grare <agrare> CommitDate: Mon Apr 9 14:37:53 2018 -0400 Set DRb conn pool to [] after closing connections After forking and closing open DRb pooled connections set the pool to [] to prevent other calls calling `conn.close` on already closed connection. This prevents the following: ``` [NoMethodError]: undefined method `close' for nil:NilClass Did you mean? clone /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1258:in `close' /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1237:in `block in open' /opt/rh/rh-ruby23/root/usr/share/ruby/sync.rb:234:in `block in sync_synchronize' /opt/rh/rh-ruby23/root/usr/share/ruby/sync.rb:231:in `handle_interrupt' /opt/rh/rh-ruby23/root/usr/share/ruby/sync.rb:231:in `sync_synchronize' /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1235:in `open' /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1141:in `block in method_missing' /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1160:in `with_friend' /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1140:in `method_missing' lib/gems/pending/VMwareWebService/MiqVimBroker.rb:419:in `getMiqVim' ``` https://bugzilla.redhat.com/show_bug.cgi?id=1562401 app/models/miq_worker.rb | 1 + 1 file changed, 1 insertion(+)
Hi Yoder, I should have been more specific when asking the question about not using Net::LDAP. My question was in response to this: "There were some instances of Net::LDAP errors in the debug_drb logs too, however Ian says they have resolved this issue:" 1. Strictly for the purposes of debugging the DRbUnknown error, does removing the Net::LDAP code result in no more DrbUnknown errors? 2. Yes, I believe Adams PR will resolve the close issue. 3. As far as an alternative for the the VM naming code, we recently added code to be able to call the naming method from an Automate method. More detail here: http://talk.manageiq.org/t/vm-naming-change/3195 4. I'm not sure why the putting the LDAP objects in $evm.object worked before, but they should not be putting the `Net:LDAP object in the $evm.object. Maybe they can refactor that code to store the LDAP information instead of the objects. Let me know if you have any questions. Thanks, Tina
Created a new ticket to track the 'close' issue: https://bugzilla.redhat.com/show_bug.cgi?id=1566198
Hi Yoder, Since we've created the new ticket to track the close issue. Can we close this ticket? Thanks, Tina