Bug 1562401 - DRb Unknown Error Thrown Trying to List Root Object Attributes on Checkprovision Retry
Summary: DRb Unknown Error Thrown Trying to List Root Object Attributes on Checkprovis...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Automate
Version: 5.8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.9.2
Assignee: Tina Fitzgerald
QA Contact: Dave Johnson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-30 14:28 UTC by myoder
Modified: 2018-05-01 21:13 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-11 20:50:01 UTC
Category: Bug
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Custom datastore for update_serviceprovision_status (144.73 KB, application/zip)
2018-03-30 14:28 UTC, myoder
no flags Details
debug_drb file (484 bytes, text/plain)
2018-04-06 18:29 UTC, Tina Fitzgerald
no flags Details

Description myoder 2018-03-30 14:28:51 UTC
Created attachment 1415141 [details]
Custom datastore for update_serviceprovision_status

Description of problem:

During a retry of the checkprovision method, the following code is failing:

$evm.root.attributes.sort.each { |k, v| $evm.log("info", "\t#{k}: #{v}") }

Throwing this error:

[----] I, [2018-03-26T10:05:24.101625 #95950:e2f130]  INFO -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod [/ManageIQ/Service/Provisioning/StateMachines/Methods/check_provisioned]> Starting 
[----] I, [2018-03-26T10:05:24.945073 #95950:ee4904]  INFO -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod check_provisioned> Listing Root Object Attributes:
[----] E, [2018-03-26T10:05:24.967530 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod check_provisioned> The following error occurred during method evaluation:
[----] E, [2018-03-26T10:05:24.968903 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod check_provisioned>   NoMethodError: undefined method `call' for #<DRb::DRbUnknown:0x00000002a97520>
[----] E, [2018-03-26T10:05:24.970318 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod check_provisioned>   (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/invokemethod.rb:11:in `block_yield'
[----] E, [2018-03-26T10:05:24.975087 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR: (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/invokemethod.rb:11:in `block_yield': undefined method `call' for #<DRb::DRbUnknown:0x00000002a97520> (NoMethodError)
[----] E, [2018-03-26T10:05:24.975786 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR:  from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/invokemethod.rb:18:in `block in perform_with_block'
[----] E, [2018-03-26T10:05:24.976501 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR:  from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/invokemethod.rb:15:in `each'
[----] E, [2018-03-26T10:05:24.977179 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR:  from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/invokemethod.rb:15:in `perform_with_block'
[----] E, [2018-03-26T10:05:24.977936 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR:  from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1582:in `perform'
[----] E, [2018-03-26T10:05:24.978672 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR:  from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1657:in `block (2 levels) in main_loop'
[----] E, [2018-03-26T10:05:24.979656 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR:  from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1653:in `loop'
[----] E, [2018-03-26T10:05:24.980516 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR:  from (druby://127.0.0.1:38067) /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1653:in `block in main_loop'
[----] E, [2018-03-26T10:05:24.981589 #95950:1553bf0] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Method STDERR:  from /ManageIQ/Service/Provisioning/StateMachines/Methods/check_provisioned:6:in `<main>'
[----] I, [2018-03-26T10:05:25.005862 #95950:e2f130]  INFO -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> <AEMethod [/ManageIQ/Service/Provisioning/StateMachines/Methods/check_provisioned]> Ending
[----] E, [2018-03-26T10:05:25.006550 #95950:e2f130] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Aborting instantiation (unknown method return code) because [Method exited with rc=Unknown RC: [1]]
[----] E, [2018-03-26T10:05:25.006950 #95950:e2f130] ERROR -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> State=<checkprovisioned> running  raised exception: <Method exited with rc=Unknown RC: [1]>
[----] W, [2018-03-26T10:05:25.007141 #95950:e2f130]  WARN -- : Q-task_id([service_template_provision_task_20000000004753]) <AutomationEngine> Error in State=[checkprovisioned]



Uploading logs from 2 failed provisions, 1 failed on the first retry the other fails on the 15th retry.



Also there is custom code involved for the update method update_serviceprovision_status from the RedHatConsulting_Utilities datastore, but I'm not sure this would affect the $evm.root object.


The provisions are successful, but the service is failing due to the DRb Unknown error.


Version-Release number of selected component (if applicable):
CFME 5.8.3.4

How reproducible:
intermittent by customer

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 Tina Fitzgerald 2018-04-02 12:28:31 UTC
Hi Yoder,

Could you ask the customer to copy the update_serviceprovision_status method to a writable domain and add these 2 lines before the root object is dumped:

Add these 2 lines:
$evm.root.attributes.keys.each { |k| $evm.log("info", " Root key: - #{k}") } 
$evm.root.attributes.keys.sort.each { |k| $evm.log("info", " Sorted Root key: - #{k}") } 

before this line:
$evm.root.attributes.sort.each { |k, v| $evm.log("info", "\t#{k}: #{v}") }

And ask them to supply the logs when they encounter the DRbUnknown error?

Thanks,
Tina

Comment 8 Tina Fitzgerald 2018-04-03 21:44:42 UTC
Hi Yoder,


The customer states the problem starting after applying the most recent 4.5 errata. 

1. What version did they have prior to the upgrade?

The customer states initially that:
Currently only in non-prod, which notably is our only environment with an OpenShift provider.

Your comment: They are now hitting this provisioning error 90% of the time in non-prod and prod environments. 

2. Is the problem still only limited to the OpenShift provider? 

3. Can you ask the customer for a set of logs including the yum.log from an appliance where the error occurs?

We're setting up an environment here to try to reproduce the issue.

Thanks,
Tina

Comment 9 Tina Fitzgerald 2018-04-04 20:49:25 UTC
Hi Yoder,

Some additional questions for the customer:

There are 3 failed requests in supplied logs:
service_template_provision_request_20000000003396 
service_template_provision_request_20000000003397 
service_template_provision_request_20000000003398 
They call CBP_Business_Logic/Service/Provisioning/StateMachines/
 Satellite_CatalogItemInitialization

Does that Service provision successfully at all?
If so, can they send us recent logs from a successful Service provision?

Does Lifecycle VM provisioning work successfully?
Does standard ManageIQ Service provisioning work properly?

Thanks,
Tina

Comment 10 Tina Fitzgerald 2018-04-05 16:04:20 UTC
Hi Yoder,

A recap for the call last night..................

We had a call with the customer last night to discuss the issue. Customer has 3 environments: sandbox, non_prod, and prod.

The sandbox environment is working properly, non_prod and prod are failing consistently. 

The only thing that is new in the environment is that the customer has been tuning appliances recently.

sandbox - 20 VMs, 3 appliances - Services Provision successfully
non-prod - 2500 VMs, 15 appliances, 9 Automate on - Service Provision fails 
prod - 2500 VMs, 15 appliances, 12 Automate on - Service Provision fails 

C&U was turned off and the Appliances were restarted on non_prod or prod and Service Provisioning still failed.

The queue had a very large number of entries(don't remember actual size) and Service Provisioning appeared to be working properly once the queue was cleaned up. The Service Provision had not completed when we ended the call, but the customer felt confident that it would be successful since they had not see it progress that far previously.

Can you ask the customer if the Service Provision completed successfully?

Thanks,
Tina

Comment 12 Tina Fitzgerald 2018-04-06 18:29:16 UTC
Created attachment 1418258 [details]
debug_drb file

Monkey Patch for DrbUnknown

Comment 13 Tina Fitzgerald 2018-04-06 18:40:53 UTC
Hi Yoder,

Comment 12 contains a monkey patch for DrbUnknown which will log the output when a DrbUnknown object is created.
It creates a file called debug_drb.log in the current directory(vmdb) and appends to it. 

The latest update in the case states that a single appliance with the Automate role appears to provision the Services successfully. 

Could you ask the customer to:
1. Enable one more appliance with the Automate role for a total of 2 appliances.
2. Copy the attached debug_drb.rb patch to both appliance vmdb/config/initializers folder.
3. Restart both servers
4. Provision the Services
5. Collect and supply the logs and the vmdb/debug_drb.log

Let me know if you have any questions.

Thanks,
Tina

Comment 14 Tina Fitzgerald 2018-04-06 19:51:01 UTC
Hi Yoder,

We discussed introducing a new 5.8.3.4 appliance(s) into one of the environments yesterday during our call with the customer. Can you ask them if any/all of the 3 appliances with the Automate role they used for testing today 
are new appliances?

Thanks,
Tina

Comment 16 Tina Fitzgerald 2018-04-09 18:19:35 UTC
Hi Yoder,

I saw the close errors in the debug logs.  Is it safe to say that the only issue now is the close issue?  I've asked Adam from the provider team to look at the logs and we should open a separate ticket on that issue.

We're still looking into why the behavior is different between the 2 versions, but want to verify that the customer is okay not using the code in question.

Let me know if you have any questions.

Thanks,
Tina

Comment 18 CFME Bot 2018-04-09 20:41:02 UTC
New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/76309b91e41725be72c5abadb19c0731b07733f7
commit 76309b91e41725be72c5abadb19c0731b07733f7
Author:     Adam Grare <agrare>
AuthorDate: Mon Apr  9 14:37:53 2018 -0400
Commit:     Adam Grare <agrare>
CommitDate: Mon Apr  9 14:37:53 2018 -0400

    Set DRb conn pool to [] after closing connections

    After forking and closing open DRb pooled connections set the pool to []
    to prevent other calls calling `conn.close` on already closed
    connection.  This prevents the following:

    ```
    [NoMethodError]: undefined method `close' for nil:NilClass
    Did you mean?  clone
    /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1258:in `close'
    /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1237:in `block in open'
    /opt/rh/rh-ruby23/root/usr/share/ruby/sync.rb:234:in `block in sync_synchronize'
    /opt/rh/rh-ruby23/root/usr/share/ruby/sync.rb:231:in `handle_interrupt'
    /opt/rh/rh-ruby23/root/usr/share/ruby/sync.rb:231:in `sync_synchronize'
    /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1235:in `open'
    /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1141:in `block in method_missing'
    /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1160:in `with_friend'
    /opt/rh/rh-ruby23/root/usr/share/ruby/drb/drb.rb:1140:in `method_missing'
    lib/gems/pending/VMwareWebService/MiqVimBroker.rb:419:in `getMiqVim'
    ```

    https://bugzilla.redhat.com/show_bug.cgi?id=1562401

 app/models/miq_worker.rb | 1 +
 1 file changed, 1 insertion(+)

Comment 20 Tina Fitzgerald 2018-04-10 14:15:39 UTC
Hi Yoder,

I should have been more specific when asking the question about not using Net::LDAP. My question was in response to this:
 "There were some instances of Net::LDAP errors in the debug_drb logs too, however Ian says they have resolved this issue:"

1. Strictly for the purposes of debugging the DRbUnknown error, does removing the Net::LDAP code result in no more DrbUnknown errors?

2. Yes, I believe Adams PR will resolve the close issue.   

3. As far as an alternative for the the VM naming code, we recently added code to be able to call the naming method from an Automate method. 
More detail here:
http://talk.manageiq.org/t/vm-naming-change/3195

4. I'm not sure why the putting the LDAP objects in $evm.object worked before, but they should not be putting the `Net:LDAP object in the $evm.object. Maybe they can refactor that code to store the LDAP information instead of the objects.

Let me know if you have any questions.

Thanks,
Tina

Comment 22 Tina Fitzgerald 2018-04-11 18:25:34 UTC
Created a new ticket to track the 'close' issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1566198

Comment 23 Tina Fitzgerald 2018-04-11 18:48:22 UTC
Hi Yoder,

Since we've created the new ticket to track the close issue.
Can we close this ticket?

Thanks,
Tina


Note You need to log in before you can comment on or make changes to this bug.