Bug 1597653 - DRb::DRbUnknown exception after update from 4.5 to 4.6
Summary: DRb::DRbUnknown exception after update from 4.5 to 4.6
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Automate
Version: 5.9.0
Hardware: All
OS: All
high
high
Target Milestone: GA
: cfme-future
Assignee: Tina Fitzgerald
QA Contact: Dave Johnson
URL:
Whiteboard:
Depends On:
Blocks: 1595269
TreeView+ depends on / blocked
 
Reported: 2018-07-03 11:10 UTC by Neha Chugh
Modified: 2021-12-10 16:32 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-18 05:18:34 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
DRB debug log (14.36 KB, text/plain)
2018-07-04 13:12 UTC, Neha Chugh
no flags Details
Sample Automate Method (773 bytes, text/plain)
2018-08-29 15:32 UTC, mkanoor
no flags Details

Description Neha Chugh 2018-07-03 11:10:09 UTC
Description of problem:
DRb::DRbUnknown exception observed after provisioning via Service catalog after upgrading the appliance from 4.5 to 4.6

Version-Release number of selected component (if applicable):
5.9.2.4

How reproducible:
Reproducible at customer's environment

Steps to Reproduce:

Drb:DRbUnknown exception has been observed at check_provisioned state after ordering the service so to provision vmware's vm.

[WRK] VM [#<DRb::DRbUnknown:0x0000000002e4e858>] Step [CheckProvisioned] Status [[TypeError]: can] Message [[TypeError]: can't cast DRb::DRbUnknown] 

Customer has prepared custom methods which are invoked as part of provisioning state machine.

This issue has specifically observed after upgrading cloudforms from 4.5 to 4.6.


Actual results:

Type cast exception observed after invoking vmware's provisioning state machine after upgrading the appliance from 4.5 to 4.6.

Expected results:

It should work as it worked before upgrading the appliances.

Additional info:

In order to access the latest logs, datastore, please use collab-shell :

 [yank] complete - access files in /cases/02108810
 [browse] the files here: http://collab-shell.usersys.redhat.com/02108810/
 [images] are available here: http://collab-shell.usersys.redhat.com/02108810/x-image

Comment 3 Greg McCullough 2018-07-03 12:07:28 UTC
I believe this is a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1562401

Please review that ticket. Specifically comments https://bugzilla.redhat.com/show_bug.cgi?id=1562401#c20 and after.

Comment 6 Neha Chugh 2018-07-04 13:12:58 UTC
Created attachment 1456500 [details]
DRB debug log

Comment 8 Tina Fitzgerald 2018-08-17 11:34:14 UTC
Hi Neha,

I'm sorry for the delayed response.

The DRbUnknown exception is caused by netssh objects being stored in a state_var, but I don't know which Automate method(s) are setting the invalid value.  


Thanks,
Tina

Comment 10 mkanoor 2018-08-22 14:01:41 UTC
Hello Neha,
Its not customary to store "connection objects" or "other complex objects"  in the state var, that is a big no no.
The state_var is only meant to store simple scalar fields like 

An integer
A String

If you connect to a server (e.g. SSH Server) and store the connection object in state_var or options hash that object will not be available in subsequent runs.

The whole ManageIQ workflows are highly asynchronous and run on different servers at different times based on how the workflow is sliced. So it doesn't make sense to store any connection objects in the state var, if you were initially running on Appliance1 and the work then gets queued and dispatched on Appliance 2, that connection object has no relevance.

I would advice please look at the customer methods and ensure that they are not storing complex objects anywhere in the options hash or in state vars.

Thanks,
Madhu

Comment 12 mkanoor 2018-08-28 13:48:33 UTC
Hi Neha,
Do you have their Automate Models. the collab shell has been cleaned up.

Heva they redone the Custom/Integration/IPAM/Methods.class/__methods__/create_reservation.rb method to not have the Net:SSH reference.

If it would help we can have a debugging session with the customer.

Thanks,
Madhu

Comment 14 mkanoor 2018-08-28 21:36:02 UTC
Hello Neha,
In their Automate Method they have this line of code

129   $evm.log(:info, "SSH Command: #{command}")
130   Net::SSH.start(server, username, {keys: ssh_key}) { |ssh| output = ssh.exec!(command) }
131   $evm.log(:info, "SSH Output: #{output}")


Line # 130 sets the output of the ssh.exec! command. The output object is of type

Net::SSH::Connection::Session::StringWithExitstatus

Which is what we are seeing in the logs and also in the DRb debug output

Net::SSH::Connection::Session::StringWithExitstatus behaves like a string but its not really a string since it has to be serialized across DRb as a string.

So I would suggest that on line #133 they convert this to a string using something like

output = output.to_s

Then they can do the parsing and peel out the Mac address and the rest of the stuff.

In Line #168 when they do this
$evm.set_state_var('MAC Address', mac_address)

what we see is this 

!ruby/object:DRb::DRbUnknown\n  
         name: Net::SSH\n  buf: \"\\x04\\bIC:8Net::SSH::Connection::Session::StringWithExitstatus\\\"\\x160a:00:c3:c8:01:2c\\x06:\\x06ET\"\n


Also they might want to cleanup their MiqQueue entries for any stale provisions stuck in the queue.

If you start with a clean set of logs you should not see any StringWithExitStatus in the evm.log and in automation.log.

If you were to grep for StringWithExitStatus in the current set of logs you would see numerous errors.

Also as a sanity check we should be checking if the ssh command was executed properly.
Currently the script doesn't have any checks for failure.

Net::SSH::Connection::Session::StringWithExitstatus object has a method called exitstatus which you can use to check if the command failed.


You could put this after the exec is called or better yet write a small function that encapsulates this.

if output.exitstatus == 0
   output = output.to_s
else
   $evm.log(:error, "ssh exec command failed #{output.exitstatus} : #{output}")
   exit(MIQ_ABORT)
end

Comment 15 Neha Chugh 2018-08-29 03:04:03 UTC
Hello Madhu,

Thanks for your update its really helpful. That was really keen observation.

I have updated the  case with your remarks and waiting for his response.

Meanwhile can you please help me how to cleanup their MiqQueue entries for any stale provisions stuck in the queue.

I have checked it via kill_provision.rb script but seems like its not working as expected. Any suggestion would be really helpful.

Regards,
Neha chugh

Comment 17 mkanoor 2018-08-29 15:31:39 UTC
Digging more into this issue why it started failing after migration. The crux of the issue is that the net ssh gem has been upgraded in our latest release

 
Net SSH version is 4.2.0 in CFME version 5.9.4.0 
Net SSH version is 3.2.0 in CFME version 5.8.4.5


Net::SSH::Connection::Session::StringWithExitstatus was introduced in the newer version of Net::SSH. And this object is not serializable via Drb and we get the DRb errors like it was reported by the customer

So if you are in the newer version please convert the output of exec! command to a string

e.g.

res = ssh.exec!(@cmd).to_s


Where previously you would have had


res = ssh.exec!(@cmd)

We might want to let the Support folks know about this change so customers with the Net::SSH are aware of it and we add it to our migration notes.

I have attached a sample Automate Method that you can run via Automate Simulation and see the output

Comment 18 mkanoor 2018-08-29 15:32:53 UTC
Created attachment 1479523 [details]
Sample Automate Method

If you have some ssh servers you can provide the IP address, userid password and test the automate script

Comment 19 mkanoor 2018-08-29 15:41:46 UTC
NetSSH added the StringWithExitstatus in 4.0.0.beta4
 
https://github.com/net-ssh/net-ssh/blob/80230544df15f8c36c06a55b50bb3d4476ae5107/lib/net/ssh/connection/session.rb#L338

Comment 20 Neha Chugh 2018-08-30 08:26:16 UTC
Thanks Madhu for all your help.

I will raise a documentation bug for the correction and suggest the required changes.

I have shared all the details to the customer and looking forward for his response.

Will update you once will get the confirmation from customer.

Regarding clearing stale queue entries, can we use kill_provision.rb , you can find this script under tools directory.

Let me know if you have ever tried using above script. For me, it didn't work.

Thank you.

Regards,
Neha Chugh

Comment 21 mkanoor 2018-08-30 14:20:31 UTC
Hello Neha,
I have not used the kill_provision, in the customers environment its possible these provisions have failed and there are no stale queue entries. I was just being cautious when I asked you to check their queue because I wanted to have them test after they have made the .to_s changes and not get bogged down with old entries stuck in queue.
Is the customer in production or dev environment?

Josh and others might have some cleanup scripts if the customer has stale entries. I hope they dont.
Thanks,
Madhu

Comment 22 Tina Fitzgerald 2018-08-30 14:23:57 UTC
Hi Neha,

I'm not sure what type of problem you're experiencing with the kill_provision script, but there's an open ticket open on it that we're currently looking into:
https://bugzilla.redhat.com/show_bug.cgi?id=1614086

Regards,
Tina

Comment 23 Tina Fitzgerald 2018-09-17 19:26:11 UTC
Hi Neha,

Can this ticket be closed?

Thanks,
Tina

Comment 24 Neha Chugh 2018-09-18 05:18:34 UTC
Hello Tina,

Yes, we are good to close the BZ. Thanks to you and Madhu for your prompt responses and for an appropriate solution.

Just an heads up, I have raised documentation BZ for appropriate corrections in Migrating and Upgrading document[1].

[1]. https://bugzilla.redhat.com/show_bug.cgi?id=1623801

Thanks and Regards,
Neha Chugh


Note You need to log in before you can comment on or make changes to this bug.