Bug 1597653
Summary: | DRb::DRbUnknown exception after update from 4.5 to 4.6 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat CloudForms Management Engine | Reporter: | Neha Chugh <nchugh> | ||||||
Component: | Automate | Assignee: | Tina Fitzgerald <tfitzger> | ||||||
Status: | CLOSED NOTABUG | QA Contact: | Dave Johnson <dajohnso> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.9.0 | CC: | mkanoor, nchugh, obarenbo, tfitzger | ||||||
Target Milestone: | GA | ||||||||
Target Release: | cfme-future | ||||||||
Hardware: | All | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-09-18 05:18:34 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1595269 | ||||||||
Attachments: |
|
Description
Neha Chugh
2018-07-03 11:10:09 UTC
I believe this is a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1562401 Please review that ticket. Specifically comments https://bugzilla.redhat.com/show_bug.cgi?id=1562401#c20 and after. Created attachment 1456500 [details]
DRB debug log
Hi Neha, I'm sorry for the delayed response. The DRbUnknown exception is caused by netssh objects being stored in a state_var, but I don't know which Automate method(s) are setting the invalid value. Thanks, Tina Hello Neha, Its not customary to store "connection objects" or "other complex objects" in the state var, that is a big no no. The state_var is only meant to store simple scalar fields like An integer A String If you connect to a server (e.g. SSH Server) and store the connection object in state_var or options hash that object will not be available in subsequent runs. The whole ManageIQ workflows are highly asynchronous and run on different servers at different times based on how the workflow is sliced. So it doesn't make sense to store any connection objects in the state var, if you were initially running on Appliance1 and the work then gets queued and dispatched on Appliance 2, that connection object has no relevance. I would advice please look at the customer methods and ensure that they are not storing complex objects anywhere in the options hash or in state vars. Thanks, Madhu Hi Neha, Do you have their Automate Models. the collab shell has been cleaned up. Heva they redone the Custom/Integration/IPAM/Methods.class/__methods__/create_reservation.rb method to not have the Net:SSH reference. If it would help we can have a debugging session with the customer. Thanks, Madhu Hello Neha, In their Automate Method they have this line of code 129 $evm.log(:info, "SSH Command: #{command}") 130 Net::SSH.start(server, username, {keys: ssh_key}) { |ssh| output = ssh.exec!(command) } 131 $evm.log(:info, "SSH Output: #{output}") Line # 130 sets the output of the ssh.exec! command. The output object is of type Net::SSH::Connection::Session::StringWithExitstatus Which is what we are seeing in the logs and also in the DRb debug output Net::SSH::Connection::Session::StringWithExitstatus behaves like a string but its not really a string since it has to be serialized across DRb as a string. So I would suggest that on line #133 they convert this to a string using something like output = output.to_s Then they can do the parsing and peel out the Mac address and the rest of the stuff. In Line #168 when they do this $evm.set_state_var('MAC Address', mac_address) what we see is this !ruby/object:DRb::DRbUnknown\n name: Net::SSH\n buf: \"\\x04\\bIC:8Net::SSH::Connection::Session::StringWithExitstatus\\\"\\x160a:00:c3:c8:01:2c\\x06:\\x06ET\"\n Also they might want to cleanup their MiqQueue entries for any stale provisions stuck in the queue. If you start with a clean set of logs you should not see any StringWithExitStatus in the evm.log and in automation.log. If you were to grep for StringWithExitStatus in the current set of logs you would see numerous errors. Also as a sanity check we should be checking if the ssh command was executed properly. Currently the script doesn't have any checks for failure. Net::SSH::Connection::Session::StringWithExitstatus object has a method called exitstatus which you can use to check if the command failed. You could put this after the exec is called or better yet write a small function that encapsulates this. if output.exitstatus == 0 output = output.to_s else $evm.log(:error, "ssh exec command failed #{output.exitstatus} : #{output}") exit(MIQ_ABORT) end Hello Madhu, Thanks for your update its really helpful. That was really keen observation. I have updated the case with your remarks and waiting for his response. Meanwhile can you please help me how to cleanup their MiqQueue entries for any stale provisions stuck in the queue. I have checked it via kill_provision.rb script but seems like its not working as expected. Any suggestion would be really helpful. Regards, Neha chugh Digging more into this issue why it started failing after migration. The crux of the issue is that the net ssh gem has been upgraded in our latest release Net SSH version is 4.2.0 in CFME version 5.9.4.0 Net SSH version is 3.2.0 in CFME version 5.8.4.5 Net::SSH::Connection::Session::StringWithExitstatus was introduced in the newer version of Net::SSH. And this object is not serializable via Drb and we get the DRb errors like it was reported by the customer So if you are in the newer version please convert the output of exec! command to a string e.g. res = ssh.exec!(@cmd).to_s Where previously you would have had res = ssh.exec!(@cmd) We might want to let the Support folks know about this change so customers with the Net::SSH are aware of it and we add it to our migration notes. I have attached a sample Automate Method that you can run via Automate Simulation and see the output Created attachment 1479523 [details]
Sample Automate Method
If you have some ssh servers you can provide the IP address, userid password and test the automate script
NetSSH added the StringWithExitstatus in 4.0.0.beta4 https://github.com/net-ssh/net-ssh/blob/80230544df15f8c36c06a55b50bb3d4476ae5107/lib/net/ssh/connection/session.rb#L338 Thanks Madhu for all your help. I will raise a documentation bug for the correction and suggest the required changes. I have shared all the details to the customer and looking forward for his response. Will update you once will get the confirmation from customer. Regarding clearing stale queue entries, can we use kill_provision.rb , you can find this script under tools directory. Let me know if you have ever tried using above script. For me, it didn't work. Thank you. Regards, Neha Chugh Hello Neha, I have not used the kill_provision, in the customers environment its possible these provisions have failed and there are no stale queue entries. I was just being cautious when I asked you to check their queue because I wanted to have them test after they have made the .to_s changes and not get bogged down with old entries stuck in queue. Is the customer in production or dev environment? Josh and others might have some cleanup scripts if the customer has stale entries. I hope they dont. Thanks, Madhu Hi Neha, I'm not sure what type of problem you're experiencing with the kill_provision script, but there's an open ticket open on it that we're currently looking into: https://bugzilla.redhat.com/show_bug.cgi?id=1614086 Regards, Tina Hi Neha, Can this ticket be closed? Thanks, Tina Hello Tina, Yes, we are good to close the BZ. Thanks to you and Madhu for your prompt responses and for an appropriate solution. Just an heads up, I have raised documentation BZ for appropriate corrections in Migrating and Upgrading document[1]. [1]. https://bugzilla.redhat.com/show_bug.cgi?id=1623801 Thanks and Regards, Neha Chugh |