Description of problem: DRb::DRbUnknown exception observed after provisioning via Service catalog after upgrading the appliance from 4.5 to 4.6 Version-Release number of selected component (if applicable): 5.9.2.4 How reproducible: Reproducible at customer's environment Steps to Reproduce: Drb:DRbUnknown exception has been observed at check_provisioned state after ordering the service so to provision vmware's vm. [WRK] VM [#<DRb::DRbUnknown:0x0000000002e4e858>] Step [CheckProvisioned] Status [[TypeError]: can] Message [[TypeError]: can't cast DRb::DRbUnknown] Customer has prepared custom methods which are invoked as part of provisioning state machine. This issue has specifically observed after upgrading cloudforms from 4.5 to 4.6. Actual results: Type cast exception observed after invoking vmware's provisioning state machine after upgrading the appliance from 4.5 to 4.6. Expected results: It should work as it worked before upgrading the appliances. Additional info: In order to access the latest logs, datastore, please use collab-shell : [yank] complete - access files in /cases/02108810 [browse] the files here: http://collab-shell.usersys.redhat.com/02108810/ [images] are available here: http://collab-shell.usersys.redhat.com/02108810/x-image
I believe this is a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1562401 Please review that ticket. Specifically comments https://bugzilla.redhat.com/show_bug.cgi?id=1562401#c20 and after.
Created attachment 1456500 [details] DRB debug log
Hi Neha, I'm sorry for the delayed response. The DRbUnknown exception is caused by netssh objects being stored in a state_var, but I don't know which Automate method(s) are setting the invalid value. Thanks, Tina
Hello Neha, Its not customary to store "connection objects" or "other complex objects" in the state var, that is a big no no. The state_var is only meant to store simple scalar fields like An integer A String If you connect to a server (e.g. SSH Server) and store the connection object in state_var or options hash that object will not be available in subsequent runs. The whole ManageIQ workflows are highly asynchronous and run on different servers at different times based on how the workflow is sliced. So it doesn't make sense to store any connection objects in the state var, if you were initially running on Appliance1 and the work then gets queued and dispatched on Appliance 2, that connection object has no relevance. I would advice please look at the customer methods and ensure that they are not storing complex objects anywhere in the options hash or in state vars. Thanks, Madhu
Hi Neha, Do you have their Automate Models. the collab shell has been cleaned up. Heva they redone the Custom/Integration/IPAM/Methods.class/__methods__/create_reservation.rb method to not have the Net:SSH reference. If it would help we can have a debugging session with the customer. Thanks, Madhu
Hello Neha, In their Automate Method they have this line of code 129 $evm.log(:info, "SSH Command: #{command}") 130 Net::SSH.start(server, username, {keys: ssh_key}) { |ssh| output = ssh.exec!(command) } 131 $evm.log(:info, "SSH Output: #{output}") Line # 130 sets the output of the ssh.exec! command. The output object is of type Net::SSH::Connection::Session::StringWithExitstatus Which is what we are seeing in the logs and also in the DRb debug output Net::SSH::Connection::Session::StringWithExitstatus behaves like a string but its not really a string since it has to be serialized across DRb as a string. So I would suggest that on line #133 they convert this to a string using something like output = output.to_s Then they can do the parsing and peel out the Mac address and the rest of the stuff. In Line #168 when they do this $evm.set_state_var('MAC Address', mac_address) what we see is this !ruby/object:DRb::DRbUnknown\n name: Net::SSH\n buf: \"\\x04\\bIC:8Net::SSH::Connection::Session::StringWithExitstatus\\\"\\x160a:00:c3:c8:01:2c\\x06:\\x06ET\"\n Also they might want to cleanup their MiqQueue entries for any stale provisions stuck in the queue. If you start with a clean set of logs you should not see any StringWithExitStatus in the evm.log and in automation.log. If you were to grep for StringWithExitStatus in the current set of logs you would see numerous errors. Also as a sanity check we should be checking if the ssh command was executed properly. Currently the script doesn't have any checks for failure. Net::SSH::Connection::Session::StringWithExitstatus object has a method called exitstatus which you can use to check if the command failed. You could put this after the exec is called or better yet write a small function that encapsulates this. if output.exitstatus == 0 output = output.to_s else $evm.log(:error, "ssh exec command failed #{output.exitstatus} : #{output}") exit(MIQ_ABORT) end
Hello Madhu, Thanks for your update its really helpful. That was really keen observation. I have updated the case with your remarks and waiting for his response. Meanwhile can you please help me how to cleanup their MiqQueue entries for any stale provisions stuck in the queue. I have checked it via kill_provision.rb script but seems like its not working as expected. Any suggestion would be really helpful. Regards, Neha chugh
Digging more into this issue why it started failing after migration. The crux of the issue is that the net ssh gem has been upgraded in our latest release Net SSH version is 4.2.0 in CFME version 5.9.4.0 Net SSH version is 3.2.0 in CFME version 5.8.4.5 Net::SSH::Connection::Session::StringWithExitstatus was introduced in the newer version of Net::SSH. And this object is not serializable via Drb and we get the DRb errors like it was reported by the customer So if you are in the newer version please convert the output of exec! command to a string e.g. res = ssh.exec!(@cmd).to_s Where previously you would have had res = ssh.exec!(@cmd) We might want to let the Support folks know about this change so customers with the Net::SSH are aware of it and we add it to our migration notes. I have attached a sample Automate Method that you can run via Automate Simulation and see the output
Created attachment 1479523 [details] Sample Automate Method If you have some ssh servers you can provide the IP address, userid password and test the automate script
NetSSH added the StringWithExitstatus in 4.0.0.beta4 https://github.com/net-ssh/net-ssh/blob/80230544df15f8c36c06a55b50bb3d4476ae5107/lib/net/ssh/connection/session.rb#L338
Thanks Madhu for all your help. I will raise a documentation bug for the correction and suggest the required changes. I have shared all the details to the customer and looking forward for his response. Will update you once will get the confirmation from customer. Regarding clearing stale queue entries, can we use kill_provision.rb , you can find this script under tools directory. Let me know if you have ever tried using above script. For me, it didn't work. Thank you. Regards, Neha Chugh
Hello Neha, I have not used the kill_provision, in the customers environment its possible these provisions have failed and there are no stale queue entries. I was just being cautious when I asked you to check their queue because I wanted to have them test after they have made the .to_s changes and not get bogged down with old entries stuck in queue. Is the customer in production or dev environment? Josh and others might have some cleanup scripts if the customer has stale entries. I hope they dont. Thanks, Madhu
Hi Neha, I'm not sure what type of problem you're experiencing with the kill_provision script, but there's an open ticket open on it that we're currently looking into: https://bugzilla.redhat.com/show_bug.cgi?id=1614086 Regards, Tina
Hi Neha, Can this ticket be closed? Thanks, Tina
Hello Tina, Yes, we are good to close the BZ. Thanks to you and Madhu for your prompt responses and for an appropriate solution. Just an heads up, I have raised documentation BZ for appropriate corrections in Migrating and Upgrading document[1]. [1]. https://bugzilla.redhat.com/show_bug.cgi?id=1623801 Thanks and Regards, Neha Chugh