Description of problem: When provisioning, VM clones successfully from template but it never powers on. Cloudforms shows the error: Error: incompatible marshal file format (can't be read) format version 4.8 required; 0.0 given Version-Release number of selected component (if applicable): 5.6.1.2 How reproducible: Only happens every few provision requests Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Please provide additional details, like what provider and provider version is being used and if possible any related logging. Can you elaborate on the statement "Only happens every few provision requests"? Is does the same template show the issue randomly or is it random on all templates?
The provider is vCenter 6 update 2. I have inquired about the statement "Only happens every few provision requests" and will report once I hear back.
There are a log of DRB connection issues that manifest in different ways, I see provision errors, refreshe errors, and metrics collection errors. A small summary of errors: ERROR -- : [DRb::DRbConnError]: too large packet 67651907 Method:[rescue in perf_collect_metrics] ERROR -- : [TypeError]: incompatible marshal file format (can't be read) [Danbury] [Broker] is VC, API version: true Error: [undefined method `queryProviderSummary' for "6.0":VimString] I don't see any errors from the broker worker in these logs, but could you enable vim debug and upload the vim.log so we can see if the broker is hiccuping while these DRB errors are happening or if it is just DRB? Another case (https://bugzilla.redhat.com/show_bug.cgi?id=1384113) mentions broker errors likely caused by memory issues, is that something that could be causing this as well?
What is the provider reply during this issue? What is the status of the provisioning(clone) task?
Can you have the customer apply the debug hotfix from https://bugzilla.redhat.com/show_bug.cgi?id=1392047 ?
*** Bug 1392047 has been marked as a duplicate of this bug. ***
Hey Michael, can you find out if the customer has hit this again yet?
Let me know if you want CU to reproduce the error. I can definitely ask CU to do so.
New commit detected on ManageIQ/vmware_web_service/master: https://github.com/ManageIQ/vmware_web_service/commit/b4548c269d09a9b621b411197e9791770cf669e8 commit b4548c269d09a9b621b411197e9791770cf669e8 Author: Adam Grare <agrare> AuthorDate: Tue Jul 18 15:10:58 2017 -0400 Commit: Adam Grare <agrare> CommitDate: Tue Jul 18 15:13:34 2017 -0400 Add marshal version and drb message size checks Add diagnostic code to dump out the buffer and object sent from the broker if either the DRbMessage size or Marshal version are invalid. https://bugzilla.redhat.com/show_bug.cgi?id=1385038 lib/VMwareWebService/DMiqVimSync.rb | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-)
Added addition logging on the server in the case where either the DRbMessage or the Marshal version is invalid. This will let us catch the error on the server side so that we can dump the buffer and the object which will give us more information as to what happened. This is still diagnostic code not expecting to fix the issue.
https://github.com/ManageIQ/manageiq-content/pull/154
New commit detected on ManageIQ/manageiq-content/master: https://github.com/ManageIQ/manageiq-content/commit/4bda33b6359f9564894d8988772d6f02d5f242ff commit 4bda33b6359f9564894d8988772d6f02d5f242ff Author: mkanoor <mkanoor> AuthorDate: Fri Jul 28 10:07:38 2017 -0400 Commit: mkanoor <mkanoor> CommitDate: Fri Jul 28 10:07:38 2017 -0400 Log the full error message from the provider https://bugzilla.redhat.com/show_bug.cgi?id=1385038 There was some confusion in this ticket if the error was being generated by Automate or if it came from the provider. If we log the full error message from the provider it would help debugging. .../StateMachines/Methods.class/__methods__/check_provisioned.rb | 1 + .../StateMachines/Methods.class/__methods__/check_provisioned.rb | 1 + 2 files changed, 2 insertions(+)
https://github.com/ManageIQ/vmware_web_service/pull/20
New commit detected on ManageIQ/vmware_web_service/master: https://github.com/ManageIQ/vmware_web_service/commit/760374d5a76caa2d53ba99ff314004b446eb36fa commit 760374d5a76caa2d53ba99ff314004b446eb36fa Author: Adam Grare <agrare> AuthorDate: Thu Oct 12 09:51:37 2017 -0400 Commit: Adam Grare <agrare> CommitDate: Mon Oct 23 15:54:16 2017 -0400 Add client side DrbMessage#load checks Add checks for TypeError and packet too large and print out the raw buffer from the socket. https://bugzilla.redhat.com/show_bug.cgi?id=1385038 lib/VMwareWebService/MiqVimBroker.rb | 2 ++ lib/VMwareWebService/MiqVimDrbDebug.rb | 2 ++ 2 files changed, 4 insertions(+)
https://github.com/ManageIQ/manageiq/pull/16953
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/271ae4023aa67b5d6a4e57d9decebf14479151a2 commit 271ae4023aa67b5d6a4e57d9decebf14479151a2 Author: Adam Grare <agrare> AuthorDate: Mon Feb 5 17:03:01 2018 -0500 Commit: Adam Grare <agrare> CommitDate: Tue Feb 6 11:04:17 2018 -0500 Close open connections from parent after fork DRb::DRbConn keeps a global pool of open connections which is shared by child processes when they are forked from a parent. If this parent executes a DRb call prior to forking a child process the child picks up this open connection and uses it which can cause replies from the server to go to the wrong DRb client. There is a long standing ruby bug https://bugs.ruby-lang.org/issues/2718 which describes the issue and has reproducer code attached. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1385038 app/models/miq_worker.rb | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)