Bug 1392047

Summary: marshal error causes ems refresh to fail for VMware provider
Product: Red Hat CloudForms Management Engine Reporter: Thomas Hennessy <thenness>
Component: ProvidersAssignee: Adam Grare <agrare>
Status: CLOSED DUPLICATE QA Contact: Dave Johnson <dajohnso>
Severity: high Docs Contact:
Priority: high    
Version: 5.6.0CC: agrare, jdeubel, jfrey, jhardy, jocarter, myoder, obarenbo, saali, thenness
Target Milestone: GA   
Target Release: cfme-future   
Hardware: x86_64   
OS: Linux   
Whiteboard: vsphere:ems_refresh:provider
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-21 14:05:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: VMware Target Upstream Version:
Attachments:
Description Flags
VMWARE EMs Refresh worker exhibiging issue
none
most recent incident of error
none
Broker Debug Patch none

Description Thomas Hennessy 2016-11-04 16:06:42 UTC
Created attachment 1217448 [details]
VMWARE EMs Refresh worker exhibiging issue

Description of problem: EMS refresh failes with error:
=====
[----] E, [2016-10-27T23:22:52.767614 #18397:e5798c] ERROR -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::Refresher#refresh) EMS: [Sacramento WLS VCenter 1], id: [50000000000004] Refresh failed
[----] E, [2016-10-27T23:22:52.768265 #18397:e5798c] ERROR -- : [TypeError]: incompatible marshal file format (can't be read)
	format version 4.8 required; 58.12 given  Method:[rescue in block in refresh]
======

Version-Release number of selected component (if applicable):5.6.1.2


How reproducible: unknown


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Thomas Hennessy 2016-11-04 16:12:46 UTC
Created attachment 1217450 [details]
most recent incident of error

Comment 3 Adam Grare 2016-11-07 18:52:46 UTC
This looks like it could be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1385038

It is interesting that both of the errors you attached were hit after a successful get_vc_data, when connecting back to the broker to get get_vc_data_host_scsi or get_vc_data_ems_customization_specs.  Not sure what this means yet, I'll look through the full logs and see if I can find anything.

Comment 4 Adam Grare 2016-11-08 20:16:26 UTC
This might be related to the workers being killed and having their broker sessions cleaned up.

The MiqScheduleWorker is killed:
[2016-11-03T07:20:41.438335 #18428:e5798c]  WARN -- : MIQ(MiqScheduleWorker#kill) Worker ID [50000001663114] PID [24231] GUID [3cb6663e-a1cf-11e6-b77e-005056827e39] has been killed

There is a DRb failure when cleaning up broker connections:
[2016-11-03T07:21:47.103457 #18428:e5798c]  INFO -- : MIQ(MiqVimBrokerWorker.cleanup_for_pid) Releasing any broker connections for pid: [24231], ERROR: too large packet 67651907

Then at essentially the same time we hit the corrupt message error:
[2016-11-03T07:21:47.118597 #22879:e5798c] ERROR -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::Refresher#refresh) EMS: [Sacramento WLS VCenter 1], id: [50000000000004] Refresh failed
[2016-11-03T07:21:47.120601 #22879:e5798c] ERROR -- : [TypeError]: incompatible marshal file format (can't be read)
        format version 4.8 required; 58.13 given  Method:[rescue in block in refresh]

Comment 6 Adam Grare 2016-11-14 14:05:00 UTC
Created attachment 1220456 [details]
Broker Debug Patch

With this patch the broker server will log drb message sizes and checksums, and the broker client will print the original message when it hits a marshal error.

Comment 7 Thomas Hennessy 2016-11-14 18:28:44 UTC
Adam,
What instructions can you provide as for applying this patch?

Tom Hennessy

Comment 8 Adam Grare 2016-11-14 18:48:11 UTC
Tom, scp the file to /var/www/miq/vmdb and un-tar it, then restart evmserverd.

Depending on how much information you want level_vim debug will print size and checksum of every outgoing message (will be quite verbose logging), otherwise leaving the vim log level as warn will show the extra information when it hits a marshaling issue.

Comment 11 Thomas Hennessy 2016-11-17 13:17:45 UTC
from customer who has recieved to hotfixes and is testing them:
=====
Most recent comment: On 2016-11-17 04:01:54, Trieu, Daniel commented:
"Hello RedHat Team,

Updating the ticket:

The hotfix to address reported heartbeat failure issues associated with pglogical has been pushed to 3 (out of 8) regions. I confirmed within 30 minutes that the heartbeat issue was resolved.

Out of caution, the plan is to push the hotfix to another 2 regions today and another 3 regions the day after.

To be clear, there is another hotfix on this ticket for marshal errors, which is in test/dev/uat right now and has not been pushed to any production region.


Daniel"
=====

Comment 12 Adam Grare 2016-11-21 14:05:37 UTC
This is the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1385038

*** This bug has been marked as a duplicate of bug 1385038 ***