Bug 1392047 - marshal error causes ems refresh to fail for VMware provider
Summary: marshal error causes ems refresh to fail for VMware provider
Keywords:
Status: CLOSED DUPLICATE of bug 1385038
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Providers
Version: 5.6.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: GA
: cfme-future
Assignee: Adam Grare
QA Contact: Dave Johnson
URL:
Whiteboard: vsphere:ems_refresh:provider
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-04 16:06 UTC by Thomas Hennessy
Modified: 2020-08-13 08:40 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-21 14:05:37 UTC
Category: ---
Cloudforms Team: VMware
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
VMWARE EMs Refresh worker exhibiging issue (68.49 KB, application/x-gzip)
2016-11-04 16:06 UTC, Thomas Hennessy
no flags Details
most recent incident of error (43.21 KB, application/x-gzip)
2016-11-04 16:12 UTC, Thomas Hennessy
no flags Details
Broker Debug Patch (4.76 KB, application/x-gzip)
2016-11-14 14:05 UTC, Adam Grare
no flags Details

Description Thomas Hennessy 2016-11-04 16:06:42 UTC
Created attachment 1217448 [details]
VMWARE EMs Refresh worker exhibiging issue

Description of problem: EMS refresh failes with error:
=====
[----] E, [2016-10-27T23:22:52.767614 #18397:e5798c] ERROR -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::Refresher#refresh) EMS: [Sacramento WLS VCenter 1], id: [50000000000004] Refresh failed
[----] E, [2016-10-27T23:22:52.768265 #18397:e5798c] ERROR -- : [TypeError]: incompatible marshal file format (can't be read)
	format version 4.8 required; 58.12 given  Method:[rescue in block in refresh]
======

Version-Release number of selected component (if applicable):5.6.1.2


How reproducible: unknown


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Thomas Hennessy 2016-11-04 16:12:46 UTC
Created attachment 1217450 [details]
most recent incident of error

Comment 3 Adam Grare 2016-11-07 18:52:46 UTC
This looks like it could be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1385038

It is interesting that both of the errors you attached were hit after a successful get_vc_data, when connecting back to the broker to get get_vc_data_host_scsi or get_vc_data_ems_customization_specs.  Not sure what this means yet, I'll look through the full logs and see if I can find anything.

Comment 4 Adam Grare 2016-11-08 20:16:26 UTC
This might be related to the workers being killed and having their broker sessions cleaned up.

The MiqScheduleWorker is killed:
[2016-11-03T07:20:41.438335 #18428:e5798c]  WARN -- : MIQ(MiqScheduleWorker#kill) Worker ID [50000001663114] PID [24231] GUID [3cb6663e-a1cf-11e6-b77e-005056827e39] has been killed

There is a DRb failure when cleaning up broker connections:
[2016-11-03T07:21:47.103457 #18428:e5798c]  INFO -- : MIQ(MiqVimBrokerWorker.cleanup_for_pid) Releasing any broker connections for pid: [24231], ERROR: too large packet 67651907

Then at essentially the same time we hit the corrupt message error:
[2016-11-03T07:21:47.118597 #22879:e5798c] ERROR -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::Refresher#refresh) EMS: [Sacramento WLS VCenter 1], id: [50000000000004] Refresh failed
[2016-11-03T07:21:47.120601 #22879:e5798c] ERROR -- : [TypeError]: incompatible marshal file format (can't be read)
        format version 4.8 required; 58.13 given  Method:[rescue in block in refresh]

Comment 6 Adam Grare 2016-11-14 14:05:00 UTC
Created attachment 1220456 [details]
Broker Debug Patch

With this patch the broker server will log drb message sizes and checksums, and the broker client will print the original message when it hits a marshal error.

Comment 7 Thomas Hennessy 2016-11-14 18:28:44 UTC
Adam,
What instructions can you provide as for applying this patch?

Tom Hennessy

Comment 8 Adam Grare 2016-11-14 18:48:11 UTC
Tom, scp the file to /var/www/miq/vmdb and un-tar it, then restart evmserverd.

Depending on how much information you want level_vim debug will print size and checksum of every outgoing message (will be quite verbose logging), otherwise leaving the vim log level as warn will show the extra information when it hits a marshaling issue.

Comment 11 Thomas Hennessy 2016-11-17 13:17:45 UTC
from customer who has recieved to hotfixes and is testing them:
=====
Most recent comment: On 2016-11-17 04:01:54, Trieu, Daniel commented:
"Hello RedHat Team,

Updating the ticket:

The hotfix to address reported heartbeat failure issues associated with pglogical has been pushed to 3 (out of 8) regions. I confirmed within 30 minutes that the heartbeat issue was resolved.

Out of caution, the plan is to push the hotfix to another 2 regions today and another 3 regions the day after.

To be clear, there is another hotfix on this ticket for marshal errors, which is in test/dev/uat right now and has not been pushed to any production region.


Daniel"
=====

Comment 12 Adam Grare 2016-11-21 14:05:37 UTC
This is the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1385038

*** This bug has been marked as a duplicate of bug 1385038 ***


Note You need to log in before you can comment on or make changes to this bug.