Bug 2054209 - ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] mailbox 65 checksum failed, not clearing mailbox, clearing new mail
Summary: ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] mailbox 65 checksum fail...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: vdsm
Classification: oVirt
Component: SuperVDSM
Version: 4.40.100.2
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Nir Soffer
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-14 12:50 UTC by petr.kyselak
Modified: 2022-05-24 13:26 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2022-05-24 13:26:07 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44692 0 None None None 2022-02-14 12:52:39 UTC

Description petr.kyselak 2022-02-14 12:50:34 UTC
Description of problem:
Hi,
I see a lot of errors in vdsm.log like:

2022-02-14 08:42:52,086+0100 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] mailbox
65 checksum failed, not clearing mailbox, clearing new mail (data=b'\xff\xff\xff\xff\
<lot of data> \x00\x00', checksum=<function checksum at 0x7f2454712b70>,
expected=b'\xbfG\x00\x00') (mailbox:602)
2022-02-14 08:42:52,087+0100 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] mailbox
66 checksum failed, not clearing mailbox, clearing new mail (data=b'\x00\x00\x00\x00\
<lot of data> \xff\xff', checksum=<function checksum at 0x7f2454712b70>,
expected=b'\x04\xf0\x0b\x00') (mailbox:602)

We have 3 hosts and 8 iSCSI domains.


Version-Release number of selected component (if applicable):
We are running latest ovirt engine and hosts:
Hosts: ovirt-node-ng-installer-4.4.10-2022020214.el8.iso
engine: ovirt-engine-4.4.10.6-1.el8.noarch

Due to attachment file size limit I am sharing them this way:
https://www.oslavany.net/userdata/publicdoc/ovirt/server1-vdsm-logs.tar
https://www.oslavany.net/userdata/publicdoc/ovirt/server2-vdsm-logs.tar
https://www.oslavany.net/userdata/publicdoc/ovirt/server4-vdsm-logs.tar

Current SPM is "server4".



How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
checksum failed

Expected results:
0 checksum failed

Additional info:
I observed the issue prior we upgraded to v4.4 (I hoped the upgrade will fix it, but it
does not)

Comment 1 RHEL Program Management 2022-02-14 13:21:55 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Arik 2022-02-14 15:12:28 UTC
Nir, it reminded us of bz 1426762
Do we need more logs or more details on the implication of this issue?

Comment 3 Nir Soffer 2022-02-14 15:30:51 UTC
(In reply to Arik from comment #2)
> Nir, it reminded us of bz 1426762
> Do we need more logs or more details on the implication of this issue?

The fixes for bug 1426762 mention that we don't have a way to prevent the race between
the hosts writing messages to the mailbox and the spm reading them. The checksum
is our way to tell that what we read is not consistent and we need to read it again.

Maybe we need to improve the way this is handled - instead of logging warnings and
dropping the message, read the relevant messages again.

The attached logs should be enough to start investigating this issue. When we do
this we may request more logs.

Comment 4 Arik 2022-02-14 16:57:56 UTC
Ack thanks.
So setting low severity as this should have no functional impact (but there is some room for improvement)

Comment 6 Arik 2022-05-24 13:26:07 UTC
Moved to GitHub: https://github.com/oVirt/vdsm/issues/205


Note You need to log in before you can comment on or make changes to this bug.