Bug 2054209

Summary: ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] mailbox 65 checksum failed, not clearing mailbox, clearing new mail
Product: [oVirt] vdsm Reporter: petr.kyselak
Component: SuperVDSMAssignee: Nir Soffer <nsoffer>
Status: CLOSED DEFERRED QA Contact: Avihai <aefrat>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.40.100.2CC: ahadas, bugs, bzlotnik
Target Milestone: ---Flags: pm-rhel: ovirt-4.5?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-24 13:26:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description petr.kyselak 2022-02-14 12:50:34 UTC
Description of problem:
Hi,
I see a lot of errors in vdsm.log like:

2022-02-14 08:42:52,086+0100 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] mailbox
65 checksum failed, not clearing mailbox, clearing new mail (data=b'\xff\xff\xff\xff\
<lot of data> \x00\x00', checksum=<function checksum at 0x7f2454712b70>,
expected=b'\xbfG\x00\x00') (mailbox:602)
2022-02-14 08:42:52,087+0100 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] mailbox
66 checksum failed, not clearing mailbox, clearing new mail (data=b'\x00\x00\x00\x00\
<lot of data> \xff\xff', checksum=<function checksum at 0x7f2454712b70>,
expected=b'\x04\xf0\x0b\x00') (mailbox:602)

We have 3 hosts and 8 iSCSI domains.


Version-Release number of selected component (if applicable):
We are running latest ovirt engine and hosts:
Hosts: ovirt-node-ng-installer-4.4.10-2022020214.el8.iso
engine: ovirt-engine-4.4.10.6-1.el8.noarch

Due to attachment file size limit I am sharing them this way:
https://www.oslavany.net/userdata/publicdoc/ovirt/server1-vdsm-logs.tar
https://www.oslavany.net/userdata/publicdoc/ovirt/server2-vdsm-logs.tar
https://www.oslavany.net/userdata/publicdoc/ovirt/server4-vdsm-logs.tar

Current SPM is "server4".



How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
checksum failed

Expected results:
0 checksum failed

Additional info:
I observed the issue prior we upgraded to v4.4 (I hoped the upgrade will fix it, but it
does not)

Comment 1 RHEL Program Management 2022-02-14 13:21:55 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Arik 2022-02-14 15:12:28 UTC
Nir, it reminded us of bz 1426762
Do we need more logs or more details on the implication of this issue?

Comment 3 Nir Soffer 2022-02-14 15:30:51 UTC
(In reply to Arik from comment #2)
> Nir, it reminded us of bz 1426762
> Do we need more logs or more details on the implication of this issue?

The fixes for bug 1426762 mention that we don't have a way to prevent the race between
the hosts writing messages to the mailbox and the spm reading them. The checksum
is our way to tell that what we read is not consistent and we need to read it again.

Maybe we need to improve the way this is handled - instead of logging warnings and
dropping the message, read the relevant messages again.

The attached logs should be enough to start investigating this issue. When we do
this we may request more logs.

Comment 4 Arik 2022-02-14 16:57:56 UTC
Ack thanks.
So setting low severity as this should have no functional impact (but there is some room for improvement)

Comment 6 Arik 2022-05-24 13:26:07 UTC
Moved to GitHub: https://github.com/oVirt/vdsm/issues/205