Description of problem: Hi, I see a lot of errors in vdsm.log like: 2022-02-14 08:42:52,086+0100 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] mailbox 65 checksum failed, not clearing mailbox, clearing new mail (data=b'\xff\xff\xff\xff\ <lot of data> \x00\x00', checksum=<function checksum at 0x7f2454712b70>, expected=b'\xbfG\x00\x00') (mailbox:602) 2022-02-14 08:42:52,087+0100 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] mailbox 66 checksum failed, not clearing mailbox, clearing new mail (data=b'\x00\x00\x00\x00\ <lot of data> \xff\xff', checksum=<function checksum at 0x7f2454712b70>, expected=b'\x04\xf0\x0b\x00') (mailbox:602) We have 3 hosts and 8 iSCSI domains. Version-Release number of selected component (if applicable): We are running latest ovirt engine and hosts: Hosts: ovirt-node-ng-installer-4.4.10-2022020214.el8.iso engine: ovirt-engine-4.4.10.6-1.el8.noarch Due to attachment file size limit I am sharing them this way: https://www.oslavany.net/userdata/publicdoc/ovirt/server1-vdsm-logs.tar https://www.oslavany.net/userdata/publicdoc/ovirt/server2-vdsm-logs.tar https://www.oslavany.net/userdata/publicdoc/ovirt/server4-vdsm-logs.tar Current SPM is "server4". How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: checksum failed Expected results: 0 checksum failed Additional info: I observed the issue prior we upgraded to v4.4 (I hoped the upgrade will fix it, but it does not)
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
Nir, it reminded us of bz 1426762 Do we need more logs or more details on the implication of this issue?
(In reply to Arik from comment #2) > Nir, it reminded us of bz 1426762 > Do we need more logs or more details on the implication of this issue? The fixes for bug 1426762 mention that we don't have a way to prevent the race between the hosts writing messages to the mailbox and the spm reading them. The checksum is our way to tell that what we read is not consistent and we need to read it again. Maybe we need to improve the way this is handled - instead of logging warnings and dropping the message, read the relevant messages again. The attached logs should be enough to start investigating this issue. When we do this we may request more logs.
Ack thanks. So setting low severity as this should have no functional impact (but there is some room for improvement)
Moved to GitHub: https://github.com/oVirt/vdsm/issues/205