Bug 1426762 - Storage mailbox checksum failure when reading mail - no checksum on storage
Summary: Storage mailbox checksum failure when reading mail - no checksum on storage
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.20.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.2.0
: 4.20.9
Assignee: Nir Soffer
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-24 19:16 UTC by Nir Soffer
Modified: 2017-12-20 10:49 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-12-20 10:49:40 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.2+


Attachments (Terms of Use)
Vdsm tests log (587.67 KB, text/plain)
2017-02-24 19:16 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 73074 0 master MERGED mailbox: Improve log when checksum fails 2020-02-25 03:28:14 UTC
oVirt gerrit 73077 0 master MERGED mailbox: Add tests for mailbox validation 2020-02-25 03:28:14 UTC
oVirt gerrit 73078 0 master MERGED mailbox: Decrease the chance of checksum failures 2020-02-25 03:28:14 UTC

Description Nir Soffer 2017-02-24 19:16:09 UTC
Created attachment 1257402 [details]
Vdsm tests log

Description of problem:

We have a random checksum failure in the storage mailbox, when reading new mail.
The error looks like:

2017-02-24 18:05:57,438 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail

When it fails, we see that there is no checksum on storage:
expected='9\x19\x00\x00'
actual='\x00\x00\x00\x00'

When this happens, we drop the mailbox content, ignoring any messages in it. The 
sender host typically will send another message again, so this is expected to 
cause a delay of 2 seconds in receiving and handling the message.

Delay in receiving extend message may lead to pausing a VM.

Version-Release number of selected component (if applicable):
Seen on master and various older versions. 

How reproducible:
Random and rare failure

Steps to Reproduce:
- Run the mailbox tests 1000 times in jenkins CI, may fail after some runs
- We don't know how to reproduce this in real system.

Actual results:
No checksum on storage

Expected results:
Checksum on storage matches computed checksum

Comment 1 Nir Soffer 2017-02-24 19:41:26 UTC
Looking at the logs we see:

HSM thread sending mail to spm:

2017-02-24 18:05:57,263 DEBUG (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor - start: 64, end: 128, len: 4096, message(1/63): '1xtnd\xe1_\xfeeT\x8a\x18\xb3\xe0JT\xe5^\xc8\xdb\x8a_Z%\xd8\xfcs.\xa4\xc3C\xbb>\xc6\xf1r\xd7000000000000006400000000000' (mailbox:428)
2017-02-24 18:05:57,263 DEBUG (mailbox-hsm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd if=/var/tmp/tmpPsWFaW iflag=direct,fullblock bs=512 count=8 skip=56 (cwd None) (commands:69)
2017-02-24 18:05:57,285 DEBUG (mailbox-hsm) [storage.Misc.excCmd] SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 0.00417743 s, 981 kB/s\n'; <rc> = 0 (commands:93)
2017-02-24 18:05:57,285 INFO  (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/var/tmp/tmpUN0jK3', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=512', 'seek=56'] (mailbox:387)

HSM started write here:

2017-02-24 18:05:57,286 DEBUG (mailbox-hsm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd of=/var/tmp/tmpUN0jK3 iflag=fullblock oflag=direct conv=notrunc bs=512 seek=56 (cwd None) (commands:69)

Write ended after 1.21 seconds

2017-02-24 18:05:58,511 DEBUG (mailbox-hsm) [storage.Misc.excCmd] SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 1.21678 s, 3.4 kB/s\n'; <rc> = 0 (commands:93)

HSM mailer wrote 8 blocks of 512 bytes.

While HSM was writing, SPM mailer was reading the mailbox several times:

2017-02-24 18:05:57,384 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69)
2017-02-24 18:05:57,436 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.0408719 s, 1.0 MB/s\n'; <rc> = 0 (commands:93)
2017-02-24 18:05:57,438 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625)

No checksum on storage...

2017-02-24 18:05:57,540 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69)
2017-02-24 18:05:57,743 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.192273 s, 213 kB/s\n'; <rc> = 0 (commands:93)
2017-02-24 18:05:57,744 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625)

Still no checksum on storage...

2017-02-24 18:05:57,845 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69)
2017-02-24 18:05:57,960 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.103274 s, 397 kB/s\n'; <rc> = 0 (commands:93)
2017-02-24 18:05:57,961 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625)

Still no checksum on storage...

2017-02-24 18:05:58,062 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69)
2017-02-24 18:05:58,231 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.159696 s, 256 kB/s\n'; <rc> = 0 (commands:93)
2017-02-24 18:05:58,232 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625)

Still no checksum on storage.

At this point the test timed out.

I think whats going on is:

1. HSM is writing a mailbox of 4096 bytes. The last 64 bytes of this
   blob contain the checksum of the mailbox contents.

2. HSM is using 512 bytes block size, so it is doing 8 writes.

3. The checksum is written only in the last write.

4. Until all the blocks were written, the mailbox is in invalid state,
   and any read will see a missing checksum, or a wrong checksum from
   a previous write.

5. Because write in the jenkins is extremely slow (3.4 kB/s), this error
   is more likely to happen in jenkins CI.

Comment 2 Tal Nisan 2017-02-26 12:53:01 UTC
Nir is this a regression or was this always the behavior?
Targeting to 4.1.1 for now

Comment 3 Nir Soffer 2017-02-26 12:59:46 UTC
(In reply to Tal Nisan from comment #2)
> Nir is this a regression or was this always the behavior?
> Targeting to 4.1.1 for now

This is old as vdsm, no regression.

Comment 4 Raz Tamir 2017-05-25 18:24:15 UTC
Verifying according to "We don't know how to reproduce this in real system.
" in comment #0

Comment 5 Sandro Bonazzola 2017-12-20 10:49:40 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.