Bug 1426762

Summary: Storage mailbox checksum failure when reading mail - no checksum on storage
Product: [oVirt] vdsm Reporter: Nir Soffer <nsoffer>
Component: CoreAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Raz Tamir <ratamir>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.20.0CC: alitke, amureini, bugs, danken, nsoffer, tnisan
Target Milestone: ovirt-4.2.0Flags: rule-engine: ovirt-4.2+
Target Release: 4.20.9   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-20 10:49:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Vdsm tests log none

Description Nir Soffer 2017-02-24 19:16:09 UTC
Created attachment 1257402 [details]
Vdsm tests log

Description of problem:

We have a random checksum failure in the storage mailbox, when reading new mail.
The error looks like:

2017-02-24 18:05:57,438 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail

When it fails, we see that there is no checksum on storage:
expected='9\x19\x00\x00'
actual='\x00\x00\x00\x00'

When this happens, we drop the mailbox content, ignoring any messages in it. The 
sender host typically will send another message again, so this is expected to 
cause a delay of 2 seconds in receiving and handling the message.

Delay in receiving extend message may lead to pausing a VM.

Version-Release number of selected component (if applicable):
Seen on master and various older versions. 

How reproducible:
Random and rare failure

Steps to Reproduce:
- Run the mailbox tests 1000 times in jenkins CI, may fail after some runs
- We don't know how to reproduce this in real system.

Actual results:
No checksum on storage

Expected results:
Checksum on storage matches computed checksum

Comment 1 Nir Soffer 2017-02-24 19:41:26 UTC
Looking at the logs we see:

HSM thread sending mail to spm:

2017-02-24 18:05:57,263 DEBUG (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor - start: 64, end: 128, len: 4096, message(1/63): '1xtnd\xe1_\xfeeT\x8a\x18\xb3\xe0JT\xe5^\xc8\xdb\x8a_Z%\xd8\xfcs.\xa4\xc3C\xbb>\xc6\xf1r\xd7000000000000006400000000000' (mailbox:428)
2017-02-24 18:05:57,263 DEBUG (mailbox-hsm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd if=/var/tmp/tmpPsWFaW iflag=direct,fullblock bs=512 count=8 skip=56 (cwd None) (commands:69)
2017-02-24 18:05:57,285 DEBUG (mailbox-hsm) [storage.Misc.excCmd] SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 0.00417743 s, 981 kB/s\n'; <rc> = 0 (commands:93)
2017-02-24 18:05:57,285 INFO  (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/var/tmp/tmpUN0jK3', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=512', 'seek=56'] (mailbox:387)

HSM started write here:

2017-02-24 18:05:57,286 DEBUG (mailbox-hsm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd of=/var/tmp/tmpUN0jK3 iflag=fullblock oflag=direct conv=notrunc bs=512 seek=56 (cwd None) (commands:69)

Write ended after 1.21 seconds

2017-02-24 18:05:58,511 DEBUG (mailbox-hsm) [storage.Misc.excCmd] SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 1.21678 s, 3.4 kB/s\n'; <rc> = 0 (commands:93)

HSM mailer wrote 8 blocks of 512 bytes.

While HSM was writing, SPM mailer was reading the mailbox several times:

2017-02-24 18:05:57,384 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69)
2017-02-24 18:05:57,436 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.0408719 s, 1.0 MB/s\n'; <rc> = 0 (commands:93)
2017-02-24 18:05:57,438 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625)

No checksum on storage...

2017-02-24 18:05:57,540 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69)
2017-02-24 18:05:57,743 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.192273 s, 213 kB/s\n'; <rc> = 0 (commands:93)
2017-02-24 18:05:57,744 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625)

Still no checksum on storage...

2017-02-24 18:05:57,845 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69)
2017-02-24 18:05:57,960 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.103274 s, 397 kB/s\n'; <rc> = 0 (commands:93)
2017-02-24 18:05:57,961 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625)

Still no checksum on storage...

2017-02-24 18:05:58,062 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69)
2017-02-24 18:05:58,231 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.159696 s, 256 kB/s\n'; <rc> = 0 (commands:93)
2017-02-24 18:05:58,232 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625)

Still no checksum on storage.

At this point the test timed out.

I think whats going on is:

1. HSM is writing a mailbox of 4096 bytes. The last 64 bytes of this
   blob contain the checksum of the mailbox contents.

2. HSM is using 512 bytes block size, so it is doing 8 writes.

3. The checksum is written only in the last write.

4. Until all the blocks were written, the mailbox is in invalid state,
   and any read will see a missing checksum, or a wrong checksum from
   a previous write.

5. Because write in the jenkins is extremely slow (3.4 kB/s), this error
   is more likely to happen in jenkins CI.

Comment 2 Tal Nisan 2017-02-26 12:53:01 UTC
Nir is this a regression or was this always the behavior?
Targeting to 4.1.1 for now

Comment 3 Nir Soffer 2017-02-26 12:59:46 UTC
(In reply to Tal Nisan from comment #2)
> Nir is this a regression or was this always the behavior?
> Targeting to 4.1.1 for now

This is old as vdsm, no regression.

Comment 4 Raz Tamir 2017-05-25 18:24:15 UTC
Verifying according to "We don't know how to reproduce this in real system.
" in comment #0

Comment 5 Sandro Bonazzola 2017-12-20 10:49:40 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.