Created attachment 1257402 [details] Vdsm tests log Description of problem: We have a random checksum failure in the storage mailbox, when reading new mail. The error looks like: 2017-02-24 18:05:57,438 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail When it fails, we see that there is no checksum on storage: expected='9\x19\x00\x00' actual='\x00\x00\x00\x00' When this happens, we drop the mailbox content, ignoring any messages in it. The sender host typically will send another message again, so this is expected to cause a delay of 2 seconds in receiving and handling the message. Delay in receiving extend message may lead to pausing a VM. Version-Release number of selected component (if applicable): Seen on master and various older versions. How reproducible: Random and rare failure Steps to Reproduce: - Run the mailbox tests 1000 times in jenkins CI, may fail after some runs - We don't know how to reproduce this in real system. Actual results: No checksum on storage Expected results: Checksum on storage matches computed checksum
Looking at the logs we see: HSM thread sending mail to spm: 2017-02-24 18:05:57,263 DEBUG (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor - start: 64, end: 128, len: 4096, message(1/63): '1xtnd\xe1_\xfeeT\x8a\x18\xb3\xe0JT\xe5^\xc8\xdb\x8a_Z%\xd8\xfcs.\xa4\xc3C\xbb>\xc6\xf1r\xd7000000000000006400000000000' (mailbox:428) 2017-02-24 18:05:57,263 DEBUG (mailbox-hsm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd if=/var/tmp/tmpPsWFaW iflag=direct,fullblock bs=512 count=8 skip=56 (cwd None) (commands:69) 2017-02-24 18:05:57,285 DEBUG (mailbox-hsm) [storage.Misc.excCmd] SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 0.00417743 s, 981 kB/s\n'; <rc> = 0 (commands:93) 2017-02-24 18:05:57,285 INFO (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/var/tmp/tmpUN0jK3', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=512', 'seek=56'] (mailbox:387) HSM started write here: 2017-02-24 18:05:57,286 DEBUG (mailbox-hsm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd of=/var/tmp/tmpUN0jK3 iflag=fullblock oflag=direct conv=notrunc bs=512 seek=56 (cwd None) (commands:69) Write ended after 1.21 seconds 2017-02-24 18:05:58,511 DEBUG (mailbox-hsm) [storage.Misc.excCmd] SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 1.21678 s, 3.4 kB/s\n'; <rc> = 0 (commands:93) HSM mailer wrote 8 blocks of 512 bytes. While HSM was writing, SPM mailer was reading the mailbox several times: 2017-02-24 18:05:57,384 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69) 2017-02-24 18:05:57,436 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.0408719 s, 1.0 MB/s\n'; <rc> = 0 (commands:93) 2017-02-24 18:05:57,438 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625) No checksum on storage... 2017-02-24 18:05:57,540 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69) 2017-02-24 18:05:57,743 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.192273 s, 213 kB/s\n'; <rc> = 0 (commands:93) 2017-02-24 18:05:57,744 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625) Still no checksum on storage... 2017-02-24 18:05:57,845 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69) 2017-02-24 18:05:57,960 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.103274 s, 397 kB/s\n'; <rc> = 0 (commands:93) 2017-02-24 18:05:57,961 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625) Still no checksum on storage... 2017-02-24 18:05:58,062 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-1 dd if=/var/tmp/tmpUN0jK3 iflag=direct,fullblock count=1 bs=40960 (cwd None) (commands:69) 2017-02-24 18:05:58,231 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n40960 bytes (41 kB) copied, 0.159696 s, 256 kB/s\n'; <rc> = 0 (commands:93) 2017-02-24 18:05:58,232 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] SPM_MailMonitor: mailbox 7 checksum failed, not clearing mailbox, clearing newMail (expected='9\x19\x00\x00', actual='\x00\x00\x00\x00') (mailbox:625) Still no checksum on storage. At this point the test timed out. I think whats going on is: 1. HSM is writing a mailbox of 4096 bytes. The last 64 bytes of this blob contain the checksum of the mailbox contents. 2. HSM is using 512 bytes block size, so it is doing 8 writes. 3. The checksum is written only in the last write. 4. Until all the blocks were written, the mailbox is in invalid state, and any read will see a missing checksum, or a wrong checksum from a previous write. 5. Because write in the jenkins is extremely slow (3.4 kB/s), this error is more likely to happen in jenkins CI.
Nir is this a regression or was this always the behavior? Targeting to 4.1.1 for now
(In reply to Tal Nisan from comment #2) > Nir is this a regression or was this always the behavior? > Targeting to 4.1.1 for now This is old as vdsm, no regression.
Verifying according to "We don't know how to reproduce this in real system. " in comment #0
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.