Bug 673144

Summary: [vdsm] [storage] mailbox checksum errors running lvextend - extend fails and vm pause
Product: Red Hat Enterprise Linux 6 Reporter: Haim <hateya>
Component: vdsmAssignee: Eduardo Warszawski <ewarszaw>
Status: CLOSED CURRENTRELEASE QA Contact: yeylon <yeylon>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1CC: abaron, bazulay, danken, dnaori, ewarszaw, iheim, ilvovsky, mgoldboi, smizrahi, srevivo, yeylon
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard: Storage
Fixed In Version: vdsm-4.9-47.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-19 15:24:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
SPM log. none

Description Haim 2011-01-27 14:36:58 UTC
Created attachment 475612 [details]
SPM log.

Description of problem:

when running vm on SPM that extends its lv rapidly (request to extend logical volume - use FCP storage), I get checksum errors on mailbox lv, and vm pause.
theory is that we have a race, as the way SPM and HSM exchange message goes as follows: 

- when HSM want to deliver message to SPM, it writes to certain location in its 
  outbox, which is SPM inbox directory. 
- on this scenario, where vm runs on SPM, vdsm machine has 2 roles, one as SPM, 
  and one as HSM (this is our code), raise might occur when there is one thread 
  that writes a message (as HSM) to extend the lv, 'dd' command is initated, and 
  thread goes to sleep, during that time, there is another thread that reads (as 
  SPM) its INBOX, due to the fact that 'dd' wasn't finished, there is a checksum 
  error.
- afterwards, there's another thread that deletes mailbox, which rubbish the all 
  thing. 

the above issue reproduced twice so far, and happens superficially on my setup, using FCP storage. 

repro steps:

1) create vm with OS installed (or live CD). 
2) add new thinly provisioned disk (100G)
3) extend lv by using 'dd' command. 
4) happens after approximately 30G (29 extends)

Comment 1 Haim 2011-02-08 09:44:13 UTC
verified on vdsm-4.9-47.el6.x86_64. 

manage to repro, and see error message, vm continue to extend. 


Dummy-122::ERROR::2011-02-08 11:38:16,385::storage_mailbox::538::Storage.MailBox.SpmMailMonitor::(_validateMailbox) SPM_MailMonitor: mailbox 1 checksum failed,
 not clearing mailbox, clearing newMail.