| Summary: | [vdsm] [storage] mailbox checksum errors running lvextend - extend fails and vm pause | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Haim <hateya> | ||||
| Component: | vdsm | Assignee: | Eduardo Warszawski <ewarszaw> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | yeylon <yeylon> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 6.1 | CC: | abaron, bazulay, danken, dnaori, ewarszaw, iheim, ilvovsky, mgoldboi, smizrahi, srevivo, yeylon | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | All | ||||||
| Whiteboard: | Storage | ||||||
| Fixed In Version: | vdsm-4.9-47.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-08-19 15:24:02 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
verified on vdsm-4.9-47.el6.x86_64. manage to repro, and see error message, vm continue to extend. Dummy-122::ERROR::2011-02-08 11:38:16,385::storage_mailbox::538::Storage.MailBox.SpmMailMonitor::(_validateMailbox) SPM_MailMonitor: mailbox 1 checksum failed, not clearing mailbox, clearing newMail. |
Created attachment 475612 [details] SPM log. Description of problem: when running vm on SPM that extends its lv rapidly (request to extend logical volume - use FCP storage), I get checksum errors on mailbox lv, and vm pause. theory is that we have a race, as the way SPM and HSM exchange message goes as follows: - when HSM want to deliver message to SPM, it writes to certain location in its outbox, which is SPM inbox directory. - on this scenario, where vm runs on SPM, vdsm machine has 2 roles, one as SPM, and one as HSM (this is our code), raise might occur when there is one thread that writes a message (as HSM) to extend the lv, 'dd' command is initated, and thread goes to sleep, during that time, there is another thread that reads (as SPM) its INBOX, due to the fact that 'dd' wasn't finished, there is a checksum error. - afterwards, there's another thread that deletes mailbox, which rubbish the all thing. the above issue reproduced twice so far, and happens superficially on my setup, using FCP storage. repro steps: 1) create vm with OS installed (or live CD). 2) add new thinly provisioned disk (100G) 3) extend lv by using 'dd' command. 4) happens after approximately 30G (29 extends)