This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 829731 - [scalability] Mailbox messages are not cleaned right after reading them [NEEDINFO]
[scalability] Mailbox messages are not cleaned right after reading them
Status: CLOSED WONTFIX
Product: oVirt
Classification: Community
Component: vdsm (Show other bugs)
3.3
Unspecified Unspecified
medium Severity high
: ---
: 3.5.0
Assigned To: Federico Simoncelli
Leonid Natapov
storage
:
Depends On:
Blocks: rhev_scalability
  Show dependency treegraph
 
Reported: 2012-06-07 08:54 EDT by Rami Vaknin
Modified: 2016-02-10 11:39 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-12-01 08:21:35 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
fsimonce: needinfo? (scohen)


Attachments (Terms of Use)
hsm logs (5.47 MB, application/x-compressed-tar)
2012-06-07 08:54 EDT, Rami Vaknin
no flags Details
spm logs (12.24 MB, application/x-compressed-tar)
2012-06-07 08:55 EDT, Rami Vaknin
no flags Details
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (7.78 MB, application/x-gzip)
2013-08-19 16:55 EDT, vvyazmin@redhat.com
no flags Details

  None (edit)
Description Rami Vaknin 2012-06-07 08:54:31 EDT
Created attachment 590187 [details]
hsm logs

Version:
RHEVM SI4, vdsm-4.9.6-10.el6.x86_64

Scenario:
The env contains hosts that run `300-500 vms each.
The mailbox becomes full a lot of time so a lot of lvextend requests are not read by the SPM.
The mailbox has 63 slots per host and the SPM does not clear the messages right after it reads them so in case of failed lvextends the mailbox becomes full.

From the HSM:
Thread-215::DEBUG::2012-06-07 10:04:08,263::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f033c949ef0>
Thread-215::ERROR::2012-06-07 10:04:08,266::storage_mailbox::444::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 412, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 364, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, cannot add new message")
RuntimeError: HSM_MailMonitor - Active messages list full, cannot add new message
Thread-215::DEBUG::2012-06-07 10:04:08,275::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x2a74680>
Thread-215::DEBUG::2012-06-07 10:04:08,277::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f03408585a8>
Thread-215::ERROR::2012-06-07 10:04:08,280::storage_mailbox::444::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 412, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 364, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, cannot add new message")
RuntimeError: HSM_MailMonitor - Active messages list full, cannot add new message
Thread-215::DEBUG::2012-06-07 10:04:08,282::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f0384f1a878>
Thread-215::ERROR::2012-06-07 10:04:08,285::storage_mailbox::444::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 412, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 364, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, cannot add new message")
RuntimeError: HSM_MailMonitor - Active messages list full, cannot add new message
Thread-215::DEBUG::2012-06-07 10:04:08,288::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f0348e23368>
Thread-215::ERROR::2012-06-07 10:04:08,290::storage_mailbox::444::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 412, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 364, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, cannot add new message")
RuntimeError: HSM_MailMonitor - Active messages list full, cannot add new message
Thread-215::DEBUG::2012-06-07 10:04:08,293::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x2980b90>
Thread-215::DEBUG::2012-06-07 10:04:08,295::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f03407c3f80>
Thread-215::DEBUG::2012-06-07 10:04:08,302::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f02ecd80cb0>
Thread-215::DEBUG::2012-06-07 10:04:08,305::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x22607a0>
Thread-215::DEBUG::2012-06-07 10:04:08,310::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x1c8c290>
Thread-215::ERROR::2012-06-07 10:04:08,315::storage_mailbox::444::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 412, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 364, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, cannot add new message")
Comment 1 Rami Vaknin 2012-06-07 08:55:43 EDT
Created attachment 590188 [details]
spm logs
Comment 3 Eduardo Warszawski 2012-07-09 13:18:36 EDT
http://gerrit.ovirt.org/#/c/6083/
Comment 4 Eduardo Warszawski 2012-07-09 13:19:41 EDT
Comment #3 is a mistake.
Comment 5 RHEL Product and Program Management 2012-12-14 02:19:01 EST
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 9 Eduardo Warszawski 2013-02-17 04:07:08 EST
As stated in comment #4, the change in comment #3 is a mistake and it is not related at all to the present BZ.
Therefore we should not change BZ status based on this.
Comment 10 vvyazmin@redhat.com 2013-08-19 16:51:31 EDT
During add 150 hosts to Data Center, get same errors:


Thread-31019::ERROR::2013-08-19 21:41:41,627::storage_mailbox::503::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread
 caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 468, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 408, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, "
RuntimeError: HSM_MailMonitor - Active messages list full, cannot add new message


RHEVM 3.3 - IS10 environment:

RHEVM:  rhevm-3.3.0-0.15.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.10-1.el6ev.noarch
VDSM:  vdsm-4.12.0-61.git8178ec2.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64
Comment 11 vvyazmin@redhat.com 2013-08-19 16:55:46 EDT
Created attachment 788200 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
Comment 14 Allon Mureinik 2014-06-24 05:05:11 EDT
Fede, is this still interesting?
Comment 15 Allon Mureinik 2014-08-27 12:58:21 EDT
(In reply to Allon Mureinik from comment #14)
> Fede, is this still interesting?
Fede, your input please?
Comment 16 Federico Simoncelli 2014-08-29 14:33:21 EDT
(In reply to Allon Mureinik from comment #15)
> (In reply to Allon Mureinik from comment #14)
> > Fede, is this still interesting?
> Fede, your input please?

We could try to reproduce the issue but there are too many variables here: the number of VMs, how fast the storage is, how many hosts are accessing it, how many disks are thinly provisioned, etc.
In the end the only feedback we'll have is about the environment we tested.

We know that this is a limit of the technology we're using (as others as e.g. max number of lvs in a domain, etc.) and we'll try to resolve this going forward (remove the spm, metadata lock per domain etc.).

I think we need the feedback from a pm (needinfo on Sean) to understand how much this is a pressing issue. If it's critical enough we can look and check if there's anything that we can do right now.
Comment 17 Allon Mureinik 2014-08-31 03:39:57 EDT
(In reply to Federico Simoncelli from comment #16)
> I think we need the feedback from a pm (needinfo on Sean) to understand how
> much this is a pressing issue. If it's critical enough we can look and check
> if there's anything that we can do right now.
This was reported by QA against 3.1 and was never escalated from the field - it's not pressing.
Pushing out to 3.6 to be reexamined after the SPM is removed.
Comment 18 Allon Mureinik 2014-12-01 08:21:35 EST
(In reply to Allon Mureinik from comment #17)
> (In reply to Federico Simoncelli from comment #16)
> > I think we need the feedback from a pm (needinfo on Sean) to understand how
> > much this is a pressing issue. If it's critical enough we can look and check
> > if there's anything that we can do right now.
> This was reported by QA against 3.1 and was never escalated from the field -
> it's not pressing.
> Pushing out to 3.6 to be reexamined after the SPM is removed.

Closing old bugs.
Since this was never encountered, and 3.6.0 will change the mechanism anyway, I don't think this is interesting.
If anyone disagrees, feel free to explain and reopen.

Note You need to log in before you can comment on or make changes to this bug.