Bug 829731

Summary: [scalability] Mailbox messages are not cleaned right after reading them
Product: [Retired] oVirt Reporter: Rami Vaknin <rvaknin>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED WONTFIX QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: medium    
Version: 3.3CC: amureini, bazulay, bugs, fsimonce, gklein, iheim, jkt, lpeer, mgoldboi, rbalakri, scohen, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-01 13:21:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 882647    
Attachments:
Description Flags
hsm logs
none
spm logs
none
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm none

Description Rami Vaknin 2012-06-07 12:54:31 UTC
Created attachment 590187 [details]
hsm logs

Version:
RHEVM SI4, vdsm-4.9.6-10.el6.x86_64

Scenario:
The env contains hosts that run `300-500 vms each.
The mailbox becomes full a lot of time so a lot of lvextend requests are not read by the SPM.
The mailbox has 63 slots per host and the SPM does not clear the messages right after it reads them so in case of failed lvextends the mailbox becomes full.

From the HSM:
Thread-215::DEBUG::2012-06-07 10:04:08,263::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f033c949ef0>
Thread-215::ERROR::2012-06-07 10:04:08,266::storage_mailbox::444::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 412, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 364, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, cannot add new message")
RuntimeError: HSM_MailMonitor - Active messages list full, cannot add new message
Thread-215::DEBUG::2012-06-07 10:04:08,275::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x2a74680>
Thread-215::DEBUG::2012-06-07 10:04:08,277::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f03408585a8>
Thread-215::ERROR::2012-06-07 10:04:08,280::storage_mailbox::444::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 412, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 364, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, cannot add new message")
RuntimeError: HSM_MailMonitor - Active messages list full, cannot add new message
Thread-215::DEBUG::2012-06-07 10:04:08,282::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f0384f1a878>
Thread-215::ERROR::2012-06-07 10:04:08,285::storage_mailbox::444::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 412, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 364, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, cannot add new message")
RuntimeError: HSM_MailMonitor - Active messages list full, cannot add new message
Thread-215::DEBUG::2012-06-07 10:04:08,288::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f0348e23368>
Thread-215::ERROR::2012-06-07 10:04:08,290::storage_mailbox::444::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 412, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 364, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, cannot add new message")
RuntimeError: HSM_MailMonitor - Active messages list full, cannot add new message
Thread-215::DEBUG::2012-06-07 10:04:08,293::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x2980b90>
Thread-215::DEBUG::2012-06-07 10:04:08,295::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f03407c3f80>
Thread-215::DEBUG::2012-06-07 10:04:08,302::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x7f02ecd80cb0>
Thread-215::DEBUG::2012-06-07 10:04:08,305::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x22607a0>
Thread-215::DEBUG::2012-06-07 10:04:08,310::storage_mailbox::361::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - ignoring duplicate message <storage.storage_mailbox.SPM_Extend_Message instance at 0x1c8c290>
Thread-215::ERROR::2012-06-07 10:04:08,315::storage_mailbox::444::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 412, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 364, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, cannot add new message")

Comment 1 Rami Vaknin 2012-06-07 12:55:43 UTC
Created attachment 590188 [details]
spm logs

Comment 3 Eduardo Warszawski 2012-07-09 17:18:36 UTC
http://gerrit.ovirt.org/#/c/6083/

Comment 4 Eduardo Warszawski 2012-07-09 17:19:41 UTC
Comment #3 is a mistake.

Comment 5 RHEL Program Management 2012-12-14 07:19:01 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 9 Eduardo Warszawski 2013-02-17 09:07:08 UTC
As stated in comment #4, the change in comment #3 is a mistake and it is not related at all to the present BZ.
Therefore we should not change BZ status based on this.

Comment 10 vvyazmin@redhat.com 2013-08-19 20:51:31 UTC
During add 150 hosts to Data Center, get same errors:


Thread-31019::ERROR::2013-08-19 21:41:41,627::storage_mailbox::503::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mailmonitoring thread
 caught exception; will try to recover
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 468, in run
    self._handleMessage(message)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 408, in _handleMessage
    raise RuntimeError("HSM_MailMonitor - Active messages list full, "
RuntimeError: HSM_MailMonitor - Active messages list full, cannot add new message


RHEVM 3.3 - IS10 environment:

RHEVM:  rhevm-3.3.0-0.15.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.10-1.el6ev.noarch
VDSM:  vdsm-4.12.0-61.git8178ec2.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

Comment 11 vvyazmin@redhat.com 2013-08-19 20:55:46 UTC
Created attachment 788200 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Comment 14 Allon Mureinik 2014-06-24 09:05:11 UTC
Fede, is this still interesting?

Comment 15 Allon Mureinik 2014-08-27 16:58:21 UTC
(In reply to Allon Mureinik from comment #14)
> Fede, is this still interesting?
Fede, your input please?

Comment 16 Federico Simoncelli 2014-08-29 18:33:21 UTC
(In reply to Allon Mureinik from comment #15)
> (In reply to Allon Mureinik from comment #14)
> > Fede, is this still interesting?
> Fede, your input please?

We could try to reproduce the issue but there are too many variables here: the number of VMs, how fast the storage is, how many hosts are accessing it, how many disks are thinly provisioned, etc.
In the end the only feedback we'll have is about the environment we tested.

We know that this is a limit of the technology we're using (as others as e.g. max number of lvs in a domain, etc.) and we'll try to resolve this going forward (remove the spm, metadata lock per domain etc.).

I think we need the feedback from a pm (needinfo on Sean) to understand how much this is a pressing issue. If it's critical enough we can look and check if there's anything that we can do right now.

Comment 17 Allon Mureinik 2014-08-31 07:39:57 UTC
(In reply to Federico Simoncelli from comment #16)
> I think we need the feedback from a pm (needinfo on Sean) to understand how
> much this is a pressing issue. If it's critical enough we can look and check
> if there's anything that we can do right now.
This was reported by QA against 3.1 and was never escalated from the field - it's not pressing.
Pushing out to 3.6 to be reexamined after the SPM is removed.

Comment 18 Allon Mureinik 2014-12-01 13:21:35 UTC
(In reply to Allon Mureinik from comment #17)
> (In reply to Federico Simoncelli from comment #16)
> > I think we need the feedback from a pm (needinfo on Sean) to understand how
> > much this is a pressing issue. If it's critical enough we can look and check
> > if there's anything that we can do right now.
> This was reported by QA against 3.1 and was never escalated from the field -
> it's not pressing.
> Pushing out to 3.6 to be reexamined after the SPM is removed.

Closing old bugs.
Since this was never encountered, and 3.6.0 will change the mechanism anyway, I don't think this is interesting.
If anyone disagrees, feel free to explain and reopen.

Comment 19 Red Hat Bugzilla 2023-09-14 01:29:45 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days