Bug 854018 - Warn if ksm(Memory Sharing) can not be enabled on the host in cluster with memory overcommit set to > 100 %
Summary: Warn if ksm(Memory Sharing) can not be enabled on the host in cluster with m...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine-webadmin-portal
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Doron Fediuck
QA Contact: Pavel Stehlik
URL:
Whiteboard: sla
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-09-03 15:04 UTC by David Jaša
Modified: 2016-02-10 20:19 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-03-11 21:50:50 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description David Jaša 2012-09-03 15:04:28 UTC
Description of problem:
Warn if ksm (Memory Sharing) can not be enabled on the host in cluster with memory overcommit set to > 100 %

Version-Release number of selected component (if applicable):
RHEV 3.0.5

How reproducible:
always

Steps to Reproduce:
1. disable ksmtuned service on the host, restart vdsm
service ksmtuned stop
chkconfig --del ksmtuned
service vdsmd restart
2. put host under memory load (over 80 % of RAM taken)
3.
  
Actual results:
memory is not shared but admin is not notified about this condition

Expected results:
warning is raised in the UI similar to one about missing fencing

Additional info:
  * proposing as z-stream as this might silently harm our current customers
  * tested on 3.0.5, more recent vdsm may already hande "missing ksmtuned" condition
    better. In that case, you could reproduce by disabling ksm in sysctl probably

Comment 3 David Jaša 2012-09-03 15:57:09 UTC
Few clarifications:

1) messing up with the system (deliberate stop & disable of ksmtuned on a host) is there just to make sure that conditions of the test machine match those of my _production_ machine where I spotted the bug

2) see also bug 854027 that would be discovered much sooner (and probably with cases attached to it) had this one been in 3.0 from the start

3) given that KSM Just Worked in 2.2, this could be actually considered a 2.2 -> 3.0 regression

Comment 4 Doron Fediuck 2012-09-05 11:39:33 UTC
(In reply to comment #3)
> Few clarifications:
> 
> 1) messing up with the system (deliberate stop & disable of ksmtuned on a
> host) is there just to make sure that conditions of the test machine match
> those of my _production_ machine where I spotted the bug
> 
> 2) see also bug 854027 that would be discovered much sooner (and probably
> with cases attached to it) had this one been in 3.0 from the start
> 
> 3) given that KSM Just Worked in 2.2, this could be actually considered a
> 2.2 -> 3.0 regression

David,
By default, vdsm is installed by the bootstrap, and pulls-in qemu-kvm.
Current defaults of both vdsm and qemu-kvm were set for it to work. So by
default vdsm will start ksm and ksmtuned, and the ksm will work as expected.
IIUC, this BZ requires a deliberate change of the working defaults. 
So this is basically unsupported. Think of a similar scenario: have a working
host. Now SSH into the host and using iptables block 54321.
So in any case, if a user changes the default the application sets, he should
handle it as he sees fit, and we shouldn't educate him, nor should we try and
handle unknowns.

Comment 5 David Jaša 2012-09-05 12:30:51 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > Few clarifications:
> > 
> > 1) messing up with the system (deliberate stop & disable of ksmtuned on a
> > host) is there just to make sure that conditions of the test machine match
> > those of my _production_ machine where I spotted the bug
> > 
> > 2) see also bug 854027 that would be discovered much sooner (and probably
> > with cases attached to it) had this one been in 3.0 from the start
> > 
> > 3) given that KSM Just Worked in 2.2, this could be actually considered a
> > 2.2 -> 3.0 regression
> 
> David,
> By default, vdsm is installed by the bootstrap, and pulls-in qemu-kvm.
> Current defaults of both vdsm and qemu-kvm were set for it to work. So by
> default vdsm will start ksm and ksmtuned, and the ksm will work as expected.
> IIUC, this BZ requires a deliberate change of the working defaults. 

Well, maybe I wasn't clear yet in comment 2 and comment 3: nobody did touch "ksm" and "ksmtuned" services since host install! So neither bootstrap, nor vdsm turned them on and RHEV-M didn't beep out a thing after 150/200 % memory overcommit was set up on the cluster.

> So this is basically unsupported. Think of a similar scenario: have a working
> host. Now SSH into the host and using iptables block 54321.

If RHEV-M will continue reporting the host as "Up" after a timeout, it will be similar sort of a bug as this one.

> So in any case, if a user changes the default the application sets, he should
> handle it as he sees fit, and we shouldn't educate him, nor should we try and
> handle unknowns.

OK, let's assume deliberate reconfiguration scenario: admin configures 200 % memory overcommit _and_ disable ksm* services on the hosts. Shouldn't RHEV-M warn him that overcommit won't work?

Comment 6 Doron Fediuck 2012-09-05 15:24:13 UTC
(In reply to comment #5)
> 
> Well, maybe I wasn't clear yet in comment 2 and comment 3: nobody did touch
> "ksm" and "ksmtuned" services since host install! So neither bootstrap, nor
> vdsm turned them on and RHEV-M didn't beep out a thing after 150/200 %
> memory overcommit was set up on the cluster.
> 
In this case this would be a qemu-kvm RPM issue, as ksm* belongs there and
being installed in this context. Can you please verify this is the case?

> If RHEV-M will continue reporting the host as "Up" after a timeout, it will
> be similar sort of a bug as this one.
> 
Not relevant if this wasn't done manually.

> > So in any case, if a user changes the default the application sets, he should
> > handle it as he sees fit, and we shouldn't educate him, nor should we try and
> > handle unknowns.
> 
> OK, let's assume deliberate reconfiguration scenario: admin configures 200 %
> memory overcommit _and_ disable ksm* services on the hosts. Shouldn't RHEV-M
> warn him that overcommit won't work?
No, since the admin may decide to that even after setting this configuration.

Comment 7 David Jaša 2012-09-05 15:45:04 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > 
> > Well, maybe I wasn't clear yet in comment 2 and comment 3: nobody did touch
> > "ksm" and "ksmtuned" services since host install! So neither bootstrap, nor
> > vdsm turned them on and RHEV-M didn't beep out a thing after 150/200 %
> > memory overcommit was set up on the cluster.
> > 
> In this case this would be a qemu-kvm RPM issue, as ksm* belongs there and
> being installed in this context. Can you please verify this is the case?
> 

What can I do now? I just fount out that the ksm and ksmtuned services were not registered with chkconfig/service.

I've opened separate bug 854027 for vdsm to enable the service so I'd move this discussion there. There is also complete yum history of vdsm and qemu-kvm of the host affected.

<snip>

> No, since the admin may decide to that even after setting this configuration.

So you are fine when there are no warnings about mutually exclusive settings in the setup? 

IMO because we trust in admin competence, we should just warn them in the engine instead of forcing some behavior upon them. That said, after discussion at this bug and bug 854027, I'm not really convinced that 854027 should be implemented but this one seems essential to have memory sharing actually working without hours of debugging the issue.

Comment 8 Doron Fediuck 2012-09-06 16:27:00 UTC
(In reply to comment #7)
> 
> I've opened separate bug 854027 for vdsm to enable the service so I'd move
> this discussion there. There is also complete yum history of vdsm and
> qemu-kvm of the host affected.
> 
I'm monitoring this bz.

> So you are fine when there are no warnings about mutually exclusive settings
> in the setup? 
> 
> IMO because we trust in admin competence, we should just warn them in the
> engine instead of forcing some behavior upon them. That said, after
> discussion at this bug and bug 854027, I'm not really convinced that 854027
> should be implemented but this one seems essential to have memory sharing
> actually working without hours of debugging the issue.

Currently at host level, you have a good indication on KSM status;
in the General sub-tab of every host you can see MemoryPageSharing state.
So technically this can become a KB or release note.

We can think of a feature for next versions to warn during cluster policy editing.

Comment 9 David Jaša 2012-09-06 17:19:42 UTC
(In reply to comment #8)
> Currently at host level, you have a good indication on KSM status;
> in the General sub-tab of every host you can see MemoryPageSharing state.

It's hardly any indication given combination of bug 732464 (and its 3.1 clone bug 831256) and bug 855018.

> So technically this can become a KB or release note.

I just reported bug 855103 which is IMO proper place for this piece of information.

> 
> We can think of a feature for next versions to warn during cluster policy
> editing.

That would be nice. If the VDSM and documentation bugs are implemented soon, I think that severity of this one will drop from thing-that-prevents-debugging to just nice-to-have thingie (and vice versa, if documentation and this one are implemented, vdsm part will get less severe and so on ;)).

Comment 10 Itamar Heim 2013-03-11 21:50:50 UTC
Closing old bugs. If this issue is still relevant/important in current version, please re-open the bug.

Comment 11 David Jaša 2013-03-12 09:27:30 UTC
Fix of bug 854027 actually means that this is no more a problem.


Note You need to log in before you can comment on or make changes to this bug.