Bug 979257
Summary: | [Hyper-V][RHEL6.5][RFE]in-kernel online support for memory hot-add | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | jason wang <jasowang> | ||||||
Component: | kernel | Assignee: | Vitaly Kuznetsov <vkuznets> | ||||||
kernel sub component: | Hyper-V | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||
Status: | CLOSED DEFERRED | Docs Contact: | |||||||
Severity: | unspecified | ||||||||
Priority: | unspecified | CC: | abgupta, ariscop, decui, hhuang, jasowang, jingli, kay, knoel, kraxel, kyliel, kys, lcapitulino, leiwang, marlinw, shwang, udev-maint-list, v-chvale, vyadav, yacao | ||||||
Version: | 6.5 | Keywords: | FutureFeature | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Enhancement | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1038941 1091679 1091680 (view as bug list) | Environment: | |||||||
Last Closed: | 2015-12-01 16:21:33 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 923342, 927417, 1091679, 1091680 | ||||||||
Attachments: |
|
Description
jason wang
2013-06-28 05:23:41 UTC
What is the reason that the kernel does not add these "devices"? The kernel has such defaults for all other devices, what makes memory so special here? Such unconditional defaults almost never belong into userspace tools. We usually avoid to ship such defaults and require to kernel to carry that out, if such a default is needed. (In reply to Kay Sievers from comment #1) > What is the reason that the kernel does not add these "devices"? The kernel > has such defaults for all other devices, what makes memory so special here? > > Such unconditional defaults almost never belong into userspace tools. We > usually > avoid to ship such defaults and require to kernel to carry that out, if such > a > default is needed. CC K.Y. for the answer. Kay, As far as I know, the hot added memory blocks need to be brought online and the preferred way is to write the appropriate sysfs variable from user land and hence the udev rules. I would prefer not to involve the user land code in getting the memory regions "online". Let me know how I can accomplish this. K.Y.: Is it possible to online the page automatically by: - export memory_block_change_state() - schedule a workqueue to online the memory explicitly by calling memory_block_change_state()? Or let's just remove the final waiting for the online in hv_balloon? Thanks If we can do the onlining of the memory synchronously that would be ideal. We would not need the delay in that case. The hot add operation is already handled in a work context and so we can do this without scheduling another work for. When I did this work initially upstream, I got the sense that mm maintainers did not want to export the API that would allow for in-kernel onlining of memory. This is obviously the ideal solution. If you can, please submit the patch. Kay, According to comment #5. Upstream does not want to provides the mechanism that doing the online synchronously. (Even if we want to do some changes, it's not a short-term task). The udev involvement were also documented in kernel Documentation/memory-hotplug.txt. Maybe we can reconsider this? Upstream udev does not want to provide unconditional, always enabled kernel policy in udev. :) It is just the wrong way to do things that way. Defaults should be set where they are coming from, not in tools running later switching them around. Udev is just not meant to carry out unconditional defaults. If udev rules always enable something in the kernel, it can be enabled right away in the kernel itself. Jason, I am ok with getting rid of the user level (udev) involvement here. I am fine with what you had suggested earlier (onlining the page synchronously after hot adding the page). Could you send such a patch out (on top of the patches that are currently circulating on the balloon driver). (In reply to K. Y. Srinivasan from comment #8) > Jason, > > I am ok with getting rid of the user level (udev) involvement here. I am > fine with what you had suggested earlier (onlining the page synchronously > after hot adding the page). Could you send such a patch out (on top of the > patches that are currently circulating on the balloon driver). K.Y: I wish I can help, but I'm not an expert on mm. I believe you are more suitable than me to do this since most of the driver were wrote by you. Thanks Ok; I will put these patches together. However, I am not sure if the upstream guys will accept it or not. So, I would recommend having the udev rules in RHEL 6.5 as our default position. Please pick up the other fixes I have submitted for the balloon driver. K.Y: Since we don't need the check of memory onling in balloon driver. And consider the in-kernel memory online may still need some time to finish. I wonder how about just document this behaviour and notice that it's the user/admin's responsibility to do the memory online manually or let them to add a default udev rules if they wish. (Looks like this is just what xen currently does). Thanks No; that would not work. I dropped the check for onlining in the balloon driver only because we expect onlining to occur albeit after some delay. We cannot have admin do the onlining as is done in Xen. The reason for this is that Xen does not have a DM policy engine running on the host as is the case on Windows. Since the host expects that hot-added memory is available in the guest, it will include the hot-added memory in its policy decisions for memory management. So, if we just require human intervention for onlining, the scheme will not work. Minimally we need the udev rules if we cannot get the in-kernel onlining patch accepted. Oh get it. Kay, I believe you've notice the upstream discussion of in-kernel memory online, looks like it could not be achieved for a short of period (lack of infrastructure support or need redesign). And As K.Y. states in comment #14, we could not let the admin to do this instead. We'd better re-consider the udev rules for this? What's your opinion? Thanks Kay, any more thought on this? Thanks (In reply to K. Y. Srinivasan from comment #14) > No; that would not work. I dropped the check for onlining in the balloon > driver only because we expect onlining to occur albeit after some delay. err... and if udev if faster than your expected "delay"?? Userspace is not the right tool to solve kernel-internal issues. Please contain that change in the kernel and do not rely userspace-delay-loops to work around kernel driver behavior. Udev is not meant to carry-out global kernel policy. This rule cannot be carried in the upstream package, because it is the wrong way of doing things. The kernel sends an event for a device and expects userspace to *always* change a flag on the very *same* device? That's just not how things should work. Such a change should not also not be done in RHEL6's udev, because we should not make such global unconditional, always-on changes to the behavior in a base package. Please change the kernel instead, and also do the QA for this change for the kernel, instead of relying userspace to carry it out; just because it seems simpler to do it that way, does not make it correct to it there. Doing this from inside the kernel will also allow to limit the change to drivers which actually expect that behavior, while doing it unconditionally for all setups from userspace might cause unforeseen side-effects. Jason, I had submitted a patch to the kernel that would bring memory online without involving the userspace. As you know the mm folks were not in favor of accepting it. Kay was involved in this discussion as well. We have that as the fallback position. For what it is worth, this patch improves the user experience significantly. The system is really responsive even under heavy memory load. If you don't have that patch; I can resubmit it to RedHat. (In reply to K. Y. Srinivasan from comment #19) > Jason, > > I had submitted a patch to the kernel that would bring memory online without > involving the userspace. As you know the mm folks were not in favor of > accepting it. Kay was involved in this discussion as well. We have that as > the fallback position. For what it is worth, this patch improves the user > experience significantly. The system is really responsive even under heavy > memory load. If you don't have that patch; I can resubmit it to RedHat. Sure, please resubmit. Generally, if the patch was not accepted upstream and not RHEL specific, it may get very little possibility to be merged into RHEL. But it the patch does very small change, I can try to submit. Thanks Created attachment 796988 [details]
Patch to support in-kernel onlining
The current machinery for hot-adding memory requires having user
level to bring the memory segments online. Export the necessary functionality
to bring the memory segment online without involving user space code.
Created attachment 796989 [details]
patch to online memory in the balloon driver
Leverage the newly exported functionality to bring memory online
without involving user level code.
K.Y: Some of our engineer does not like this since it was not upstream. Is there any chance that the patch can be accepted in the near future? If it could, since we still have time for 6.6, let's make it for 6.6 then. Jason, I don't know if (much less when) this patch will be accepted. The current maintainers of the memory hot add infrastructure seem to think what they have now is the way it should be. Given, how important this feature is to our customers, I suggest we package the udev rules for hot-adding memory as has been done by all other Distros today. (In reply to K. Y. Srinivasan from comment #19) > Jason, > > I had submitted a patch to the kernel that would bring memory online without > involving the userspace. As you know the mm folks were not in favor of > accepting it. Kay was involved in this discussion as well. Do you have a pointer to this discussion? Google finds me only this ... http://comments.gmane.org/gmane.linux.kernel.mm/103912 ... which doesn't look like the discussion you are referring to. Gerd, I forwarded you some emails on this thread that capture the gist of where we are - Kay does not want udev rules to control these kind of events and Dave and others do not want the in-kernel mechanism for onlining. This is just a mechanism and can be used at will - hyper-V balloon driver would use it. In some previous email with RredHat it appeared that you were ok with udev rules and only wanted it to be Hyper-V specific. Has Kay signed off on that. If that is the case it should be easy enough to structure such a rule: We could invoke a Hyper-V specific onlining script that gets invoked from the standard udev rule and this hyper-v specific udev rule would check to see if the hyperv balloon driver is loaded before onlining the memory. I am sure there are other ways we could structure this as well. If Kay is ok with this, we can work on such a script. > I forwarded you some emails on this thread that capture the gist of where we > are Thanks, I'll have a look. > - Kay does not want udev rules to control these kind of events and Dave > and others do not want the in-kernel mechanism for onlining. This is just a > mechanism and can be used at will - hyper-V balloon driver would use it. I agree with Kay here. It's pretty pointless to do the round-trip to userspace if userspace simply onlines memory unconditionally. It makes things complicated for no good reason. > In some previous email with RedHat it appeared that you were ok with udev > rules and only wanted it to be Hyper-V specific. Tested dynamic memory meanwhile. Problem is there is no easy way for udev to see that the memory was hotplugged by the hyperv-balloon driver and online it only in that case. The memory nodes in sysfs have neither some name/id attribute nor a symlink to the balloon driver. I think we should continue to push for the in-kernel solution. I think the udev rule should only used as temporary stopgap, and of course it need to be limited to hyperv. A somewhat crude way to do it would be this: SUBSYSTEM=="memory", TEST=="/sys/bus/vmbus/drivers/hv_balloon", ACTION=="add", ATTR{state}="online" That tests for the hyperv balloon driver to be loaded. Not exactly the same as whenever the memory has been hotplugged by hyperv balloon driver. But as the driver is autoloaded in hyperv guests only and there is only this one way to hotplug memory in hyperv it should be close enough in practice. > Has Kay signed off on that. Kay? As far as the hyper-V specific udev rule, I was thinking on similar lines (to what you have) here. We will wait for Kay's input. > I forwarded you some emails on this thread that capture the gist of where we
> are
So it boils down to these requests from Dave:
<quote>
1. Come up with an interface that specifies a default policy for
newly-added memory sections. Currently, added memory gets "added",
but not "onlined", and the default should stay doing that.
2. Make sure that we at least WARN_ONCE() if someone tries to online an
already-kernel-onlined memory section. That way, if someone trips
over this new policy, we have a _chance_ of explaining to them what
is going on.
3. Think about what we do in the failure case where we are able to
"add", but fail to "online" in the kernel. Do we tear the
newly-added structures down and back out of the operation, or do
we leave the memory added, but offline (what happens in the normal
case now)?
</quote>
There is also the problem that you must allocate memory (so you can manage the new pages) before the new memory becomes available, which might be a problem under memory pressure.
On (1): I think online_memory_block() is fine. I suggest to submit it again.
On (2): Adding a patch which does WARN_ON on a double online is easily done ;)
On (3): Error handling is a valid concern. I think this belongs into the hotplug driver, not in the kernel memory hotplug core. Different drivers probably want do different things here. For hyper-v (and other hypervisors) it makes sense to remove_memory() and signal the failure back to the hypervisor so it can unmap the memory as the guest isn't going to use it anyway. When it comes to physical dimms: those are not going to go away just because onlining them failed, so the hotplug driver probably should not remove_memory() then. So, I'd suggest to fix patch #2 to catch+handle online_memory_block() failures + resubmit upstream.
On the memory allocation issue: I don't think that'll be a big issue in practice as (a) the balloon driver monitors memory usage and signals the need for additional memory early enough to the hypervisor and (b) memory is hotplugged in small (128 MB) chunks, so the amount of memory needed to manage the new memory isn't *that* big. I suggest to explain that in the commit message.
Feel free to Cc: me when submitting the patches so I can help in the discussions.
Thanks Gerd; I will resubmit them with some changes in a while. Are we now settled on having the udev rule as an intermediate solution? (In reply to K. Y. Srinivasan from comment #32) > Thanks Gerd; I will resubmit them with some changes in a while. Are we now > settled on having the udev rule as an intermediate solution? I think it's ok. No answer from Kay yet. It's not urgent though, RHEL 6.6 is still in early development phase so we have a fair chance to get the in-kernel bits accepted upstream in time. My understanding is that this is still pending an upstream solution, setting conditional NACK. (In reply to Luiz Capitulino from comment #34) > My understanding is that this is still pending an upstream solution, setting > conditional NACK. Condnack was cleared because of the devel_ack+. Fixed. Has there been any movement on this bug? Just ran across it on a fedora install, manually added the udev entry This is still a missing feature upstream and it won't probably make it to RHEL6 ever. Closing the bug. It's still present in Fedora 23 but no bug seems to be open against it This is a missing kernel feature and I'm not sure that creating a Fedora bug will help but I think it's possible to add a udev rule (the same we actually have in RHEL) as a workaround: /lib/udev/rules.d/40-redhat.rules # CPU hotadd request SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}="1" # Memory hotadd request SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online" |