Bug 452650
Summary: | [RHEL5.2]: Blktap is limited to 100 disks total | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Chris Lalancette <clalance> |
Component: | kernel-xen | Assignee: | Laszlo Ersek <lersek> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 5.2 | CC: | ddutile, drjones, lersek, mvattakk, pbonzini, qcai, samuel.kielek, sputhenp, srostedt, xen-maint, yuzhang, yuzhou |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-07-21 10:26:32 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 514490, 680407 | ||
Attachments: |
Description
Chris Lalancette
2008-06-24 10:09:34 UTC
Note that this is closely related to BZ 442723, where we are expanding the maximum number of disks that you can attach to a guest. Chris Lalancette This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. (In reply to comment #0) > Description of problem: > Currently in the RHEL-5.2 kernel, the blktap driver has a hardcoded limit of > 100 devices. This means that the dom0 kernel can only service 100 disks > total, across all guests. Upstream has kicked this up to 256; however, there > really seems to be no reason that this couldn't be made dynamic in the > drivers/xen/blktap/blktapmain.c code, so you could have basically unlimited > disks (pending memory requirements, of course). > > The basic thing to fix is changing the statically allocated tapfds and > domid_translate arrays to dynamically allocated, and then fixing all of the > fallout from that. It should be relatively straightforward and confined to > the blktap driver. Of course, we should definitely work with upstream on > this one. Minor devices for /dev/xen/blktap* are already dynamically allocated and reused from a "pool"; it's only that the pool (the array) holding the pointers has static storage duration (and so fixed size). A dynamically resized pool would require kmalloc() + memcpy() + kfree(), then resetting the pointer pointing to the base of the pool. (Naturally, the array object would be replaced by a pointer object.) This reassignment would require locking or atomic operations, beacuse lots of places in "blktap.c" query "tapfds[minor]". I think this would introduce unnecessary risk. Another data structure to consider would be a list, but that's even worse -- not only does it involve continuous pointer manipulation, it also doesn't support O(1) access by device minor id. But structure more complicated than a vector or a list is unwarranted, in my opinion. I think the best we can do, and the most we should do, is to introduce a module parameter that specifies the pool size, and replace the static array with a dynamically *pre*allocated array (blkif_init()). (I'm talking about a single array to replace, but I realize in RHEL-5 we also have the "translate_domid array".) Considering this bug is two and a half years old and was not linked with business cases here; what's more, that upstream still has a fixed limit of 256 as of rev 1070:2994d2997f9d, I think we can safely assume that a fixed but tuneable pool size would cover all real demands. It seems that the real bottleneck was bug 442723 (number of guest disks). Now that a single guest can handle up to 26 + 26 * 26 = 702 disks, considering a 1 TB host with ~1000 1 GB guests, I think our new module parameter should be clamped between the round limits of "64" and "512 * 1024", and it should default to 256 (to adjust ourselves to upstream). Whoever needs more than the default 256 does qualify, after 2.5 years, as a special case, and should be willing to set the module parameter. Minor device numbers have 20 bits of information, so this should fit. Two arguments for lowering the upper limit to something actually sensible: - There really shouldn't be hundreds of thousands of /dev/xen/blktap* nodes. - Many functions search these arrays in a linear fashion. For example, a sequence of device allocations (via N invocations of get_next_free_dev(), each called from a separate BLKTAP_IOCTL_NEWINTF / _EXT ioctl) takes quadratic time in total. BTW. the "MAX_DEV_NAME" macro is superfluous; there's no single application of it in the entire tree, not even in upstream. Proposition: a) Brainstorm an upper limit for the pool size. b) Delete MAX_DEV_NAME. c) Make the pool size a module parameter. d) Send patch to upstream. e) Send (potentially modified) patch to rhkernel-list. I'm willing to do (b)-(e); please comment on (a), and of course on the idea itself. Thank you very much! (In reply to comment #5) > A dynamically resized pool would require kmalloc() + memcpy() + kfree(), then > resetting the pointer pointing to the base of the pool. (Naturally, the array > object would be replaced by a pointer object.) This reassignment would require > locking or atomic operations, beacuse lots of places in "blktap.c" query > "tapfds[minor]". I think this would introduce unnecessary risk. > > Another data structure to consider would be a list, but that's even worse -- > not only does it involve continuous pointer manipulation, it also doesn't > support O(1) access by device minor id. But structure more complicated than a > vector or a list is unwarranted, in my opinion. > > I think the best we can do, and the most we should do, is to introduce a module > parameter that specifies the pool size, and replace the static array with a > dynamically *pre*allocated array (blkif_init()). > > (I'm talking about a single array to replace, but I realize in RHEL-5 we also > have the "translate_domid array".) > > Considering this bug is two and a half years old and was not linked with > business cases here; what's more, that upstream still has a fixed limit of 256 > as of rev 1070:2994d2997f9d, I think we can safely assume that a fixed but > tuneable pool size would cover all real demands. It seems that the real > bottleneck was bug 442723 (number of guest disks). I disagree with this point, though not strongly. In those 2.5 years, I have had multiple people ask why this limitation is in place, and a way to workaround it. The workaround is typically easy; use phy: devices instead. Still, it is something that has come up frequently enough that it is annoying (and very non-obvious). My preferred solution (and the one that I would have tried to implement, given enough time) is the kmalloc() + memcpy() + kfree() one with the locking. I don't think it will be that hard to do. All of that being said, I do agree that it adds some risk to the whole thing, and we are getting awfully late in RHEL-5 Xen's lifetime to start doing risky things. Additionally, hasn't blktap mostly been replaced by blktap2 in upstream Xen? Assuming this is the case, it would be code that wouldn't get much testing by the community, so we would be going it alone here. So... > Now that a single guest can handle up to 26 + 26 * 26 = 702 disks, considering > a 1 TB host with ~1000 1 GB guests, I think our new module parameter should be > clamped between the round limits of "64" and "512 * 1024", and it should > default to 256 (to adjust ourselves to upstream). Whoever needs more than the > default 256 does qualify, after 2.5 years, as a special case, and should be > willing to set the module parameter. Minor device numbers have 20 bits of > information, so this should fit. > > Two arguments for lowering the upper limit to something actually sensible: > - There really shouldn't be hundreds of thousands of /dev/xen/blktap* nodes. > - Many functions search these arrays in a linear fashion. For example, a > sequence of device allocations (via N invocations of get_next_free_dev(), > each called from a separate BLKTAP_IOCTL_NEWINTF / _EXT ioctl) takes > quadratic time in total. > > BTW. the "MAX_DEV_NAME" macro is superfluous; there's no single application of > it in the entire tree, not even in upstream. > > Proposition: > a) Brainstorm an upper limit for the pool size. > b) Delete MAX_DEV_NAME. > c) Make the pool size a module parameter. > d) Send patch to upstream. > e) Send (potentially modified) patch to rhkernel-list. > > I'm willing to do (b)-(e); please comment on (a), and of course on the idea > itself. Thank you very much! I think that we should increase the limit to 256 like you said. The addition of a module parameter is a nice-to-have; if you do add it, I wouldn't artificially restrict it. As we like to say, we'll give you all the rope you want; it's up to you whether you choose to strangle yourself with it. So that means my counter-proposition would be: a) Delete MAX_DEV_NAME b) Increase the limit to 256 to match upstream c) Make the pool size a module parameter d) Send patch to upstream e) Send patch to rhkernel-list Does that work for you? Chris Lalancette (In reply to comment #6) > (In reply to comment #5) > My preferred solution (and the one that I would have tried to > implement, given enough time) is the kmalloc() + memcpy() + kfree() > one with the locking. I don't think it will be that hard to do. I'm generally not afraid (just very conscious) of locking. Being a novice in kernel space, I couldn't claim I'm completely aware of my surroundings all the time, so I try to constrain my modifications to a scope as narrow as possible. Perhaps a spinlock would be appropriate here. I'm not sure if a normal spinlock ensures visibility (cache flushes, "publication safety") -- I might have to force acquire-release semantics with memory barriers after storing the new pointer value. Protecting only the base pointer doesn't seem enough, even; the accesses to tapfds, tapfds[i], and *tapfds[i] do not form a pure producer/consumer pattern. Protecting each entry separately is overkill and not very easy to get right IMHO; protecting all those 1 + n + n objects with a single spinlock might prove a bottleneck. > All of that being said, I do agree that it adds some risk to the > whole thing, and we are getting awfully late in RHEL-5 Xen's lifetime > to start doing risky things. Additionally, hasn't blktap mostly been > replaced by blktap2 in upstream Xen? I couldn't say; I was surprised they have blktap2 at all. > I think that we should increase the limit to 256 like you said. The > addition of a module parameter is a nice-to-have; if you do add it, I > wouldn't artificially restrict it. As we like to say, we'll give you > all the rope you want; it's up to you whether you choose to strangle > yourself with it. > > So that means my counter-proposition would be: > a) Delete MAX_DEV_NAME > b) Increase the limit to 256 to match upstream > c) Make the pool size a module parameter > d) Send patch to upstream > e) Send patch to rhkernel-list > > Does that work for you? It's less work than what I proposed, so obviously it is :) Though there is a constraint we certainly must put on the upper value. The array size (after multiplication) must not overflow a size_t. That is, supposing the module parameter must be a signed integer, 0 < num_tapfds && (size_t)num_tapfds <= (size_t)-1 / sizeof tapfds[0] must hold. (Otherwise, the allocation might "succeed" for an overflown size_t, and the subsequent initialization loop will trample on whatever's after the truncated array.) On 64 bit, the second condition can't yield false; sometimes that condition stand-alone is the source of a useless gcc warning. Thank you! lacos This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 479561 [details]
Remove unused MAX_DEV_NAME macro.
Created attachment 479562 [details] Make the maximum number of blktap devices a module parameter. Testing in progress: https://brewweb.devel.redhat.com/taskinfo?taskID=3123631 Created attachment 479935 [details]
Remove unused MAX_DEV_NAME macro. Make the maximum number of blktap devices a module parameter.
Make limit checks stricter.
Rely on correct implicit conversions instead of explicit casts.
Collapse patches into one.
Comment on attachment 479935 [details] Remove unused MAX_DEV_NAME macro. Make the maximum number of blktap devices a module parameter. http://www.kernel.org/doc/Documentation/CodingStyle The static asserts are unnecessary. The additional context block for the 'new' is unnecessary. Using uint32_t, so the limit is the same for both 32-bit and 64-bit compiles, or even something smaller would be better than using size_t-1. The " As of kernel-2.6.18-245.el5" comment is confusing. It sounds like those leaks were introduced with -245, but really you mean that those leaks have been there a long time, up to and including -245. I would just drop it from the comment. If someone fixes the leaks, they'd remove the comment anyway. Also, if you really want static asserts the idiom is BUILD_BUG_ON. Bugzilla ate my first comment. :) I would just set a relatively low LIM_TAP_DEV, like 1024 (without any static check: 1024 times any sensible struct size is less than 2^32 or 2^64!). Don't forget that the limit on the number of event channels in dom0 would be hit before that: only Windows guests are able to tear down block devices and free their event channels. (In reply to comment #14) > Also, if you really want static asserts the idiom is BUILD_BUG_ON. Thanks; I was searching for variations on "static_assert" and couldn't remember BUILD_BUG_ON. (In reply to comment #15) > I would just set a relatively low > LIM_TAP_DEV, like 1024 (without any static check: 1024 times any sensible > struct size is less than 2^32 or 2^64!). Thank you; this is what I asked for in comment 5 ("Brainstorm an upper limit for the pool size"). I wrote the current patch (attachment 479935 [details]) for the case recommended by Chris in comment 6 ("we'll give you all the rope you want"). Unsigned overflow is even more rope than that, so I had to make sure that the product can be represented in a size_t (= the type that kmalloc() takes). > Don't forget that the limit on the number of event channels in dom0 would be > hit before that: only Windows guests are able to tear down block devices and > free their event channels. Doesn't this defeat the entire purpose of this bug? Since guests can now support up to 600-700 virtual block devices, I would think of at least an order of (decimal) magnitude higher limit in the host, like 8192. (In reply to comment #13) > The " As of kernel-2.6.18-245.el5" comment is confusing. It sounds like those > leaks were introduced with -245, but really you mean that those leaks have > been there a long time, up to and including -245. I would just drop it from > the comment. If someone fixes the leaks, they'd remove the comment anyway. I will correct the comment, thanks. Created attachment 479965 [details] Remove unused MAX_DEV_NAME macro. Make the maximum number of blktap devices a module parameter. Define LIM_MAX_TAP_DEV as 1024. Consequently: - Update description of "max_tap_dev" modparam to display LIM_MAX_TAP_DEV as well. - Remove static asserts. - Use <include/linux/kernel.h> clamp() macro for clamping. Remove RHEL kernel tag reference from leak comment. Task info (in progress): http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3128347 I put a link in my first review to the CodingStyle because you have a few things that violate it. 1) you have dyslexic conditions, e.g. 0 != ret. I know the reasoning, but the kernel doesn't do it like that. kernel guys just don't confuse = and ==. 2) you're missing your () for sizeof, it's not required, even by the CodingStyle, but it's the norm. 3) you have {} braces wrapping a single statement, for an if-block. they should never be used for single statements 4) not your fault, but there's a multi-statement line that you touch for(i = 0; i < max_tap_dev; i++) translate_domid[i].domid = 0xFFFF; it would be nice to drop that second statement down where it belongs sorry for the nitpicking... The Xen code has these: <include/xen/sched.h> #define MAX_EVTCHNS(d) (BITS_PER_EVTCHN_WORD(d) * BITS_PER_EVTCHN_WORD(d) * 64) and <include/public/xen.h> /* * Event channel endpoints per domain: * 1024 if a long is 32 bits; 4096 if a long is 64 bits. */ #define NR_EVENT_CHANNELS (sizeof(unsigned long) * sizeof(unsigned long) * 64) The latter is present in the kernel tree as well (<include/xen/interface/xen.h>); I'll change the patch to use NR_EVENT_CHANNELS as the upper limit. Thanks Paolo for the hint. I'll also fix the problems listen in comment 18, thanks. Created attachment 480123 [details]
max_tap_dev module parameter; v4
Difference to previous version:
- Use NR_EVENT_CHANNELS instead of LIM_MAX_TAP_DEV.
- Since NR_EVENT_CHANNELS is unsuitable for stringification by the preprocessor,
change modparam desc to display only default value.
- Fix extra spaces inside parentheses on touched lines (CodingStyle section 3.1)
and break out loop body in a separate line.
- Switch to clamp_t() from clamp(), as NR_EVENT_CHANNELS is of type "size_t".
- Remove "0 != " where it means success.
- Replace "0 != " with " != 0" where it means failure.
- Remove spurious braces around single statement under "if".
- Use parentheses with sizeof all the time.
(In reply to comment #0) > Currently in the RHEL-5.2 kernel, the blktap driver has a hardcoded limit of > 100 devices. This means that the dom0 kernel can only service 100 disks > total, across all guests. Upstream has kicked this up to 256; however, there > really seems to be no reason that this couldn't be made dynamic in the > drivers/xen/blktap/blktapmain.c code, so you could have basically unlimited > disks (pending memory requirements, of course). http://lists.xensource.com/archives/html/xen-devel/2011-02/msg01332.html > Without replacing the call to register_chrdev() with one to > __register_chrdev() (available only with 2.6.32 and newer) I > can't see how you would get beyond 256 devices with the > changes you propose. I assumed this was already covered, based on the related bug 442723. So, what should I do? I could simply bump the original macro from 100 to 256, or I could reimplement register_chrdev() in blktap in a way that I can pass in max_tap_dev: - alloc_chrdev_region() - cdev_alloc() - cdev_init() - cdev_add() On the error path: - cdev_del() - unregister_chrdev_region() However, some settings would have to be done by hand: - setting the cdev->owner, - setting the name of &cdev->kobject. Is this bug worth that? (In reply to comment #21) > (In reply to comment #0) > > > Currently in the RHEL-5.2 kernel, the blktap driver has a hardcoded limit of > > 100 devices. This means that the dom0 kernel can only service 100 disks > > total, across all guests. Upstream has kicked this up to 256; however, there > > really seems to be no reason that this couldn't be made dynamic in the > > drivers/xen/blktap/blktapmain.c code, so you could have basically unlimited > > disks (pending memory requirements, of course). > > http://lists.xensource.com/archives/html/xen-devel/2011-02/msg01332.html > > > Without replacing the call to register_chrdev() with one to > > __register_chrdev() (available only with 2.6.32 and newer) I > > can't see how you would get beyond 256 devices with the > > changes you propose. > > I assumed this was already covered, based on the related bug 442723. > > So, what should I do? I could simply bump the original macro from 100 to 256, > or I could reimplement register_chrdev() in blktap in a way that I can pass in > max_tap_dev: > > - alloc_chrdev_region() > - cdev_alloc() > - cdev_init() > - cdev_add() > > On the error path: > - cdev_del() > - unregister_chrdev_region() > > However, some settings would have to be done by hand: > - setting the cdev->owner, > - setting the name of &cdev->kobject. > > Is this bug worth that? NO! ... RHEL5.7 ! .... wire to 256 & move on. Like this 'feature', it's unlikely another bz could change what should be init'd &/or added to the cdev structure, and thus, what else may need to be init'd (which this code will sit hidden, and unknown to a modification in the common cdev code), but.... it could happen, and this (rotten) code could cause mysterious cdev events/failures/secondary-effects due to lack of another element not init'd properly. If rhel6.1, and able to modify/use common register_chrdev() functions, that'd be one (out-of-band) step, but to mimic the ops of an existing register_chrdev() with caveats, hoping the common, replicated code doesn't change in a way that buries a potential issue in blktap.... I think there is too much risk for this late in the RHEL5 schedule. Unfortunately, someone didn't do their scaling homework, and solve this problem earlier in (upstream first and then) RHEL5. Yet another selling reason for RHEL6! ;-) The upstream commit that changed MAX_TAP_DEV from 100 to 256 is xen-unstable c/s 11868:80b296ec93dc, written by (if I may) our own Steven Rostedt. The patch raises the limit to 256, and solves the dynamic allocation of tapfd descriptors. It merges the translate_domid array with the tapfds array (by embedding parallel elements of the former in elements of the latter), and adds a wmb() to get_next_free_dev() (see comment 7), besides rewriting most of it. The upstream patch delays the allocation of *tapfds[i], but the array of pointers (tapfds) itself remains statically allocated and cannot be resized. I think the only thing that we need to backport from it is the MAX_TAP_DEV change. (I wrote this comment purely to point out the upstream origin of the MAX_TAP_DEV increase.) Created attachment 480437 [details]
Bump MAX_TAP_DEV from 100 to 256. Remove superfluous MAX_DEV_NAME.
(In reply to comment #22) > (In reply to comment #21) > > (In reply to comment #0) > > > > > Currently in the RHEL-5.2 kernel, the blktap driver has a hardcoded limit of > > > 100 devices. This means that the dom0 kernel can only service 100 disks > > > total, across all guests. Upstream has kicked this up to 256; however, there > > > really seems to be no reason that this couldn't be made dynamic in the > > > drivers/xen/blktap/blktapmain.c code, so you could have basically unlimited > > > disks (pending memory requirements, of course). > > > > http://lists.xensource.com/archives/html/xen-devel/2011-02/msg01332.html > > > > > Without replacing the call to register_chrdev() with one to > > > __register_chrdev() (available only with 2.6.32 and newer) I > > > can't see how you would get beyond 256 devices with the > > > changes you propose. > > > > I assumed this was already covered, based on the related bug 442723. > > > > So, what should I do? I could simply bump the original macro from 100 to 256, > > or I could reimplement register_chrdev() in blktap in a way that I can pass in > > max_tap_dev: > > > > - alloc_chrdev_region() > > - cdev_alloc() > > - cdev_init() > > - cdev_add() > > > > On the error path: > > - cdev_del() > > - unregister_chrdev_region() > > > > However, some settings would have to be done by hand: > > - setting the cdev->owner, > > - setting the name of &cdev->kobject. > > > > Is this bug worth that? > > NO! ... RHEL5.7 ! .... wire to 256 & move on. As that thread pointed out, > 256 minors is possible in 2.6.18, but it is not worth the churn. You've done due diligence to try to solve this the right way, and we know what the right way would take. That's always the situation we want to be in; know what the right answer is, and then we can decide whether it is worth it to go down that route. In this case, I agree that that is more churn than we want in 5.7. So wire it to 256 and be done with it. Chris Lalancette in kernel-2.6.18-246.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. in kernel-2.6.18-246.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. Verified the bug on kernel-xen-2.6.18-257.el5, xen-3.0.3-129.el5. Steps to verified: 1. Add "fs.aio-max-nr = 262144" to "/etc/sysctl.conf" in host, and execute "sysctl -p". 2. Create a PV guest and create 256 image files in dom0 3. Attach the 256 images to guest one by one by script. Results: Return error since 235th images, all 234 images can be seen and works well in guest. no other limits hit. So change the Status to VERIFIED. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html |