Bug 1389503
Summary: | director should increase libvirtd FD limits on ceph backed compute nodes | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Tim Wilkinson <twilkins> |
Component: | openstack-tripleo-puppet-elements | Assignee: | Giulio Fidente <gfidente> |
Status: | CLOSED DUPLICATE | QA Contact: | Yogev Rabl <yrabl> |
Severity: | high | Docs Contact: | Derek <dcadzow> |
Priority: | high | ||
Version: | 10.0 (Newton) | CC: | abond, bengland, berrange, dcain, dwilson, eglynn, gfidente, hbrock, jdurgin, jefbrown, jharriga, johfulto, jomurphy, jslagle, kbader, mburns, pmyers, rhel-osp-director-maint, rsussman, twilkins |
Target Milestone: | --- | Keywords: | FutureFeature, Triaged |
Target Release: | 11.0 (Ocata) | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-23 10:57:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Tim Wilkinson
2016-10-27 18:24:08 UTC
This impacts scalability of OSP 10 on Ceph storage, and limits use of the scale lab to test OpenStack with Ceph storage. It does not happen on small configs. There's several things here QEMU should not hang when the number of files is too low. If ceph can't open a socket/file during qemu startup, QEMU should fail to start with an appropriate error message. If it happens during later runtime, then the guest OS should be paused. QEMU itself should never hang. Maybe the guest OS pausing was mistaken for a hang ? I don't think we should need to raise the limit of libvirtd here - if it is QEMU having the problem, then we should raise the QEMU limit in /etc/libvirt/qemu.conf via the 'max_files' parameter. There's probably a reasonable argument to be made for libvirt to ship with a higher default limit. The Linux kernel default ulimits haven't changed in decades, and are very pessimistically low IMHO given current hardware scale. (In reply to Daniel Berrange from comment #3) > QEMU itself should never hang. Maybe the guest OS > pausing was mistaken for a hang ? Yes, that could have been the case. I should have been more specific. I observed all I/O to the ceph backed cinder volumes stop, even though many guests were still running FIO jobs. Correction: I/O stops on each of the guests affected by the problem. >QEMU should not hang when the number of files is too low. If ceph can't open a >socket/file during qemu startup, QEMU should fail to start with an appropriate >error message. If it happens during later runtime, then the guest OS should be >paused. QEMU itself should never hang. Maybe the guest OS pausing was mistaken >for a hang ?
Ceph librbd doesn't actually open the TCP socket to the OSD until it needs to talk to it. You can see the number of sockets growing as the application accesses more of the volume, because it is then hitting new OSDs (block devices on Ceph servers) that it didn't need before. To see this, try creating a fio volume, pre-populate it with data, and then:
fio --ioengine=rbd --clientname=admin --pool=ben --rbdname=v3 --rw=randread --ramp=30 --size=16g --bs=4k --runtime=50 --rate_iops=100 --name=foo > /tmp/fio.log 2>&1 &
In this example, we'll see librbd gradually create a socket for every OSD containing data associated with the RBD volume. For large enough RBD volumes, it could be *all* of them - since OSDs are chosen at random for replication, it's hard to predict when the threshold of 1024 file descriptors will be crossed. For small Ceph clusters you might never cross this threshold. But if you want OpenStack to be scalable with Ceph-backed storage, you don't want to hit this threshold.
(root@c07-h01-6048r) - (18:28) - (~)
-=>>while [ 1 ] ; do netstat -anp | grep fio | wc -l ; sleep 2 ; done138
192
242
286
325
364
397
434
462
488
517
538
560
[1]+ Done fio --ioengine=rbd --clientname=admin --pool=ben --rbdname=v3 --rw=randread --size=16g --bs=4k --runtime=50 --rate_iops=20 --name=foo > /tmp/fio.log 2>&1
0
Any chance of doing this fix in OSP 10 to improve its scalability? In the bottom of the initial post, we provided an example fix for the problem. In comment 6 I show how this fix is necessary for librbd. On re-reading comment 3 I see that there is more than one way to adjust the file descriptor limit, as Daniel Berrange suggested, and I defer to developers on which way is best. All that is needed is for OpenStack to automate the process of deploying that adjustment to file descriptor limit to ensure that all guests can create sufficient fds to access Ceph storage. For example, one customer that we work with routinely has many OpenStack+Ceph clusters with 500 OSDs in them, and they want to go to 1000-OSD deployments but are afraid to because of problems with guest response times or hangs. Hangs are exactly what I encountered when exceeding the FD limit, see: http://tracker.ceph.com/issues/17573 And guests encounter them too after accessing enough of their Cinder volumes to hit the fd limit, but they may not hit this problem until long after the guests have started to run the application, and the problem may not be encountered consistently, adding to the difficulty of diagnosis. Hence we reduced the ulimit in the report to show how you could easily reproduce the problem, but it is not necessary to reduce ulimit -n to encounter it, as long as OSD count >= file descriptor limit. Ideally, Ceph should fix the hang in librbd resulting from insufficient fds, but it's *far* easier to prevent the problem than to diagnose and fix it. For example, you'd have to implement the fix documented above by hand in all the compute hosts and restart libvirtd on all of them, then you have to stop and start your guests, am I right? For some users that can be very disruptive. With this fix and the other fix to kernel.pid-max in bz 1389502, which is being worked on for OSP10 right now, Tim and I are now running a wide variety of I/O tests with 500 guests (and soon 1000 guests) on this configuration and could not have done that without these two adjustments. Assuming these tests complete without finding more scaling limitations, then we can say that OSP10, with these two fixes, would support much greater scalability with Ceph than previous releases. is this a duplicate of BZ 1372589? (In reply to Giulio Fidente from comment #8) > is this a duplicate of BZ 1372589? It sounds like it would accomplish the same overall goal of increasing the FD limits automatically without user intervention. Thanks Tim, marking this as duplicate as the other one is a bit older. Hopefully we can get it fixed quickly. *** This bug has been marked as a duplicate of bug 1372589 *** |