Description of problem:
In our OSP10d deployed env, we had to increase the file descriptor limits for libvirtd in order to avoid problems with the computes communicating with the ceph monitors resulting in qemu-kvm process hangs.
[root@overcloud-novacompute-0 qemu]# ps -lef|grep libvirtd |grep -v grep
4 S root 104929 1 0 80 0 - 628325 poll_s 14:24 ? 00:00:55 /usr/sbin/libvirtd --listen
[root@overcloud-novacompute-0 qemu]# grep "open files" /proc/104929/limits
Max open files 1024 4096 files
This env has 21 computes backed by 1043 Ceph OSDs. Ceph tracker http://tracker.ceph.com/issues/17573 provides a detailed description of the KVM guest hang that we can reproduce and Ceph development has confirmed that Ceph librbd will open a TCP socket to every OSD and keep it open.
If left unchanged, RHOSP 10 will fail to scale on RHCS 2.0.
Red Hat Enterprise Linux Server release 7.3 Beta (Maipo)
ceph-*.x86_64 1:10.2.2-41.el7cp @rhos-10.0-ceph-2.0-mon-signed
openstack-selinux.noarch 0.7.9-1.el7ost @rhos-10.0-puddle
puppet-ceph.noarch 2.2.0-1.el7ost @rhos-10.0-puddle
puppet-openstack_extras.noarch 9.4.0-1.el7ost @rhos-10.0-puddle
puppet-vswitch.noarch 5.4.0-1.el7ost @rhos-10.0-puddle
python-rados.x86_64 1:10.2.2-41.el7cp @rhos-10.0-ceph-2.0-mon-signed
python-rbd.x86_64 1:10.2.2-41.el7cp @rhos-10.0-ceph-2.0-mon-signed
Steps to Reproduce:
1. set nofile limit low ... ulimit -n 512
2. start sequential write (16G w/4M transfer size) using librbd
3. observe I/O start then crawl to a stop/hang
'dd' writes and fio tests to rbd volume hang.
All I/O to ceph storage completes without hang or error.
This issue was resolved by increasing the nofile limits for libvirtd as follows on every compute node and restarting all KVM guests:
echo -e "[Service]\nLimitNOFILE=16384 " > /etc/systemd/system/libvirtd.service.d/limits.conf
systemctl stop libvirtd
systemctl start libvirtd
This impacts scalability of OSP 10 on Ceph storage, and limits use of the scale lab to test OpenStack with Ceph storage. It does not happen on small configs.
There's several things here
QEMU should not hang when the number of files is too low. If ceph can't open a socket/file during qemu startup, QEMU should fail to start with an appropriate error message. If it happens during later runtime, then the guest OS should be paused. QEMU itself should never hang. Maybe the guest OS pausing was mistaken for a hang ?
I don't think we should need to raise the limit of libvirtd here - if it is QEMU having the problem, then we should raise the QEMU limit in /etc/libvirt/qemu.conf via the 'max_files' parameter.
There's probably a reasonable argument to be made for libvirt to ship with a higher default limit. The Linux kernel default ulimits haven't changed in decades, and are very pessimistically low IMHO given current hardware scale.
(In reply to Daniel Berrange from comment #3)
> QEMU itself should never hang. Maybe the guest OS
> pausing was mistaken for a hang ?
Yes, that could have been the case. I should have been more specific. I observed all I/O to the ceph backed cinder volumes stop, even though many guests were still running FIO jobs.
Correction: I/O stops on each of the guests affected by the problem.
>QEMU should not hang when the number of files is too low. If ceph can't open a >socket/file during qemu startup, QEMU should fail to start with an appropriate >error message. If it happens during later runtime, then the guest OS should be >paused. QEMU itself should never hang. Maybe the guest OS pausing was mistaken >for a hang ?
Ceph librbd doesn't actually open the TCP socket to the OSD until it needs to talk to it. You can see the number of sockets growing as the application accesses more of the volume, because it is then hitting new OSDs (block devices on Ceph servers) that it didn't need before. To see this, try creating a fio volume, pre-populate it with data, and then:
fio --ioengine=rbd --clientname=admin --pool=ben --rbdname=v3 --rw=randread --ramp=30 --size=16g --bs=4k --runtime=50 --rate_iops=100 --name=foo > /tmp/fio.log 2>&1 &
In this example, we'll see librbd gradually create a socket for every OSD containing data associated with the RBD volume. For large enough RBD volumes, it could be *all* of them - since OSDs are chosen at random for replication, it's hard to predict when the threshold of 1024 file descriptors will be crossed. For small Ceph clusters you might never cross this threshold. But if you want OpenStack to be scalable with Ceph-backed storage, you don't want to hit this threshold.
(root@c07-h01-6048r) - (18:28) - (~)
-=>>while [ 1 ] ; do netstat -anp | grep fio | wc -l ; sleep 2 ; done138
+ Done fio --ioengine=rbd --clientname=admin --pool=ben --rbdname=v3 --rw=randread --size=16g --bs=4k --runtime=50 --rate_iops=20 --name=foo > /tmp/fio.log 2>&1
Any chance of doing this fix in OSP 10 to improve its scalability?
In the bottom of the initial post, we provided an example fix for the problem. In comment 6 I show how this fix is necessary for librbd.
On re-reading comment 3 I see that there is more than one way to adjust the file descriptor limit, as Daniel Berrange suggested, and I defer to developers on which way is best. All that is needed is for OpenStack to automate the process of deploying that adjustment to file descriptor limit to ensure that all guests can create sufficient fds to access Ceph storage.
For example, one customer that we work with routinely has many OpenStack+Ceph clusters with 500 OSDs in them, and they want to go to 1000-OSD deployments but are afraid to because of problems with guest response times or hangs. Hangs are exactly what I encountered when exceeding the FD limit, see:
And guests encounter them too after accessing enough of their Cinder volumes to hit the fd limit, but they may not hit this problem until long after the guests have started to run the application, and the problem may not be encountered consistently, adding to the difficulty of diagnosis. Hence we reduced the ulimit in the report to show how you could easily reproduce the problem, but it is not necessary to reduce ulimit -n to encounter it, as long as OSD count >= file descriptor limit.
Ideally, Ceph should fix the hang in librbd resulting from insufficient fds, but it's *far* easier to prevent the problem than to diagnose and fix it. For example, you'd have to implement the fix documented above by hand in all the compute hosts and restart libvirtd on all of them, then you have to stop and start your guests, am I right? For some users that can be very disruptive.
With this fix and the other fix to kernel.pid-max in bz 1389502, which is being worked on for OSP10 right now, Tim and I are now running a wide variety of I/O tests with 500 guests (and soon 1000 guests) on this configuration and could not have done that without these two adjustments. Assuming these tests complete without finding more scaling limitations, then we can say that OSP10, with these two fixes, would support much greater scalability with Ceph than previous releases.
is this a duplicate of BZ 1372589?
(In reply to Giulio Fidente from comment #8)
> is this a duplicate of BZ 1372589?
It sounds like it would accomplish the same overall goal of increasing the FD limits automatically without user intervention.
Thanks Tim, marking this as duplicate as the other one is a bit older. Hopefully we can get it fixed quickly.
*** This bug has been marked as a duplicate of bug 1372589 ***