Bug 1321570
Summary: | [RFE] LIBRADOS modify Pipe::connect() to return the error code | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> |
Component: | RADOS | Assignee: | Vikhyat Umrao <vumrao> |
Status: | CLOSED WONTFIX | QA Contact: | Manohar Murthy <mmurthy> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 1.3.2 | CC: | anharris, assingh, bhubbard, ceph-eng-bugs, dzafman, flucifre, gfarnum, gveitmic, hnallurv, jdurgin, kchai, kdreyer, nojha, uboppana |
Target Milestone: | rc | Keywords: | FutureFeature |
Target Release: | 4.* | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-09-27 23:24:29 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vikhyat Umrao
2016-03-28 12:32:03 UTC
- Actually it started from checking that if librbd logs proper error message or not if librbd run out of FDs. - Below is the test case which I used : - For testing run time FDs count limit hit issue, I used below given testing steps : - Created instance # nova list +--------------------------------------+---------+--------+------------+-------------+-------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+---------+--------+------------+-------------+-------------------------+ | 048834b3-a49f-4eed-b2f0-6ef1694cbb0e | testvm2 | ACTIVE | - | Running | internal=192.168.122.65 | +--------------------------------------+---------+--------+------------+-------------+-------------------------+ ps -ef | grep kvm root 1004 2 0 Feb03 ? 00:00:00 [kvm-irqfd-clean] qemu 55757 1 15 10:29 ? 00:00:02 /usr/libexec/qemu-kvm -name instance-00000053 -S -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off -cpu Haswell- # cat /proc/55757/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 62505 62505 processes Max open files 1024 4096 files echo /proc/55757/fd/* | wc -w 32 - Then I changed the Max open files with prlimit in run time $prlimit --pid 55757 --nofile=36:40 # cat /proc/55757/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 62505 62505 processes Max open files 36 40 files # echo /proc/55757/fd/* | wc -w 36 # virsh list Id Name State ---------------------------------------------------- 2 instance-00000053 running - As soon as I attach RBD device qemu opens couple of FDs and count reaches to 36 and rbd device is still stuck in attaching state - And we start getting below given logs in ceph client side logs. 7f0b769e7700 -1 -- 192.168.128.30:0/2021513 >> 192.168.128.35:6800/24374 pipe(0x7f0bcabc0000 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0bc55e1ce0).connect couldn't created socket (24) Too many open files I have tested this with soft and hard limit with same value : $prlimit --pid 55757 --nofile=36:36 but same results. - But qemu logs do not have any trace of "Too many open files" . When Qemu process goes out of FD in run time. - Then I involved one of our QEMU/KVM colleague to look into this from QEMU/KVM side and below are his findings. Thanks Germano Veit Michel. ---------------------------------------------------------------------------- In qemu, there is a function called bdrv_open_common which is responsible to opening disks (open FD to disk), which is latter attached to the VM. This is the part that we are interested in: /* Open the image, either directly or using a protocol */ if (drv->bdrv_file_open) { assert(file == NULL); assert(!drv->bdrv_needs_filename || filename != NULL); ret = drv->bdrv_file_open(bs, options, open_flags, &local_err); <--------------- [1] Calls driver/format specific function for each image type (qcow2, raw...) } else { if (file == NULL) { error_setg(errp, "Can't use '%s' as a block driver for the " "protocol level", drv->format_name); ret = -EINVAL; goto free_and_fail; } bs->file = file; ret = drv->bdrv_open(bs, options, open_flags, &local_err); <--------------- [2] Slightly different entry point (protocol) } if (ret < 0) { if (local_err) { error_propagate(errp, local_err); } else if (bs->filename[0]) { error_setg_errno(errp, -ret, "Could not open '%s'", bs->filename); <------------ [3] This is where the -ENFILE is printed in qemu } else { error_setg_errno(errp, -ret, "Could not open image"); } goto free_and_fail; } And it works for qcow2/raw, no need to add debugging options. [3] produces the following log for qcow2 images: > 2016-03-28T04:35:47.843808Z Could not open '/rhev/data-center/00000001-0001-0001-0001-000000000227/fa020229-5f5f-4ba8-993d-4ac212a13217/images/ab6d3ee8-1bfc-43e1-af8c- > c3400439f923/c300ecfe-3e22-4840-827c-478765a4e7ec': Too many open files Well, for ceph/rbd there are a bunch of other functions involved (librbd) and the backtrace is a bit different. But still, it gets to the point above provided the qemu process has enough FDs to reach that point. So, let's see what really happens when we are using ceph/rbd: # ps -ef | grep qemu qemu 47428 1 3 Mar22 ? 05:06:17 /usr/libexec/qemu-kvm -name instance-00000058 [...] # echo /proc/47428/fd/* | wc -w 32 # prlimit --pid 47428 --nofile=32:32 # cat /proc/47428/limits | egrep "Limit|Max open files" Limit Soft Limit Hard Limit Units Max open files 32 32 files # gdb /usr/libexec/qemu-kvm 47428 First of all, I can see it takes ~5 FDs for an RBD image to be successfully attached to the VM. If we limit it at 32, that is, no additional FD can be opened, then we attach a disk, as below: # nova volume-attach 13759d04-5323-4824-be05-5afc6ee883f6 61c9acac-4b7f-4d4c-bd8e-0ef0886b2951 Then it fails here, when trying to open the first FD in qemu_rbd_open: r = rados_connect(s->cluster); <---- it hangs here for a couple minutes if (r < 0) { error_setg(errp, "error connecting"); <---------------- error printed in qemu logs goto failed_shutdown; } And it returns -110, goes to failed_shutdown... (gdb) p r $3 = -110 And we see this in qemu logs, which is the exact error in the function above. > 2016-03-28T05:51:10.208565Z error connecting But -110 is ETIMEDOUT. So we know rados_connect failed because it did not have enough FDs, but it returned -110 to qemu. At this point there is no way for qemu to log that it failed due to "too many open files" since it was returned a -ETIMEDOUT. So we have TWO problems here: 1) qemu does not print the error code (ETIMEDOUT) use error_setg_errno with r instead of error_setg? 2) librbd does not return the proper error code (EMFILE) Moving on, if we allow it to open 3 FDs, it fails slightly after in the same function, here: r = rbd_open(s->io_ctx, s->name, &s->image, s->snap); if (r < 0) { error_setg(errp, "error reading header from %s", s->name); goto failed_open; } Also hangs for quite a few minutes. I might not have waited long enough (~15 minutes now), but I assume it will also eventually fail with some unrelated ERROR and qemu will print (or not) another vague message. I assumed that if we get the number of allowed FDs to open right, we may hit this error within qemu-kvm code and, depending on the open call that fails, it will log the appropriate error. But when these librbd functions return -ETIMEDOUT I can't see how qemu can log the correct error. So we don't see the "Too many open files". We may need to check all the cases, for 0, 1, 2, 3, 4, 5 FDs, as depending on how many FDs the qemu-kvm process it allowed to open it fails on different functions. TODO: 1) Check all points qemu<->rbd may fail due to lack of FDs (prlimit on 0, 1, 2, 3, 4, 5 and 6 FDs) 2) Check all return codes from librbd on functions that hanged/failed due to lack of FD 3) Make qemu print all error codes from librbd -------------------------------------------------------------------------------------------------------------------- As per Germano findings : So we have TWO problems here: > 1) qemu does not print the error code (ETIMEDOUT) use error_setg_errno with r instead of error_setg? - I think this needs to be handled at qemu layer > 2) librbd does not return the proper error code (EMFILE) - and for this as I have given my analysis in comment#0 1215 fail: 1216 if (conf->ms_inject_internal_delays) { 1217 ldout(msgr->cct, 10) << " sleep for " << msgr->cct->_conf->ms_inject_internal_delays << dendl; 1218 utime_t t; 1219 t.set_from_double(msgr->cct->_conf->ms_inject_internal_delays); 1220 t.sleep(); 1221 } 1222 1223 pipe_lock.Lock(); 1224 fail_locked: 1225 if (state == STATE_CONNECTING) 1226 fault(); 1227 else 1228 ldout(msgr->cct,3) << "connect fault, but state = " << get_state_name() 1229 << " != connecting, stopping" << dendl; 1230 1231 stop_locked: 1232 delete authorizer; 1233 return -1; 1234 } Here we should return *return -errno* instead of *return -1* Josh, this is the same issue which I discussed with you some time back that we need to check if librbd is logging or not for *EMFILE*. If the specific error codes here are something we want to see, it'll require a lot more than returning the errno from this stop_locked label. We invoke this code path from a lot of different locations with a lot of different statuses. Some of them will have the real error code in errno; some in the return value (which is thrown away in a conditional check!); some are because of protocol errors of various kinds that don't really have matching return codes. If this requires significant refactoring, I'm inclined to close it with 'Won't Fix'. Does async messenger have this problem? I'm not sure it requires refactoring, but some actual thought. Probably specifying a function-local error code and making sure that gets set everywhere before jumping through to the failure handling. Thanks Greg and Sam for your inputs.
Another issue here is if we fix also this issue in our ceph layer, qemu still have limitation of :
> 1) qemu does not print the error code (ETIMEDOUT) use error_setg_errno with r instead of error_setg?
As per my understanding instead of 'Won't Fix', would it be okay to take it as RFE ?
Sure. The kind of work involved, while detailed, also doesn't require much of anything in the way of background knowledge — this is probably something sustaining engineering can do? Sure. The kind of work involved, while detailed, also doesn't require much of anything in the way of background knowledge — this is probably something sustaining engineering can do? RFE it is. Thanks Greg and Sam. Sure I will look into it. For now assigning it to myself. If I will come across any difficulties will discuss with you. Created upstream ceph tracker : http://tracker.ceph.com/issues/15308 > Does async messenger have this problem? async has this problem as well, see AsyncConnection::_process_connection(), and search for "sd = net.nonblock_connect(get_peer_addr());". @vikhyat, as we are *posting* messages, and the the Pipe and Messengers are working asynchronously. normally, when it comes to rbd_open(), we start an operation, and wait on it, see ImageState<I>::open(). as greg put > Probably specifying a function-local error code and making sure that gets set everywhere before jumping through to the failure handling. yeah, we need to save the errno somewhere and assign it to rval of the context of the request. Thanks Kefu for your inputs and nice discussion. - Now we need to fix both the messengers : simple messenger and async messenger connect function definitions. - Then I and Kefu discussed in detail more about it and we concluded we need to fix all the caller functions of these functions with proper return codes. - I checked parent AsyncConnection::_process_connection() parent functions and they look good. like :_process_connection() calls -> nonblock_connect() calls -> generic_connect() -> and generic connect definition is proper. int NetHandler::generic_connect(const entity_addr_t& addr, bool nonblock) { int ret; int s = create_socket(addr.get_family()); if (s < 0) return s; if (nonblock) { ret = set_nonblock(s); if (ret < 0) { close(s); return ret; } } set_socket_options(s); ret = ::connect(s, (sockaddr*)&addr.addr, addr.addr_size()); if (ret < 0) { if (errno == EINPROGRESS && nonblock) return s; ldout(cct, 10) << __func__ << " connect: " << strerror(errno) << dendl; close(s); return -errno; } return s; } Hello Josh, Thanks for your help and nice discussion. As discussed I have sent first part of pipe::connect() fix : https://github.com/ceph/ceph/pull/8416 Will work on qemu patch for fixing error number function fix. For Qemu Fix for this issue, Qemu Bug : https://bugzilla.redhat.com/show_bug.cgi?id=1329641 Part 2 tracker : http://tracker.ceph.com/issues/16310 Included in 10.2.4 upstream. Hi Josh, I think we should move this to the next release because the first part is only fixed which in 10.2.4. The second part of the issue is still pending in this upstream tracker: http://tracker.ceph.com/issues/16310 Thank you, Vikhyat Good point Vikhyat. Moving to 3.0 since there's no target for 2.4. http://tracker.ceph.com/issues/16310 still needs to be done upstream. |