Bug 1321570

Summary: [RFE] LIBRADOS modify Pipe::connect() to return the error code
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: Vikhyat Umrao <vumrao>
Status: CLOSED WONTFIX QA Contact: Manohar Murthy <mmurthy>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.3.2CC: anharris, assingh, bhubbard, ceph-eng-bugs, dzafman, flucifre, gfarnum, gveitmic, hnallurv, jdurgin, kchai, kdreyer, nojha, uboppana
Target Milestone: rcKeywords: FutureFeature
Target Release: 4.*   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-27 23:24:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vikhyat Umrao 2016-03-28 12:32:03 UTC
Description of problem:
LIBRADOS Pipe::connect() function is not returning correct return code


Version-Release number of selected component (if applicable):
ceph-common-0.94.3-3.el7cp.x86_64
ceph master branch 

How reproducible:
Always 


Steps to Reproduce:
1. create a qemu instance 
2. reduce the fd limits with prlimit command 
3. try to attach rbd image to this qemu instance 
4.  ceph client side log file will have logs as given below :

7f0b769e7700 -1 -- 192.168.128.30:0/2021513 >> 192.168.128.35:6800/24374 pipe(0x7f0bcabc0000 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0bc55e1ce0).connect couldn't created socket (24) Too many open files

5. But qemu logs will not log anything regarding this error as rados_connect call from qemu wont have proper return code from librados Pipe::connect().


Ceph source file : src/msg/simple/Pipe.cc

 870 int Pipe::connect()
 871 {
 872   bool got_bad_auth = false;
 873 
 874   ldout(msgr->cct,10) << "connect " << connect_seq << dendl;
 875   assert(pipe_lock.is_locked());

....................
.....................

 901   // create socket?
 902   sd = ::socket(peer_addr.get_family(), SOCK_STREAM, 0);
 903   if (sd < 0) {
 904     lderr(msgr->cct) << "connect couldn't created socket " << cpp_strerror(errno) << dendl;
 905     goto fail;
 906   }

- "connect couldn't created socket (24) Too many open files"
- This error is coming from this function socket() from Pipe::connect() function.

- and here we are saying go to *fail* but if we check  *fail*  we are just returning "-1" but we should return *-errno".

1215  fail:
1216   if (conf->ms_inject_internal_delays) {
1217     ldout(msgr->cct, 10) << " sleep for " << msgr->cct->_conf->ms_inject_internal_delays << dendl;
1218     utime_t t;
1219     t.set_from_double(msgr->cct->_conf->ms_inject_internal_delays);
1220     t.sleep();
1221   }
1222 
1223   pipe_lock.Lock();
1224  fail_locked:
1225   if (state == STATE_CONNECTING)
1226     fault();
1227   else
1228     ldout(msgr->cct,3) << "connect fault, but state = " << get_state_name()
1229                        << " != connecting, stopping" << dendl;
1230 
1231  stop_locked:
1232   delete authorizer;
1233   return -1;
1234 }

Comment 1 Vikhyat Umrao 2016-03-28 12:37:27 UTC
- Actually it started from checking that if librbd logs proper error message or not if librbd run out of FDs.

- Below is the test case which I used :

- For testing run time FDs count limit hit issue, I used below given testing steps :

- Created instance 

# nova list
+--------------------------------------+---------+--------+------------+-------------+-------------------------+
| ID                                   | Name    | Status | Task State | Power State | Networks                |
+--------------------------------------+---------+--------+------------+-------------+-------------------------+
| 048834b3-a49f-4eed-b2f0-6ef1694cbb0e | testvm2 | ACTIVE | -          | Running     | internal=192.168.122.65 |
+--------------------------------------+---------+--------+------------+-------------+-------------------------+

 ps -ef | grep kvm
root       1004      2  0 Feb03 ?        00:00:00 [kvm-irqfd-clean]
qemu      55757      1 15 10:29 ?        00:00:02 /usr/libexec/qemu-kvm -name instance-00000053 -S -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off -cpu Haswell-

# cat /proc/55757/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             62505                62505                processes 
Max open files            1024                 4096                 files     
      
echo /proc/55757/fd/* | wc -w
32

- Then  I changed the Max open files   with prlimit in run time 

$prlimit --pid 55757 --nofile=36:40

# cat /proc/55757/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             62505                62505                processes 
Max open files            36                   40                   files     


# echo /proc/55757/fd/* | wc -w
36

# virsh list
 Id    Name                           State
 ----------------------------------------------------
 2     instance-00000053              running

- As soon as I attach RBD device qemu opens couple of FDs and count reaches to 36 and rbd device is still stuck in attaching state 

- And we start getting below given logs in ceph client side logs.

7f0b769e7700 -1 -- 192.168.128.30:0/2021513 >> 192.168.128.35:6800/24374 pipe(0x7f0bcabc0000 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0bc55e1ce0).connect couldn't created socket (24) Too many open files

I have tested this with soft and hard limit with same value : 

$prlimit --pid 55757 --nofile=36:36  but same results.

- But qemu logs do not have any trace of "Too many open files" . When Qemu process goes out of FD in run time.

Comment 2 Vikhyat Umrao 2016-03-28 12:40:06 UTC
- Then I involved one of our QEMU/KVM colleague to look into this from QEMU/KVM side and below are his findings. Thanks Germano Veit Michel.

----------------------------------------------------------------------------
In qemu, there is a function called bdrv_open_common which is responsible to opening disks (open FD to disk), which is latter attached to the VM.

This is the part that we are interested in:

    /* Open the image, either directly or using a protocol */
    if (drv->bdrv_file_open) {
        assert(file == NULL);
        assert(!drv->bdrv_needs_filename || filename != NULL);
        ret = drv->bdrv_file_open(bs, options, open_flags, &local_err);  <--------------- [1] Calls driver/format specific function for each image type (qcow2, raw...)
    } else { 
        if (file == NULL) {
            error_setg(errp, "Can't use '%s' as a block driver for the "
                       "protocol level", drv->format_name);
            ret = -EINVAL;
            goto free_and_fail;
        }
        bs->file = file;
        ret = drv->bdrv_open(bs, options, open_flags, &local_err);      <--------------- [2] Slightly different entry point (protocol)
    }

    if (ret < 0) {
        if (local_err) {
            error_propagate(errp, local_err);
        } else if (bs->filename[0]) {
            error_setg_errno(errp, -ret, "Could not open '%s'", bs->filename); <------------ [3] This is where the -ENFILE is printed in qemu
        } else {
            error_setg_errno(errp, -ret, "Could not open image");
        }
        goto free_and_fail;
    }

And it works for qcow2/raw, no need to add debugging options. [3] produces the following log for qcow2 images:
> 2016-03-28T04:35:47.843808Z Could not open '/rhev/data-center/00000001-0001-0001-0001-000000000227/fa020229-5f5f-4ba8-993d-4ac212a13217/images/ab6d3ee8-1bfc-43e1-af8c-
> c3400439f923/c300ecfe-3e22-4840-827c-478765a4e7ec': Too many open files


Well, for ceph/rbd there are a bunch of other functions involved (librbd) and the backtrace is a bit different. But still, it gets to the point above provided the qemu process has enough FDs to reach that point.

So, let's see what really happens when we are using ceph/rbd:

# ps -ef | grep qemu
qemu      47428      1  3 Mar22 ?        05:06:17 /usr/libexec/qemu-kvm -name instance-00000058 [...]

# echo /proc/47428/fd/* | wc -w
32

# prlimit --pid 47428 --nofile=32:32

# cat /proc/47428/limits | egrep "Limit|Max open files"
Limit                     Soft Limit           Hard Limit           Units     
Max open files            32                   32                   files

# gdb /usr/libexec/qemu-kvm 47428

First of all, I can see it takes ~5 FDs for an RBD image to be successfully attached to the VM. If we limit it at 32, that is, no additional FD can be opened, then we attach a disk, as below:

# nova volume-attach 13759d04-5323-4824-be05-5afc6ee883f6 61c9acac-4b7f-4d4c-bd8e-0ef0886b2951

Then it fails here, when trying to open the first FD in qemu_rbd_open:
   
    r = rados_connect(s->cluster);  <---- it hangs here for a couple minutes
    if (r < 0) {
        error_setg(errp, "error connecting");  <---------------- error printed in qemu logs
        goto failed_shutdown;
    }

    And it returns -110, goes to failed_shutdown...

(gdb) p	r
$3 = -110

And we see this in qemu logs, which is the exact error in the function above.

> 2016-03-28T05:51:10.208565Z error connecting

But -110 is ETIMEDOUT. So we know rados_connect failed because it did not have enough FDs, but it returned -110 to qemu. At this point there is no way for qemu to log that it failed due to "too many open files" since it was returned a -ETIMEDOUT.

So we have TWO problems here:
1) qemu does not print the error code (ETIMEDOUT) use error_setg_errno with r instead of error_setg?
2) librbd does not return the proper error code (EMFILE)

Moving on, if we allow it to open 3 FDs, it fails slightly after in the same function, here:

    r = rbd_open(s->io_ctx, s->name, &s->image, s->snap);
    if (r < 0) {
        error_setg(errp, "error reading header from %s", s->name);
        goto failed_open;
    }

Also hangs for quite a few minutes. I might not have waited long enough (~15 minutes now), but I assume it will also eventually fail with some unrelated ERROR and qemu will print (or not) another vague message.

I assumed that if we get the number of allowed FDs to open right, we may hit this error within qemu-kvm code and, depending on the open call that fails, it will log the appropriate error. But when these librbd functions return -ETIMEDOUT I can't see how qemu can log the correct error. So we don't see the "Too many open files".

We may need to check all the cases, for 0, 1, 2, 3, 4, 5 FDs, as depending on how many FDs the qemu-kvm process it allowed to open it fails on different functions.

TODO:
1) Check all points qemu<->rbd may fail due to lack of FDs (prlimit on 0, 1, 2, 3, 4, 5 and 6 FDs)
2) Check all return codes from librbd on functions that hanged/failed due to lack of FD
3) Make qemu print all error codes from librbd
--------------------------------------------------------------------------------------------------------------------

Comment 3 Vikhyat Umrao 2016-03-28 12:44:02 UTC
As per Germano findings : So we have TWO problems here:

> 1) qemu does not print the error code (ETIMEDOUT) use error_setg_errno with r instead of error_setg?

- I think this needs to be handled at qemu layer 

> 2) librbd does not return the proper error code (EMFILE)

- and for this as I have given my analysis in comment#0 

1215  fail:
1216   if (conf->ms_inject_internal_delays) {
1217     ldout(msgr->cct, 10) << " sleep for " << msgr->cct->_conf->ms_inject_internal_delays << dendl;
1218     utime_t t;
1219     t.set_from_double(msgr->cct->_conf->ms_inject_internal_delays);
1220     t.sleep();
1221   }
1222 
1223   pipe_lock.Lock();
1224  fail_locked:
1225   if (state == STATE_CONNECTING)
1226     fault();
1227   else
1228     ldout(msgr->cct,3) << "connect fault, but state = " << get_state_name()
1229                        << " != connecting, stopping" << dendl;
1230 
1231  stop_locked:
1232   delete authorizer;
1233   return -1;
1234 }

Here we should return *return -errno* instead of *return -1*

Comment 4 Vikhyat Umrao 2016-03-28 12:57:58 UTC
Josh, this is the same issue which I discussed with you some time back that we need to check if librbd is logging or not for *EMFILE*.

Comment 5 Greg Farnum 2016-03-28 14:00:43 UTC
If the specific error codes here are something we want to see, it'll require a lot more than returning the errno from this stop_locked label. We invoke this code path from a lot of different locations with a lot of different statuses. Some of them will have the real error code in errno; some in the return value (which is thrown away in a conditional check!); some are because of protocol errors of various kinds that don't really have matching return codes.

Comment 6 Samuel Just 2016-03-28 14:43:33 UTC
If this requires significant refactoring, I'm inclined to close it with 'Won't Fix'.  Does async messenger have this problem?

Comment 7 Greg Farnum 2016-03-28 16:06:19 UTC
I'm not sure it requires refactoring, but some actual thought. Probably specifying a function-local error code and making sure that gets set everywhere before jumping through to the failure handling.

Comment 8 Vikhyat Umrao 2016-03-29 08:58:02 UTC
Thanks Greg and Sam for your inputs. 

Another issue here is if we fix also this issue in our ceph layer, qemu still have limitation of :

> 1) qemu does not print the error code (ETIMEDOUT) use error_setg_errno with r instead of error_setg?

As per my understanding instead of 'Won't Fix', would it be okay to take it as RFE ?

Comment 9 Greg Farnum 2016-03-29 14:03:25 UTC
Sure. The kind of work involved, while detailed, also doesn't require much of anything in the way of background knowledge — this is probably something sustaining engineering can do?

Comment 10 Greg Farnum 2016-03-29 14:03:43 UTC
Sure. The kind of work involved, while detailed, also doesn't require much of anything in the way of background knowledge — this is probably something sustaining engineering can do?

Comment 11 Samuel Just 2016-03-29 14:46:34 UTC
RFE it is.

Comment 12 Vikhyat Umrao 2016-03-29 15:07:13 UTC
Thanks Greg and Sam. Sure I will look into it. For now assigning it to myself.  If I will come across any difficulties will discuss with you. 

Created upstream ceph tracker : http://tracker.ceph.com/issues/15308

Comment 13 Kefu Chai 2016-03-30 10:57:44 UTC
> Does async messenger have this problem?

async has this problem as well, see AsyncConnection::_process_connection(), and search for "sd = net.nonblock_connect(get_peer_addr());".

@vikhyat, as we are *posting* messages, and the the Pipe and Messengers are working asynchronously.

normally, when it comes to rbd_open(), we start an operation, and wait on it, see ImageState<I>::open().

as greg put

> Probably specifying a function-local error code and making sure that gets set everywhere before jumping through to the failure handling.

yeah, we need to save the errno somewhere and assign it to rval of the context of the request.

Comment 14 Vikhyat Umrao 2016-03-30 13:45:51 UTC
Thanks Kefu for your inputs and nice discussion.

- Now we need to fix both the messengers : simple messenger and async messenger connect function definitions.

- Then I and Kefu discussed in detail more about it and we concluded we need to fix all the caller functions of these functions with proper return codes.

- I checked parent  AsyncConnection::_process_connection() parent functions and they look good.
  
   like :_process_connection() calls -> nonblock_connect() calls ->   generic_connect() -> and generic connect definition is proper.

int NetHandler::generic_connect(const entity_addr_t& addr, bool nonblock)
{
  int ret;
  int s = create_socket(addr.get_family());
  if (s < 0)
    return s;

  if (nonblock) {
    ret = set_nonblock(s);
    if (ret < 0) {
      close(s);
      return ret;
    }
  }

  set_socket_options(s);

  ret = ::connect(s, (sockaddr*)&addr.addr, addr.addr_size());
  if (ret < 0) {
    if (errno == EINPROGRESS && nonblock)
      return s;

    ldout(cct, 10) << __func__ << " connect: " << strerror(errno) << dendl;
    close(s);
    return -errno;
  }

  return s;
}

Comment 16 Vikhyat Umrao 2016-04-01 15:08:06 UTC
Hello Josh,

Thanks for your help and nice discussion. 

As discussed I have sent first part of pipe::connect() fix : https://github.com/ceph/ceph/pull/8416

Will work on qemu patch for fixing error number function fix.

Comment 17 Vikhyat Umrao 2016-04-23 06:40:32 UTC
For Qemu Fix for this issue, Qemu Bug : https://bugzilla.redhat.com/show_bug.cgi?id=1329641

Comment 18 Vikhyat Umrao 2016-06-15 07:13:55 UTC
Part 2 tracker : http://tracker.ceph.com/issues/16310

Comment 20 Josh Durgin 2017-04-01 01:23:51 UTC
Included in 10.2.4 upstream.

Comment 21 Vikhyat Umrao 2017-04-05 14:12:30 UTC
Hi Josh,

I think we should move this to the next release because the first part is only fixed which in 10.2.4.

The second part of the issue is still pending in this upstream tracker: http://tracker.ceph.com/issues/16310

Thank you,
Vikhyat

Comment 22 Josh Durgin 2017-04-05 16:16:53 UTC
Good point Vikhyat. Moving to 3.0 since there's no target for 2.4.

Comment 24 Josh Durgin 2017-08-30 18:07:11 UTC
http://tracker.ceph.com/issues/16310 still needs to be done upstream.