Bug 1956963 - fio is unusable with ioengine=libaio (does I/O, segfault, no output)
Summary: fio is unusable with ioengine=libaio (does I/O, segfault, no output)
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: fio
Version: 34
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Eric Sandeen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-04 18:46 UTC by Alexey Dobriyan
Modified: 2021-09-09 20:05 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2021-09-09 20:05:48 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Alexey Dobriyan 2021-05-04 18:46:16 UTC
Description of problem:
ioengine=libaio is broken with dlopen'ed fio-engine-* packages


Version-Release number of selected component (if applicable):
fio.x86_64                    3.25-3.fc34
fio-engine-libaio.x86_64      3.25-3.fc34


How reproducible:
100%


Steps to Reproduce:

[global]
filename=1.dd
direct=1
thread

ioengine=libaio
iodepth=1
bs=4096

[j]
rw=read
size=64M


Actual results:
fio[1521]: segfault at 7f90314458a0 ip 0000558dfaccaa45 sp 00007f9030c2fd40 error 6 in fio[558dfacbb000+76000]
Code: 00 48 c7 83 18 10 00 00 00 00 00 00 48 8b 78 20 48 85 ff 74 1d f6 05 51 81 16 00 04 75 27 e8 52 1f ff ff 48 8b 83 40 42 04 00 <48> c7 40 20 00 00 00 00 48 c7 83 40 42 04 00 00 00 00 00 5b c3 66


Additional info:



00000000000279d0 <free_ioengine@@Base>:
    ...
   27a39:       e8 52 1f ff ff          call   19990 <dlclose@plt>
   27a3e:       48 8b 83 40 42 04 00    mov    rax,QWORD PTR [rbx+0x44240]
   27a45: ***** 48 c7 40 20 00 00 00    mov    QWORD PTR [rax+0x20],0x0
   27a4c:       00
   27a4d:       48 c7 83 40 42 04 00    mov    QWORD PTR [rbx+0x44240],0x0
   27a54:       00 00 00 00
   27a58:       5b                      pop    rbx
   27a59:       c3                      ret

RIP corresponds to bogus td->io_ops.

void free_ioengine(struct thread_data *td)
{
        dprint(FD_IO, "free ioengine %s\n", td->io_ops->name);

        if (td->eo && td->io_ops->options) {
                options_free(td->io_ops->options, td->eo);
                free(td->eo);
                td->eo = NULL;
        }

        if (td->io_ops->dlhandle) {
                dprint(FD_IO, "dlclose ioengine %s\n", td->io_ops->name);
                dlclose(td->io_ops->dlhandle);
     ======>    td->io_ops->dlhandle = NULL;
        }

        td->io_ops = NULL;
}


ioengine=psync works because sync I/O is builtin.

Comment 1 Alexey Dobriyan 2021-05-04 18:49:10 UTC
I wonder if libaio ioengine should be builtin given the importance of the async I/O.

Comment 2 Eric Sandeen 2021-05-05 17:41:55 UTC
Unfortunately the ioengine selection is all or nothing upstream, at least for now.

Would you be able to check/test 3.26 from rawhide?  I think this is probably fixed by this upstream commit:

commit 48ff7df9daea86c82a572b0a840bb8371b6b1a29
Author: Eric Sandeen <sandeen>
Date:   Mon Jan 25 13:23:48 2021 -0600

    fio: fix dlopen refcounting of dynamic engines
    
    ioengine_load() will dlclose the dynamic library if it matches one
    that we've already got open, but this defeats the built-in refcounting
    done by dlopen/dlclose.  As each thread exits, it calls free_ioengine(),
    and this may do a final dlclose on a dynamic ioengine that is still
    in use if we don't have the proper reference count.
    
    Fix this by dropping the explicit dlclose of a "matching" dlopened
    dynamic engine library, and let each dlclose decrement the refcount
    on the engine library as is normal.
    
    This also adds/modifies a couple of debug messages to help track this.
    
    Signed-off-by: Eric Sandeen <sandeen>
    Signed-off-by: Jens Axboe <axboe>


Sorry for not getting that pushed to F34. If you can't test it I'll just push it, it should resolve this issue. I guess I didn't realize that this was never fixed in F34.

Comment 3 Alexey Dobriyan 2021-05-06 21:53:04 UTC
I rebuilt fio-3.26-1.fc35.src.rpm and installed onto F34.

It does NOT work, segfaults in the same place

    td->io_ops->dlhandle = NULL;

Comment 4 Eric Sandeen 2021-05-06 23:03:05 UTC
Hrm, thank you for testing.  I'll dig into this, I guess it's yet another, different problem w/ the dlopen'd ioengines...

Comment 5 Eric Sandeen 2021-05-07 01:35:04 UTC
I have trouble tracking what's going on with these dynamic engines.

I /think/

                dlclose(td->io_ops->dlhandle);
                td->io_ops->dlhandle = NULL;

segfaults because the dlclose actually removes the symbol at io_ops, and therefore we can no longer reference io_ops->dlhandle.
It seems that we should simply not try to set dlhandle to NULL after the dlclose.

Comment 6 Alexey Dobriyan 2021-05-08 16:19:31 UTC
Deleting "td->io_ops->dlhandle = NULL" line seems to work.

Comment 7 Eric Sandeen 2021-05-08 21:26:34 UTC
Yup, thanks for testing it.  I sent that upstream.

Comment 8 Alexey Dobriyan 2021-06-10 12:15:38 UTC
fixed in 3.26-2.fc34


Note You need to log in before you can comment on or make changes to this bug.