Bug 1956963

Summary: fio is unusable with ioengine=libaio (does I/O, segfault, no output)
Product: [Fedora] Fedora Reporter: Alexey Dobriyan <adobriyan>
Component: fioAssignee: Eric Sandeen <esandeen>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 34CC: esandeen, pportant, ykorman
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Alexey Dobriyan 2021-05-04 18:46:16 UTC
Description of problem:
ioengine=libaio is broken with dlopen'ed fio-engine-* packages

Version-Release number of selected component (if applicable):
fio.x86_64                    3.25-3.fc34
fio-engine-libaio.x86_64      3.25-3.fc34

How reproducible:

Steps to Reproduce:




Actual results:
fio[1521]: segfault at 7f90314458a0 ip 0000558dfaccaa45 sp 00007f9030c2fd40 error 6 in fio[558dfacbb000+76000]
Code: 00 48 c7 83 18 10 00 00 00 00 00 00 48 8b 78 20 48 85 ff 74 1d f6 05 51 81 16 00 04 75 27 e8 52 1f ff ff 48 8b 83 40 42 04 00 <48> c7 40 20 00 00 00 00 48 c7 83 40 42 04 00 00 00 00 00 5b c3 66

Additional info:

00000000000279d0 <free_ioengine@@Base>:
   27a39:       e8 52 1f ff ff          call   19990 <dlclose@plt>
   27a3e:       48 8b 83 40 42 04 00    mov    rax,QWORD PTR [rbx+0x44240]
   27a45: ***** 48 c7 40 20 00 00 00    mov    QWORD PTR [rax+0x20],0x0
   27a4c:       00
   27a4d:       48 c7 83 40 42 04 00    mov    QWORD PTR [rbx+0x44240],0x0
   27a54:       00 00 00 00
   27a58:       5b                      pop    rbx
   27a59:       c3                      ret

RIP corresponds to bogus td->io_ops.

void free_ioengine(struct thread_data *td)
        dprint(FD_IO, "free ioengine %s\n", td->io_ops->name);

        if (td->eo && td->io_ops->options) {
                options_free(td->io_ops->options, td->eo);
                td->eo = NULL;

        if (td->io_ops->dlhandle) {
                dprint(FD_IO, "dlclose ioengine %s\n", td->io_ops->name);
     ======>    td->io_ops->dlhandle = NULL;

        td->io_ops = NULL;

ioengine=psync works because sync I/O is builtin.

Comment 1 Alexey Dobriyan 2021-05-04 18:49:10 UTC
I wonder if libaio ioengine should be builtin given the importance of the async I/O.

Comment 2 Eric Sandeen 2021-05-05 17:41:55 UTC
Unfortunately the ioengine selection is all or nothing upstream, at least for now.

Would you be able to check/test 3.26 from rawhide?  I think this is probably fixed by this upstream commit:

commit 48ff7df9daea86c82a572b0a840bb8371b6b1a29
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Mon Jan 25 13:23:48 2021 -0600

    fio: fix dlopen refcounting of dynamic engines
    ioengine_load() will dlclose the dynamic library if it matches one
    that we've already got open, but this defeats the built-in refcounting
    done by dlopen/dlclose.  As each thread exits, it calls free_ioengine(),
    and this may do a final dlclose on a dynamic ioengine that is still
    in use if we don't have the proper reference count.
    Fix this by dropping the explicit dlclose of a "matching" dlopened
    dynamic engine library, and let each dlclose decrement the refcount
    on the engine library as is normal.
    This also adds/modifies a couple of debug messages to help track this.
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Sorry for not getting that pushed to F34. If you can't test it I'll just push it, it should resolve this issue. I guess I didn't realize that this was never fixed in F34.

Comment 3 Alexey Dobriyan 2021-05-06 21:53:04 UTC
I rebuilt fio-3.26-1.fc35.src.rpm and installed onto F34.

It does NOT work, segfaults in the same place

    td->io_ops->dlhandle = NULL;

Comment 4 Eric Sandeen 2021-05-06 23:03:05 UTC
Hrm, thank you for testing.  I'll dig into this, I guess it's yet another, different problem w/ the dlopen'd ioengines...

Comment 5 Eric Sandeen 2021-05-07 01:35:04 UTC
I have trouble tracking what's going on with these dynamic engines.

I /think/

                td->io_ops->dlhandle = NULL;

segfaults because the dlclose actually removes the symbol at io_ops, and therefore we can no longer reference io_ops->dlhandle.
It seems that we should simply not try to set dlhandle to NULL after the dlclose.

Comment 6 Alexey Dobriyan 2021-05-08 16:19:31 UTC
Deleting "td->io_ops->dlhandle = NULL" line seems to work.

Comment 7 Eric Sandeen 2021-05-08 21:26:34 UTC
Yup, thanks for testing it.  I sent that upstream.

Comment 8 Alexey Dobriyan 2021-06-10 12:15:38 UTC
fixed in 3.26-2.fc34