RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1991652 - delay-close option doesn't work and there is assertion failure at the end of nbdkit conversion
Summary: delay-close option doesn't work and there is assertion failure at the end of ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: nbdkit
Version: 9.0
Hardware: x86_64
OS: Unspecified
medium
medium
Target Milestone: beta
: ---
Assignee: Virtualization Maintenance
QA Contact: mxie@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-09 15:29 UTC by mxie@redhat.com
Modified: 2021-12-07 21:37 UTC (History)
11 users (show)

Fixed In Version: nbdkit-1.26.3-4.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-07 21:35:16 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
nbdkit-delay-close-option.log (1.31 MB, text/plain)
2021-08-09 15:29 UTC, mxie@redhat.com
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-92927 0 None None None 2021-08-09 15:30:39 UTC

Description mxie@redhat.com 2021-08-09 15:29:43 UTC
Created attachment 1812477 [details]
nbdkit-delay-close-option.log

Description of problem:
delay-close option doesn't work and there is assertion failure at the end of nbdkit conversion 

Version-Release number of selected component (if applicable):
nbdkit-1.26.3-1.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Convert a disk from VMware via vddk plugin without delay filter
#time nbdkit -rfv -U - --exportname --filter=retry  vddk server=10.73.198.169 user=root password=+/home/passwd vm=moref=vm-6104  file='[datastore1] esx7.0-win2022-preview-x86_64/esx7.0-win2022-preview-x86_64.vmdk' libdir=/home/vddk7.0.2 thumbprint='B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78' transports=nbdssl  compression=skipz --run 'qemu-img convert $nbd /home/esx7.0-win2022-preview-skipz'
....
real	1m44.349s
user	0m5.733s
sys	0m36.490s


2. Convert a disk from VMware via vddk plugin with delay filter and delay-close=40000ms 
# time nbdkit -rfv -U - --exportname --filter=retry  vddk server=10.73.198.169 user=root password=+/home/passwd vm=moref=vm-6104  file='[datastore1] esx7.0-win2022-preview-x86_64/esx7.0-win2022-preview-x86_64.vmdk' libdir=/home/vddk7.0.2 thumbprint='B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78' transports=nbdssl  compression=skipz --run 'qemu-img convert $nbd /home/esx7.0-win2022-preview-skipz' --filter=delay delay-close=40000ms  
....
nbdkit: public.c:725: nbdkit_nanosleep: Assertion `quit || (conn && conn->nworkers > 0 && connection_get_status () < 1) || (conn && (fds[2].revents & (POLLRDHUP | POLLHUP | POLLERR)))' failed.

real	1m44.214s
user	0m6.040s
sys	0m39.352s


Actual results:
As above description

Expected results:
delay-close option works well in nbdkit conversion 


Additional info:
1. delay-open option works well in nbdkit conversion 

#time nbdkit -rfv -U - --exportname --filter=retry  vddk server=10.73.198.169 user=root password=+/home/passwd vm=moref=vm-6104  file='[datastore1] esx7.0-win2022-preview-x86_64/esx7.0-win2022-preview-x86_64.vmdk' libdir=/home/vddk7.0.2 thumbprint='B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78' transports=nbdssl  compression=skipz --run 'qemu-img convert $nbd /home/esx7.0-win2022-preview-skipz' --filter=delay delay-open=40000ms 
.....
real	2m22.986s
user	0m6.012s
sys	0m37.128s

Comment 1 Richard W.M. Jones 2021-08-09 17:00:21 UTC
Slightly simpler reproducer:

$ cd /var/tmp
$ virt-builder fedora-33 --format=vmdk
$ nbdkit -fv --filter=delay vddk file=/var/tmp/fedora-33.vmdk libdir=$HOME/tmp/vddk/vmware-vix-disklib-distrib delay-close=40 --run 'nbdcopy $uri null:'
...
nbdkit: public.c:725: nbdkit_nanosleep: Assertion `quit || (conn && conn->nworkers > 0 && connection_get_status () < 1) || (conn && (fds[2].revents & (POLLRDHUP | POLLHUP | POLLERR)))' failed.

The stack trace is:

Thread 1 (Thread 0x7f04bffff640 (LWP 514728)):
#0  __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:45
#1  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at pthread_kill.c:62
#2  0x00007f04cafc86b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f04cafb27d3 in __GI_abort () at abort.c:79
#4  0x00007f04cafb26fb in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at assert.c:92
#5  0x00007f04cafc13a6 in __GI___assert_fail (assertion=assertion@entry=0x55c1d11e3f30 "quit || (conn && conn->nworkers > 0 && connection_get_status () < 1) || (conn && (fds[2].revents & (POLLRDHUP | POLLHUP | POLLERR)))", file=file@entry=0x55c1d11e388a "public.c", line=line@entry=725, function=function@entry=0x55c1d11e4d30 <__PRETTY_FUNCTION__.1.lto_priv.11> "nbdkit_nanosleep") at assert.c:101
#6  0x000055c1d11dc37f in nbdkit_nanosleep (sec=<optimized out>, nsec=<optimized out>) at /home/rjones/d/nbdkit/server/public.c:725
#7  0x000055c1d11d0dc7 in backend_close (c=0x7f04940013c0) at /home/rjones/d/nbdkit/server/backend.c:369
#8  0x000055c1d11d5ace in free_connection (conn=0x7f0494000dd0) at /home/rjones/d/nbdkit/server/connections.c:357
#9  handle_single_connection (sockin=<optimized out>, sockout=<optimized out>) at /home/rjones/d/nbdkit/server/connections.c:233
#10 0x000055c1d11dd540 in start_thread (datav=0x55c1dad5e250) at /home/rjones/d/nbdkit/server/sockets.c:354
#11 0x00007f04cb013aaf in start_thread (arg=<optimized out>) at pthread_create.c:434
#12 0x00007f04cb098300 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

(gdb) print quit
$1 = 0
(gdb) print *conn
value has been optimized out
(gdb) print fds[2]
$2 = {fd = 13, events = 8192, revents = 32}

I'll see if it can be reproduced with -g -O0 and try to get more debugging.
Also I have a feeling that this probably doesn't depend on VDDK so it'd be
good to have a simpler reproducer.

Comment 2 Richard W.M. Jones 2021-08-09 19:17:28 UTC
(gdb) print conn->nworkers
$3 = 0
(gdb) print conn->status
$4 = 0
(gdb) print fds[0]
$9 = {fd = 5, events = 1, revents = 0}
(gdb) print fds[1]
$10 = {fd = -1, events = 1, revents = 0}
(gdb) print fds[2]
$11 = {fd = 13, events = 8192, revents = 32}

So it's the last conjunction of the assertion which is failing:

conn && (fds[2].revents & (POLLRDHUP | POLLHUP | POLLERR))

is not true because revents == POLLNVAL.

       POLLNVAL
              Invalid request: fd not open (only returned in revents;  ignored
              in events).

I think if the socket has been closed already we should probably
return from nbdkit_nanosleep early, since it's similar to the
other case where we return if the same socket has an error.

https://listman.redhat.com/archives/libguestfs/2021-August/msg00054.html

Comment 3 Richard W.M. Jones 2021-08-09 19:33:17 UTC
Upstream fix:
https://gitlab.com/nbdkit/nbdkit/-/commit/87a88f8c52a0d2fd392c35d37f8b048bcede1382

mxie: I suspect that the delay-close flag actually can never work
because we deliberately end the sleep operation early when closing
the socket.  We should probably remove it.

Comment 4 Richard W.M. Jones 2021-08-10 08:21:03 UTC
I posted this 3 part patch series which fixes delay-close:

https://listman.redhat.com/archives/libguestfs/2021-August/thread.html#00056

Comment 7 mxie@redhat.com 2021-08-13 13:25:17 UTC
Reproduce the bug with nbdkit-1.26.3-1.el9.x86_64

Steps to reproduce:
1.Check bdkit-delay-filter man page about delay-close option
# man nbdkit-delay-filter |grep delay-close -A 6
.....
       delay-close=SECS
       delay-close=NNms
           (nbdkit ≥ 1.28)

           Delay open and close operations by "SECS" seconds or "NN" milliseconds.  Open corresponds to client
           connection.  Close may not be visible to clients if they abruptly disconnect.

2.Check if nbdkit_nanosleep gives an correct error message when it aborts early.
# nbdkit --filter=delay null delay-close=3 --run 'nbdinfo --size $uri; nbdinfo --size $uri'
0
nbdkit: null[1]: error: aborting sleep to shut down
nbdkit: null[2]0
: error: aborting sleep to shut down

3. Check if deplay-close works when clients use NBD_CMD_DISC (libnbd nbd_shutdown) or clients which drop the connection

3.1 # nbdkit --filter=delay null delay-close=3  --run 'time nbdsh -u $uri -c "h.shutdown()"'
nbdkit: null[1]: error: aborting sleep to shut down

real	0m0.131s   -----> # Client used shutdown, was not delayed
user	0m0.107s
sys	0m0.020s

3.2 # nbdkit --filter=delay null delay-close=3  --run 'time nbdsh -u $uri -c "pass"'
nbdkit: null[1]: error: aborting sleep to shut down

real	0m0.129s   -----> # Client disconnected, was not delayed
user	0m0.109s
sys	0m0.017s



Verify the bug with nbdkit-1.26.3-4.el9.x86_64

Steps 
1.Check nbdkit-delay-filter man page about delay-close option, 
# man nbdkit-delay-filter |grep delay-close -A 6
.....
       delay-close=SECS
       delay-close=NNms
           (nbdkit ≥ 1.28)

           Delay close (client disconnection) by "SECS" seconds or "NN" milliseconds.  This can also cause
           server shutdown to be delayed if clients are connected at the time.  This only affects clients that
           gracefully disconnect (using "NBD_CMD_DISC" / libnbd function nbd_shutdown(3)).  Clients that
           abruptly disconnect from the server cannot be delayed.

Result of step1: the description of delay-close option has updated in nbdkit-delay-filter man page

2.Check if nbdkit_nanosleep gives an correct error message when it aborts early.
# nbdkit --filter=delay null delay-close=3 --run 'nbdinfo --size $uri; nbdinfo --size $uri'
0
0

Result of step2: there is no error message when nbdkit_nanosleep aborts early

3.Check if deplay-close works when clients use NBD_CMD_DISC (libnbd nbd_shutdown) or clients which drop the connection

3.1 # nbdkit --filter=delay null delay-close=3  --run 'time nbdsh -u $uri -c "h.shutdown()"'

real	0m3.133s   ----->  # Client used shutdown, was delayed
user	0m0.103s   
sys	0m0.024s

3.2 # nbdkit --filter=delay null delay-close=3  --run 'time nbdsh -u $uri -c "pass"'

real	0m0.128s   -----> # Client disconnected, was not delayed
user	0m0.100s
sys	0m0.026s

Result of step3: deplay-close only delays clients which use NBD_CMD_DISC (libnbd nbd_shutdown(3)).  Clients which drop the connection obviously cannot be delayed. the result is expected


4.Check if delay-close option works in nbdkit vddk conversion

4.1 Convert a disk from VMware via vddk plugin without delay filter
#time nbdkit -rfv -U - --exportname --filter=retry  vddk server=10.73.198.169 user=root password=+/home/passwd vm=moref=vm-6104  file='[datastore1] esx7.0-win2022-preview-x86_64/esx7.0-win2022-preview-x86_64.vmdk' libdir=/home/vddk7.0.2 thumbprint='B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78' transports=nbdssl  compression=skipz --run 'qemu-img convert $nbd /home/esx7.0-win2022-preview-skipz'
....
real	1m41.378s
user	0m32.250s
sys	0m37.698s


4.2.Convert a disk from VMware via vddk plugin with delay filter and delay-close=40000ms 
# time nbdkit -rfv -U - --exportname --filter=retry  vddk server=10.73.198.169 user=root password=+/home/passwd vm=moref=vm-6104  file='[datastore1] esx7.0-win2022-preview-x86_64/esx7.0-win2022-preview-x86_64.vmdk' libdir=/home/vddk7.0.2 thumbprint='B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78' transports=nbdssl  compression=skipz --run 'qemu-img convert $nbd /home/esx7.0-win2022-preview-skipz' --filter=delay delay-close=40000ms  
....
nbdkit: vddk[1]: debug: VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
nbdkit: vddk[1]: debug: VDDK call: VixDiskLib_FreeBlockList (block_list)
nbdkit: vddk[1]: debug: client sent NBD_CMD_DISC, closing connection
nbdkit: vddk[1]: debug: delay: finalize
nbdkit: vddk[1]: debug: vddk: finalize
nbdkit: vddk[1]: debug: vddk: finalize
nbdkit: debug: delay: cleanup
nbdkit: debug: vddk: cleanup
nbdkit: debug: vddk: unload plugin
nbdkit: debug: VDDK call: VixDiskLib_Exit ()
nbdkit: debug: VDDK_PhoneHome: VDDK PhoneHome succeed to exit
nbdkit: debug: VixDiskLib: VixDiskLib_Exit: Unmatched Init calls so far: 1.
nbdkit: debug: OBJLIB-LIB: ObjLib cleanup done.
nbdkit: debug: delay: unload filter

real	2m22.787s
user	0m31.289s
sys	0m39.454s

Result of step4: delay-close option works in nbdkit vddk conversion and there is no assertion failure 


Hi rjones,

    Please help to check the result of step2: there is no error message when nbdkit_nanosleep aborts early when verify the bug with nbdkit-1.26.3-4.el9.x86_64, is it expected?

Comment 8 Richard W.M. Jones 2021-08-14 09:18:09 UTC
> Please help to check the result of step2: there is no error message when nbdkit_nanosleep aborts early when verify the bug with nbdkit-1.26.3-4.el9.x86_64, is it expected?

Yes it's right.  nbdinfo gracefully shuts down the connection
so there is a 3 second delay, and (because nbdkit_nanosleep isn't
interrupted) there is no error message at all.

Comment 9 mxie@redhat.com 2021-08-16 02:42:14 UTC
(In reply to Richard W.M. Jones from comment #8)
> > Please help to check the result of step2: there is no error message when nbdkit_nanosleep aborts early when verify the bug with nbdkit-1.26.3-4.el9.x86_64, is it expected?
> 
> Yes it's right.  nbdinfo gracefully shuts down the connection
> so there is a 3 second delay, and (because nbdkit_nanosleep isn't
> interrupted) there is no error message at all.

I saw you have improved the error message about aborting sleep to shut down in https://listman.redhat.com/archives/libguestfs/2021-August/msg00056.html, how to verify the improved error message?

Comment 10 Richard W.M. Jones 2021-08-19 08:48:42 UTC
It's tricky to check this without using a custom plugin, so I don't
think there is an easy way.

Comment 11 mxie@redhat.com 2021-08-19 08:57:57 UTC
Thanks rjones, move the bug from ON_QA to VERIFIED according to comment7 ~ comment10.


Note You need to log in before you can comment on or make changes to this bug.