Bug 1991652
| Summary: | delay-close option doesn't work and there is assertion failure at the end of nbdkit conversion | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | mxie <mxie> | ||||
| Component: | nbdkit | Assignee: | Virtualization Maintenance <virt-maint> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | mxie <mxie> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 9.0 | CC: | chhu, eblake, juzhou, kkiwi, mzhan, rjones, tyan, tzheng, virt-maint, vwu, xiaodwan | ||||
| Target Milestone: | beta | Keywords: | Triaged | ||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||
| Hardware: | x86_64 | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | nbdkit-1.26.3-4.el9 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-12-07 21:35:16 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Slightly simpler reproducer:
$ cd /var/tmp
$ virt-builder fedora-33 --format=vmdk
$ nbdkit -fv --filter=delay vddk file=/var/tmp/fedora-33.vmdk libdir=$HOME/tmp/vddk/vmware-vix-disklib-distrib delay-close=40 --run 'nbdcopy $uri null:'
...
nbdkit: public.c:725: nbdkit_nanosleep: Assertion `quit || (conn && conn->nworkers > 0 && connection_get_status () < 1) || (conn && (fds[2].revents & (POLLRDHUP | POLLHUP | POLLERR)))' failed.
The stack trace is:
Thread 1 (Thread 0x7f04bffff640 (LWP 514728)):
#0 __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:45
#1 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at pthread_kill.c:62
#2 0x00007f04cafc86b6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007f04cafb27d3 in __GI_abort () at abort.c:79
#4 0x00007f04cafb26fb in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at assert.c:92
#5 0x00007f04cafc13a6 in __GI___assert_fail (assertion=assertion@entry=0x55c1d11e3f30 "quit || (conn && conn->nworkers > 0 && connection_get_status () < 1) || (conn && (fds[2].revents & (POLLRDHUP | POLLHUP | POLLERR)))", file=file@entry=0x55c1d11e388a "public.c", line=line@entry=725, function=function@entry=0x55c1d11e4d30 <__PRETTY_FUNCTION__.1.lto_priv.11> "nbdkit_nanosleep") at assert.c:101
#6 0x000055c1d11dc37f in nbdkit_nanosleep (sec=<optimized out>, nsec=<optimized out>) at /home/rjones/d/nbdkit/server/public.c:725
#7 0x000055c1d11d0dc7 in backend_close (c=0x7f04940013c0) at /home/rjones/d/nbdkit/server/backend.c:369
#8 0x000055c1d11d5ace in free_connection (conn=0x7f0494000dd0) at /home/rjones/d/nbdkit/server/connections.c:357
#9 handle_single_connection (sockin=<optimized out>, sockout=<optimized out>) at /home/rjones/d/nbdkit/server/connections.c:233
#10 0x000055c1d11dd540 in start_thread (datav=0x55c1dad5e250) at /home/rjones/d/nbdkit/server/sockets.c:354
#11 0x00007f04cb013aaf in start_thread (arg=<optimized out>) at pthread_create.c:434
#12 0x00007f04cb098300 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) print quit
$1 = 0
(gdb) print *conn
value has been optimized out
(gdb) print fds[2]
$2 = {fd = 13, events = 8192, revents = 32}
I'll see if it can be reproduced with -g -O0 and try to get more debugging.
Also I have a feeling that this probably doesn't depend on VDDK so it'd be
good to have a simpler reproducer.
(gdb) print conn->nworkers
$3 = 0
(gdb) print conn->status
$4 = 0
(gdb) print fds[0]
$9 = {fd = 5, events = 1, revents = 0}
(gdb) print fds[1]
$10 = {fd = -1, events = 1, revents = 0}
(gdb) print fds[2]
$11 = {fd = 13, events = 8192, revents = 32}
So it's the last conjunction of the assertion which is failing:
conn && (fds[2].revents & (POLLRDHUP | POLLHUP | POLLERR))
is not true because revents == POLLNVAL.
POLLNVAL
Invalid request: fd not open (only returned in revents; ignored
in events).
I think if the socket has been closed already we should probably
return from nbdkit_nanosleep early, since it's similar to the
other case where we return if the same socket has an error.
https://listman.redhat.com/archives/libguestfs/2021-August/msg00054.html
Upstream fix: https://gitlab.com/nbdkit/nbdkit/-/commit/87a88f8c52a0d2fd392c35d37f8b048bcede1382 mxie: I suspect that the delay-close flag actually can never work because we deliberately end the sleep operation early when closing the socket. We should probably remove it. I posted this 3 part patch series which fixes delay-close: https://listman.redhat.com/archives/libguestfs/2021-August/thread.html#00056 Reproduce the bug with nbdkit-1.26.3-1.el9.x86_64
Steps to reproduce:
1.Check bdkit-delay-filter man page about delay-close option
# man nbdkit-delay-filter |grep delay-close -A 6
.....
delay-close=SECS
delay-close=NNms
(nbdkit ≥ 1.28)
Delay open and close operations by "SECS" seconds or "NN" milliseconds. Open corresponds to client
connection. Close may not be visible to clients if they abruptly disconnect.
2.Check if nbdkit_nanosleep gives an correct error message when it aborts early.
# nbdkit --filter=delay null delay-close=3 --run 'nbdinfo --size $uri; nbdinfo --size $uri'
0
nbdkit: null[1]: error: aborting sleep to shut down
nbdkit: null[2]0
: error: aborting sleep to shut down
3. Check if deplay-close works when clients use NBD_CMD_DISC (libnbd nbd_shutdown) or clients which drop the connection
3.1 # nbdkit --filter=delay null delay-close=3 --run 'time nbdsh -u $uri -c "h.shutdown()"'
nbdkit: null[1]: error: aborting sleep to shut down
real 0m0.131s -----> # Client used shutdown, was not delayed
user 0m0.107s
sys 0m0.020s
3.2 # nbdkit --filter=delay null delay-close=3 --run 'time nbdsh -u $uri -c "pass"'
nbdkit: null[1]: error: aborting sleep to shut down
real 0m0.129s -----> # Client disconnected, was not delayed
user 0m0.109s
sys 0m0.017s
Verify the bug with nbdkit-1.26.3-4.el9.x86_64
Steps
1.Check nbdkit-delay-filter man page about delay-close option,
# man nbdkit-delay-filter |grep delay-close -A 6
.....
delay-close=SECS
delay-close=NNms
(nbdkit ≥ 1.28)
Delay close (client disconnection) by "SECS" seconds or "NN" milliseconds. This can also cause
server shutdown to be delayed if clients are connected at the time. This only affects clients that
gracefully disconnect (using "NBD_CMD_DISC" / libnbd function nbd_shutdown(3)). Clients that
abruptly disconnect from the server cannot be delayed.
Result of step1: the description of delay-close option has updated in nbdkit-delay-filter man page
2.Check if nbdkit_nanosleep gives an correct error message when it aborts early.
# nbdkit --filter=delay null delay-close=3 --run 'nbdinfo --size $uri; nbdinfo --size $uri'
0
0
Result of step2: there is no error message when nbdkit_nanosleep aborts early
3.Check if deplay-close works when clients use NBD_CMD_DISC (libnbd nbd_shutdown) or clients which drop the connection
3.1 # nbdkit --filter=delay null delay-close=3 --run 'time nbdsh -u $uri -c "h.shutdown()"'
real 0m3.133s -----> # Client used shutdown, was delayed
user 0m0.103s
sys 0m0.024s
3.2 # nbdkit --filter=delay null delay-close=3 --run 'time nbdsh -u $uri -c "pass"'
real 0m0.128s -----> # Client disconnected, was not delayed
user 0m0.100s
sys 0m0.026s
Result of step3: deplay-close only delays clients which use NBD_CMD_DISC (libnbd nbd_shutdown(3)). Clients which drop the connection obviously cannot be delayed. the result is expected
4.Check if delay-close option works in nbdkit vddk conversion
4.1 Convert a disk from VMware via vddk plugin without delay filter
#time nbdkit -rfv -U - --exportname --filter=retry vddk server=10.73.198.169 user=root password=+/home/passwd vm=moref=vm-6104 file='[datastore1] esx7.0-win2022-preview-x86_64/esx7.0-win2022-preview-x86_64.vmdk' libdir=/home/vddk7.0.2 thumbprint='B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78' transports=nbdssl compression=skipz --run 'qemu-img convert $nbd /home/esx7.0-win2022-preview-skipz'
....
real 1m41.378s
user 0m32.250s
sys 0m37.698s
4.2.Convert a disk from VMware via vddk plugin with delay filter and delay-close=40000ms
# time nbdkit -rfv -U - --exportname --filter=retry vddk server=10.73.198.169 user=root password=+/home/passwd vm=moref=vm-6104 file='[datastore1] esx7.0-win2022-preview-x86_64/esx7.0-win2022-preview-x86_64.vmdk' libdir=/home/vddk7.0.2 thumbprint='B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78' transports=nbdssl compression=skipz --run 'qemu-img convert $nbd /home/esx7.0-win2022-preview-skipz' --filter=delay delay-close=40000ms
....
nbdkit: vddk[1]: debug: VixDiskLib: VixDiskLib_QueryAllocatedBlocks: Query allocated blocks.
nbdkit: vddk[1]: debug: VDDK call: VixDiskLib_FreeBlockList (block_list)
nbdkit: vddk[1]: debug: client sent NBD_CMD_DISC, closing connection
nbdkit: vddk[1]: debug: delay: finalize
nbdkit: vddk[1]: debug: vddk: finalize
nbdkit: vddk[1]: debug: vddk: finalize
nbdkit: debug: delay: cleanup
nbdkit: debug: vddk: cleanup
nbdkit: debug: vddk: unload plugin
nbdkit: debug: VDDK call: VixDiskLib_Exit ()
nbdkit: debug: VDDK_PhoneHome: VDDK PhoneHome succeed to exit
nbdkit: debug: VixDiskLib: VixDiskLib_Exit: Unmatched Init calls so far: 1.
nbdkit: debug: OBJLIB-LIB: ObjLib cleanup done.
nbdkit: debug: delay: unload filter
real 2m22.787s
user 0m31.289s
sys 0m39.454s
Result of step4: delay-close option works in nbdkit vddk conversion and there is no assertion failure
Hi rjones,
Please help to check the result of step2: there is no error message when nbdkit_nanosleep aborts early when verify the bug with nbdkit-1.26.3-4.el9.x86_64, is it expected?
> Please help to check the result of step2: there is no error message when nbdkit_nanosleep aborts early when verify the bug with nbdkit-1.26.3-4.el9.x86_64, is it expected?
Yes it's right. nbdinfo gracefully shuts down the connection
so there is a 3 second delay, and (because nbdkit_nanosleep isn't
interrupted) there is no error message at all.
(In reply to Richard W.M. Jones from comment #8) > > Please help to check the result of step2: there is no error message when nbdkit_nanosleep aborts early when verify the bug with nbdkit-1.26.3-4.el9.x86_64, is it expected? > > Yes it's right. nbdinfo gracefully shuts down the connection > so there is a 3 second delay, and (because nbdkit_nanosleep isn't > interrupted) there is no error message at all. I saw you have improved the error message about aborting sleep to shut down in https://listman.redhat.com/archives/libguestfs/2021-August/msg00056.html, how to verify the improved error message? It's tricky to check this without using a custom plugin, so I don't think there is an easy way. |
Created attachment 1812477 [details] nbdkit-delay-close-option.log Description of problem: delay-close option doesn't work and there is assertion failure at the end of nbdkit conversion Version-Release number of selected component (if applicable): nbdkit-1.26.3-1.el9.x86_64 How reproducible: 100% Steps to Reproduce: 1.Convert a disk from VMware via vddk plugin without delay filter #time nbdkit -rfv -U - --exportname --filter=retry vddk server=10.73.198.169 user=root password=+/home/passwd vm=moref=vm-6104 file='[datastore1] esx7.0-win2022-preview-x86_64/esx7.0-win2022-preview-x86_64.vmdk' libdir=/home/vddk7.0.2 thumbprint='B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78' transports=nbdssl compression=skipz --run 'qemu-img convert $nbd /home/esx7.0-win2022-preview-skipz' .... real 1m44.349s user 0m5.733s sys 0m36.490s 2. Convert a disk from VMware via vddk plugin with delay filter and delay-close=40000ms # time nbdkit -rfv -U - --exportname --filter=retry vddk server=10.73.198.169 user=root password=+/home/passwd vm=moref=vm-6104 file='[datastore1] esx7.0-win2022-preview-x86_64/esx7.0-win2022-preview-x86_64.vmdk' libdir=/home/vddk7.0.2 thumbprint='B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78' transports=nbdssl compression=skipz --run 'qemu-img convert $nbd /home/esx7.0-win2022-preview-skipz' --filter=delay delay-close=40000ms .... nbdkit: public.c:725: nbdkit_nanosleep: Assertion `quit || (conn && conn->nworkers > 0 && connection_get_status () < 1) || (conn && (fds[2].revents & (POLLRDHUP | POLLHUP | POLLERR)))' failed. real 1m44.214s user 0m6.040s sys 0m39.352s Actual results: As above description Expected results: delay-close option works well in nbdkit conversion Additional info: 1. delay-open option works well in nbdkit conversion #time nbdkit -rfv -U - --exportname --filter=retry vddk server=10.73.198.169 user=root password=+/home/passwd vm=moref=vm-6104 file='[datastore1] esx7.0-win2022-preview-x86_64/esx7.0-win2022-preview-x86_64.vmdk' libdir=/home/vddk7.0.2 thumbprint='B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78' transports=nbdssl compression=skipz --run 'qemu-img convert $nbd /home/esx7.0-win2022-preview-skipz' --filter=delay delay-open=40000ms ..... real 2m22.986s user 0m6.012s sys 0m37.128s