Bug 2173054

Summary: [abrt] nbdkit-server: raw_send_socket(): nbdkit killed by SIGABRT
Product: Red Hat Enterprise Linux 9 Reporter: Eric Blake <eblake>
Component: nbdkitAssignee: Eric Blake <eblake>
Status: CLOSED ERRATA QA Contact: mxie <mxie>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.2CC: eblake, extras-qa, lersek, mxie, rjones, tzheng, virt-maint, vwu, xiaodwan
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
URL: https://retrace.fedoraproject.org/faf/reports/bthash/7ab51c0d92763b4d5af6b7989e05708c45a535e8
Whiteboard: abrt_hash:363ab534cc1daec2aa5c7dd72faf2da433da482f;VARIANT_ID=workstation;
Fixed In Version: nbdkit-1.33.11-1.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2173047 Environment:
Last Closed: 2023-11-07 08:28:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2168629    
Bug Blocks:    

Description Eric Blake 2023-02-23 20:00:55 UTC
+++ This bug was initially created as a clone of Bug #2173047 +++

Description of problem:
'make check' in libnbd project but system-installed nbdkit

Version-Release number of selected component:
nbdkit-server-1.32.5-1.fc37

Additional info:
reporter:       libreport-2.17.4
backtrace_rating: 4
cgroup:         0::/user.slice/user-14986.slice/user/app.slice/app-org.gnome.Terminal.slice/vte-spawn-d6fe51c3-4e81-4e8f-8775-2b8c418b0fcc.scope
cmdline:        nbdkit --exit-with-parent -v --filter=error pattern 5M error-pread-rate=0.5
crash_function: raw_send_socket
executable:     /usr/sbin/nbdkit
journald_cursor: s=1be4dd20d4854712bab1191d895af0dc;i=4a3ad5;b=c162492451d944938c784f4627ff93a6;m=693923ae92;t=5f5626ed38091;x=72b7f4e687039a6c
kernel:         6.1.11-200.fc37.x86_64
rootdir:        /
runlevel:       N 5
type:           CCpp
uid:            14986

--- Additional comment from Eric Blake on 2023-02-23 12:26:51 MST ---



--- Additional comment from Eric Blake on 2023-02-23 12:26:52 MST ---



--- Additional comment from Eric Blake on 2023-02-23 12:26:54 MST ---



--- Additional comment from Eric Blake on 2023-02-23 12:26:55 MST ---



--- Additional comment from Eric Blake on 2023-02-23 12:26:56 MST ---



--- Additional comment from Eric Blake on 2023-02-23 12:26:58 MST ---



--- Additional comment from Eric Blake on 2023-02-23 12:26:59 MST ---



--- Additional comment from Eric Blake on 2023-02-23 12:27:00 MST ---



--- Additional comment from Eric Blake on 2023-02-23 12:27:02 MST ---



--- Additional comment from Eric Blake on 2023-02-23 12:27:03 MST ---



--- Additional comment from Eric Blake on 2023-02-23 12:59:17 MST ---

Reproduced with:
$ nbdcopy -- [ nbdkit --exit-with-parent -v --filter=error pattern 5M error-pread-rate=0.5 ] null:
...
nbdkit: connections.c:402: raw_send_socket: Assertion `sock >= 0' failed.

using libnbd-1.15.9-2.fc38.x86_64, nbdkit-1.33.8-1.fc38.x86_64

The libnbd testsuite is silently continuing in spite of the nbdkit assertion failure.

The failure itself is in raw_send_socket(), in an assertion added in commit daef505e

Comment 1 Richard W.M. Jones 2023-02-23 20:04:25 UTC
Unfortunately Monday is the beginning of the exception phase
for RHEL 9.2.  If we had exception+ then we could fix this.

It probably does not affect virt-v2v or any layered products.

But we're still waiting on Eric's analysis of the bug, and
that might change if he things it is more serious.

Comment 2 Eric Blake 2023-02-23 20:08:44 UTC
nbdkit: pattern.1: debug: error-inject: pread count=262144 offset=4194304
nbdkit: pattern.1: error: injecting EIO error into pread
nbdkit: pattern.1: debug: sending error reply: Input/output error
nbdkit: pattern.0: debug: pattern: pread count=262144 offset=4456448
nbdkit: pattern.2: error: write data: NBD_CMD_READ: Broken pipe
nbdkit: pattern.2: debug: exiting worker thread pattern.2
nbdkit: connections.c:402: raw_send_socket: Assertion `sock >= 0' failed.

Looks like nbdcopy is peppering the server with multiple requests, but hanging up early as soon as one request hits EIO.  Other pending requests that do succeed happen to get EPIPE because the client is already gone, and change sock to -1 to reflect this fact, even before we can detect clean shutdown.  Perhaps libnbd can be nicer and send NBD_CMD_DISC after read errors rather than abruptly hanging up, but the server should NOT be crashing.  Fortunately, the crash is only on exit (the use of --exit-with-parent shows that no other client will be trying to connect), rather than during the data-serving phase.  I'm playing with ideas how to patch upstream...

Comment 3 Eric Blake 2023-02-23 20:30:28 UTC
The crash is more serious when --exit-with-parent is not in use.  If a single server allows parallel clients, any one of the clients can trigger the EPIPE/SIGABRT scenario by hanging up early with large in-flight read requests, which then tears down that connection, but where the SIGABRT then tears down the entire nbdkit process and denies service to all other currently-connected clients.  I'm not sure if that ranks as a CVE, though - either you are using TLS (so the only client that can trigger the problem already has the same privileges as all other clients that were able to connect - no privilege escalation boundary), or you are not (at which point, there's plenty of other ways for one client to starve others, whether or not we patch this SIGABRT).

Comment 4 Eric Blake 2023-02-23 20:41:42 UTC
Upstream patch proposed:
https://listman.redhat.com/archives/libguestfs/2023-February/030855.html

Comment 5 Eric Blake 2023-02-23 20:44:19 UTC
It may also be worth cloning this bug to libnbd to have nbdcopy gracefully consume ALL pending requests and issue a clean NBD_CMD_DISC, rather than abruptly hanging up on the server on the first EIO, since not all NBD servers might be as graceful as we intend for nbdkit to behave.

Comment 6 Eric Blake 2023-02-24 17:15:04 UTC
Laszlo asked me a question which led me to find a potential data corruption bug introduced at the same time as the assertion failure, if a second client connects in the window between when thread 1 of the first client checks the connection status, thread 2 of the first client kills the connection, then thread 1 tries to flush its pending output buffer on the stale fd now pointing to the socket allocated by the second client connecting.
https://listman.redhat.com/archives/libguestfs/2023-February/030871.html

The window is rather narrow, so it is hard to argue whether a client could actually intentionally abuse it to the point of corrupting data of a peer client rather than crashing nbdkit with an assertion failure, but this race should be fixed at the same time.

Comment 9 mxie@redhat.com 2023-03-22 14:43:05 UTC
Reproduce the bug with nbdkit-1.32.5-4.el9.x86_64 and libnbd-1.14.2-1.el9.x86_64

Steps to reproduce:
1.# nbdcopy -- [ nbdkit --exit-with-parent -v --filter=error pattern 5M error-pread-rate=0.5 ] null:
.....
nbdkit: pattern.1: error: injecting EIO error into pread
nbdkit: pattern.14: debug: error-inject: pread count=262144 offset=4718592
nbdkit: pattern.14: debug: pattern: pread count=262144 offset=4718592
nbdkit: pattern.1: debug: sending error reply: Input/output error
nbdkit: pattern.4: debug: error-inject: pread count=262144 offset=4980736
nbdkit: pattern.4: debug: pattern: pread count=262144 offset=4980736
nbdkit: pattern.3: error: write data: NBD_CMD_READ: Broken pipe
nbdkit: pattern.3: debug: exiting worker thread pattern.3
nbdkit: pattern.0: debug: exiting worker thread pattern.0
nbdkit: connections.c:402: raw_send_socket: Assertion `sock >= 0' failed.
nbdkit: pattern.14: debug: exiting worker thread pattern.14

Result:  Third thread will trigger the assertion failure  when nbdkit client hangs up abruptly

Verify the bug with nbdkit-server-1.33.11-1.el9.x86_64 and libnbd-1.15.12-1.el9.x86_64

Steps:
1. #nbdcopy -- [ nbdkit --exit-with-parent -v --filter=error pattern 5M error-pread-rate=0.5 ] null:
nbdkit: pattern.5: error: write error reply: Bad file descriptor
nbdkit: pattern.5: debug: exiting worker thread pattern.5
nbdkit: pattern.9: error: write reply: NBD_CMD_READ: Bad file descriptor
nbdkit: pattern.9: debug: exiting worker thread pattern.9
nbdkit: pattern.10: error: write error reply: Bad file descriptor
nbdkit: pattern.15: debug: exiting worker thread pattern.15
nbdkit: pattern.10: debug: exiting worker thread pattern.10
nbdkit: pattern.11: error: write reply: NBD_CMD_READ: Bad file descriptor
nbdkit: pattern.12: error: write reply: NBD_CMD_READ: Bad file descriptor
nbdkit: pattern.11: debug: exiting worker thread pattern.11
nbdkit: pattern.12: debug: exiting worker thread pattern.12
nbdkit: pattern[1]: debug: error-inject: finalize
nbdkit: pattern[1]: debug: pattern: finalize
nbdkit: debug: error-inject: cleanup
nbdkit: debug: pattern: cleanup
nbdkit: debug: pattern: unload plugin
nbdkit: debug: error-inject: unload filter

Result: nbdkit can exit gracefully

Comment 11 Richard W.M. Jones 2023-04-12 16:46:38 UTC
This bug is in a confusing state.  Shouldn't it be added to an erratum (automatically)?

Comment 14 mxie@redhat.com 2023-05-05 03:52:19 UTC
Verify the bug with nbdkit-server-1.34.1-1.el9.x86_64 and libnbd-1.16.0-1.el9.x86_64

Steps:
1. #nbdcopy -- [ nbdkit --exit-with-parent -v --filter=error pattern 5M error-pread-rate=0.5 ] null:
....
nbdkit: pattern.10: error: write reply: NBD_CMD_READ: Bad file descriptor
nbdkit: pattern.10: debug: exiting worker thread pattern.10
nbdkit: pattern.12: error: write error reply: Bad file descriptor
nbdkit: pattern.12: debug: exiting worker thread pattern.12
nbdkit: pattern.13: error: write reply: NBD_CMD_READ: Bad file descriptor
nbdkit: pattern.13: debug: exiting worker thread pattern.13
nbdkit: pattern.7: error: write error reply: Bad file descriptor
nbdkit: pattern.7: debug: exiting worker thread pattern.7
nbdkit: pattern.8: error: write error reply: Bad file descriptor
nbdkit: pattern.8: debug: exiting worker thread pattern.8
nbdkit: pattern.9: error: write error reply: Bad file descriptor
nbdkit: pattern.9: debug: exiting worker thread pattern.9
nbdkit: pattern[1]: debug: error-inject: finalize
nbdkit: pattern[1]: debug: pattern: finalize
nbdkit: debug: error-inject: cleanup
nbdkit: debug: pattern: cleanup
nbdkit: debug: pattern: unload plugin
nbdkit: debug: error-inject: unload filter


Result: nbdkit can exit gracefully, move the bug from ON_QA to VERIFIED

Comment 16 errata-xmlrpc 2023-11-07 08:28:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (nbdkit bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:6374