1925345 – qemu-nbd needs larger backlog for Unix socket listen()

Bug 1925345 - qemu-nbd needs larger backlog for Unix socket listen()

Summary: qemu-nbd needs larger backlog for Unix socket listen()

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.4
Assignee:	Eric Blake
QA Contact:	zixchen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1901441
TreeView+	depends on / blocked

Reported:	2021-02-04 22:27 UTC by Eric Blake
Modified:	2021-05-25 06:47 UTC (History)
CC List:	15 users (show)
Fixed In Version:	qemu-kvm-5.2.0-9.module+el8.4.0+10182+4161bd91
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1925045
Environment:
Last Closed:	2021-05-25 06:47:28 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Eric Blake 2021-02-04 22:27:14 UTC

TL:DR; summary: qemu-nbd uses listen(fd, 1) instead of listen(fd, SOMAXCONN), which has negative impacts visible to clients such as what is described in the rest of this comment:

+++ This bug was initially created as a clone of Bug #1925045 +++

Description of problem:

(Originally reported and analysed by Xin Long and Lukas Doktor)

For Unix domain sockets on Linux, connect(2) can return EAGAIN
to mean that the receive queue on the server side has run out
of space.  qemu-nbd sets the receive queue to something very
small (2 entries, I think) so it is very easy to hit this problem.

$ touch /tmp/disk.img
$ truncate -s 10M /tmp/disk.img
$ rm /tmp/sock
$ qemu-nbd -t -k /tmp/sock -f raw /tmp/disk.img 

Press ^Z so that qemu-nbd is stopped:

^Z
[1]+  Stopped                 qemu-nbd -t -k /tmp/sock -f raw /tmp/disk.img

Now run nbdsh a few times:

$ nbdsh -c 'h.connect_unix("/tmp/sock")' &
[2] 1099128
$ nbdsh -c 'h.connect_unix("/tmp/sock")' &
[3] 1099134
$ nbdsh -c 'h.connect_unix("/tmp/sock")' &
[4] 1099140
Traceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/lib64/python3.6/site-packages/nbd.py", line 2099, in <module>
    nbdsh.shell()
  File "/usr/lib64/python3.6/site-packages/nbdsh.py", line 100, in shell
    exec (c, d, d)
  File "<string>", line 1, in <module>
  File "/usr/lib64/python3.6/site-packages/nbd.py", line 881, in connect_unix
    return libnbdmod.connect_unix (self._o, unixsocket)
nbd.Error: nbd_connect_unix: connect: Resource temporarily unavailable (EAGAIN)

[4]-  Exit 1                  nbdsh -c 'h.connect_unix("/tmp/sock")'

Version-Release number of selected component (if applicable):

libnbd-1.4.0-1.module+el8.3.0+7821+2cebf880.x86_64
qemu-img-5.2.0-3.scrmod+el8.4.0+9533+93b6ae37.wrb210120.x86_64

How reproducible:

100%

Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=1901441#c31

--- Additional comment from Richard W.M. Jones on 2021-02-04 05:57:20 MST ---

I did a bit of experimentation here and I'm not sure this bug is solvable.
It's really a buggy kernel API in Linux.

The socket(2) call succeeds, and connect(2) fails with EAGAIN.

After this, the socket is writable (POLLOUT|POLLHUP), but connecting
fails again immediately.  We just end up in a loop doing:

poll([{fd=3, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=3, revents=POLLOUT|POLLHUP}])
connect(3, {sa_family=AF_UNIX, sun_path="/tmp/sock"}, 110) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=3, revents=POLLOUT|POLLHUP}])
connect(3, {sa_family=AF_UNIX, sun_path="/tmp/sock"}, 110) = -1 EAGAIN (Resource temporarily unavailable)

which of course uses 100% CPU.

Since the socket on its own does not "remember" which socket we wanted to
connect to, there's no way that poll could do the right thing here.

--- Additional comment from Richard W.M. Jones on 2021-02-04 06:28:22 MST ---

Here's a simpler reproducer which doesn't require NBD at all
(suggested by Daniel Berrange).

-----------------------------------
#include <sys/socket.h>
#include <sys/un.h>
#include <fcntl.h>
#include <unistd.h>
#include <assert.h>
#include <stdio.h>

int main(int argc, char **argv)
{
  int s = socket(AF_UNIX, SOCK_STREAM|SOCK_NONBLOCK|SOCK_CLOEXEC, 0);
  struct sockaddr_un un = {};
  const char *path = "/tmp/sock";
  int ret;

  assert(s >= 0);

  un.sun_family = AF_UNIX;
  memcpy(un.sun_path, path, strlen(path));

  ret = connect(s, (struct sockaddr *)&un, sizeof(un));
  if (ret < 0) {
    perror("connect");
  }

  sleep(30);

  return 0;
}
-----------------------------------

$ gcc sock.c -o sock

$ socat  UNIX-LISTEN:/tmp/sock,backlog=1 STDOUT
^Z
[1]+  Stopped                 socat UNIX-LISTEN:/tmp/sock,backlog=1 STDOUT

In other windows, run the following command three times:

$ strace ./sock

The first two will connect successfully, and the third will fail with:

connect(3, {sa_family=AF_UNIX, sun_path="/tmp/sock"}, 110) = -1 EAGAIN (Resource temporarily unavailable)

--- Additional comment from Richard W.M. Jones on 2021-02-04 10:25:51 MST ---

http://post-office.corp.redhat.com/archives/tech-list/2021-February/msg00024.html

--- Additional comment from Richard W.M. Jones on 2021-02-04 11:11:59 MST ---

Patch posted:
https://www.redhat.com/archives/libguestfs/2021-February/msg00015.html

Comment 1 Eric Blake 2021-02-04 22:35:04 UTC

Upstream patch proposed:
https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg01843.html

Comment 2 zixchen 2021-02-05 10:17:43 UTC

Checked with qemu-img-4.2.0-42.module+el8.4.0+9559+d618a2c8.x86_64 as well, can reproduce this issue, do we need a copy to slow train?

Comment 3 Richard W.M. Jones 2021-02-05 11:01:38 UTC

Should this be RHEL AV, not RHEL?  I'm not sure how much we care
about this issue in RHEL since the layered products that might
hit it are all using RHEL AV.

Comment 4 Eric Blake 2021-02-05 13:45:17 UTC

Definitely RHEL AV; and probably worth consideration for 8.3 z-stream rather than just waiting for 8.4, as it is easy to demonstrate its impact on external clients.  Whether we also want RHEL is a judgment call (the problem has been present for a long time, but it has only recently become apparent as we finally have more NBD clients willing to try parallel connections and running into the server issue).  And it wouldn't be the first time that I've opened a bug with the wrong product selected, so moving it to whatever we decide is the right one is fine by me.

Comment 5 Eric Blake 2021-02-09 14:38:24 UTC

(In reply to Eric Blake from comment #0)

> $ touch /tmp/disk.img
> $ truncate -s 10M /tmp/disk.img
> $ rm /tmp/sock
> $ qemu-nbd -t -k /tmp/sock -f raw /tmp/disk.img 

Note that by default, the NBD server created by the nbd-server-start QMP command in qemu allows unlimited clients, while the server created by qemu-nbd defaults to allowing only the first successful client unless you use -e (--shared) with 0 for unlimited or a value larger than 1.

There is an interesting trade-off to be made: if the server is going to reject all subsequent clients anyways, specifying a larger SOMAXCONN only means that rogue unsuccessful clients that attempt to connect() first will not cause EAGAIN errors on the actual client that we expect to connect(); but if you have rogue clients competing for the connection, you already have other problems to worry about.  On the other hand, if you run:

qemu-nbd -t -e 100 -k /tmp/sock -f raw /tmp/disk.img

then it you ABSOLUTELY want to be able to connect 100 simultaneous clients without EAGAIN stalls.  But the other factor here is -t: when -t is not present, qemu-nbd will be going away after the first client.  But when -t is used even without -e, later clients are permitted (only one at once, but the later ones will eventually get their chance), so in that scenario we also want SOMAXCONN rather than paying attention to -e.

> 
> Patch posted:
> https://www.redhat.com/archives/libguestfs/2021-February/msg00015.html

So another version of this patch will be going upstream soon based on those observations.

Comment 8 zixchen 2021-02-24 01:33:57 UTC

Test with qemu-kvm-5.2.0-8.el8.eblake202102221522.x86_64, no issue found.

Test steps:
1. Export with # qemu-nbd -t -k /tmp/sock -f raw /tmp/disk.img
Repeat 10 times with # nbdsh -c 'h.connect_unix("/tmp/sock")' &, no issue found.

2. Export with # qemu-nbd -e 3 -k /tmp/sock -f raw /tmp/disk.img
After executing "nbdsh -c 'h.connect_unix("/tmp/sock")' & " for 4 times, the 6th execution failed with
nbdsh: command line script failed: nbd_connect_unix: connect: server backlog overflowed, see https://bugzilla.redhat.com/1925045: Resource temporarily unavailable

Comment 15 zixchen 2021-03-02 02:54:03 UTC

Test with qemu-kvm-5.2.0-9.module+el8.4.0+10182+4161bd91.x86_64, no issue found, so change status to verified.

Version:
qemu-kvm-5.2.0-9.module+el8.4.0+10182+4161bd91.x86_64
kernel-4.18.0-291.el8.x86_64

Steps anf result:
1.# touch /tmp/disk.img
2.# truncate -s 10M /tmp/disk.img
3.# rm /tmp/sock
4.# qemu-nbd -e 3 -k /tmp/sock -f raw /tmp/disk.img
^Z
[4]+  Stopped                 qemu-nbd -e 3 -k /tmp/sock -f raw /tmp/disk.img
5. # nbdsh -c 'h.connect_unix("/tmp/sock")' &
[5] 88578
# nbdsh -c 'h.connect_unix("/tmp/sock")' &
[6] 88580
# nbdsh -c 'h.connect_unix("/tmp/sock")' &
[7] 88582
# nbdsh -c 'h.connect_unix("/tmp/sock")' &
[8] 88584
# nbdsh -c 'h.connect_unix("/tmp/sock")' &
[9] 88586
# nbdsh: command line script failed: nbd_connect_unix: connect: server backlog overflowed, see https://bugzilla.redhat.com/1925045: Resource temporarily unavailable

6. kill export image, 
# qemu-nbd -t -k /tmp/sock -f raw /tmp/disk.img
^Z
[1]+  Stopped                 qemu-nbd -t -k /tmp/sock -f raw /tmp/disk.img
7. repeat 10 times connect to the image.
# nbdsh -c 'h.connect_unix("/tmp/sock")' &
[4] 88533
...
# nbdsh -c 'h.connect_unix("/tmp/sock")' &
[14] 88553

Expected result:
Same as actual result.

Comment 17 errata-xmlrpc 2021-05-25 06:47:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2098

Note You need to log in before you can comment on or make changes to this bug.