Bug 2088125

Summary: [s390x] unit test lib/isc/tests/netmgr_test fails reliably
Product: [Fedora] Fedora Reporter: Petr Menšík <pemensik>
Component: bindAssignee: Petr Menšík <pemensik>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: aegorenk, anon.amish, dns-sig, mruprich, pemensik, sgallagh, vonsch, zdohnal
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Unspecified   
Whiteboard:
Fixed In Version: bind-9.16.29-1.fc36 bind-9.16.29-1.fc35 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-01 01:24:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test s390x pcap
none
test x86_64 pcap none

Description Petr Menšík 2022-05-18 18:37:10 UTC
Description of problem:
I tried reenabling disabled unit tests on rawhide. It seems kyua is not a problem anymore, but one of tests fails often on s390x platform.

Version-Release number of selected component (if applicable):
bind-9.16.28-2.fc37

How reproducible:
~90% reliable

Steps to Reproduce:
1. use s390x to compile
2. fedpkg clone --anonymous bind && cd bind
3. fedpkg local --with UNITTESTS

Actual results:
lib/isc/tests/netmgr fails.

Expected results:
all tests pass just as on other platforms.

Additional info:
I have seen errors with s390x also before. But were able to debug it and find interesting parameters. Compared with build on x86_64. Only on s390x uv library calls nrecv == 0 and flags == 8.

(gdb) break udp_recv_cb if nrecv == 0 && addr != 0 && flags  == 8

This breaks a lot just on s390x. Then it fails.

(gdb) bt
#0  __GI_exit (status=status@entry=-1) at exit.c:143
#1  0x000003fffdd8318e in exit_test (quit_application=1) at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:408
#2  0x000003fffdd8329a in _fail (file=file@entry=0x2aa000171cc "../../../../lib/isc/tests/netmgr_test.c", 
    line=<optimized out>) at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:2196
#3  0x000003fffdd8335c in _assert_true (result=0, line=<optimized out>, 
    file=0x2aa000171cc "../../../../lib/isc/tests/netmgr_test.c", 
    expression=0x2aa00017390 "region->length >= sizeof(magic)")
    at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:1730
#4  0x000002aa00008074 in listen_read_cb (handle=<optimized out>, eresult=<optimized out>, 
    region=region@entry=0x3fffc6f6c18, cbarg=0x0) at ../../../../lib/isc/tests/netmgr_test.c:523
#5  0x000003fffdeba746 in isc__nm_async_readcb (worker=worker@entry=0x0, ev0=ev0@entry=0x3fffc6f6cd0)
    at netmgr/../../../../lib/isc/netmgr/netmgr.c:2619
#6  0x000003fffdebaec8 in isc__nm_readcb (sock=0x2aa006084a0, uvreq=<optimized out>, eresult=<optimized out>)
    at netmgr/../../../../lib/isc/netmgr/netmgr.c:2592
#7  0x000002aa00006740 in udp_recv_cb (handle=<optimized out>, nrecv=0, buf=0x3fffc6f6f18, addr=<optimized out>, 
    flags=<optimized out>) at ../../../../lib/isc/tests/../netmgr/udp.c:455
#8  0x000003fffdd24f3a in uv__udp_recvmmsg (handle=handle@entry=0x2aa006086d8, buf=buf@entry=0x3fffc6f7878)
    at src/unix/udp.c:231
#9  0x000003fffdd250e0 in uv__udp_recvmsg (handle=0x2aa006086d8) at src/unix/udp.c:273
#10 uv__udp_io (loop=<optimized out>, w=0x2aa00608758, revents=<optimized out>) at src/unix/udp.c:178
#11 0x000003fffdd265c8 in uv__io_poll (loop=0x2aa000db6d0, timeout=6326104) at src/unix/epoll.c:374
#12 0x000003fffdd0ffc2 in uv__io_poll (timeout=<optimized out>, loop=0x2aa000db6d0) at src/unix/epoll.c:136
#13 uv_run (loop=loop@entry=0x2aa000db6d0, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:391
#14 0x000003fffdeb6eb2 in nm_thread (worker0=0x2aa000db6c0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:749
#15 0x000003fffded4594 in isc__trampoline_run (arg=0x2aa000b9a40) at ../../../lib/isc/trampoline.c:198
#16 0x000003fffdb96238 in start_thread (arg=0x3fffc6fc840) at pthread_create.c:442
#17 0x000003fffdc116ee in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:66

The same breakpoint on x86 never stopped.

Comment 1 Petr Menšík 2022-05-18 18:41:02 UTC
# kyua test
aes_test:main  ->  passed  [0.026s]
buffer_test:main  ->  passed  [0.072s]
counter_test:main  ->  passed  [0.037s]
errno_test:main  ->  passed  [0.024s]
file_test:main  ->  passed  [0.021s]
hash_test:main  ->  passed  [0.022s]
heap_test:main  ->  passed  [0.033s]
hmac_test:main  ->  passed  [0.026s]
ht_test:main  ->  passed  [5.119s]
lex_test:main  ->  passed  [0.030s]
md_test:main  ->  passed  [0.069s]
mem_test:main  ->  passed  [4.485s]
netaddr_test:main  ->  passed  [0.022s]
netmgr_test:main  ->  broken: TAP test program yielded invalid data: Load of '/tmp/kyua.0eiQlP/28/stdout.txt' failed: Reported plan differs from actual executed tests  [0.362s]
parse_test:main  ->  passed  [0.038s]
pool_test:main  ->  passed  [0.052s]
radix_test:main  ->  passed  [0.032s]
quota_test:main  ->  passed  [0.126s]
regex_test:main  ->  passed  [0.753s]
result_test:main  ->  passed  [0.031s]
safe_test:main  ->  passed  [0.028s]
siphash_test:main  ->  passed  [0.040s]
sockaddr_test:main  ->  passed  [0.032s]
socket_test:main  ->  passed  [0.143s]
symtab_test:main  ->  passed  [0.044s]
task_test:main  ->  passed  [12.238s]
taskpool_test:main  ->  passed  [0.066s]
time_test:main  ->  passed  [0.022s]
timer_test:main  ->  passed  [13.582s]

Comment 2 Petr Menšík 2022-05-18 19:02:07 UTC
Reported it also upstream, but I guess I would have to debug on RH machines. I doubt upstream has required machine available.

Comment 3 Petr Menšík 2022-05-18 19:15:33 UTC
Thread 39 "isc-net-0001" hit Breakpoint 4, udp_recv_cb (handle=0x2aa000cda18, nrecv=0, buf=0x3fffd038f18, addr=0x3fffd039068, flags=8) at ../../../../lib/isc/tests/../netmgr/udp.c:375
375		    const struct sockaddr *addr, unsigned flags) {
(gdb) bt
#0  udp_recv_cb (handle=0x2aa000cda18, nrecv=0, buf=0x3fffd038f18, addr=0x3fffd039068, flags=8)
    at ../../../../lib/isc/tests/../netmgr/udp.c:375
#1  0x000003fffdd24f3a in uv__udp_recvmmsg (handle=handle@entry=0x2aa000cda18, buf=buf@entry=0x3fffd039878)
    at src/unix/udp.c:231
#2  0x000003fffdd250e0 in uv__udp_recvmsg (handle=0x2aa000cda18) at src/unix/udp.c:273
#3  uv__udp_io (loop=<optimized out>, w=0x2aa000cda98, revents=<optimized out>) at src/unix/udp.c:178
#4  0x000003fffdd265c8 in uv__io_poll (loop=0x2aa000dbb80, timeout=842392) at src/unix/epoll.c:374
#5  0x000003fffdd0ffc2 in uv__io_poll (timeout=<optimized out>, loop=0x2aa000dbb80) at src/unix/epoll.c:136
#6  uv_run (loop=loop@entry=0x2aa000dbb80, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:391
#7  0x000003fffdeb6eb2 in nm_thread (worker0=0x2aa000dbb70) at netmgr/../../../../lib/isc/netmgr/netmgr.c:749
#8  0x000003fffded4594 in isc__trampoline_run (arg=0x2aa000b9a80) at ../../../lib/isc/trampoline.c:198
#9  0x000003fffdb96238 in start_thread (arg=0x3fffd03e840) at pthread_create.c:442
#10 0x000003fffdc116ee in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:66

Such calls happens multiple times. Then if fails with above assertion failure. It seems upstream expects nrecv should be always non-zero when flags=8. And it is on x86_64. But it does not work on this platform.

When it failed, this were hopefully original handle:
(gdb) set $h = (struct uv_udp_s *) 0x2aa000ddf48
(gdb) p $h
$1 = (struct uv_udp_s *) 0x2aa000ddf48
(gdb) p *$h
$2 = {data = 0x2aa000ddd10, loop = 0x2aa000db6d0, type = UV_UDP, close_cb = 0x0, handle_queue = {0x2aa000ddde0, 
    0x2aa000dba40}, u = {fd = 0, reserved = {0x0, 0x0, 0x0, 0x0}}, next_closing = 0x0, flags = 71311372, 
  send_queue_size = 0, send_queue_count = 0, alloc_cb = 0x3fffdeb8dd0 <isc__nm_alloc_cb>, 
  recv_cb = 0x2aa00006610 <udp_recv_cb>, io_watcher = {cb = 0x3fffdd25010 <uv__udp_io>, pending_queue = {0x2aa000ddfd0, 
      0x2aa000ddfd0}, watcher_queue = {0x2aa000ddfe0, 0x2aa000ddfe0}, pevents = 1, events = 1, fd = 22}, write_queue = {
    0x2aa000de000, 0x2aa000de000}, write_completed_queue = {0x2aa000de010, 0x2aa000de010}}
(gdb) p /x 71311372
$3 = 0x440200c

Comment 4 Petr Menšík 2022-05-18 19:28:28 UTC
Hello Steven, I found you as Red Hat owner of libuv. Would you know, what might be reason libuv sends only on s390x platform callbacks with nrecv == 0 and flags == UV_UDP_MMSG_CHUNK? ISC upstream thinks it should be always non-zero. Comment in udp.h header suggests it should be zero with UV_UDP_MMSG_FREE.
I am not able to provide short reproducer. This unit test runs several threads and does complicated stuff.

Would you have any idea how can I narrow a reason of this difference? All other platforms pass fine. The test is in build/lib/isc/tests once built with UNITTESTS enabled. Can be run manually:

# CMOCKA_MESSAGE_OUTPUT= ./netmgr_test 
[==========] Running 49 test(s).
[ RUN      ] mock_listenudp_uv_udp_open
[       OK ] mock_listenudp_uv_udp_open
[ RUN      ] mock_listenudp_uv_udp_bind
[       OK ] mock_listenudp_uv_udp_bind
[ RUN      ] mock_listenudp_uv_udp_recv_start
[       OK ] mock_listenudp_uv_udp_recv_start
[ RUN      ] mock_udpconnect_uv_udp_open
[       OK ] mock_udpconnect_uv_udp_open
[ RUN      ] mock_udpconnect_uv_udp_bind
[       OK ] mock_udpconnect_uv_udp_bind
[ RUN      ] mock_udpconnect_uv_udp_connect
[       OK ] mock_udpconnect_uv_udp_connect
[ RUN      ] mock_udpconnect_uv_recv_buffer_size
[       OK ] mock_udpconnect_uv_recv_buffer_size
[ RUN      ] mock_udpconnect_uv_send_buffer_size
[       OK ] mock_udpconnect_uv_send_buffer_size
[ RUN      ] udp_noop
[       OK ] udp_noop
[ RUN      ] udp_noresponse
[       OK ] udp_noresponse
[ RUN      ] udp_timeout_recovery
[       OK ] udp_timeout_recovery
[ RUN      ] udp_recv_one
# echo $?
255

On passing platform it should run much more tests.

Comment 5 Stephen Gallagher 2022-05-18 21:33:14 UTC
I actually only package this for Fedora. I don't have much knowledge of its inner workings. That said, I fell down a rabbit hole and as best I can tell, the issue is most likely in the kernel. Here's the implementation of the function that returns the variable that stores nrecv and flags:

int uv__recvmmsg(int fd, struct uv__mmsghdr* mmsg, unsigned int vlen) {
#if defined(__i386__)
  unsigned long args[5];
  int rc;

  args[0] = (unsigned long) fd;
  args[1] = (unsigned long) mmsg;
  args[2] = (unsigned long) vlen;
  args[3] = /* flags */ 0;
  args[4] = /* timeout */ 0;

  /* socketcall() raises EINVAL when SYS_RECVMMSG is not supported. */
  rc = syscall(/* __NR_socketcall */ 102, 19 /* SYS_RECVMMSG */, args);
  if (rc == -1)
    if (errno == EINVAL)
      errno = ENOSYS;

  return rc;
#elif defined(__NR_recvmmsg)
  return syscall(__NR_recvmmsg, fd, mmsg, vlen, /* flags */ 0, /* timeout */ 0);
#else
  return errno = ENOSYS, -1;
#endif
}

As you can see, all it does is pass the request on to a syscall and then return the result. It appears that the libuv code makes the assumption that any messages it receives must be non-zero in length, which may not be a safe assumption on s390x.

I'll report this issue upstream to the libuv maintainers.

Comment 6 Stephen Gallagher 2022-05-18 21:49:38 UTC
Actually, I just found the relevant documentation:
http://docs.libuv.org/en/v1.x/udp.html?highlight=uv__recvmmsg#c.uv_udp_recv_cb

See in particular: "Note The receive callback will be called with ... nread == 0 and addr != NULL when an empty UDP packet is received."

So I think ISC's assumption that nread must always be nonzero is incorrect.

Comment 7 Petr Menšík 2022-05-24 16:19:53 UTC
Created attachment 1882838 [details]
test s390x pcap

Test pcap done on s390x machine. Indeed all packets sent are of size 0.

Comment 8 Petr Menšík 2022-05-24 17:00:24 UTC
Created attachment 1882845 [details]
test x86_64 pcap

Similar dump of communication on x86_64, bind-9.16.28 on f34

Comment 9 Petr Menšík 2022-05-24 17:19:28 UTC
Think I found something in connect_send() function.

(gdb) 
423			isc_nm_send(sendhandle, (isc_region_t *)&send_msg,
(gdb) p send_msg
$3 = {base = 0x2aa0001cb48 <send_magic> "fP\256\071P\321\\\020]A:\376\005\204\256", <incomplete sequence \311>, len = 8}
(gdb) step
isc_nm_send (handle=0x2aa00608e20, region=0x2aa0001c7b0 <send_msg>, cb=0x2aa00007c20 <connect_send_cb>, cbarg=0x0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:2427
2427		REQUIRE(VALID_NMHANDLE(handle));
(gdb) next
2429		switch (handle->sock->type) {
(gdb) 
2432			isc__nm_udp_send(handle, region, cb, cbarg);
(gdb) p region
$4 = (isc_region_t *) 0x2aa0001c7b0 <send_msg>
(gdb) p *region
$5 = {base = 0x2aa0001cb48 <send_magic> "fP\256\071P\321\\\020]A:\376\005\204\256", <incomplete sequence \311>, length = 0}
(gdb) frame
#0  isc_nm_send (handle=0x2aa00608e20, region=0x2aa0001c7b0 <send_msg>, cb=0x2aa00007c20 <connect_send_cb>, cbarg=0x0)
    at netmgr/../../../../lib/isc/netmgr/netmgr.c:2432
2432			isc__nm_udp_send(handle, region, cb, cbarg);

It seems changing type from struct uv_buf_t to isc_region_t does not work as expected on this platform.

(gdb) ptype send_msg
type = struct uv_buf_t {
    char *base;
    size_t len;
}
(gdb) ptype region
type = struct isc_region {
    unsigned char *base;
    unsigned int length;
} *
(gdb) p sizeof(region)
$6 = 8
(gdb) p sizeof(send_msg)
$7 = 16

This should not work and has to be done manual way. size_t and unsigned are not same size on other platforms. It seems to be just luck it worked on other platforms.

Comment 10 Petr Menšík 2022-05-24 17:54:37 UTC
Reported to upstream on MR https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/6352

Should affect any big endian architecture and affected would be any 9.16.x or higher version using libuv.

Comment 11 Fedora Update System 2022-05-27 20:19:16 UTC
FEDORA-2022-d1c57e7e41 has been submitted as an update to Fedora 36. https://bodhi.fedoraproject.org/updates/FEDORA-2022-d1c57e7e41

Comment 12 Fedora Update System 2022-05-27 22:16:13 UTC
FEDORA-2022-48eb6235a4 has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2022-48eb6235a4

Comment 13 Fedora Update System 2022-05-28 02:06:51 UTC
FEDORA-2022-d1c57e7e41 has been pushed to the Fedora 36 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2022-d1c57e7e41`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-d1c57e7e41

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 14 Fedora Update System 2022-05-28 02:23:42 UTC
FEDORA-2022-48eb6235a4 has been pushed to the Fedora 35 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2022-48eb6235a4`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-48eb6235a4

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 15 Fedora Update System 2022-06-01 01:24:28 UTC
FEDORA-2022-d1c57e7e41 has been pushed to the Fedora 36 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 16 Fedora Update System 2022-06-05 01:27:27 UTC
FEDORA-2022-48eb6235a4 has been pushed to the Fedora 35 stable repository.
If problem still persists, please make note of it in this bug report.