Description of problem: I tried reenabling disabled unit tests on rawhide. It seems kyua is not a problem anymore, but one of tests fails often on s390x platform. Version-Release number of selected component (if applicable): bind-9.16.28-2.fc37 How reproducible: ~90% reliable Steps to Reproduce: 1. use s390x to compile 2. fedpkg clone --anonymous bind && cd bind 3. fedpkg local --with UNITTESTS Actual results: lib/isc/tests/netmgr fails. Expected results: all tests pass just as on other platforms. Additional info: I have seen errors with s390x also before. But were able to debug it and find interesting parameters. Compared with build on x86_64. Only on s390x uv library calls nrecv == 0 and flags == 8. (gdb) break udp_recv_cb if nrecv == 0 && addr != 0 && flags == 8 This breaks a lot just on s390x. Then it fails. (gdb) bt #0 __GI_exit (status=status@entry=-1) at exit.c:143 #1 0x000003fffdd8318e in exit_test (quit_application=1) at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:408 #2 0x000003fffdd8329a in _fail (file=file@entry=0x2aa000171cc "../../../../lib/isc/tests/netmgr_test.c", line=<optimized out>) at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:2196 #3 0x000003fffdd8335c in _assert_true (result=0, line=<optimized out>, file=0x2aa000171cc "../../../../lib/isc/tests/netmgr_test.c", expression=0x2aa00017390 "region->length >= sizeof(magic)") at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:1730 #4 0x000002aa00008074 in listen_read_cb (handle=<optimized out>, eresult=<optimized out>, region=region@entry=0x3fffc6f6c18, cbarg=0x0) at ../../../../lib/isc/tests/netmgr_test.c:523 #5 0x000003fffdeba746 in isc__nm_async_readcb (worker=worker@entry=0x0, ev0=ev0@entry=0x3fffc6f6cd0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:2619 #6 0x000003fffdebaec8 in isc__nm_readcb (sock=0x2aa006084a0, uvreq=<optimized out>, eresult=<optimized out>) at netmgr/../../../../lib/isc/netmgr/netmgr.c:2592 #7 0x000002aa00006740 in udp_recv_cb (handle=<optimized out>, nrecv=0, buf=0x3fffc6f6f18, addr=<optimized out>, flags=<optimized out>) at ../../../../lib/isc/tests/../netmgr/udp.c:455 #8 0x000003fffdd24f3a in uv__udp_recvmmsg (handle=handle@entry=0x2aa006086d8, buf=buf@entry=0x3fffc6f7878) at src/unix/udp.c:231 #9 0x000003fffdd250e0 in uv__udp_recvmsg (handle=0x2aa006086d8) at src/unix/udp.c:273 #10 uv__udp_io (loop=<optimized out>, w=0x2aa00608758, revents=<optimized out>) at src/unix/udp.c:178 #11 0x000003fffdd265c8 in uv__io_poll (loop=0x2aa000db6d0, timeout=6326104) at src/unix/epoll.c:374 #12 0x000003fffdd0ffc2 in uv__io_poll (timeout=<optimized out>, loop=0x2aa000db6d0) at src/unix/epoll.c:136 #13 uv_run (loop=loop@entry=0x2aa000db6d0, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:391 #14 0x000003fffdeb6eb2 in nm_thread (worker0=0x2aa000db6c0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:749 #15 0x000003fffded4594 in isc__trampoline_run (arg=0x2aa000b9a40) at ../../../lib/isc/trampoline.c:198 #16 0x000003fffdb96238 in start_thread (arg=0x3fffc6fc840) at pthread_create.c:442 #17 0x000003fffdc116ee in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:66 The same breakpoint on x86 never stopped.
# kyua test aes_test:main -> passed [0.026s] buffer_test:main -> passed [0.072s] counter_test:main -> passed [0.037s] errno_test:main -> passed [0.024s] file_test:main -> passed [0.021s] hash_test:main -> passed [0.022s] heap_test:main -> passed [0.033s] hmac_test:main -> passed [0.026s] ht_test:main -> passed [5.119s] lex_test:main -> passed [0.030s] md_test:main -> passed [0.069s] mem_test:main -> passed [4.485s] netaddr_test:main -> passed [0.022s] netmgr_test:main -> broken: TAP test program yielded invalid data: Load of '/tmp/kyua.0eiQlP/28/stdout.txt' failed: Reported plan differs from actual executed tests [0.362s] parse_test:main -> passed [0.038s] pool_test:main -> passed [0.052s] radix_test:main -> passed [0.032s] quota_test:main -> passed [0.126s] regex_test:main -> passed [0.753s] result_test:main -> passed [0.031s] safe_test:main -> passed [0.028s] siphash_test:main -> passed [0.040s] sockaddr_test:main -> passed [0.032s] socket_test:main -> passed [0.143s] symtab_test:main -> passed [0.044s] task_test:main -> passed [12.238s] taskpool_test:main -> passed [0.066s] time_test:main -> passed [0.022s] timer_test:main -> passed [13.582s]
Reported it also upstream, but I guess I would have to debug on RH machines. I doubt upstream has required machine available.
Thread 39 "isc-net-0001" hit Breakpoint 4, udp_recv_cb (handle=0x2aa000cda18, nrecv=0, buf=0x3fffd038f18, addr=0x3fffd039068, flags=8) at ../../../../lib/isc/tests/../netmgr/udp.c:375 375 const struct sockaddr *addr, unsigned flags) { (gdb) bt #0 udp_recv_cb (handle=0x2aa000cda18, nrecv=0, buf=0x3fffd038f18, addr=0x3fffd039068, flags=8) at ../../../../lib/isc/tests/../netmgr/udp.c:375 #1 0x000003fffdd24f3a in uv__udp_recvmmsg (handle=handle@entry=0x2aa000cda18, buf=buf@entry=0x3fffd039878) at src/unix/udp.c:231 #2 0x000003fffdd250e0 in uv__udp_recvmsg (handle=0x2aa000cda18) at src/unix/udp.c:273 #3 uv__udp_io (loop=<optimized out>, w=0x2aa000cda98, revents=<optimized out>) at src/unix/udp.c:178 #4 0x000003fffdd265c8 in uv__io_poll (loop=0x2aa000dbb80, timeout=842392) at src/unix/epoll.c:374 #5 0x000003fffdd0ffc2 in uv__io_poll (timeout=<optimized out>, loop=0x2aa000dbb80) at src/unix/epoll.c:136 #6 uv_run (loop=loop@entry=0x2aa000dbb80, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:391 #7 0x000003fffdeb6eb2 in nm_thread (worker0=0x2aa000dbb70) at netmgr/../../../../lib/isc/netmgr/netmgr.c:749 #8 0x000003fffded4594 in isc__trampoline_run (arg=0x2aa000b9a80) at ../../../lib/isc/trampoline.c:198 #9 0x000003fffdb96238 in start_thread (arg=0x3fffd03e840) at pthread_create.c:442 #10 0x000003fffdc116ee in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:66 Such calls happens multiple times. Then if fails with above assertion failure. It seems upstream expects nrecv should be always non-zero when flags=8. And it is on x86_64. But it does not work on this platform. When it failed, this were hopefully original handle: (gdb) set $h = (struct uv_udp_s *) 0x2aa000ddf48 (gdb) p $h $1 = (struct uv_udp_s *) 0x2aa000ddf48 (gdb) p *$h $2 = {data = 0x2aa000ddd10, loop = 0x2aa000db6d0, type = UV_UDP, close_cb = 0x0, handle_queue = {0x2aa000ddde0, 0x2aa000dba40}, u = {fd = 0, reserved = {0x0, 0x0, 0x0, 0x0}}, next_closing = 0x0, flags = 71311372, send_queue_size = 0, send_queue_count = 0, alloc_cb = 0x3fffdeb8dd0 <isc__nm_alloc_cb>, recv_cb = 0x2aa00006610 <udp_recv_cb>, io_watcher = {cb = 0x3fffdd25010 <uv__udp_io>, pending_queue = {0x2aa000ddfd0, 0x2aa000ddfd0}, watcher_queue = {0x2aa000ddfe0, 0x2aa000ddfe0}, pevents = 1, events = 1, fd = 22}, write_queue = { 0x2aa000de000, 0x2aa000de000}, write_completed_queue = {0x2aa000de010, 0x2aa000de010}} (gdb) p /x 71311372 $3 = 0x440200c
Hello Steven, I found you as Red Hat owner of libuv. Would you know, what might be reason libuv sends only on s390x platform callbacks with nrecv == 0 and flags == UV_UDP_MMSG_CHUNK? ISC upstream thinks it should be always non-zero. Comment in udp.h header suggests it should be zero with UV_UDP_MMSG_FREE. I am not able to provide short reproducer. This unit test runs several threads and does complicated stuff. Would you have any idea how can I narrow a reason of this difference? All other platforms pass fine. The test is in build/lib/isc/tests once built with UNITTESTS enabled. Can be run manually: # CMOCKA_MESSAGE_OUTPUT= ./netmgr_test [==========] Running 49 test(s). [ RUN ] mock_listenudp_uv_udp_open [ OK ] mock_listenudp_uv_udp_open [ RUN ] mock_listenudp_uv_udp_bind [ OK ] mock_listenudp_uv_udp_bind [ RUN ] mock_listenudp_uv_udp_recv_start [ OK ] mock_listenudp_uv_udp_recv_start [ RUN ] mock_udpconnect_uv_udp_open [ OK ] mock_udpconnect_uv_udp_open [ RUN ] mock_udpconnect_uv_udp_bind [ OK ] mock_udpconnect_uv_udp_bind [ RUN ] mock_udpconnect_uv_udp_connect [ OK ] mock_udpconnect_uv_udp_connect [ RUN ] mock_udpconnect_uv_recv_buffer_size [ OK ] mock_udpconnect_uv_recv_buffer_size [ RUN ] mock_udpconnect_uv_send_buffer_size [ OK ] mock_udpconnect_uv_send_buffer_size [ RUN ] udp_noop [ OK ] udp_noop [ RUN ] udp_noresponse [ OK ] udp_noresponse [ RUN ] udp_timeout_recovery [ OK ] udp_timeout_recovery [ RUN ] udp_recv_one # echo $? 255 On passing platform it should run much more tests.
I actually only package this for Fedora. I don't have much knowledge of its inner workings. That said, I fell down a rabbit hole and as best I can tell, the issue is most likely in the kernel. Here's the implementation of the function that returns the variable that stores nrecv and flags: int uv__recvmmsg(int fd, struct uv__mmsghdr* mmsg, unsigned int vlen) { #if defined(__i386__) unsigned long args[5]; int rc; args[0] = (unsigned long) fd; args[1] = (unsigned long) mmsg; args[2] = (unsigned long) vlen; args[3] = /* flags */ 0; args[4] = /* timeout */ 0; /* socketcall() raises EINVAL when SYS_RECVMMSG is not supported. */ rc = syscall(/* __NR_socketcall */ 102, 19 /* SYS_RECVMMSG */, args); if (rc == -1) if (errno == EINVAL) errno = ENOSYS; return rc; #elif defined(__NR_recvmmsg) return syscall(__NR_recvmmsg, fd, mmsg, vlen, /* flags */ 0, /* timeout */ 0); #else return errno = ENOSYS, -1; #endif } As you can see, all it does is pass the request on to a syscall and then return the result. It appears that the libuv code makes the assumption that any messages it receives must be non-zero in length, which may not be a safe assumption on s390x. I'll report this issue upstream to the libuv maintainers.
Actually, I just found the relevant documentation: http://docs.libuv.org/en/v1.x/udp.html?highlight=uv__recvmmsg#c.uv_udp_recv_cb See in particular: "Note The receive callback will be called with ... nread == 0 and addr != NULL when an empty UDP packet is received." So I think ISC's assumption that nread must always be nonzero is incorrect.
Created attachment 1882838 [details] test s390x pcap Test pcap done on s390x machine. Indeed all packets sent are of size 0.
Created attachment 1882845 [details] test x86_64 pcap Similar dump of communication on x86_64, bind-9.16.28 on f34
Think I found something in connect_send() function. (gdb) 423 isc_nm_send(sendhandle, (isc_region_t *)&send_msg, (gdb) p send_msg $3 = {base = 0x2aa0001cb48 <send_magic> "fP\256\071P\321\\\020]A:\376\005\204\256", <incomplete sequence \311>, len = 8} (gdb) step isc_nm_send (handle=0x2aa00608e20, region=0x2aa0001c7b0 <send_msg>, cb=0x2aa00007c20 <connect_send_cb>, cbarg=0x0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:2427 2427 REQUIRE(VALID_NMHANDLE(handle)); (gdb) next 2429 switch (handle->sock->type) { (gdb) 2432 isc__nm_udp_send(handle, region, cb, cbarg); (gdb) p region $4 = (isc_region_t *) 0x2aa0001c7b0 <send_msg> (gdb) p *region $5 = {base = 0x2aa0001cb48 <send_magic> "fP\256\071P\321\\\020]A:\376\005\204\256", <incomplete sequence \311>, length = 0} (gdb) frame #0 isc_nm_send (handle=0x2aa00608e20, region=0x2aa0001c7b0 <send_msg>, cb=0x2aa00007c20 <connect_send_cb>, cbarg=0x0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:2432 2432 isc__nm_udp_send(handle, region, cb, cbarg); It seems changing type from struct uv_buf_t to isc_region_t does not work as expected on this platform. (gdb) ptype send_msg type = struct uv_buf_t { char *base; size_t len; } (gdb) ptype region type = struct isc_region { unsigned char *base; unsigned int length; } * (gdb) p sizeof(region) $6 = 8 (gdb) p sizeof(send_msg) $7 = 16 This should not work and has to be done manual way. size_t and unsigned are not same size on other platforms. It seems to be just luck it worked on other platforms.
Reported to upstream on MR https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/6352 Should affect any big endian architecture and affected would be any 9.16.x or higher version using libuv.
FEDORA-2022-d1c57e7e41 has been submitted as an update to Fedora 36. https://bodhi.fedoraproject.org/updates/FEDORA-2022-d1c57e7e41
FEDORA-2022-48eb6235a4 has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2022-48eb6235a4
FEDORA-2022-d1c57e7e41 has been pushed to the Fedora 36 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2022-d1c57e7e41` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-d1c57e7e41 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2022-48eb6235a4 has been pushed to the Fedora 35 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2022-48eb6235a4` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-48eb6235a4 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2022-d1c57e7e41 has been pushed to the Fedora 36 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2022-48eb6235a4 has been pushed to the Fedora 35 stable repository. If problem still persists, please make note of it in this bug report.