Bug 2088125
| Summary: | [s390x] unit test lib/isc/tests/netmgr_test fails reliably | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Petr Menšík <pemensik> | ||||||
| Component: | bind | Assignee: | Petr Menšík <pemensik> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | rawhide | CC: | aegorenk, anon.amish, dns-sig, mruprich, pemensik, sgallagh, vonsch, zdohnal | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | s390x | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | bind-9.16.29-1.fc36 bind-9.16.29-1.fc35 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2022-06-01 01:24:28 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
# kyua test aes_test:main -> passed [0.026s] buffer_test:main -> passed [0.072s] counter_test:main -> passed [0.037s] errno_test:main -> passed [0.024s] file_test:main -> passed [0.021s] hash_test:main -> passed [0.022s] heap_test:main -> passed [0.033s] hmac_test:main -> passed [0.026s] ht_test:main -> passed [5.119s] lex_test:main -> passed [0.030s] md_test:main -> passed [0.069s] mem_test:main -> passed [4.485s] netaddr_test:main -> passed [0.022s] netmgr_test:main -> broken: TAP test program yielded invalid data: Load of '/tmp/kyua.0eiQlP/28/stdout.txt' failed: Reported plan differs from actual executed tests [0.362s] parse_test:main -> passed [0.038s] pool_test:main -> passed [0.052s] radix_test:main -> passed [0.032s] quota_test:main -> passed [0.126s] regex_test:main -> passed [0.753s] result_test:main -> passed [0.031s] safe_test:main -> passed [0.028s] siphash_test:main -> passed [0.040s] sockaddr_test:main -> passed [0.032s] socket_test:main -> passed [0.143s] symtab_test:main -> passed [0.044s] task_test:main -> passed [12.238s] taskpool_test:main -> passed [0.066s] time_test:main -> passed [0.022s] timer_test:main -> passed [13.582s] Reported it also upstream, but I guess I would have to debug on RH machines. I doubt upstream has required machine available. Thread 39 "isc-net-0001" hit Breakpoint 4, udp_recv_cb (handle=0x2aa000cda18, nrecv=0, buf=0x3fffd038f18, addr=0x3fffd039068, flags=8) at ../../../../lib/isc/tests/../netmgr/udp.c:375
375 const struct sockaddr *addr, unsigned flags) {
(gdb) bt
#0 udp_recv_cb (handle=0x2aa000cda18, nrecv=0, buf=0x3fffd038f18, addr=0x3fffd039068, flags=8)
at ../../../../lib/isc/tests/../netmgr/udp.c:375
#1 0x000003fffdd24f3a in uv__udp_recvmmsg (handle=handle@entry=0x2aa000cda18, buf=buf@entry=0x3fffd039878)
at src/unix/udp.c:231
#2 0x000003fffdd250e0 in uv__udp_recvmsg (handle=0x2aa000cda18) at src/unix/udp.c:273
#3 uv__udp_io (loop=<optimized out>, w=0x2aa000cda98, revents=<optimized out>) at src/unix/udp.c:178
#4 0x000003fffdd265c8 in uv__io_poll (loop=0x2aa000dbb80, timeout=842392) at src/unix/epoll.c:374
#5 0x000003fffdd0ffc2 in uv__io_poll (timeout=<optimized out>, loop=0x2aa000dbb80) at src/unix/epoll.c:136
#6 uv_run (loop=loop@entry=0x2aa000dbb80, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:391
#7 0x000003fffdeb6eb2 in nm_thread (worker0=0x2aa000dbb70) at netmgr/../../../../lib/isc/netmgr/netmgr.c:749
#8 0x000003fffded4594 in isc__trampoline_run (arg=0x2aa000b9a80) at ../../../lib/isc/trampoline.c:198
#9 0x000003fffdb96238 in start_thread (arg=0x3fffd03e840) at pthread_create.c:442
#10 0x000003fffdc116ee in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:66
Such calls happens multiple times. Then if fails with above assertion failure. It seems upstream expects nrecv should be always non-zero when flags=8. And it is on x86_64. But it does not work on this platform.
When it failed, this were hopefully original handle:
(gdb) set $h = (struct uv_udp_s *) 0x2aa000ddf48
(gdb) p $h
$1 = (struct uv_udp_s *) 0x2aa000ddf48
(gdb) p *$h
$2 = {data = 0x2aa000ddd10, loop = 0x2aa000db6d0, type = UV_UDP, close_cb = 0x0, handle_queue = {0x2aa000ddde0,
0x2aa000dba40}, u = {fd = 0, reserved = {0x0, 0x0, 0x0, 0x0}}, next_closing = 0x0, flags = 71311372,
send_queue_size = 0, send_queue_count = 0, alloc_cb = 0x3fffdeb8dd0 <isc__nm_alloc_cb>,
recv_cb = 0x2aa00006610 <udp_recv_cb>, io_watcher = {cb = 0x3fffdd25010 <uv__udp_io>, pending_queue = {0x2aa000ddfd0,
0x2aa000ddfd0}, watcher_queue = {0x2aa000ddfe0, 0x2aa000ddfe0}, pevents = 1, events = 1, fd = 22}, write_queue = {
0x2aa000de000, 0x2aa000de000}, write_completed_queue = {0x2aa000de010, 0x2aa000de010}}
(gdb) p /x 71311372
$3 = 0x440200c
Hello Steven, I found you as Red Hat owner of libuv. Would you know, what might be reason libuv sends only on s390x platform callbacks with nrecv == 0 and flags == UV_UDP_MMSG_CHUNK? ISC upstream thinks it should be always non-zero. Comment in udp.h header suggests it should be zero with UV_UDP_MMSG_FREE. I am not able to provide short reproducer. This unit test runs several threads and does complicated stuff. Would you have any idea how can I narrow a reason of this difference? All other platforms pass fine. The test is in build/lib/isc/tests once built with UNITTESTS enabled. Can be run manually: # CMOCKA_MESSAGE_OUTPUT= ./netmgr_test [==========] Running 49 test(s). [ RUN ] mock_listenudp_uv_udp_open [ OK ] mock_listenudp_uv_udp_open [ RUN ] mock_listenudp_uv_udp_bind [ OK ] mock_listenudp_uv_udp_bind [ RUN ] mock_listenudp_uv_udp_recv_start [ OK ] mock_listenudp_uv_udp_recv_start [ RUN ] mock_udpconnect_uv_udp_open [ OK ] mock_udpconnect_uv_udp_open [ RUN ] mock_udpconnect_uv_udp_bind [ OK ] mock_udpconnect_uv_udp_bind [ RUN ] mock_udpconnect_uv_udp_connect [ OK ] mock_udpconnect_uv_udp_connect [ RUN ] mock_udpconnect_uv_recv_buffer_size [ OK ] mock_udpconnect_uv_recv_buffer_size [ RUN ] mock_udpconnect_uv_send_buffer_size [ OK ] mock_udpconnect_uv_send_buffer_size [ RUN ] udp_noop [ OK ] udp_noop [ RUN ] udp_noresponse [ OK ] udp_noresponse [ RUN ] udp_timeout_recovery [ OK ] udp_timeout_recovery [ RUN ] udp_recv_one # echo $? 255 On passing platform it should run much more tests. I actually only package this for Fedora. I don't have much knowledge of its inner workings. That said, I fell down a rabbit hole and as best I can tell, the issue is most likely in the kernel. Here's the implementation of the function that returns the variable that stores nrecv and flags:
int uv__recvmmsg(int fd, struct uv__mmsghdr* mmsg, unsigned int vlen) {
#if defined(__i386__)
unsigned long args[5];
int rc;
args[0] = (unsigned long) fd;
args[1] = (unsigned long) mmsg;
args[2] = (unsigned long) vlen;
args[3] = /* flags */ 0;
args[4] = /* timeout */ 0;
/* socketcall() raises EINVAL when SYS_RECVMMSG is not supported. */
rc = syscall(/* __NR_socketcall */ 102, 19 /* SYS_RECVMMSG */, args);
if (rc == -1)
if (errno == EINVAL)
errno = ENOSYS;
return rc;
#elif defined(__NR_recvmmsg)
return syscall(__NR_recvmmsg, fd, mmsg, vlen, /* flags */ 0, /* timeout */ 0);
#else
return errno = ENOSYS, -1;
#endif
}
As you can see, all it does is pass the request on to a syscall and then return the result. It appears that the libuv code makes the assumption that any messages it receives must be non-zero in length, which may not be a safe assumption on s390x.
I'll report this issue upstream to the libuv maintainers.
Actually, I just found the relevant documentation: http://docs.libuv.org/en/v1.x/udp.html?highlight=uv__recvmmsg#c.uv_udp_recv_cb See in particular: "Note The receive callback will be called with ... nread == 0 and addr != NULL when an empty UDP packet is received." So I think ISC's assumption that nread must always be nonzero is incorrect. Created attachment 1882838 [details]
test s390x pcap
Test pcap done on s390x machine. Indeed all packets sent are of size 0.
Created attachment 1882845 [details]
test x86_64 pcap
Similar dump of communication on x86_64, bind-9.16.28 on f34
Think I found something in connect_send() function.
(gdb)
423 isc_nm_send(sendhandle, (isc_region_t *)&send_msg,
(gdb) p send_msg
$3 = {base = 0x2aa0001cb48 <send_magic> "fP\256\071P\321\\\020]A:\376\005\204\256", <incomplete sequence \311>, len = 8}
(gdb) step
isc_nm_send (handle=0x2aa00608e20, region=0x2aa0001c7b0 <send_msg>, cb=0x2aa00007c20 <connect_send_cb>, cbarg=0x0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:2427
2427 REQUIRE(VALID_NMHANDLE(handle));
(gdb) next
2429 switch (handle->sock->type) {
(gdb)
2432 isc__nm_udp_send(handle, region, cb, cbarg);
(gdb) p region
$4 = (isc_region_t *) 0x2aa0001c7b0 <send_msg>
(gdb) p *region
$5 = {base = 0x2aa0001cb48 <send_magic> "fP\256\071P\321\\\020]A:\376\005\204\256", <incomplete sequence \311>, length = 0}
(gdb) frame
#0 isc_nm_send (handle=0x2aa00608e20, region=0x2aa0001c7b0 <send_msg>, cb=0x2aa00007c20 <connect_send_cb>, cbarg=0x0)
at netmgr/../../../../lib/isc/netmgr/netmgr.c:2432
2432 isc__nm_udp_send(handle, region, cb, cbarg);
It seems changing type from struct uv_buf_t to isc_region_t does not work as expected on this platform.
(gdb) ptype send_msg
type = struct uv_buf_t {
char *base;
size_t len;
}
(gdb) ptype region
type = struct isc_region {
unsigned char *base;
unsigned int length;
} *
(gdb) p sizeof(region)
$6 = 8
(gdb) p sizeof(send_msg)
$7 = 16
This should not work and has to be done manual way. size_t and unsigned are not same size on other platforms. It seems to be just luck it worked on other platforms.
Reported to upstream on MR https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/6352 Should affect any big endian architecture and affected would be any 9.16.x or higher version using libuv. FEDORA-2022-d1c57e7e41 has been submitted as an update to Fedora 36. https://bodhi.fedoraproject.org/updates/FEDORA-2022-d1c57e7e41 FEDORA-2022-48eb6235a4 has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2022-48eb6235a4 FEDORA-2022-d1c57e7e41 has been pushed to the Fedora 36 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2022-d1c57e7e41` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-d1c57e7e41 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates. FEDORA-2022-48eb6235a4 has been pushed to the Fedora 35 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2022-48eb6235a4` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-48eb6235a4 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates. FEDORA-2022-d1c57e7e41 has been pushed to the Fedora 36 stable repository. If problem still persists, please make note of it in this bug report. FEDORA-2022-48eb6235a4 has been pushed to the Fedora 35 stable repository. If problem still persists, please make note of it in this bug report. |
Description of problem: I tried reenabling disabled unit tests on rawhide. It seems kyua is not a problem anymore, but one of tests fails often on s390x platform. Version-Release number of selected component (if applicable): bind-9.16.28-2.fc37 How reproducible: ~90% reliable Steps to Reproduce: 1. use s390x to compile 2. fedpkg clone --anonymous bind && cd bind 3. fedpkg local --with UNITTESTS Actual results: lib/isc/tests/netmgr fails. Expected results: all tests pass just as on other platforms. Additional info: I have seen errors with s390x also before. But were able to debug it and find interesting parameters. Compared with build on x86_64. Only on s390x uv library calls nrecv == 0 and flags == 8. (gdb) break udp_recv_cb if nrecv == 0 && addr != 0 && flags == 8 This breaks a lot just on s390x. Then it fails. (gdb) bt #0 __GI_exit (status=status@entry=-1) at exit.c:143 #1 0x000003fffdd8318e in exit_test (quit_application=1) at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:408 #2 0x000003fffdd8329a in _fail (file=file@entry=0x2aa000171cc "../../../../lib/isc/tests/netmgr_test.c", line=<optimized out>) at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:2196 #3 0x000003fffdd8335c in _assert_true (result=0, line=<optimized out>, file=0x2aa000171cc "../../../../lib/isc/tests/netmgr_test.c", expression=0x2aa00017390 "region->length >= sizeof(magic)") at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:1730 #4 0x000002aa00008074 in listen_read_cb (handle=<optimized out>, eresult=<optimized out>, region=region@entry=0x3fffc6f6c18, cbarg=0x0) at ../../../../lib/isc/tests/netmgr_test.c:523 #5 0x000003fffdeba746 in isc__nm_async_readcb (worker=worker@entry=0x0, ev0=ev0@entry=0x3fffc6f6cd0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:2619 #6 0x000003fffdebaec8 in isc__nm_readcb (sock=0x2aa006084a0, uvreq=<optimized out>, eresult=<optimized out>) at netmgr/../../../../lib/isc/netmgr/netmgr.c:2592 #7 0x000002aa00006740 in udp_recv_cb (handle=<optimized out>, nrecv=0, buf=0x3fffc6f6f18, addr=<optimized out>, flags=<optimized out>) at ../../../../lib/isc/tests/../netmgr/udp.c:455 #8 0x000003fffdd24f3a in uv__udp_recvmmsg (handle=handle@entry=0x2aa006086d8, buf=buf@entry=0x3fffc6f7878) at src/unix/udp.c:231 #9 0x000003fffdd250e0 in uv__udp_recvmsg (handle=0x2aa006086d8) at src/unix/udp.c:273 #10 uv__udp_io (loop=<optimized out>, w=0x2aa00608758, revents=<optimized out>) at src/unix/udp.c:178 #11 0x000003fffdd265c8 in uv__io_poll (loop=0x2aa000db6d0, timeout=6326104) at src/unix/epoll.c:374 #12 0x000003fffdd0ffc2 in uv__io_poll (timeout=<optimized out>, loop=0x2aa000db6d0) at src/unix/epoll.c:136 #13 uv_run (loop=loop@entry=0x2aa000db6d0, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:391 #14 0x000003fffdeb6eb2 in nm_thread (worker0=0x2aa000db6c0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:749 #15 0x000003fffded4594 in isc__trampoline_run (arg=0x2aa000b9a40) at ../../../lib/isc/trampoline.c:198 #16 0x000003fffdb96238 in start_thread (arg=0x3fffc6fc840) at pthread_create.c:442 #17 0x000003fffdc116ee in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:66 The same breakpoint on x86 never stopped.