Bug 2088125 - [s390x] unit test lib/isc/tests/netmgr_test fails reliably
Summary: [s390x] unit test lib/isc/tests/netmgr_test fails reliably
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: bind
Version: rawhide
Hardware: s390x
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Petr Menšík
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-18 18:37 UTC by Petr Menšík
Modified: 2022-06-05 01:27 UTC (History)
8 users (show)

Fixed In Version: bind-9.16.29-1.fc36 bind-9.16.29-1.fc35
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-01 01:24:28 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
test s390x pcap (1.62 KB, application/vnd.tcpdump.pcap)
2022-05-24 16:19 UTC, Petr Menšík
no flags Details
test x86_64 pcap (959.12 KB, application/vnd.tcpdump.pcap)
2022-05-24 17:00 UTC, Petr Menšík
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Internet Systems Consortium (ISC) isc-projects bind9 issues 2801 0 None opened unit test netmgr_test of 9.17.15 fails reliably on s390x 2022-05-18 19:02:06 UTC
Internet Systems Consortium (ISC) isc-projects bind9 merge_requests 6352 0 None opened Fix failures in isc netmgr_test on big endian machines 2022-05-24 17:54:37 UTC

Description Petr Menšík 2022-05-18 18:37:10 UTC
Description of problem:
I tried reenabling disabled unit tests on rawhide. It seems kyua is not a problem anymore, but one of tests fails often on s390x platform.

Version-Release number of selected component (if applicable):
bind-9.16.28-2.fc37

How reproducible:
~90% reliable

Steps to Reproduce:
1. use s390x to compile
2. fedpkg clone --anonymous bind && cd bind
3. fedpkg local --with UNITTESTS

Actual results:
lib/isc/tests/netmgr fails.

Expected results:
all tests pass just as on other platforms.

Additional info:
I have seen errors with s390x also before. But were able to debug it and find interesting parameters. Compared with build on x86_64. Only on s390x uv library calls nrecv == 0 and flags == 8.

(gdb) break udp_recv_cb if nrecv == 0 && addr != 0 && flags  == 8

This breaks a lot just on s390x. Then it fails.

(gdb) bt
#0  __GI_exit (status=status@entry=-1) at exit.c:143
#1  0x000003fffdd8318e in exit_test (quit_application=1) at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:408
#2  0x000003fffdd8329a in _fail (file=file@entry=0x2aa000171cc "../../../../lib/isc/tests/netmgr_test.c", 
    line=<optimized out>) at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:2196
#3  0x000003fffdd8335c in _assert_true (result=0, line=<optimized out>, 
    file=0x2aa000171cc "../../../../lib/isc/tests/netmgr_test.c", 
    expression=0x2aa00017390 "region->length >= sizeof(magic)")
    at /usr/src/debug/cmocka-1.1.5-11.fc36.s390x/src/cmocka.c:1730
#4  0x000002aa00008074 in listen_read_cb (handle=<optimized out>, eresult=<optimized out>, 
    region=region@entry=0x3fffc6f6c18, cbarg=0x0) at ../../../../lib/isc/tests/netmgr_test.c:523
#5  0x000003fffdeba746 in isc__nm_async_readcb (worker=worker@entry=0x0, ev0=ev0@entry=0x3fffc6f6cd0)
    at netmgr/../../../../lib/isc/netmgr/netmgr.c:2619
#6  0x000003fffdebaec8 in isc__nm_readcb (sock=0x2aa006084a0, uvreq=<optimized out>, eresult=<optimized out>)
    at netmgr/../../../../lib/isc/netmgr/netmgr.c:2592
#7  0x000002aa00006740 in udp_recv_cb (handle=<optimized out>, nrecv=0, buf=0x3fffc6f6f18, addr=<optimized out>, 
    flags=<optimized out>) at ../../../../lib/isc/tests/../netmgr/udp.c:455
#8  0x000003fffdd24f3a in uv__udp_recvmmsg (handle=handle@entry=0x2aa006086d8, buf=buf@entry=0x3fffc6f7878)
    at src/unix/udp.c:231
#9  0x000003fffdd250e0 in uv__udp_recvmsg (handle=0x2aa006086d8) at src/unix/udp.c:273
#10 uv__udp_io (loop=<optimized out>, w=0x2aa00608758, revents=<optimized out>) at src/unix/udp.c:178
#11 0x000003fffdd265c8 in uv__io_poll (loop=0x2aa000db6d0, timeout=6326104) at src/unix/epoll.c:374
#12 0x000003fffdd0ffc2 in uv__io_poll (timeout=<optimized out>, loop=0x2aa000db6d0) at src/unix/epoll.c:136
#13 uv_run (loop=loop@entry=0x2aa000db6d0, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:391
#14 0x000003fffdeb6eb2 in nm_thread (worker0=0x2aa000db6c0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:749
#15 0x000003fffded4594 in isc__trampoline_run (arg=0x2aa000b9a40) at ../../../lib/isc/trampoline.c:198
#16 0x000003fffdb96238 in start_thread (arg=0x3fffc6fc840) at pthread_create.c:442
#17 0x000003fffdc116ee in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:66

The same breakpoint on x86 never stopped.

Comment 1 Petr Menšík 2022-05-18 18:41:02 UTC
# kyua test
aes_test:main  ->  passed  [0.026s]
buffer_test:main  ->  passed  [0.072s]
counter_test:main  ->  passed  [0.037s]
errno_test:main  ->  passed  [0.024s]
file_test:main  ->  passed  [0.021s]
hash_test:main  ->  passed  [0.022s]
heap_test:main  ->  passed  [0.033s]
hmac_test:main  ->  passed  [0.026s]
ht_test:main  ->  passed  [5.119s]
lex_test:main  ->  passed  [0.030s]
md_test:main  ->  passed  [0.069s]
mem_test:main  ->  passed  [4.485s]
netaddr_test:main  ->  passed  [0.022s]
netmgr_test:main  ->  broken: TAP test program yielded invalid data: Load of '/tmp/kyua.0eiQlP/28/stdout.txt' failed: Reported plan differs from actual executed tests  [0.362s]
parse_test:main  ->  passed  [0.038s]
pool_test:main  ->  passed  [0.052s]
radix_test:main  ->  passed  [0.032s]
quota_test:main  ->  passed  [0.126s]
regex_test:main  ->  passed  [0.753s]
result_test:main  ->  passed  [0.031s]
safe_test:main  ->  passed  [0.028s]
siphash_test:main  ->  passed  [0.040s]
sockaddr_test:main  ->  passed  [0.032s]
socket_test:main  ->  passed  [0.143s]
symtab_test:main  ->  passed  [0.044s]
task_test:main  ->  passed  [12.238s]
taskpool_test:main  ->  passed  [0.066s]
time_test:main  ->  passed  [0.022s]
timer_test:main  ->  passed  [13.582s]

Comment 2 Petr Menšík 2022-05-18 19:02:07 UTC
Reported it also upstream, but I guess I would have to debug on RH machines. I doubt upstream has required machine available.

Comment 3 Petr Menšík 2022-05-18 19:15:33 UTC
Thread 39 "isc-net-0001" hit Breakpoint 4, udp_recv_cb (handle=0x2aa000cda18, nrecv=0, buf=0x3fffd038f18, addr=0x3fffd039068, flags=8) at ../../../../lib/isc/tests/../netmgr/udp.c:375
375		    const struct sockaddr *addr, unsigned flags) {
(gdb) bt
#0  udp_recv_cb (handle=0x2aa000cda18, nrecv=0, buf=0x3fffd038f18, addr=0x3fffd039068, flags=8)
    at ../../../../lib/isc/tests/../netmgr/udp.c:375
#1  0x000003fffdd24f3a in uv__udp_recvmmsg (handle=handle@entry=0x2aa000cda18, buf=buf@entry=0x3fffd039878)
    at src/unix/udp.c:231
#2  0x000003fffdd250e0 in uv__udp_recvmsg (handle=0x2aa000cda18) at src/unix/udp.c:273
#3  uv__udp_io (loop=<optimized out>, w=0x2aa000cda98, revents=<optimized out>) at src/unix/udp.c:178
#4  0x000003fffdd265c8 in uv__io_poll (loop=0x2aa000dbb80, timeout=842392) at src/unix/epoll.c:374
#5  0x000003fffdd0ffc2 in uv__io_poll (timeout=<optimized out>, loop=0x2aa000dbb80) at src/unix/epoll.c:136
#6  uv_run (loop=loop@entry=0x2aa000dbb80, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:391
#7  0x000003fffdeb6eb2 in nm_thread (worker0=0x2aa000dbb70) at netmgr/../../../../lib/isc/netmgr/netmgr.c:749
#8  0x000003fffded4594 in isc__trampoline_run (arg=0x2aa000b9a80) at ../../../lib/isc/trampoline.c:198
#9  0x000003fffdb96238 in start_thread (arg=0x3fffd03e840) at pthread_create.c:442
#10 0x000003fffdc116ee in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:66

Such calls happens multiple times. Then if fails with above assertion failure. It seems upstream expects nrecv should be always non-zero when flags=8. And it is on x86_64. But it does not work on this platform.

When it failed, this were hopefully original handle:
(gdb) set $h = (struct uv_udp_s *) 0x2aa000ddf48
(gdb) p $h
$1 = (struct uv_udp_s *) 0x2aa000ddf48
(gdb) p *$h
$2 = {data = 0x2aa000ddd10, loop = 0x2aa000db6d0, type = UV_UDP, close_cb = 0x0, handle_queue = {0x2aa000ddde0, 
    0x2aa000dba40}, u = {fd = 0, reserved = {0x0, 0x0, 0x0, 0x0}}, next_closing = 0x0, flags = 71311372, 
  send_queue_size = 0, send_queue_count = 0, alloc_cb = 0x3fffdeb8dd0 <isc__nm_alloc_cb>, 
  recv_cb = 0x2aa00006610 <udp_recv_cb>, io_watcher = {cb = 0x3fffdd25010 <uv__udp_io>, pending_queue = {0x2aa000ddfd0, 
      0x2aa000ddfd0}, watcher_queue = {0x2aa000ddfe0, 0x2aa000ddfe0}, pevents = 1, events = 1, fd = 22}, write_queue = {
    0x2aa000de000, 0x2aa000de000}, write_completed_queue = {0x2aa000de010, 0x2aa000de010}}
(gdb) p /x 71311372
$3 = 0x440200c

Comment 4 Petr Menšík 2022-05-18 19:28:28 UTC
Hello Steven, I found you as Red Hat owner of libuv. Would you know, what might be reason libuv sends only on s390x platform callbacks with nrecv == 0 and flags == UV_UDP_MMSG_CHUNK? ISC upstream thinks it should be always non-zero. Comment in udp.h header suggests it should be zero with UV_UDP_MMSG_FREE.
I am not able to provide short reproducer. This unit test runs several threads and does complicated stuff.

Would you have any idea how can I narrow a reason of this difference? All other platforms pass fine. The test is in build/lib/isc/tests once built with UNITTESTS enabled. Can be run manually:

# CMOCKA_MESSAGE_OUTPUT= ./netmgr_test 
[==========] Running 49 test(s).
[ RUN      ] mock_listenudp_uv_udp_open
[       OK ] mock_listenudp_uv_udp_open
[ RUN      ] mock_listenudp_uv_udp_bind
[       OK ] mock_listenudp_uv_udp_bind
[ RUN      ] mock_listenudp_uv_udp_recv_start
[       OK ] mock_listenudp_uv_udp_recv_start
[ RUN      ] mock_udpconnect_uv_udp_open
[       OK ] mock_udpconnect_uv_udp_open
[ RUN      ] mock_udpconnect_uv_udp_bind
[       OK ] mock_udpconnect_uv_udp_bind
[ RUN      ] mock_udpconnect_uv_udp_connect
[       OK ] mock_udpconnect_uv_udp_connect
[ RUN      ] mock_udpconnect_uv_recv_buffer_size
[       OK ] mock_udpconnect_uv_recv_buffer_size
[ RUN      ] mock_udpconnect_uv_send_buffer_size
[       OK ] mock_udpconnect_uv_send_buffer_size
[ RUN      ] udp_noop
[       OK ] udp_noop
[ RUN      ] udp_noresponse
[       OK ] udp_noresponse
[ RUN      ] udp_timeout_recovery
[       OK ] udp_timeout_recovery
[ RUN      ] udp_recv_one
# echo $?
255

On passing platform it should run much more tests.

Comment 5 Stephen Gallagher 2022-05-18 21:33:14 UTC
I actually only package this for Fedora. I don't have much knowledge of its inner workings. That said, I fell down a rabbit hole and as best I can tell, the issue is most likely in the kernel. Here's the implementation of the function that returns the variable that stores nrecv and flags:

int uv__recvmmsg(int fd, struct uv__mmsghdr* mmsg, unsigned int vlen) {
#if defined(__i386__)
  unsigned long args[5];
  int rc;

  args[0] = (unsigned long) fd;
  args[1] = (unsigned long) mmsg;
  args[2] = (unsigned long) vlen;
  args[3] = /* flags */ 0;
  args[4] = /* timeout */ 0;

  /* socketcall() raises EINVAL when SYS_RECVMMSG is not supported. */
  rc = syscall(/* __NR_socketcall */ 102, 19 /* SYS_RECVMMSG */, args);
  if (rc == -1)
    if (errno == EINVAL)
      errno = ENOSYS;

  return rc;
#elif defined(__NR_recvmmsg)
  return syscall(__NR_recvmmsg, fd, mmsg, vlen, /* flags */ 0, /* timeout */ 0);
#else
  return errno = ENOSYS, -1;
#endif
}

As you can see, all it does is pass the request on to a syscall and then return the result. It appears that the libuv code makes the assumption that any messages it receives must be non-zero in length, which may not be a safe assumption on s390x.

I'll report this issue upstream to the libuv maintainers.

Comment 6 Stephen Gallagher 2022-05-18 21:49:38 UTC
Actually, I just found the relevant documentation:
http://docs.libuv.org/en/v1.x/udp.html?highlight=uv__recvmmsg#c.uv_udp_recv_cb

See in particular: "Note The receive callback will be called with ... nread == 0 and addr != NULL when an empty UDP packet is received."

So I think ISC's assumption that nread must always be nonzero is incorrect.

Comment 7 Petr Menšík 2022-05-24 16:19:53 UTC
Created attachment 1882838 [details]
test s390x pcap

Test pcap done on s390x machine. Indeed all packets sent are of size 0.

Comment 8 Petr Menšík 2022-05-24 17:00:24 UTC
Created attachment 1882845 [details]
test x86_64 pcap

Similar dump of communication on x86_64, bind-9.16.28 on f34

Comment 9 Petr Menšík 2022-05-24 17:19:28 UTC
Think I found something in connect_send() function.

(gdb) 
423			isc_nm_send(sendhandle, (isc_region_t *)&send_msg,
(gdb) p send_msg
$3 = {base = 0x2aa0001cb48 <send_magic> "fP\256\071P\321\\\020]A:\376\005\204\256", <incomplete sequence \311>, len = 8}
(gdb) step
isc_nm_send (handle=0x2aa00608e20, region=0x2aa0001c7b0 <send_msg>, cb=0x2aa00007c20 <connect_send_cb>, cbarg=0x0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:2427
2427		REQUIRE(VALID_NMHANDLE(handle));
(gdb) next
2429		switch (handle->sock->type) {
(gdb) 
2432			isc__nm_udp_send(handle, region, cb, cbarg);
(gdb) p region
$4 = (isc_region_t *) 0x2aa0001c7b0 <send_msg>
(gdb) p *region
$5 = {base = 0x2aa0001cb48 <send_magic> "fP\256\071P\321\\\020]A:\376\005\204\256", <incomplete sequence \311>, length = 0}
(gdb) frame
#0  isc_nm_send (handle=0x2aa00608e20, region=0x2aa0001c7b0 <send_msg>, cb=0x2aa00007c20 <connect_send_cb>, cbarg=0x0)
    at netmgr/../../../../lib/isc/netmgr/netmgr.c:2432
2432			isc__nm_udp_send(handle, region, cb, cbarg);

It seems changing type from struct uv_buf_t to isc_region_t does not work as expected on this platform.

(gdb) ptype send_msg
type = struct uv_buf_t {
    char *base;
    size_t len;
}
(gdb) ptype region
type = struct isc_region {
    unsigned char *base;
    unsigned int length;
} *
(gdb) p sizeof(region)
$6 = 8
(gdb) p sizeof(send_msg)
$7 = 16

This should not work and has to be done manual way. size_t and unsigned are not same size on other platforms. It seems to be just luck it worked on other platforms.

Comment 10 Petr Menšík 2022-05-24 17:54:37 UTC
Reported to upstream on MR https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/6352

Should affect any big endian architecture and affected would be any 9.16.x or higher version using libuv.

Comment 11 Fedora Update System 2022-05-27 20:19:16 UTC
FEDORA-2022-d1c57e7e41 has been submitted as an update to Fedora 36. https://bodhi.fedoraproject.org/updates/FEDORA-2022-d1c57e7e41

Comment 12 Fedora Update System 2022-05-27 22:16:13 UTC
FEDORA-2022-48eb6235a4 has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2022-48eb6235a4

Comment 13 Fedora Update System 2022-05-28 02:06:51 UTC
FEDORA-2022-d1c57e7e41 has been pushed to the Fedora 36 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2022-d1c57e7e41`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-d1c57e7e41

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 14 Fedora Update System 2022-05-28 02:23:42 UTC
FEDORA-2022-48eb6235a4 has been pushed to the Fedora 35 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2022-48eb6235a4`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-48eb6235a4

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 15 Fedora Update System 2022-06-01 01:24:28 UTC
FEDORA-2022-d1c57e7e41 has been pushed to the Fedora 36 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 16 Fedora Update System 2022-06-05 01:27:27 UTC
FEDORA-2022-48eb6235a4 has been pushed to the Fedora 35 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.