Hide Forgot
I see netmgr big-endian fixes in 9.18.5 pkg source, 4122 commit 5e9c9e3539f29fdb812bca553da7e76e5053d9b3 4123 Merge: 44c4ff2d2d d074386ef1 4124 Author: Ond�<99>ej Surý <ondrej> 4125 Date: Tue May 24 18:28:09 2022 +0000 4126 4127 Merge branch '2801-unittest-isc-netmgr-s390x-v9_18' into 'v9_18' 4128 4129 Fix failures in isc netmgr_test on big endian machines 4130 4131 See merge request isc-projects/bind9!6353 4132 4133 commit d074386ef1fbc0fd8e3c867cd51004061b390c42 4134 Author: Petr Menšík <pemensik> 4135 Date: Tue May 24 19:42:41 2022 +0200 4136 4137 Fix failures in isc netmgr_test on big endian machines 4138 4139 Typing from libuv structure to isc_region_t is not possible, because 4140 their sizes differ on 64 bit architectures. Little endian machines seems 4141 to be lucky and still result in test passed. But big endian machine such 4142 as s390x fails the test reliably. 4143 4144 Fix by directly creating the buffer as isc_region_t and skipping the 4145 type conversion. More readable and still more correct. 4146 4147 (cherry picked from commit 057438cb45f2f02615dc309e3822f23c0ca70a0a) building bind9 from src.fed sources for f36/f37/rawhide on COPR, for x86_64, I'm seeing intermittent build FAILs @ netmgt_test ... **without** changes between builds, it FAILs differently for subsequent builds. Unclear yet if this is same issue, or new? E.g., builds @ https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4772089/ & https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4772683/ In one case, F36 & F37 builds are OK, but rawhide FAILs In the other, rawhide is OK, but F36 & F37 are FAILs. Always at "netmgr_test" ... Possibly this same issue still? A COPR buildsys artifact? Something else?
editing spec %if %{with UNITTEST} pushd build - CPUS=$(lscpu -p=cpu,core | grep -v '^#' | wc -l) - THREADS="$CPUS" + CPUS=1 + THREADS=1 seems (?) to do the trick. COPR builds for all F36/F37/rawhide are OK, @ https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4772848/
ugh ... spoke too soon: https://download.copr.fedorainfracloud.org/results/pgfed/bind-FORK/fedora-36-x86_64/04772853-bind/builder-live.log.gz netmgr_test. again.
looking at netmgr test source, my as-yet-unfounded suspicion is that these timeouts https://github.com/isc-projects/bind9/blob/v9_18_5/tests/isc/netmgr_test.c#L116 are intermittently hitting limits -- only in COPR/online. perhaps for specific transport? I also note that -- in main, upstream, 3 days ago -- netmgr tests are being split up, into separate per-transport tests, https://github.com/isc-projects/bind9/commit/37a1be5acc32244cec03cedc1bd46bc4aa0fbc18 I'm not clear what specific problem is being solved by that split, but imagine that it might well have an effect on builds @ COPR. I've not been able to get detailed test FAIL logs from COPR builds (local builds do not FAIL). currently, @ #fedora-buildsys, did manage to get a reproducer of the build FAIL; &, working on getting access to those FAIL logs via a manual COPR build.
With no changes to any source/spec, simply rebuilding @ COPR, builds FAIL randomly: Time F36 F37 Rawhide URL -------------------- ---- ---- ------- ------ 2022-08-29 15:58 EDT OK FAIL OK https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4784469/ 2022-08-29 14:23 EDT FAIL OK OK https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4784210/ 2022-08-29 11:49 EDT OK OK OK https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4776394/ again, always/only @ bind's netmgr_test ...
on bind-users ML, from dev, > The netmgr unit tests are not meant to run fully in the CI as some of it are time sensitive. > > You might want to set the CI=true environment variable to reduce the set of the netmgr unit tests to just the more reliable subset. limiting UNITTESTs by adding, export CI=true in .spec @ https://src.fedoraproject.org/fork/pgfed/rpms/bind/blob/rawhide/f/bind.spec#_357 similarly random/intermittent FAILs, 3x OK, 0x FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784746/ 1x OK, 2x FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784745/ 0x OK, 3x FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784744/ 1x OK, 2x FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784743/ subsequent reply @ ML, > Then run only the system tests by running make check only in the bin/tests/system directory instead of the top level. Or don’t run the tests at all - these are mostly meant for development purposes where we have better control over the build environment. editing in .spec, to eliminate exec of all UNITTESTs %bcond_with SYSTEMTEST ... - %bcond_without UNITTEST + %bcond_with UNITTEST results in 0 build fails, 3x OK, Nx FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786692 3x OK, Nx FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786691 3x OK, Nx FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786690 3x OK, 0x FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786675 expected, as the FAILs were occurring in the unittests ...
well, yes. Some tests are not very reliable and sometime depend on count or cores on building machine. I have to sometime retry the build without any change to pass tests. But I think it is better to have at least most of tests enabled. They catch many bugs and they are run on every different platform, which is not case for normal. There are also some other tests, which fail from time to time. ISC upstream tests have advantage that they are able to rebuild just platform failing. Our tests on Fedora do not run on any other platforms than x86. That is why I have them enabled even when they sometime fail for no obvious reason. Oh, interesting. I have not known unit tests can use some environment variables to configure them. I admit I haven't found enough time to make serious testing of reliability of tests. I would like to fix unreliable test instead of turning off them at all.
FrostyX @ #fedora-buildsys helpfully provided the FAIL'd netmgr_test log from a manual COPR build, cat ./var/lib/mock/fedora-36-x86_64-1661812240.003689/root/builddir/build/BUILD/bind-9.18.5/build/tests/isc/netmgr_test.log [==========] Running 83 test(s). [ RUN ] mock_listenudp_uv_udp_open [ OK ] mock_listenudp_uv_udp_open [ RUN ] mock_listenudp_uv_udp_bind [ OK ] mock_listenudp_uv_udp_bind [ RUN ] mock_listenudp_uv_udp_recv_start [ OK ] mock_listenudp_uv_udp_recv_start [ RUN ] mock_udpconnect_uv_udp_open [ OK ] mock_udpconnect_uv_udp_open [ RUN ] mock_udpconnect_uv_udp_bind [ OK ] mock_udpconnect_uv_udp_bind [ RUN ] mock_udpconnect_uv_udp_connect [ OK ] mock_udpconnect_uv_udp_connect [ RUN ] mock_udpconnect_uv_recv_buffer_size [ OK ] mock_udpconnect_uv_recv_buffer_size [ RUN ] mock_udpconnect_uv_send_buffer_size [ OK ] mock_udpconnect_uv_send_buffer_size [ RUN ] udp_noop [ OK ] udp_noop [ RUN ] udp_noresponse [ OK ] udp_noresponse [ RUN ] udp_timeout_recovery [ OK ] udp_timeout_recovery [ RUN ] udp_recv_one [ OK ] udp_recv_one [ RUN ] udp_recv_two [ OK ] udp_recv_two [ RUN ] udp_recv_send [ OK ] udp_recv_send [ RUN ] udp_recv_half_send [ OK ] udp_recv_half_send [ RUN ] udp_half_recv_send [ OK ] udp_half_recv_send [ RUN ] udp_half_recv_half_send [ OK ] udp_half_recv_half_send [ RUN ] tcp_noop [ OK ] tcp_noop [ RUN ] tcp_noresponse [ OK ] tcp_noresponse [ RUN ] tcp_timeout_recovery [ OK ] tcp_timeout_recovery [ RUN ] tcp_recv_one [ OK ] tcp_recv_one [ RUN ] tcp_recv_two [ OK ] tcp_recv_two [ RUN ] tcp_recv_send [ OK ] tcp_recv_send [ RUN ] tcp_recv_half_send [ OK ] tcp_recv_half_send [ RUN ] tcp_half_recv_send [ OK ] tcp_half_recv_send [ RUN ] tcp_half_recv_half_send [ OK ] tcp_half_recv_half_send [ RUN ] tcp_recv_send_sendback [ OK ] tcp_recv_send_sendback [ RUN ] tcp_recv_half_send_sendback [ OK ] tcp_recv_half_send_sendback [ RUN ] tcp_half_recv_send_sendback [ OK ] tcp_half_recv_send_sendback [ RUN ] tcp_half_recv_half_send_sendback [ OK ] tcp_half_recv_half_send_sendback [ RUN ] tcp_recv_one_quota [ OK ] tcp_recv_one_quota [ RUN ] tcp_recv_two_quota [ OK ] tcp_recv_two_quota [ RUN ] tcp_recv_send_quota [ OK ] tcp_recv_send_quota [ RUN ] tcp_recv_half_send_quota [ OK ] tcp_recv_half_send_quota [ RUN ] tcp_half_recv_send_quota [ OK ] tcp_half_recv_send_quota [ RUN ] tcp_half_recv_half_send_quota [ OK ] tcp_half_recv_half_send_quota [ RUN ] tcp_recv_send_quota_sendback [ OK ] tcp_recv_send_quota_sendback [ RUN ] tcp_recv_half_send_quota_sendback [ OK ] tcp_recv_half_send_quota_sendback [ RUN ] tcp_half_recv_send_quota_sendback [ OK ] tcp_half_recv_send_quota_sendback [ RUN ] tcp_half_recv_half_send_quota_sendback [ OK ] tcp_half_recv_half_send_quota_sendback [ RUN ] tcpdns_recv_one [ OK ] tcpdns_recv_one [ RUN ] tcpdns_recv_two [ OK ] tcpdns_recv_two [ RUN ] tcpdns_noop [ OK ] tcpdns_noop [ RUN ] tcpdns_noresponse [ OK ] tcpdns_noresponse [ RUN ] tcpdns_timeout_recovery [ OK ] tcpdns_timeout_recovery [ RUN ] tcpdns_recv_send [ OK ] tcpdns_recv_send [ RUN ] tcpdns_recv_half_send [ OK ] tcpdns_recv_half_send [ RUN ] tcpdns_half_recv_send [ OK ] tcpdns_half_recv_send [ RUN ] tcpdns_half_recv_half_send [ OK ] tcpdns_half_recv_half_send [ RUN ] tls_noop [ OK ] tls_noop [ RUN ] tls_noresponse [ OK ] tls_noresponse [ RUN ] tls_timeout_recovery [ OK ] tls_timeout_recovery [ RUN ] tls_recv_one [ OK ] tls_recv_one [ RUN ] tls_recv_two [ OK ] tls_recv_two [ RUN ] tls_recv_send [ OK ] tls_recv_send [ RUN ] tls_recv_half_send [ OK ] tls_recv_half_send [ RUN ] tls_half_recv_send [ OK ] tls_half_recv_send [ RUN ] tls_half_recv_half_send [ OK ] tls_half_recv_half_send [ RUN ] tls_recv_send_sendback [ OK ] tls_recv_send_sendback [ RUN ] tls_recv_half_send_sendback [ OK ] tls_recv_half_send_sendback [ RUN ] tls_half_recv_send_sendback [ OK ] tls_half_recv_send_sendback [ RUN ] tls_half_recv_half_send_sendback [ OK ] tls_half_recv_half_send_sendback [ RUN ] tls_recv_one_quota [ OK ] tls_recv_one_quota [ RUN ] tls_recv_two_quota [ OK ] tls_recv_two_quota [ RUN ] tls_recv_send_quota [ OK ] tls_recv_send_quota [ RUN ] tls_recv_half_send_quota ../../../lib/isc/netmgr/tlsstream.c:910: REQUIRE(((mgr) != ((void *)0) && ((const isc__magic_t *)(mgr))->magic == ((('N') << 24 | ('E') << 16 | ('T') << 8 | ('M'))))) failed, back trace /builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(+0x3a734)[0x7f00740fa734] /builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(isc_assertion_failed+0x10)[0x7f00740f9ed0] /builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(isc_nm_tlsconnect+0x139)[0x7f007412cf49] /builddir/build/BUILD/bind-9.18.5/build/tests/isc/.libs/netmgr_test(+0xcf39)[0x55f50bf7df39] /builddir/build/BUILD/bind-9.18.5/build/tests/isc/.libs/netmgr_test(+0x96e9)[0x55f50bf7a6e9] /builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(isc__trampoline_run+0x19)[0x7f007411ca69] /lib64/libc.so.6(+0x8ce2d)[0x7f0073f05e2d] /lib64/libc.so.6(+0x1121b0)[0x7f0073f8b1b0] ../../tests/unit-test-driver.sh: line 36: 36467 Aborted (core dumped) "${TEST_PROGRAM}" FAIL netmgr_test (exit status: 134) where, cat lib/isc/netmgr/tlsstream.c ... void isc_nm_tlsconnect(isc_nm_t *mgr, isc_sockaddr_t *local, isc_sockaddr_t *peer, isc_nm_cb_t cb, void *cbarg, isc_tlsctx_t *ctx, isc_tlsctx_client_session_cache_t *client_sess_cache, unsigned int timeout, size_t extrahandlesize) { isc_nmsocket_t *nsock = NULL; #if defined(NETMGR_TRACE) && defined(NETMGR_TRACE_VERBOSE) fprintf(stderr, "TLS: isc_nm_tlsconnect(): in net thread: %s\n", isc__nm_in_netthread() ? "yes" : "no"); #endif /* NETMGR_TRACE */ 910 REQUIRE(VALID_NM(mgr)); nsock = isc_mem_get(mgr->mctx, sizeof(*nsock)); isc__nmsocket_init(nsock, mgr, isc_nm_tlssocket, local); nsock->extrahandlesize = extrahandlesize; nsock->result = ISC_R_UNSET; nsock->connect_cb = cb; nsock->connect_cbarg = cbarg; nsock->connect_timeout = timeout; isc_tlsctx_attach(ctx, &nsock->tlsstream.ctx); atomic_init(&nsock->client, true); if (client_sess_cache != NULL) { INSIST(isc_tlsctx_client_session_cache_getctx( client_sess_cache) == ctx); isc_tlsctx_client_session_cache_attach( client_sess_cache, &nsock->tlsstream.client_sess_cache); } isc_nm_tcpconnect(mgr, local, peer, tcp_connected, nsock, nsock->connect_timeout, 0); } ... At first glance, that looks (?) like DOH/DOT code. I don't see why it's necessarily a timing issue, as upstream commented. That's only from one test instance, of course. I can't yet confirm that that's the same cause fail each time; tho, seems likely. agreed re: getting to a full(ish) working set of tests that run/work in Koji/COPR CI, rather than just disabling generally
I have pushed improvement in reporting errors to my fork, producing copr builds: https://copr.fedorainfracloud.org/coprs/pemensik/bind/packages/ Skipping some tests by defining CI=true environment helped a bit. But I have not seen full crash yet in my builds. I think this would be most common case so far: [ RUN ] tcp_recv_two_quota PID 36471 exceeded run time limit, sending SIGKILL ../../tests/unit-test-driver.sh: line 43: 36471 Killed
@pemensik Repeated (re)builds @ COPR, with your fork'd src = https://src.fedoraproject.org/forks/pemensik/rpms/bind.git return, again, random results OK F36,F37,Rawhide https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803700/ OK F36,F37 FAIL Rawhide https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803701/ OK F36,Rawhide FAIL F37 https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803703/ OK F36,F37 FAIL Rawhide https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803704/
random_test f37 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4830585/ netmgr_test: f35 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4829490/ f35 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4829365/ f38 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4813161/ [ RUN ] tcp_recv_two_quota PID 32099 exceeded run time limit, sending SIGKILL ../../tests/unit-test-driver.sh: line 41: 32099 Killed "${TEST_PROGRAM}" FAIL netmgr_test (exit status: 124) f35 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4813161/ f38 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4803637/ [ RUN ] tcp_noresponse PID 32090 exceeded run time limit, sending SIGKILL ../../tests/unit-test-driver.sh: line 43: 32090 Killed "${TEST_PROGRAM}" FAIL netmgr_test (exit status: 124) I admit it is hard to pass all builds when building for 4 releases. Passing all is rare now. The hard part would be retrying those failures under debugger and reproducing those issues. So I could find somehow the real cause of those failures. Both most often seen issues are handling tcp responses, but that is still very wide area.
@pemensik > I admit it is hard to pass all builds when building for 4 releases. Passing all is rare now. The hard part would be retrying those failures under debugger and reproducing those issues. So I could find somehow the real cause of those failures. Both most often seen issues are handling tcp responses, but that is still very wide area. until that's debugged, does it make sense to disable JUST the netmgr tests only if/when used in CI, e.g at COPR? that's the only test that seems to be failing, in that specific env. atm, the only failure-free all-releases build option seems to be to disable ALL unit tests ... which seems unwise. since the issue appears to be COPR-specific, the buildsys team should be able to spin up a COPR builder instance for you to test -- including debugging -- if interested.
FEDORA-2022-0fea8abd6e has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2022-0fea8abd6e
FEDORA-2022-0fea8abd6e has been pushed to the Fedora 38 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2022-cbcb55d5c7 has been submitted as an update to Fedora 37. https://bodhi.fedoraproject.org/updates/FEDORA-2022-cbcb55d5c7
FEDORA-2022-cbcb55d5c7 has been pushed to the Fedora 37 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2022-cbcb55d5c7` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-cbcb55d5c7 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
Reopening again. It seems it helped in some cases, but it is still failing on more similar. It seems the problem is seen in all tests calling stream_recv_two(). Not sure why only those tests are affected, but it seems they are.
FEDORA-2022-cbcb55d5c7 has been pushed to the Fedora 37 stable repository. If problem still persists, please make note of it in this bug report.