Bug 2122010
Summary: | v9.18.5 intermittent build FAILs @ netmgr_test ? | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | pgnet.dev |
Component: | bind | Assignee: | Petr Menšík <pemensik> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 37 | CC: | aegorenkov.91, anon.amish, dns-sig, jkadlcik, mruprich, pemensik, vonsch, zdohnal |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | bind-9.18.6-3.fc38 bind-9.18.6-3.fc37 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-09-18 00:17:29 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
pgnet.dev
2022-08-28 11:05:20 UTC
editing spec %if %{with UNITTEST} pushd build - CPUS=$(lscpu -p=cpu,core | grep -v '^#' | wc -l) - THREADS="$CPUS" + CPUS=1 + THREADS=1 seems (?) to do the trick. COPR builds for all F36/F37/rawhide are OK, @ https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4772848/ ugh ... spoke too soon: https://download.copr.fedorainfracloud.org/results/pgfed/bind-FORK/fedora-36-x86_64/04772853-bind/builder-live.log.gz netmgr_test. again. looking at netmgr test source, my as-yet-unfounded suspicion is that these timeouts https://github.com/isc-projects/bind9/blob/v9_18_5/tests/isc/netmgr_test.c#L116 are intermittently hitting limits -- only in COPR/online. perhaps for specific transport? I also note that -- in main, upstream, 3 days ago -- netmgr tests are being split up, into separate per-transport tests, https://github.com/isc-projects/bind9/commit/37a1be5acc32244cec03cedc1bd46bc4aa0fbc18 I'm not clear what specific problem is being solved by that split, but imagine that it might well have an effect on builds @ COPR. I've not been able to get detailed test FAIL logs from COPR builds (local builds do not FAIL). currently, @ #fedora-buildsys, did manage to get a reproducer of the build FAIL; &, working on getting access to those FAIL logs via a manual COPR build. With no changes to any source/spec, simply rebuilding @ COPR, builds FAIL randomly: Time F36 F37 Rawhide URL -------------------- ---- ---- ------- ------ 2022-08-29 15:58 EDT OK FAIL OK https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4784469/ 2022-08-29 14:23 EDT FAIL OK OK https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4784210/ 2022-08-29 11:49 EDT OK OK OK https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4776394/ again, always/only @ bind's netmgr_test ... on bind-users ML, from dev, > The netmgr unit tests are not meant to run fully in the CI as some of it are time sensitive. > > You might want to set the CI=true environment variable to reduce the set of the netmgr unit tests to just the more reliable subset. limiting UNITTESTs by adding, export CI=true in .spec @ https://src.fedoraproject.org/fork/pgfed/rpms/bind/blob/rawhide/f/bind.spec#_357 similarly random/intermittent FAILs, 3x OK, 0x FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784746/ 1x OK, 2x FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784745/ 0x OK, 3x FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784744/ 1x OK, 2x FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784743/ subsequent reply @ ML, > Then run only the system tests by running make check only in the bin/tests/system directory instead of the top level. Or don’t run the tests at all - these are mostly meant for development purposes where we have better control over the build environment. editing in .spec, to eliminate exec of all UNITTESTs %bcond_with SYSTEMTEST ... - %bcond_without UNITTEST + %bcond_with UNITTEST results in 0 build fails, 3x OK, Nx FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786692 3x OK, Nx FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786691 3x OK, Nx FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786690 3x OK, 0x FAIL https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786675 expected, as the FAILs were occurring in the unittests ... well, yes. Some tests are not very reliable and sometime depend on count or cores on building machine. I have to sometime retry the build without any change to pass tests. But I think it is better to have at least most of tests enabled. They catch many bugs and they are run on every different platform, which is not case for normal. There are also some other tests, which fail from time to time. ISC upstream tests have advantage that they are able to rebuild just platform failing. Our tests on Fedora do not run on any other platforms than x86. That is why I have them enabled even when they sometime fail for no obvious reason. Oh, interesting. I have not known unit tests can use some environment variables to configure them. I admit I haven't found enough time to make serious testing of reliability of tests. I would like to fix unreliable test instead of turning off them at all. FrostyX @ #fedora-buildsys helpfully provided the FAIL'd netmgr_test log from a manual COPR build, cat ./var/lib/mock/fedora-36-x86_64-1661812240.003689/root/builddir/build/BUILD/bind-9.18.5/build/tests/isc/netmgr_test.log [==========] Running 83 test(s). [ RUN ] mock_listenudp_uv_udp_open [ OK ] mock_listenudp_uv_udp_open [ RUN ] mock_listenudp_uv_udp_bind [ OK ] mock_listenudp_uv_udp_bind [ RUN ] mock_listenudp_uv_udp_recv_start [ OK ] mock_listenudp_uv_udp_recv_start [ RUN ] mock_udpconnect_uv_udp_open [ OK ] mock_udpconnect_uv_udp_open [ RUN ] mock_udpconnect_uv_udp_bind [ OK ] mock_udpconnect_uv_udp_bind [ RUN ] mock_udpconnect_uv_udp_connect [ OK ] mock_udpconnect_uv_udp_connect [ RUN ] mock_udpconnect_uv_recv_buffer_size [ OK ] mock_udpconnect_uv_recv_buffer_size [ RUN ] mock_udpconnect_uv_send_buffer_size [ OK ] mock_udpconnect_uv_send_buffer_size [ RUN ] udp_noop [ OK ] udp_noop [ RUN ] udp_noresponse [ OK ] udp_noresponse [ RUN ] udp_timeout_recovery [ OK ] udp_timeout_recovery [ RUN ] udp_recv_one [ OK ] udp_recv_one [ RUN ] udp_recv_two [ OK ] udp_recv_two [ RUN ] udp_recv_send [ OK ] udp_recv_send [ RUN ] udp_recv_half_send [ OK ] udp_recv_half_send [ RUN ] udp_half_recv_send [ OK ] udp_half_recv_send [ RUN ] udp_half_recv_half_send [ OK ] udp_half_recv_half_send [ RUN ] tcp_noop [ OK ] tcp_noop [ RUN ] tcp_noresponse [ OK ] tcp_noresponse [ RUN ] tcp_timeout_recovery [ OK ] tcp_timeout_recovery [ RUN ] tcp_recv_one [ OK ] tcp_recv_one [ RUN ] tcp_recv_two [ OK ] tcp_recv_two [ RUN ] tcp_recv_send [ OK ] tcp_recv_send [ RUN ] tcp_recv_half_send [ OK ] tcp_recv_half_send [ RUN ] tcp_half_recv_send [ OK ] tcp_half_recv_send [ RUN ] tcp_half_recv_half_send [ OK ] tcp_half_recv_half_send [ RUN ] tcp_recv_send_sendback [ OK ] tcp_recv_send_sendback [ RUN ] tcp_recv_half_send_sendback [ OK ] tcp_recv_half_send_sendback [ RUN ] tcp_half_recv_send_sendback [ OK ] tcp_half_recv_send_sendback [ RUN ] tcp_half_recv_half_send_sendback [ OK ] tcp_half_recv_half_send_sendback [ RUN ] tcp_recv_one_quota [ OK ] tcp_recv_one_quota [ RUN ] tcp_recv_two_quota [ OK ] tcp_recv_two_quota [ RUN ] tcp_recv_send_quota [ OK ] tcp_recv_send_quota [ RUN ] tcp_recv_half_send_quota [ OK ] tcp_recv_half_send_quota [ RUN ] tcp_half_recv_send_quota [ OK ] tcp_half_recv_send_quota [ RUN ] tcp_half_recv_half_send_quota [ OK ] tcp_half_recv_half_send_quota [ RUN ] tcp_recv_send_quota_sendback [ OK ] tcp_recv_send_quota_sendback [ RUN ] tcp_recv_half_send_quota_sendback [ OK ] tcp_recv_half_send_quota_sendback [ RUN ] tcp_half_recv_send_quota_sendback [ OK ] tcp_half_recv_send_quota_sendback [ RUN ] tcp_half_recv_half_send_quota_sendback [ OK ] tcp_half_recv_half_send_quota_sendback [ RUN ] tcpdns_recv_one [ OK ] tcpdns_recv_one [ RUN ] tcpdns_recv_two [ OK ] tcpdns_recv_two [ RUN ] tcpdns_noop [ OK ] tcpdns_noop [ RUN ] tcpdns_noresponse [ OK ] tcpdns_noresponse [ RUN ] tcpdns_timeout_recovery [ OK ] tcpdns_timeout_recovery [ RUN ] tcpdns_recv_send [ OK ] tcpdns_recv_send [ RUN ] tcpdns_recv_half_send [ OK ] tcpdns_recv_half_send [ RUN ] tcpdns_half_recv_send [ OK ] tcpdns_half_recv_send [ RUN ] tcpdns_half_recv_half_send [ OK ] tcpdns_half_recv_half_send [ RUN ] tls_noop [ OK ] tls_noop [ RUN ] tls_noresponse [ OK ] tls_noresponse [ RUN ] tls_timeout_recovery [ OK ] tls_timeout_recovery [ RUN ] tls_recv_one [ OK ] tls_recv_one [ RUN ] tls_recv_two [ OK ] tls_recv_two [ RUN ] tls_recv_send [ OK ] tls_recv_send [ RUN ] tls_recv_half_send [ OK ] tls_recv_half_send [ RUN ] tls_half_recv_send [ OK ] tls_half_recv_send [ RUN ] tls_half_recv_half_send [ OK ] tls_half_recv_half_send [ RUN ] tls_recv_send_sendback [ OK ] tls_recv_send_sendback [ RUN ] tls_recv_half_send_sendback [ OK ] tls_recv_half_send_sendback [ RUN ] tls_half_recv_send_sendback [ OK ] tls_half_recv_send_sendback [ RUN ] tls_half_recv_half_send_sendback [ OK ] tls_half_recv_half_send_sendback [ RUN ] tls_recv_one_quota [ OK ] tls_recv_one_quota [ RUN ] tls_recv_two_quota [ OK ] tls_recv_two_quota [ RUN ] tls_recv_send_quota [ OK ] tls_recv_send_quota [ RUN ] tls_recv_half_send_quota ../../../lib/isc/netmgr/tlsstream.c:910: REQUIRE(((mgr) != ((void *)0) && ((const isc__magic_t *)(mgr))->magic == ((('N') << 24 | ('E') << 16 | ('T') << 8 | ('M'))))) failed, back trace /builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(+0x3a734)[0x7f00740fa734] /builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(isc_assertion_failed+0x10)[0x7f00740f9ed0] /builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(isc_nm_tlsconnect+0x139)[0x7f007412cf49] /builddir/build/BUILD/bind-9.18.5/build/tests/isc/.libs/netmgr_test(+0xcf39)[0x55f50bf7df39] /builddir/build/BUILD/bind-9.18.5/build/tests/isc/.libs/netmgr_test(+0x96e9)[0x55f50bf7a6e9] /builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(isc__trampoline_run+0x19)[0x7f007411ca69] /lib64/libc.so.6(+0x8ce2d)[0x7f0073f05e2d] /lib64/libc.so.6(+0x1121b0)[0x7f0073f8b1b0] ../../tests/unit-test-driver.sh: line 36: 36467 Aborted (core dumped) "${TEST_PROGRAM}" FAIL netmgr_test (exit status: 134) where, cat lib/isc/netmgr/tlsstream.c ... void isc_nm_tlsconnect(isc_nm_t *mgr, isc_sockaddr_t *local, isc_sockaddr_t *peer, isc_nm_cb_t cb, void *cbarg, isc_tlsctx_t *ctx, isc_tlsctx_client_session_cache_t *client_sess_cache, unsigned int timeout, size_t extrahandlesize) { isc_nmsocket_t *nsock = NULL; #if defined(NETMGR_TRACE) && defined(NETMGR_TRACE_VERBOSE) fprintf(stderr, "TLS: isc_nm_tlsconnect(): in net thread: %s\n", isc__nm_in_netthread() ? "yes" : "no"); #endif /* NETMGR_TRACE */ 910 REQUIRE(VALID_NM(mgr)); nsock = isc_mem_get(mgr->mctx, sizeof(*nsock)); isc__nmsocket_init(nsock, mgr, isc_nm_tlssocket, local); nsock->extrahandlesize = extrahandlesize; nsock->result = ISC_R_UNSET; nsock->connect_cb = cb; nsock->connect_cbarg = cbarg; nsock->connect_timeout = timeout; isc_tlsctx_attach(ctx, &nsock->tlsstream.ctx); atomic_init(&nsock->client, true); if (client_sess_cache != NULL) { INSIST(isc_tlsctx_client_session_cache_getctx( client_sess_cache) == ctx); isc_tlsctx_client_session_cache_attach( client_sess_cache, &nsock->tlsstream.client_sess_cache); } isc_nm_tcpconnect(mgr, local, peer, tcp_connected, nsock, nsock->connect_timeout, 0); } ... At first glance, that looks (?) like DOH/DOT code. I don't see why it's necessarily a timing issue, as upstream commented. That's only from one test instance, of course. I can't yet confirm that that's the same cause fail each time; tho, seems likely. agreed re: getting to a full(ish) working set of tests that run/work in Koji/COPR CI, rather than just disabling generally I have pushed improvement in reporting errors to my fork, producing copr builds: https://copr.fedorainfracloud.org/coprs/pemensik/bind/packages/ Skipping some tests by defining CI=true environment helped a bit. But I have not seen full crash yet in my builds. I think this would be most common case so far: [ RUN ] tcp_recv_two_quota PID 36471 exceeded run time limit, sending SIGKILL ../../tests/unit-test-driver.sh: line 43: 36471 Killed @pemensik Repeated (re)builds @ COPR, with your fork'd src = https://src.fedoraproject.org/forks/pemensik/rpms/bind.git return, again, random results OK F36,F37,Rawhide https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803700/ OK F36,F37 FAIL Rawhide https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803701/ OK F36,Rawhide FAIL F37 https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803703/ OK F36,F37 FAIL Rawhide https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803704/ random_test f37 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4830585/ netmgr_test: f35 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4829490/ f35 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4829365/ f38 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4813161/ [ RUN ] tcp_recv_two_quota PID 32099 exceeded run time limit, sending SIGKILL ../../tests/unit-test-driver.sh: line 41: 32099 Killed "${TEST_PROGRAM}" FAIL netmgr_test (exit status: 124) f35 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4813161/ f38 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4803637/ [ RUN ] tcp_noresponse PID 32090 exceeded run time limit, sending SIGKILL ../../tests/unit-test-driver.sh: line 43: 32090 Killed "${TEST_PROGRAM}" FAIL netmgr_test (exit status: 124) I admit it is hard to pass all builds when building for 4 releases. Passing all is rare now. The hard part would be retrying those failures under debugger and reproducing those issues. So I could find somehow the real cause of those failures. Both most often seen issues are handling tcp responses, but that is still very wide area. @pemensik
> I admit it is hard to pass all builds when building for 4 releases. Passing all is rare now. The hard part would be retrying those failures under debugger and reproducing those issues. So I could find somehow the real cause of those failures. Both most often seen issues are handling tcp responses, but that is still very wide area.
until that's debugged, does it make sense to disable JUST the netmgr tests only if/when used in CI, e.g at COPR?
that's the only test that seems to be failing, in that specific env.
atm, the only failure-free all-releases build option seems to be to disable ALL unit tests ... which seems unwise.
since the issue appears to be COPR-specific, the buildsys team should be able to spin up a COPR builder instance for you to test -- including debugging -- if interested.
FEDORA-2022-0fea8abd6e has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2022-0fea8abd6e FEDORA-2022-0fea8abd6e has been pushed to the Fedora 38 stable repository. If problem still persists, please make note of it in this bug report. FEDORA-2022-cbcb55d5c7 has been submitted as an update to Fedora 37. https://bodhi.fedoraproject.org/updates/FEDORA-2022-cbcb55d5c7 FEDORA-2022-cbcb55d5c7 has been pushed to the Fedora 37 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2022-cbcb55d5c7` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-cbcb55d5c7 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates. Reopening again. It seems it helped in some cases, but it is still failing on more similar. It seems the problem is seen in all tests calling stream_recv_two(). Not sure why only those tests are affected, but it seems they are. FEDORA-2022-cbcb55d5c7 has been pushed to the Fedora 37 stable repository. If problem still persists, please make note of it in this bug report. |