Bug 2122010

Summary: v9.18.5 intermittent build FAILs @ netmgr_test ?
Product: [Fedora] Fedora Reporter: pgnet.dev
Component: bindAssignee: Petr Menšík <pemensik>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 37CC: aegorenkov.91, anon.amish, dns-sig, jkadlcik, mruprich, pemensik, vonsch, zdohnal
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: bind-9.18.6-3.fc38 bind-9.18.6-3.fc37 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-18 00:17:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description pgnet.dev 2022-08-28 11:05:20 UTC
I see netmgr big-endian fixes in 9.18.5 pkg source,


   4122 commit 5e9c9e3539f29fdb812bca553da7e76e5053d9b3
   4123 Merge: 44c4ff2d2d d074386ef1
   4124 Author: Ond�<99>ej Surý <ondrej>
   4125 Date:   Tue May 24 18:28:09 2022 +0000
   4126  
   4127     Merge branch '2801-unittest-isc-netmgr-s390x-v9_18' into 'v9_18'
   4128     
   4129     Fix failures in isc netmgr_test on big endian machines
   4130     
   4131     See merge request isc-projects/bind9!6353
   4132  
   4133 commit d074386ef1fbc0fd8e3c867cd51004061b390c42
   4134 Author: Petr Menšík <pemensik>
   4135 Date:   Tue May 24 19:42:41 2022 +0200
   4136  
   4137     Fix failures in isc netmgr_test on big endian machines
   4138     
   4139     Typing from libuv structure to isc_region_t is not possible, because
   4140     their sizes differ on 64 bit architectures. Little endian machines seems
   4141     to be lucky and still result in test passed. But big endian machine such
   4142     as s390x fails the test reliably.
   4143     
   4144     Fix by directly creating the buffer as isc_region_t and skipping the
   4145     type conversion. More readable and still more correct.
   4146     
   4147     (cherry picked from commit 057438cb45f2f02615dc309e3822f23c0ca70a0a)


building bind9 from src.fed sources for f36/f37/rawhide on COPR, for x86_64,

I'm seeing intermittent build FAILs @ netmgt_test ... **without** changes between builds, it FAILs differently for subsequent builds.

Unclear yet if this is same issue, or new?

E.g., builds @

  https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4772089/

&

  https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4772683/

In one case, F36 & F37 builds are OK, but rawhide FAILs
In the other, rawhide is OK, but F36 & F37 are FAILs.

Always at "netmgr_test" ...

Possibly this same issue still?
A COPR buildsys artifact?
Something else?

Comment 1 pgnet.dev 2022-08-28 12:43:45 UTC
editing spec

	%if %{with UNITTEST}
	  pushd build
-	  CPUS=$(lscpu -p=cpu,core | grep -v '^#' | wc -l)
-	  THREADS="$CPUS"
+	  CPUS=1
+	  THREADS=1

seems (?) to do the trick.

COPR builds for all F36/F37/rawhide are OK, @

  https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4772848/

Comment 3 pgnet.dev 2022-08-29 15:10:31 UTC
looking at netmgr test source, my as-yet-unfounded suspicion is that these timeouts

  https://github.com/isc-projects/bind9/blob/v9_18_5/tests/isc/netmgr_test.c#L116

are intermittently hitting limits -- only in COPR/online.  perhaps for specific transport?

I also note that -- in main, upstream, 3 days ago -- netmgr tests are being split up, into separate per-transport tests,

  https://github.com/isc-projects/bind9/commit/37a1be5acc32244cec03cedc1bd46bc4aa0fbc18

I'm not clear what specific problem is being solved by that split, but imagine that it might well have an effect on builds @ COPR.

I've not been able to get detailed test FAIL logs from COPR builds (local builds do not FAIL).  currently, @ #fedora-buildsys, did manage to get a reproducer of the build FAIL; &, working on getting access to those FAIL logs via a manual COPR build.

Comment 4 pgnet.dev 2022-08-29 20:23:21 UTC
With no changes to any source/spec, simply rebuilding @ COPR, builds FAIL randomly:

Time                   F36   F37   Rawhide  URL
--------------------   ----  ----  -------  ------
2022-08-29 15:58 EDT   OK    FAIL  OK       https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4784469/

2022-08-29 14:23 EDT   FAIL  OK    OK       https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4784210/

2022-08-29 11:49 EDT   OK    OK    OK       https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4776394/

again, always/only @ bind's netmgr_test ...

Comment 5 pgnet.dev 2022-08-30 11:41:40 UTC
on bind-users ML, from dev,

> The netmgr unit tests are not meant to run fully in the CI as some of it are time sensitive.
> 
> You might want to set the CI=true environment variable to reduce the set of the netmgr unit tests to just the more reliable subset.


limiting UNITTESTs by adding,

   export CI=true

in .spec @

   https://src.fedoraproject.org/fork/pgfed/rpms/bind/blob/rawhide/f/bind.spec#_357

similarly random/intermittent FAILs,

	3x OK, 0x FAIL  https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784746/
	1x OK, 2x FAIL  https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784745/
	0x OK, 3x FAIL  https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784744/
	1x OK, 2x FAIL  https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4784743/


subsequent reply @ ML,

> Then run only the system tests by running make check only in the bin/tests/system directory instead of the top level. Or don’t run the tests at all - these are mostly meant for development purposes where we have better control over the build environment.

editing in .spec, to eliminate exec of all UNITTESTs

	%bcond_with    SYSTEMTEST
	...
-	%bcond_without UNITTEST
+	%bcond_with    UNITTEST

results in 0 build fails,

	3x OK, Nx FAIL  https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786692
	3x OK, Nx FAIL  https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786691
	3x OK, Nx FAIL  https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786690
	3x OK, 0x FAIL  https://copr.fedorainfracloud.org/coprs/pgfed/bind-FORK/build/4786675

expected, as the FAILs were occurring in the unittests ...

Comment 6 Petr Menšík 2022-08-30 17:01:37 UTC
well, yes. Some tests are not very reliable and sometime depend on count or cores on building machine. I have to sometime retry the build without any change to pass tests. But I think it is better to have at least most of tests enabled. They catch many bugs and they are run on every different platform, which is not case for normal.

There are also some other tests, which fail from time to time. ISC upstream tests have advantage that they are able to rebuild just platform failing. Our tests on Fedora do not run on any other platforms than x86. That is why I have them enabled even when they sometime fail for no obvious reason.

Oh, interesting. I have not known unit tests can use some environment variables to configure them. I admit I haven't found enough time to make serious testing of reliability of tests. I would like to fix unreliable test instead of turning off them at all.

Comment 7 pgnet.dev 2022-08-30 17:40:13 UTC
FrostyX @ #fedora-buildsys helpfully provided the FAIL'd netmgr_test log from a manual COPR build,

	cat ./var/lib/mock/fedora-36-x86_64-1661812240.003689/root/builddir/build/BUILD/bind-9.18.5/build/tests/isc/netmgr_test.log
		[==========] Running 83 test(s).
		[ RUN      ] mock_listenudp_uv_udp_open
		[       OK ] mock_listenudp_uv_udp_open
		[ RUN      ] mock_listenudp_uv_udp_bind
		[       OK ] mock_listenudp_uv_udp_bind
		[ RUN      ] mock_listenudp_uv_udp_recv_start
		[       OK ] mock_listenudp_uv_udp_recv_start
		[ RUN      ] mock_udpconnect_uv_udp_open
		[       OK ] mock_udpconnect_uv_udp_open
		[ RUN      ] mock_udpconnect_uv_udp_bind
		[       OK ] mock_udpconnect_uv_udp_bind
		[ RUN      ] mock_udpconnect_uv_udp_connect
		[       OK ] mock_udpconnect_uv_udp_connect
		[ RUN      ] mock_udpconnect_uv_recv_buffer_size
		[       OK ] mock_udpconnect_uv_recv_buffer_size
		[ RUN      ] mock_udpconnect_uv_send_buffer_size
		[       OK ] mock_udpconnect_uv_send_buffer_size
		[ RUN      ] udp_noop
		[       OK ] udp_noop
		[ RUN      ] udp_noresponse
		[       OK ] udp_noresponse
		[ RUN      ] udp_timeout_recovery
		[       OK ] udp_timeout_recovery
		[ RUN      ] udp_recv_one
		[       OK ] udp_recv_one
		[ RUN      ] udp_recv_two
		[       OK ] udp_recv_two
		[ RUN      ] udp_recv_send
		[       OK ] udp_recv_send
		[ RUN      ] udp_recv_half_send
		[       OK ] udp_recv_half_send
		[ RUN      ] udp_half_recv_send
		[       OK ] udp_half_recv_send
		[ RUN      ] udp_half_recv_half_send
		[       OK ] udp_half_recv_half_send
		[ RUN      ] tcp_noop
		[       OK ] tcp_noop
		[ RUN      ] tcp_noresponse
		[       OK ] tcp_noresponse
		[ RUN      ] tcp_timeout_recovery
		[       OK ] tcp_timeout_recovery
		[ RUN      ] tcp_recv_one
		[       OK ] tcp_recv_one
		[ RUN      ] tcp_recv_two
		[       OK ] tcp_recv_two
		[ RUN      ] tcp_recv_send
		[       OK ] tcp_recv_send
		[ RUN      ] tcp_recv_half_send
		[       OK ] tcp_recv_half_send
		[ RUN      ] tcp_half_recv_send
		[       OK ] tcp_half_recv_send
		[ RUN      ] tcp_half_recv_half_send
		[       OK ] tcp_half_recv_half_send
		[ RUN      ] tcp_recv_send_sendback
		[       OK ] tcp_recv_send_sendback
		[ RUN      ] tcp_recv_half_send_sendback
		[       OK ] tcp_recv_half_send_sendback
		[ RUN      ] tcp_half_recv_send_sendback
		[       OK ] tcp_half_recv_send_sendback
		[ RUN      ] tcp_half_recv_half_send_sendback
		[       OK ] tcp_half_recv_half_send_sendback
		[ RUN      ] tcp_recv_one_quota
		[       OK ] tcp_recv_one_quota
		[ RUN      ] tcp_recv_two_quota
		[       OK ] tcp_recv_two_quota
		[ RUN      ] tcp_recv_send_quota
		[       OK ] tcp_recv_send_quota
		[ RUN      ] tcp_recv_half_send_quota
		[       OK ] tcp_recv_half_send_quota
		[ RUN      ] tcp_half_recv_send_quota
		[       OK ] tcp_half_recv_send_quota
		[ RUN      ] tcp_half_recv_half_send_quota
		[       OK ] tcp_half_recv_half_send_quota
		[ RUN      ] tcp_recv_send_quota_sendback
		[       OK ] tcp_recv_send_quota_sendback
		[ RUN      ] tcp_recv_half_send_quota_sendback
		[       OK ] tcp_recv_half_send_quota_sendback
		[ RUN      ] tcp_half_recv_send_quota_sendback
		[       OK ] tcp_half_recv_send_quota_sendback
		[ RUN      ] tcp_half_recv_half_send_quota_sendback
		[       OK ] tcp_half_recv_half_send_quota_sendback
		[ RUN      ] tcpdns_recv_one
		[       OK ] tcpdns_recv_one
		[ RUN      ] tcpdns_recv_two
		[       OK ] tcpdns_recv_two
		[ RUN      ] tcpdns_noop
		[       OK ] tcpdns_noop
		[ RUN      ] tcpdns_noresponse
		[       OK ] tcpdns_noresponse
		[ RUN      ] tcpdns_timeout_recovery
		[       OK ] tcpdns_timeout_recovery
		[ RUN      ] tcpdns_recv_send
		[       OK ] tcpdns_recv_send
		[ RUN      ] tcpdns_recv_half_send
		[       OK ] tcpdns_recv_half_send
		[ RUN      ] tcpdns_half_recv_send
		[       OK ] tcpdns_half_recv_send
		[ RUN      ] tcpdns_half_recv_half_send
		[       OK ] tcpdns_half_recv_half_send
		[ RUN      ] tls_noop
		[       OK ] tls_noop
		[ RUN      ] tls_noresponse
		[       OK ] tls_noresponse
		[ RUN      ] tls_timeout_recovery
		[       OK ] tls_timeout_recovery
		[ RUN      ] tls_recv_one
		[       OK ] tls_recv_one
		[ RUN      ] tls_recv_two
		[       OK ] tls_recv_two
		[ RUN      ] tls_recv_send
		[       OK ] tls_recv_send
		[ RUN      ] tls_recv_half_send
		[       OK ] tls_recv_half_send
		[ RUN      ] tls_half_recv_send
		[       OK ] tls_half_recv_send
		[ RUN      ] tls_half_recv_half_send
		[       OK ] tls_half_recv_half_send
		[ RUN      ] tls_recv_send_sendback
		[       OK ] tls_recv_send_sendback
		[ RUN      ] tls_recv_half_send_sendback
		[       OK ] tls_recv_half_send_sendback
		[ RUN      ] tls_half_recv_send_sendback
		[       OK ] tls_half_recv_send_sendback
		[ RUN      ] tls_half_recv_half_send_sendback
		[       OK ] tls_half_recv_half_send_sendback
		[ RUN      ] tls_recv_one_quota
		[       OK ] tls_recv_one_quota
		[ RUN      ] tls_recv_two_quota
		[       OK ] tls_recv_two_quota
		[ RUN      ] tls_recv_send_quota
		[       OK ] tls_recv_send_quota
		[ RUN      ] tls_recv_half_send_quota
		../../../lib/isc/netmgr/tlsstream.c:910: REQUIRE(((mgr) != ((void *)0) && ((const isc__magic_t *)(mgr))->magic == ((('N') << 24 | ('E') << 16 | ('T') << 8 | ('M'))))) failed, back trace
		/builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(+0x3a734)[0x7f00740fa734]
		/builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(isc_assertion_failed+0x10)[0x7f00740f9ed0]
		/builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(isc_nm_tlsconnect+0x139)[0x7f007412cf49]
		/builddir/build/BUILD/bind-9.18.5/build/tests/isc/.libs/netmgr_test(+0xcf39)[0x55f50bf7df39]
		/builddir/build/BUILD/bind-9.18.5/build/tests/isc/.libs/netmgr_test(+0x96e9)[0x55f50bf7a6e9]
		/builddir/build/BUILD/bind-9.18.5/build/lib/isc/.libs/libisc-9.18.5.so(isc__trampoline_run+0x19)[0x7f007411ca69]
		/lib64/libc.so.6(+0x8ce2d)[0x7f0073f05e2d]
		/lib64/libc.so.6(+0x1121b0)[0x7f0073f8b1b0]
		../../tests/unit-test-driver.sh: line 36: 36467 Aborted                 (core dumped) "${TEST_PROGRAM}"
		FAIL netmgr_test (exit status: 134)

where,

	cat lib/isc/netmgr/tlsstream.c 
		...
		void
		isc_nm_tlsconnect(isc_nm_t *mgr, isc_sockaddr_t *local, isc_sockaddr_t *peer,
		          isc_nm_cb_t cb, void *cbarg, isc_tlsctx_t *ctx,
		          isc_tlsctx_client_session_cache_t *client_sess_cache,
		          unsigned int timeout, size_t extrahandlesize) {
		    isc_nmsocket_t *nsock = NULL;
		#if defined(NETMGR_TRACE) && defined(NETMGR_TRACE_VERBOSE)
		    fprintf(stderr, "TLS: isc_nm_tlsconnect(): in net thread: %s\n",
		        isc__nm_in_netthread() ? "yes" : "no");
		#endif /* NETMGR_TRACE */

910		    REQUIRE(VALID_NM(mgr));

		    nsock = isc_mem_get(mgr->mctx, sizeof(*nsock));
		    isc__nmsocket_init(nsock, mgr, isc_nm_tlssocket, local);
		    nsock->extrahandlesize = extrahandlesize;
		    nsock->result = ISC_R_UNSET;
		    nsock->connect_cb = cb;
		    nsock->connect_cbarg = cbarg;
		    nsock->connect_timeout = timeout;
		    isc_tlsctx_attach(ctx, &nsock->tlsstream.ctx);
		    atomic_init(&nsock->client, true);
		    if (client_sess_cache != NULL) {
		        INSIST(isc_tlsctx_client_session_cache_getctx(
		                   client_sess_cache) == ctx);
		        isc_tlsctx_client_session_cache_attach(
		            client_sess_cache, &nsock->tlsstream.client_sess_cache);
		    }

		    isc_nm_tcpconnect(mgr, local, peer, tcp_connected, nsock,
		              nsock->connect_timeout, 0);
		}
		...


At first glance, that looks (?) like DOH/DOT code.  I don't see why it's necessarily a timing issue, as upstream commented.

That's only from one test instance, of course.  I can't yet confirm that that's the same cause fail each time; tho, seems likely.

agreed re: getting to a full(ish) working set of tests that run/work in Koji/COPR CI, rather than just disabling generally

Comment 8 Petr Menšík 2022-09-01 14:57:48 UTC
I have pushed improvement in reporting errors to my fork, producing copr builds:
https://copr.fedorainfracloud.org/coprs/pemensik/bind/packages/

Skipping some tests by defining CI=true environment helped a bit. But I have not seen full crash yet in my builds.

I think this would be most common case so far:

[ RUN      ] tcp_recv_two_quota
PID 36471 exceeded run time limit, sending SIGKILL
../../tests/unit-test-driver.sh: line 43: 36471 Killed

Comment 9 pgnet.dev 2022-09-01 16:11:33 UTC
@pemensik

Repeated (re)builds @ COPR, with your fork'd src = https://src.fedoraproject.org/forks/pemensik/rpms/bind.git

return, again, random results

 OK F36,F37,Rawhide               https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803700/
 OK F36,F37         FAIL Rawhide  https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803701/
 OK F36,Rawhide     FAIL F37      https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803703/
 OK F36,F37         FAIL Rawhide  https://copr.fedorainfracloud.org/coprs/pgfed/bind/build/4803704/

Comment 10 Petr Menšík 2022-09-11 22:28:16 UTC
random_test f37 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4830585/ 

netmgr_test:
f35 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4829490/
f35 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4829365/
f38 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4813161/
[ RUN      ] tcp_recv_two_quota
PID 32099 exceeded run time limit, sending SIGKILL
../../tests/unit-test-driver.sh: line 41: 32099 Killed                  "${TEST_PROGRAM}"
FAIL netmgr_test (exit status: 124)

f35 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4813161/
f38 fail: https://copr.fedorainfracloud.org/coprs/pemensik/bind/build/4803637/
[ RUN      ] tcp_noresponse
PID 32090 exceeded run time limit, sending SIGKILL
../../tests/unit-test-driver.sh: line 43: 32090 Killed                  "${TEST_PROGRAM}"
FAIL netmgr_test (exit status: 124)

I admit it is hard to pass all builds when building for 4 releases. Passing all is rare now. The hard part would be retrying those failures under debugger and reproducing those issues. So I could find somehow the real cause of those failures. Both most often seen issues are handling tcp responses, but that is still very wide area.

Comment 11 pgnet.dev 2022-09-11 23:14:54 UTC
@pemensik

> I admit it is hard to pass all builds when building for 4 releases. Passing all is rare now. The hard part would be retrying those failures under debugger and reproducing those issues. So I could find somehow the real cause of those failures. Both most often seen issues are handling tcp responses, but that is still very wide area.

until that's debugged, does it make sense to disable JUST the netmgr tests only if/when used in CI, e.g at COPR?

that's the only test that seems to be failing, in that specific env.

atm, the only failure-free all-releases build option seems to be to disable ALL unit tests ... which seems unwise.

since the issue appears to be COPR-specific, the buildsys team should be able to spin up a COPR builder instance for you to test -- including debugging -- if interested.

Comment 12 Fedora Update System 2022-09-13 12:12:04 UTC
FEDORA-2022-0fea8abd6e has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2022-0fea8abd6e

Comment 13 Fedora Update System 2022-09-13 12:25:17 UTC
FEDORA-2022-0fea8abd6e has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 14 Fedora Update System 2022-09-13 13:49:11 UTC
FEDORA-2022-cbcb55d5c7 has been submitted as an update to Fedora 37. https://bodhi.fedoraproject.org/updates/FEDORA-2022-cbcb55d5c7

Comment 15 Fedora Update System 2022-09-14 01:52:26 UTC
FEDORA-2022-cbcb55d5c7 has been pushed to the Fedora 37 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2022-cbcb55d5c7`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-cbcb55d5c7

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 16 Petr Menšík 2022-09-14 14:21:39 UTC
Reopening again. It seems it helped in some cases, but it is still failing on more similar.

It seems the problem is seen in all tests calling stream_recv_two(). Not sure why only those tests are affected, but it seems they are.

Comment 17 Fedora Update System 2022-09-18 00:17:29 UTC
FEDORA-2022-cbcb55d5c7 has been pushed to the Fedora 37 stable repository.
If problem still persists, please make note of it in this bug report.