Bug 2114531 - git: FTBFS in Fedora rawhide/f37
Summary: git: FTBFS in Fedora rawhide/f37
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: git
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Todd Zullinger
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: F37FTBFS
TreeView+ depends on / blocked
 
Reported: 2022-08-02 19:39 UTC by Fedora Release Engineering
Modified: 2022-08-26 12:30 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-02 20:43:23 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
build.log (32.00 KB, text/plain)
2022-08-02 19:39 UTC, Fedora Release Engineering
no flags Details
root.log (32.00 KB, text/plain)
2022-08-02 19:39 UTC, Fedora Release Engineering
no flags Details
state.log (957 bytes, text/plain)
2022-08-02 19:39 UTC, Fedora Release Engineering
no flags Details
use random ports for tests which start daemons (1.96 KB, patch)
2022-08-18 19:00 UTC, Todd Zullinger
no flags Details | Diff

Description Fedora Release Engineering 2022-08-02 19:39:23 UTC
git failed to build from source in Fedora rawhide/f37

https://koji.fedoraproject.org/koji/taskinfo?taskID=89780626


For details on the mass rebuild see:

https://fedoraproject.org/wiki/Fedora_37_Mass_Rebuild
Please fix git at your earliest convenience and set the bug's status to
ASSIGNED when you start fixing it. If the bug remains in NEW state for 8 weeks,
git will be orphaned. Before branching of Fedora 38,
git will be retired, if it still fails to build.

For more details on the FTBFS policy, please visit:
https://docs.fedoraproject.org/en-US/fesco/Fails_to_build_from_source_Fails_to_install/

Comment 1 Fedora Release Engineering 2022-08-02 19:39:26 UTC
Created attachment 1903005 [details]
build.log

file build.log too big, will only attach last 32768 bytes

Comment 2 Fedora Release Engineering 2022-08-02 19:39:28 UTC
Created attachment 1903006 [details]
root.log

file root.log too big, will only attach last 32768 bytes

Comment 3 Fedora Release Engineering 2022-08-02 19:39:29 UTC
Created attachment 1903007 [details]
state.log

Comment 4 Todd Zullinger 2022-08-02 20:43:23 UTC
This was a transient failure due to issues in the test suite, where tests which start a temporary httpd on a high port failed because the port was opened by something else on the build host.  If there are ways to ensure the build environment has access to a port range, I'd be interested to know.  This comes up occasionally in the test suite and I'd love to be able to fix it and make the test suite more reliable in our infrastructure.

Here's a successful scratch build:

  https://koji.fedoraproject.org/koji/taskinfo?taskID=90408676

Comment 5 Ondřej Pohořelský 2022-08-03 14:34:34 UTC
The same thing happens to me occasionally on RHEL builders, and I'm also interested in fixing it.

I took a short look on this issue.
The ports are set by the test number, unless it is defined beforehand. There are 23 tests that open test server on its own defined port.
What I don't know is if the git build is colliding with some random other build that is also starting its own server, or if it is colliding with another git build running at the same time.

Looking at the failed build, it looks like we used the same host for x86_64 and i686, which might be the reason why the ports were blocked.

Anyway, I'm not sure how to deal with this issue. I don't think we can ensure the ports are available in the build environment.
I also don't think that we should do anything on the spec side, as it would only add confusion to it.
IMO, the best idea is to improve the git test suit. I'm thinking about randomizing the port numbers for each test.

What's your view on this issue?

Comment 6 Todd Zullinger 2022-08-03 17:23:02 UTC
Hi Ondřej,

Perhaps we should try setting those ports randomly, possibly in a different and/or higher range?  Then multiple builds would be less likely to conflict, if that's the issue?

We skip t9115-git-svn-dcommit-funky-renames on power64 arches because it frequently runs into the port in use.  There, we don't run share the build with other architectures, like the i686/x86_64 case you mentioned.  I'd always assumed it was something on the host, but I never dug into it.  So it's just baseless speculation on my part. :)

I haven't looked in some time, but do all the tests which run a server use test_set_port now?  If so, our list of ports to set is relatively small:

  $ git grep -P '^\s*test_set_port'
  t/lib-git-daemon.sh:test_set_port LIB_GIT_DAEMON_PORT
  t/lib-git-p4.sh:test_set_port P4DPORT
  t/lib-git-svn.sh:test_set_port SVNSERVE_PORT
  t/lib-httpd.sh:test_set_port LIB_HTTPD_PORT
  t/t5512-ls-remote.sh:   test_set_port JGIT_DAEMON_PORT &&
  t/test-lib-functions.sh:test_set_port () {

If the problem is multiple tests from the same build tripping over each other, then this won't help.  But if that is the case, we should be able to work upstream to improve things.

Comment 7 Ondřej Pohořelský 2022-08-17 13:16:38 UTC
I'm trying to make the randomization of ports work, but I'm running into multiple issues.

First, I tried to set the port to 0, hoping that the kernel is going to assign a free port on its own.
This doesn't work, as apache httpd nor lighttpd doesn't support this behavior.

After looking at the solution rubygem-pg package[0] is using, I added these lines to the spec file, trying to randomize all the ports.

```
export LIB_GIT_DAEMON_PORT="$((${RANDOM} % (65535 - 1024) + 1024))"
export P4DPORT="$((${RANDOM} % (65535 - 1024) + 1024))"
export SVNSERVE_PORT="$((${RANDOM} % (65535 - 1024) + 1024))"
export LIB_HTTPD_PORT="$((${RANDOM} % (65535 - 1024) + 1024))"
export JGIT_DAEMON_PORT="$((${RANDOM} % (65535 - 1024) + 1024))"
```

Now the test suit fails all the time[1]. Strangely, it happens because of LIB_HTTPD_PORT. It looks like no matter which port I'm trying to set through this envar, it always fails.

When not setting LIB_HTTPD_PORT, the test suit succeeds in maybe half of the builds. 
This is really strange, as I would expect to not run into these issues when the ports are randomized.

I'll try to dig some more into this issue and possibly contact upstream about it.

[0]https://src.fedoraproject.org/rpms/rubygem-pg/pull-request/3#request_diff
[1]https://koji.fedoraproject.org/koji/taskinfo?taskID=90919637

Comment 8 Todd Zullinger 2022-08-17 22:35:03 UTC
Ahh, I see (at least one of the reasons) why that fails so badly.  Setting the *_PORT variables leaves us with a single port for all the tests which use each service.  So t/t5541-http-push-smart.sh, t/t5551-http-fetch-smart.sh, and t/t5703-upload-pack-ref-in-want.sh all try to start httpd on the same port, which isn't an improvement.

I wonder if we could adjust test_set_port to provide a randomized range by adding an arm to the `case "$port" in` block?  Something like setting LIB_GIT_DAEMON_PORT=random (and perhaps, if needed, another var like GIT_TEST_PORT_RANGE_START to set the starting point), e.g.:

```
diff --git i/t/test-lib-functions.sh w/t/test-lib-functions.sh
index c6479f24eb..62629f8c20 100644
--- i/t/test-lib-functions.sh
+++ w/t/test-lib-functions.sh
@@ -1731,6 +1731,10 @@ test_set_port () {
 			port=$(($port + 10000))
 		fi
 		;;
+	random)
+		# Pick a random port
+		port=$(( RANDOM % (65535 - 1024) + 1024))
+		;;
 	*[!0-9]*|0*)
 		error >&7 "invalid port number: $port"
 		;;
```

The two instances of '1024' in the above could be replaced with $GIT_TEST_PORT_RANGE_START if such a variable were added.  This is just for testing, as using RANDOM is bash-specific. If it works out, one possible POSIX-compliant way to do it might be:

    port=$( awk -v min=1024 -v max=65535 'BEGIN{srand(); print int(min+rand()*(max-min+1))}')

(courtesy of Stéphane Chazelas at https://unix.stackexchange.com/a/140756)

I ran a scratch build with that patch applied and the ports set to random:

```
diff --git i/git.spec w/git.spec
index 35651f6..768c0df 100644
--- i/git.spec
+++ w/git.spec
@@ -109,6 +109,10 @@ Source99:       print-failed-test-output
 # https://bugzilla.redhat.com/490602
 Patch0:         git-cvsimport-Ignore-cvsps-2.2b1-Branches-output.patch
 
+# allow setting the test port to "random"
+# https://bugzilla.redhat.com/2114531#c8
+Patch1:         git-test_set_port-random.patch
+
 %if %{with docs}
 # pod2man is needed to build Git.3pm
 BuildRequires:  %{_bindir}/pod2man
@@ -849,6 +853,11 @@ export GIT_TEST_HTTPD=true
 export GIT_TEST_SVNSERVE=true
 export GIT_TEST_SVN_HTTPD=true
 
+# Use randomized ports for tests which start a daemon
+export LIB_GIT_DAEMON_PORT=random
+export SVNSERVE_PORT=random
+export LIB_HTTPD_PORT=random
+
 # Create tmpdir for test output and update GIT_TEST_OPTS
 # Also update GIT-BUILD-OPTIONS to keep make from any needless rebuilding
 testdir=$(mktemp -d -p /tmp git-t.XXXX)
```

The scratch build:

https://koji.fedoraproject.org/koji/taskinfo?taskID=90940126

Since this doesn't come up all the time, it might need a few more scratch builds to see if it actually helps resolve the issue.  I should probably drop the t9115 skipping on ppc64-le arches since that was a common source of this issue.  I kicked off another scratch build with that change: https://koji.fedoraproject.org/koji/taskinfo?taskID=90941617

I also didn't set the JGIT_DAEMON_PORT or P4DPORT variables since the Fedora builds don't test either of them.  We skip jgit tests because I got tired of adding and removing it every other release when it was about to be taken out of the distro by java dependency issues.  We skip the p4d tests because they require downloading non-free binaries to run.  I wouldn't do that on my own systems so I don't do it on the Fedora infrastructure. :)

Let me know if you think this is a workable idea or if you can poke any holes in it.

Comment 9 Ondřej Pohořelský 2022-08-18 11:39:21 UTC
Now I see what I missed, thanks for pointing it out.
This patch looks good, and the POSIX-compliant way might be the ideal solution for upstream.

One problem is described in the stackexchange comment. When two tests try to get a random port number in the same second, it is likely that we hit the same port.
I'm not sure if this is something we are going to experience, but if yes, it can be fixed by giving srand() a unique seed, something like date in milliseconds.

Comment 10 Todd Zullinger 2022-08-18 15:19:36 UTC
I tried to use the awk method last night, but I got a lot of failures.  I didn't see that it was from the duplicate issue, but that is what seems likely.  It would be worth trying it with a microsecond seed to srand() (assuming that's actually portable still?).

Another thing I noticed while looking is that RANDOM is limited to 0 and 32767.  SRANDOM would be better in that case, but is only available in bash 5.1+, so it wouldn't help for RHEL-8.

I haven't tried to put this into code, but for upstream, I wonder if we could add a variable to pass in the method for obtaining the random port in config.mak.  Then if a portable solution wasn't quite perfect for every system, it could be overridden easily, with defaults for various systems set in config.mak.uname.  (That could be gross overkill though, if the awk or similar method works reliably across all/most systems.)

Comment 11 Todd Zullinger 2022-08-18 19:00:18 UTC
Created attachment 1906392 [details]
use random ports for tests which start daemons

I tested again with awk using the attached patch, initializing srand() with the current nanoseconds and the tests all passed.  That's encouraging.

I ran builds for f{35..38} in parallel and then builds for el{7..9} in parallel.  They had the ppc64le t9115 test enabled, which usually would fail with a number of builds running at the same time.  Here's the scratch builds:

f35 https://koji.fedoraproject.org/koji/taskinfo?taskID=90977280
f36 https://koji.fedoraproject.org/koji/taskinfo?taskID=90977281
f37 https://koji.fedoraproject.org/koji/taskinfo?taskID=90977282
f38 https://koji.fedoraproject.org/koji/taskinfo?taskID=90977287
el7 https://koji.fedoraproject.org/koji/taskinfo?taskID=90978304
el8 https://koji.fedoraproject.org/koji/taskinfo?taskID=90978306
el9 https://koji.fedoraproject.org/koji/taskinfo?taskID=90978308

A remaining downside is that `date +%N` isn't portable either.  Some other method to initialize the srand() would be needed to upstream this.  Maybe the easiest thing is to add a test helper to do it, similar to test-genrandom.c.  But I am surely overlooking some reasonably obvious method of generating a simple semi-random number.  It's not like it needs to be cryptographically secure. :)

Comment 12 Todd Zullinger 2022-08-22 07:24:27 UTC
I played a bit more over the past several days.  I did generate a test helper in C to create the random ports.

That was simpler than trying to figure out some portable way to do it, or even if it was OK to depend on perl which could do this pretty easily:

  (min=1024; max=65535; perl -e "print int($min+rand()*($max-$min+1))")

The C helper was only 16 lines.  That work is on the `tmz/tests-use-random-ports` branch, here:

  https://src.fedoraproject.org/fork/tmz/rpms/git/commits/tmz/tests-use-random-ports

I ran a good number of builds, f35-f38 and epel7-epel9 in parallel, without many failures.  But it wasn't zero.  Even random ports occasionally trigger the problem.

For that, we really need to retry on a different port when we run into this issue.  I added a retry to the `start_httpd()` method in `t/lib-httpd.sh` which makes 3 attempts, incrementing the port number each time.  That could be combined with the random ports, but there may not be a need if the retry method works well.  That work is on the `tmz/tests-retry-start_httpd` branch, here:

  https://src.fedoraproject.org/fork/tmz/rpms/git/commits/tmz/tests-retry-start_httpd

I didn't get any failures with that method -- but it's also somewhat rare to get the failures even without it.  If you have a way to kick off a bunch of builds in brew that might trigger the bug more reliably, it would be great to know if the retry branch fixes it.  That only covers `httpd` but I think that's the primary source of failures.

The jgit, git daemon, and svnserve setups are all slightly different since they're run in the background rather than as real daemons like httpd.  That might make it a little more tedious to use the same retry method, I don't know.  If so, then the random port method might be worth adding as well.

Any testing, suggestions, or holes in either or both of these methods are very welcome. :)

Comment 13 Ondřej Pohořelský 2022-08-22 15:37:56 UTC
This looks very good, and I think it can solve or at least minimize the test failures.
I wonder if it can be useful to provide a variable that defines number of retries in case of failing to open on a random port. Or maybe when it fails, go straight to another random port?
Anyway, that would probably be an overkill, and we are good with the patch as it is.

To be fair, I think the best option to deal with the failures would be to completely skip the randomization part and go straight for a free port.
Sadly, that's not easily doable, as the httpd server doesn't provide any function that would return free port, netstat isn't POSIX compliant and other ways of getting a list of free ports look like a waste of computing time.
Going for a random is probably the best we can do here.

I'm not sure how to trigger the bug more reliably, I'll try to look into it.

Thanks for writing this patch!

Comment 14 Todd Zullinger 2022-08-22 22:17:58 UTC
The retry patch uses: `while test $i -lt ${GIT_TEST_START_HTTPD_TRIES:-3}` to make it easy to override the number of retry attempts, if necessary. :)

If you manage to trigger the failures more reliably, I'll be quite interested.  It's funny how it seems to work consistently when you hope it fails and frequently fails when you hope it works.

Thank you for helping to work out these improvements and for testing them!

I'll try to work on retries for the other few tests which need a daemon sometime soon.

Comment 15 Ondřej Pohořelský 2022-08-26 12:30:18 UTC
Oops, I overlooked that variable, sorry.

I tried to crash the builds, but I was unsuccessful. I also tried to ask around and find a way how to trigger that failure consistently, but I didn't find anything useful.
So basically, my testing was spamming builds at the same time, which before resulted in at least one failure in ten builds. Now all the builds passed, even the ones that ran on the same host.

I think you can push this into the rawhide, and we will see if any failures show up. But I think this is going to be fine :)


Note You need to log in before you can comment on or make changes to this bug.