Created attachment 867956 [details] build.log Description of problem: ruby fails to build on PPC, one of the self checks is failing: Finished tests in 429.660844s, 28.2991 tests/s, 5947.7610 assertions/s. 1) Skipped: test_capture_subprocess_io(TestMiniTestUnitTestCase) [/builddir/build/BUILD/ruby-2.0.0-p353/test/minitest/test_minitest_unit.rb:1339]: Dunno why but the parallel run of this fails 2) Skipped: test_completion_encoding(TestReadline) [/builddir/build/BUILD/ruby-2.0.0-p353/test/readline/test_readline.rb:294]: missing test for locale US-ASCII 3) Skipped: test_input_metachar_multibyte(TestReadline) [/builddir/build/BUILD/ruby-2.0.0-p353/test/readline/test_readline.rb:420]: this test needs UTF-8 locale 4) Error: test_autobind(TestSocket_UNIXSocket): Errno::ECONNREFUSED: Connection refused - connect(2) /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:65:in `connect' /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:65:in `connect_internal' /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:135:in `connect' /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:765:in `unix' /builddir/build/BUILD/ruby-2.0.0-p353/test/socket/test_unix.rb:589:in `block in test_autobind' /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:806:in `unix_server_socket' /builddir/build/BUILD/ruby-2.0.0-p353/test/socket/test_unix.rb:585:in `test_autobind' 12159 tests, 2555520 assertions, 0 failures, 1 errors, 31 skips ruby -v: ruby 2.0.0p353 (2013-11-22) [powerpc64-linux] make: *** [yes-test-all] Error 1 Version-Release number of selected component (if applicable): ruby-2.0.0.353-17.fc21 How reproducible: Steps to Reproduce: 1. ppc-koji build --scratch f21 ruby-2.0.0.353-17.fc21.src.rpm 2. 3. Actual results: http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=1679470 Expected results: Additional info:
Yes, this is know issue [1]. The test is quite unstable on PPC, although it passes from time to time. You might be the most qualified person to provide more inside to have this issue resolved. This is the offending line [2]. [1] https://bugs.ruby-lang.org/issues/9373 [2] https://github.com/ruby/ruby/blob/trunk/ext/socket/lib/socket.rb#L65
I got a reproducer for this issue. It is not obvious to me what is wrong, though I'm not very familiar with abstract unix socket addressing. As I won't be around for the next few days, anyone willing to jump in and help is very welcome. Here is the reproducer: #include <sys/socket.h> #include <stdio.h> int main() { int fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0); int i = 1; setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &i, 4); struct sockaddr addr; addr.sa_family = AF_LOCAL; bind(fd, &addr, 2); listen(fd, 128); struct sockaddr_storage ss; socklen_t sslen = (socklen_t)sizeof(ss); getsockname(fd, (struct sockaddr*)&ss, &sslen); fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0); if (connect(fd, (struct sockaddr*)&ss, sslen) == -1){ perror(NULL); return 1; } printf("OK\n"); return 0; }
(In reply to Gustavo Luiz Duarte from comment #2) Thanks for the reproducer. I linked it into upstream issue.
I looked at this some more. I don't think this is a glibc bug. It looks like something goes wrong with the kernel autobind operation. I played around with this a bit and added a loop to the reproducer if the connection failed. The connection failure is persistent. Connection attempts from a separate process also fail. I can even bind to the same name from a different process (without setting any SO_* options): c000000277fee400: 00000002 00000000 00010000 0001 01 54909 @00033 c0000002e5c2b400: 00000002 00000000 00010000 0001 01 64438 @00033 All this is consistent with a miscomputed hash over the name during autobind, which causes the name to be added to the wrong hash bucket. I stared a bit at af_unix.c:unix_autobind(), but I couldn't find anything wrong with it. Maybe all this points in the wrong direction.
Hmm, this might be a bug in the ppc64 csum_partial implementation that only appears with short lengths and suitable (mis)alignment.
reproducable with kernel-3.14.0-0.rc4.git0.1.fc21.ppc64 and kernel-2.6.32-431.3.1.el6.ppc64
Is it just PPC64? I have misty memories that from time to time I met this issue on PPC as well.
Thanks for the testcase Gustavo! It is indeed a kernel issue, I've sent out a possible fix at: http://patchwork.ozlabs.org/patch/326203/
http://patchwork.ozlabs.org/patch/326572/ is the latest revision. We'll grab it as soon as it's ACKed.
It's now upstream: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0a13404dd3bf4ea870e3d96270b5a382edca85c0
Yep. Should be in the build I did last Friday. Thanks Anton!
So when this will propagate into PPC? It should also propagate into the builder's kernel ... Don't think that closed is appropriate state, although the component might be reconsidered.
I have no idea on either of those. The fix is in the kernel and there's nothing more I think we can do from that aspect. If there is, someone let me know. We can move this to the distribution component if we need to, but for getting the builders updated that might be better served by opening a rel-eng ticket.
@Vit: The PPC builders are running RHEL-6.5. As I don't think that we'll get an official RHEL-6 kernel with this fix anytime soon, could you exclude that particular test from being run on PPC ?
So I reported the bug 1077296 against RHEL and I am reassigning back to Ruby to disable the test until the fix is available in builder's kernel.
@Karsten: This was fixed in RHEL6.6, how does it look with Fedora's builders? Are they updated already? Can we close this? I don't remember to see this issue on Fedora recently.
I haven't seen this issue on the builders for quite a while. The PPC koji hub is still running an older EL6 kernel until we manage to schedule some downtime. But there's a fixed kernel installed already, just waiting for a reboot. I think this can be closed.