Bug 1070206 - FTBFS: Error: test_autobind(TestSocket_UNIXSocket): Errno::ECONNREFUSED: Connection refused - connect(2)
Summary: FTBFS: Error: test_autobind(TestSocket_UNIXSocket): Errno::ECONNREFUSED: Conn...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: ruby
Version: rawhide
Hardware: powerpc
OS: Linux
medium
high
Target Milestone: ---
Assignee: Jeroen van Meeuwen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 1077296 1106402
Blocks: PPCTracker
TreeView+ depends on / blocked
 
Reported: 2014-02-26 12:06 UTC by Karsten Hopp
Modified: 2014-11-18 09:47 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1077225 (view as bug list)
Environment:
Last Closed: 2014-11-18 09:47:30 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
build.log (919.01 KB, text/plain)
2014-02-26 12:06 UTC, Karsten Hopp
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ruby 9373 0 None None None Never

Description Karsten Hopp 2014-02-26 12:06:54 UTC
Created attachment 867956 [details]
build.log

Description of problem:
ruby fails to build on PPC, one of the self checks is failing:

Finished tests in 429.660844s, 28.2991 tests/s, 5947.7610 assertions/s.
  1) Skipped:
test_capture_subprocess_io(TestMiniTestUnitTestCase) [/builddir/build/BUILD/ruby-2.0.0-p353/test/minitest/test_minitest_unit.rb:1339]:
Dunno why but the parallel run of this fails
  2) Skipped:
test_completion_encoding(TestReadline) [/builddir/build/BUILD/ruby-2.0.0-p353/test/readline/test_readline.rb:294]:
missing test for locale US-ASCII
  3) Skipped:
test_input_metachar_multibyte(TestReadline) [/builddir/build/BUILD/ruby-2.0.0-p353/test/readline/test_readline.rb:420]:
this test needs UTF-8 locale
  4) Error:
test_autobind(TestSocket_UNIXSocket):
Errno::ECONNREFUSED: Connection refused - connect(2)
    /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:65:in `connect'
    /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:65:in `connect_internal'
    /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:135:in `connect'
    /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:765:in `unix'
    /builddir/build/BUILD/ruby-2.0.0-p353/test/socket/test_unix.rb:589:in `block in test_autobind'
    /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:806:in `unix_server_socket'
    /builddir/build/BUILD/ruby-2.0.0-p353/test/socket/test_unix.rb:585:in `test_autobind'
12159 tests, 2555520 assertions, 0 failures, 1 errors, 31 skips
ruby -v: ruby 2.0.0p353 (2013-11-22) [powerpc64-linux]
make: *** [yes-test-all] Error 1

Version-Release number of selected component (if applicable):
ruby-2.0.0.353-17.fc21

How reproducible:


Steps to Reproduce:
1. ppc-koji build --scratch f21 ruby-2.0.0.353-17.fc21.src.rpm
2.
3.

Actual results:
http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=1679470

Expected results:


Additional info:

Comment 1 Vít Ondruch 2014-02-26 12:58:17 UTC
Yes, this is know issue [1]. The test is quite unstable on PPC, although it passes from time to time. You might be the most qualified person to provide more inside to have this issue resolved. This is the offending line [2].


[1] https://bugs.ruby-lang.org/issues/9373
[2] https://github.com/ruby/ruby/blob/trunk/ext/socket/lib/socket.rb#L65

Comment 2 Gustavo Luiz Duarte 2014-02-28 22:05:39 UTC
I got a reproducer for this issue. It is not obvious to me what is wrong, though I'm not very familiar with abstract unix socket addressing. As I won't be around for the next few days, anyone willing to jump in and help is very welcome.

Here is the reproducer:

#include <sys/socket.h>
#include <stdio.h>

int main()
{
        int fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

        int i = 1;
        setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &i, 4);

        struct sockaddr addr;
        addr.sa_family = AF_LOCAL;
        bind(fd, &addr, 2);

        listen(fd, 128);

        struct sockaddr_storage ss;
        socklen_t sslen = (socklen_t)sizeof(ss);
        getsockname(fd, (struct sockaddr*)&ss, &sslen);

        fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

        if (connect(fd, (struct sockaddr*)&ss, sslen) == -1){
                perror(NULL);
                return 1;
        }
        printf("OK\n");
        return 0;
}

Comment 3 Vít Ondruch 2014-03-03 08:44:16 UTC
(In reply to Gustavo Luiz Duarte from comment #2)
Thanks for the reproducer. I linked it into upstream issue.

Comment 4 Florian Weimer 2014-03-03 14:52:08 UTC
I looked at this some more.  I don't think this is a glibc bug.  It looks like something goes wrong with the kernel autobind operation.

I played around with this a bit and added a loop to the reproducer if the connection failed.  The connection failure is persistent.  Connection attempts from a separate process also fail.  I can even bind to the same name from a different process (without setting any SO_* options):

c000000277fee400: 00000002 00000000 00010000 0001 01 54909 @00033
c0000002e5c2b400: 00000002 00000000 00010000 0001 01 64438 @00033

All this is consistent with a miscomputed hash over the name during autobind, which causes the name to be added to the wrong hash bucket.

I stared a bit at af_unix.c:unix_autobind(), but I couldn't find anything wrong with it.  Maybe all this points in the wrong direction.

Comment 5 Florian Weimer 2014-03-03 15:07:52 UTC
Hmm, this might be a bug in the ppc64 csum_partial implementation that only appears with short lengths and suitable (mis)alignment.

Comment 6 Karsten Hopp 2014-03-03 16:27:18 UTC
reproducable with kernel-3.14.0-0.rc4.git0.1.fc21.ppc64 and kernel-2.6.32-431.3.1.el6.ppc64

Comment 7 Vít Ondruch 2014-03-03 16:36:57 UTC
Is it just PPC64? I have misty memories that from time to time I met this issue on PPC as well.

Comment 8 Anton Blanchard 2014-03-04 10:02:53 UTC
Thanks for the testcase Gustavo! It is indeed a kernel issue, I've sent out a possible fix at:

http://patchwork.ozlabs.org/patch/326203/

Comment 9 Josh Boyer 2014-03-06 14:42:46 UTC
http://patchwork.ozlabs.org/patch/326572/ is the latest revision.  We'll grab it as soon as it's ACKed.

Comment 11 Josh Boyer 2014-03-17 12:58:03 UTC
Yep.  Should be in the build I did last Friday.  Thanks Anton!

Comment 12 Vít Ondruch 2014-03-17 15:01:18 UTC
So when this will propagate into PPC? It should also propagate into the builder's kernel ... Don't think that closed is appropriate state, although the component might be reconsidered.

Comment 13 Josh Boyer 2014-03-17 15:10:39 UTC
I have no idea on either of those.  The fix is in the kernel and there's nothing more I think we can do from that aspect.  If there is, someone let me know.  We can move this to the distribution component if we need to, but for getting the builders updated that might be better served by opening a rel-eng ticket.

Comment 14 Karsten Hopp 2014-03-17 15:33:48 UTC
@Vit: The PPC builders are running RHEL-6.5. As I don't think that we'll get an official RHEL-6 kernel with this fix anytime soon, could you exclude that particular test from being run on PPC ?

Comment 15 Vít Ondruch 2014-03-17 16:34:54 UTC
So I reported the bug 1077296 against RHEL and I am reassigning back to Ruby to disable the test until the fix is available in builder's kernel.

Comment 16 Vít Ondruch 2014-11-18 08:04:39 UTC
@Karsten: This was fixed in RHEL6.6, how does it look with Fedora's builders? Are they updated already? Can we close this? I don't remember to see this issue on Fedora recently.

Comment 17 Karsten Hopp 2014-11-18 09:47:30 UTC
I haven't seen this issue on the builders for quite a while.
The PPC koji hub is still running an older EL6 kernel until we manage to schedule some downtime. But there's a fixed kernel installed already, just waiting for a reboot.
I think this can be closed.


Note You need to log in before you can comment on or make changes to this bug.