1070206 – FTBFS: Error: test_autobind(TestSocket_UNIXSocket): Errno::ECONNREFUSED: Connection refused - connect(2)

Bug 1070206 - FTBFS: Error: test_autobind(TestSocket_UNIXSocket): Errno::ECONNREFUSED: Connection refused - connect(2)

Summary: FTBFS: Error: test_autobind(TestSocket_UNIXSocket): Errno::ECONNREFUSED: Conn...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	ruby
Sub Component:
Version:	rawhide
Hardware:	powerpc
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeroen van Meeuwen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:	1077296 1106402
Blocks:	PPCTracker
TreeView+	depends on / blocked

Reported:	2014-02-26 12:06 UTC by Karsten Hopp
Modified:	2014-11-18 09:47 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Clone Of:
Clones:	1077225 (view as bug list)
Environment:
Last Closed:	2014-11-18 09:47:30 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
build.log (919.01 KB, text/plain) 2014-02-26 12:06 UTC, Karsten Hopp	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ruby	9373	0	None	None	None	Never

Description Karsten Hopp 2014-02-26 12:06:54 UTC

Created attachment 867956 [details]
build.log

Description of problem:
ruby fails to build on PPC, one of the self checks is failing:

Finished tests in 429.660844s, 28.2991 tests/s, 5947.7610 assertions/s.
  1) Skipped:
test_capture_subprocess_io(TestMiniTestUnitTestCase) [/builddir/build/BUILD/ruby-2.0.0-p353/test/minitest/test_minitest_unit.rb:1339]:
Dunno why but the parallel run of this fails
  2) Skipped:
test_completion_encoding(TestReadline) [/builddir/build/BUILD/ruby-2.0.0-p353/test/readline/test_readline.rb:294]:
missing test for locale US-ASCII
  3) Skipped:
test_input_metachar_multibyte(TestReadline) [/builddir/build/BUILD/ruby-2.0.0-p353/test/readline/test_readline.rb:420]:
this test needs UTF-8 locale
  4) Error:
test_autobind(TestSocket_UNIXSocket):
Errno::ECONNREFUSED: Connection refused - connect(2)
    /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:65:in `connect'
    /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:65:in `connect_internal'
    /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:135:in `connect'
    /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:765:in `unix'
    /builddir/build/BUILD/ruby-2.0.0-p353/test/socket/test_unix.rb:589:in `block in test_autobind'
    /builddir/build/BUILD/ruby-2.0.0-p353/.ext/common/socket.rb:806:in `unix_server_socket'
    /builddir/build/BUILD/ruby-2.0.0-p353/test/socket/test_unix.rb:585:in `test_autobind'
12159 tests, 2555520 assertions, 0 failures, 1 errors, 31 skips
ruby -v: ruby 2.0.0p353 (2013-11-22) [powerpc64-linux]
make: *** [yes-test-all] Error 1

Version-Release number of selected component (if applicable):
ruby-2.0.0.353-17.fc21

How reproducible:


Steps to Reproduce:
1. ppc-koji build --scratch f21 ruby-2.0.0.353-17.fc21.src.rpm
2.
3.

Actual results:
http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=1679470

Expected results:


Additional info:

Comment 1 Vít Ondruch 2014-02-26 12:58:17 UTC

Yes, this is know issue [1]. The test is quite unstable on PPC, although it passes from time to time. You might be the most qualified person to provide more inside to have this issue resolved. This is the offending line [2].


[1] https://bugs.ruby-lang.org/issues/9373
[2] https://github.com/ruby/ruby/blob/trunk/ext/socket/lib/socket.rb#L65

Comment 2 Gustavo Luiz Duarte 2014-02-28 22:05:39 UTC

I got a reproducer for this issue. It is not obvious to me what is wrong, though I'm not very familiar with abstract unix socket addressing. As I won't be around for the next few days, anyone willing to jump in and help is very welcome.

Here is the reproducer:

#include <sys/socket.h>
#include <stdio.h>

int main()
{
        int fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

        int i = 1;
        setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &i, 4);

        struct sockaddr addr;
        addr.sa_family = AF_LOCAL;
        bind(fd, &addr, 2);

        listen(fd, 128);

        struct sockaddr_storage ss;
        socklen_t sslen = (socklen_t)sizeof(ss);
        getsockname(fd, (struct sockaddr*)&ss, &sslen);

        fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

        if (connect(fd, (struct sockaddr*)&ss, sslen) == -1){
                perror(NULL);
                return 1;
        }
        printf("OK\n");
        return 0;
}

Comment 3 Vít Ondruch 2014-03-03 08:44:16 UTC

(In reply to Gustavo Luiz Duarte from comment #2)
Thanks for the reproducer. I linked it into upstream issue.

Comment 4 Florian Weimer 2014-03-03 14:52:08 UTC

I looked at this some more.  I don't think this is a glibc bug.  It looks like something goes wrong with the kernel autobind operation.

I played around with this a bit and added a loop to the reproducer if the connection failed.  The connection failure is persistent.  Connection attempts from a separate process also fail.  I can even bind to the same name from a different process (without setting any SO_* options):

c000000277fee400: 00000002 00000000 00010000 0001 01 54909 @00033
c0000002e5c2b400: 00000002 00000000 00010000 0001 01 64438 @00033

All this is consistent with a miscomputed hash over the name during autobind, which causes the name to be added to the wrong hash bucket.

I stared a bit at af_unix.c:unix_autobind(), but I couldn't find anything wrong with it.  Maybe all this points in the wrong direction.

Comment 5 Florian Weimer 2014-03-03 15:07:52 UTC

Hmm, this might be a bug in the ppc64 csum_partial implementation that only appears with short lengths and suitable (mis)alignment.

Comment 6 Karsten Hopp 2014-03-03 16:27:18 UTC

reproducable with kernel-3.14.0-0.rc4.git0.1.fc21.ppc64 and kernel-2.6.32-431.3.1.el6.ppc64

Comment 7 Vít Ondruch 2014-03-03 16:36:57 UTC

Is it just PPC64? I have misty memories that from time to time I met this issue on PPC as well.

Comment 8 Anton Blanchard 2014-03-04 10:02:53 UTC

Thanks for the testcase Gustavo! It is indeed a kernel issue, I've sent out a possible fix at:

http://patchwork.ozlabs.org/patch/326203/

Comment 9 Josh Boyer 2014-03-06 14:42:46 UTC

http://patchwork.ozlabs.org/patch/326572/ is the latest revision.  We'll grab it as soon as it's ACKed.

Comment 10 Anton Blanchard 2014-03-15 06:20:23 UTC

It's now upstream:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0a13404dd3bf4ea870e3d96270b5a382edca85c0

Comment 11 Josh Boyer 2014-03-17 12:58:03 UTC

Yep.  Should be in the build I did last Friday.  Thanks Anton!

Comment 12 Vít Ondruch 2014-03-17 15:01:18 UTC

So when this will propagate into PPC? It should also propagate into the builder's kernel ... Don't think that closed is appropriate state, although the component might be reconsidered.

Comment 13 Josh Boyer 2014-03-17 15:10:39 UTC

I have no idea on either of those.  The fix is in the kernel and there's nothing more I think we can do from that aspect.  If there is, someone let me know.  We can move this to the distribution component if we need to, but for getting the builders updated that might be better served by opening a rel-eng ticket.

Comment 14 Karsten Hopp 2014-03-17 15:33:48 UTC

@Vit: The PPC builders are running RHEL-6.5. As I don't think that we'll get an official RHEL-6 kernel with this fix anytime soon, could you exclude that particular test from being run on PPC ?

Comment 15 Vít Ondruch 2014-03-17 16:34:54 UTC

So I reported the bug 1077296 against RHEL and I am reassigning back to Ruby to disable the test until the fix is available in builder's kernel.

Comment 16 Vít Ondruch 2014-11-18 08:04:39 UTC

@Karsten: This was fixed in RHEL6.6, how does it look with Fedora's builders? Are they updated already? Can we close this? I don't remember to see this issue on Fedora recently.

Comment 17 Karsten Hopp 2014-11-18 09:47:30 UTC

I haven't seen this issue on the builders for quite a while.
The PPC koji hub is still running an older EL6 kernel until we manage to schedule some downtime. But there's a fixed kernel installed already, just waiting for a reboot.
I think this can be closed.

Note You need to log in before you can comment on or make changes to this bug.