1164221 – test fails on ppc64

Bug 1164221 - test fails on ppc64

Summary: test fails on ppc64

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	21
Hardware:	ppc64
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Carlos O'Donell
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	PPCTracker
TreeView+	depends on / blocked

Reported:	2014-11-14 11:33 UTC by Dan Horák
Modified:	2016-11-24 12:11 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-02 04:58:41 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Back traces (9.16 KB, text/plain) 2014-12-10 15:08 UTC, Jakub Čajka	no flags	Details
valgrind (4.85 KB, text/plain) 2015-01-13 18:25 UTC, Jakub Čajka	no flags	Details
valgring output (3.74 KB, text/plain) 2015-01-14 15:10 UTC, Jakub Čajka	no flags	Details
Show Obsolete (1) View All

Description Dan Horák 2014-11-14 11:33:59 UTC

a test is failing in ruby-2.1.4-24.fc21

...

Finished tests in 824.375963s, 18.3933 tests/s, 3383.3750 assertions/s.
  1) Skipped:
TestGemExtBuilder#test_build_extensions_extconf_bad [/builddir/build/BUILD/ruby-2.1.4/test/rubygems/test_gem_ext_builder.rb:236]:
Gem.ruby is not the name of the binary being run in the end
  2) Skipped:
TestRequire#test_require_nonascii_path [/builddir/build/BUILD/ruby-2.1.4/test/ruby/test_require.rb:60]:
cannot convert path encoding to filesystem
  3) Skipped:
TestMiniTestUnitTestCase#test_capture_subprocess_io [/builddir/build/BUILD/ruby-2.1.4/test/minitest/test_minitest_unit.rb:1396]:
Dunno why but the parallel run of this fails
  4) Error:
TestM17NComb#test_str_crypt:
Errno::EINVAL: Invalid argument - crypt
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/test_m17n_comb.rb:749:in `crypt'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/test_m17n_comb.rb:749:in `block in test_str_crypt'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/allpairs.rb:82:in `block in each'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/allpairs.rb:74:in `block in each_index'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/allpairs.rb:46:in `block in make_large_block'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/allpairs.rb:26:in `block (2 levels) in make_basic_block'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/allpairs.rb:21:in `times'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/allpairs.rb:21:in `block in make_basic_block'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/allpairs.rb:20:in `times'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/allpairs.rb:20:in `make_basic_block'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/allpairs.rb:45:in `make_large_block'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/allpairs.rb:70:in `each_index'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/allpairs.rb:81:in `each'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/test_m17n_comb.rb:63:in `combination'
    /builddir/build/BUILD/ruby-2.1.4/test/ruby/test_m17n_comb.rb:741:in `test_str_crypt'
15163 tests, 2789173 assertions, 0 failures, 1 errors, 32 skips
ruby -v: ruby 2.1.4p265 (2014-10-27 revision 48166) [powerpc64-linux]
uncommon.mk:528: recipe for target 'yes-test-all' failed
make: *** [yes-test-all] Error 1
RPM build errors:

for full logs please see http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=2181989

Version-Release number of selected component (if applicable):
ruby-2.1.4-24.fc21

Comment 1 Vít Ondruch 2014-11-20 10:45:03 UTC

There used to by such error:

https://bugs.ruby-lang.org/issues/7312

Is there chance that glibc detection fails? Otherwise this might be the offending commit:

https://github.com/ruby/ruby/commit/def5eab9e10120de90ec2d2dac1828b1cffb31c1

Comment 2 Jakub Čajka 2014-11-20 13:34:02 UTC

Actually glibc detection fails, but in bit unexpected way at `#{glibcpath}` in up mentioned test case. It returns empty string, although path to libc(/lib64/libc.so.6) is correct(even passed explicitly`/lib64...`). Run outside ruby returns expected result. Also running 'ordinary' executables(`/bin/ls`) in test case yields expected results. Replacing version detection with expected version('2.20'.split(... ) in test makes it finish successfully. 

I have tried to scratch build ruby-2.1.4 for f20, surprisingly this test case passes(build fails elsewhere on ppc(64)).

http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=2190978

I have noted glibc version difference, 2.18 x 2.20 (contains ppc related changes(as ^^ is affecting only ppc64...)).

Comment 3 Vít Ondruch 2014-11-20 14:12:42 UTC

Oh my, I was not aware that they again reverted [2] my patches proposed in [1]


(In reply to Jakub Čajka from comment #2)
> I have tried to scratch build ruby-2.1.4 for f20, surprisingly this test
> case passes(build fails elsewhere on ppc(64)).
> 
> http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=2190978
> 
> I have noted glibc version difference, 2.18 x 2.20 (contains ppc related
> changes(as ^^ is affecting only ppc64...)).


Do I read it correctly, that this test is passing with current glibc?


[1] https://bugs.ruby-lang.org/issues/7312
[2] https://bugs.ruby-lang.org/issues/7828

Comment 4 Jakub Čajka 2014-11-20 14:41:58 UTC

(In reply to Vít Ondruch from comment #3)

> Do I read it correctly, that this test is passing with current glibc?

Nope, test is passing wit old one(f20 2.18 pass,f21 2.20 fail). There is more differences(gcc,...), this is one that stood out (bug's related to glibc detection..., and it's failing only on ppc64...).

Bit unrelated:

build of ruby failed recently on ppc64le with(passed successfully last time, see OP):

  5) Failure:
OpenSSL::TestPKCS7#test_signed [/builddir/build/BUILD/ruby-2.1.4/test/openssl/test_pkcs7.rb:47]:
Failed assertion, no message given.

http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=2189041

No change in buildroot, it seems as timing issue.

Comment 5 Jakub Čajka 2014-12-10 15:08:43 UTC

Created attachment 966856 [details]
Back traces

To start again, my previous comments are bit confusing... Sorry for that...

Actual problem seems to be caused by invocation of `/lib64/libc.so.6` in test/ruby/test_m17n_comp.rb:736, which leads to segmentation fault (witch is not reported by ruby), and consequent crypt failure(empty string as salt :749) with EINVAL.

It seems that, build fails only on ppc64 with glibc-2.19 and newer(tried 2.19.90-36, 2.20.90-10, and f21 2.20-5), although I was unsuccessful reproducing it outside ruby build tests. Using older glibc-2.18.90-27, or building for f20(glibc-2.18) results in successful build.

(2.20.90-10 built with disabled valgrind tests)

Running other "normal" executables for example `/bin/ls`, doesn't result in seg fault.

Please see attachment for back traces from gdb.

Comment 6 Jakub Čajka 2014-12-15 15:05:06 UTC

Could someone from glibc please check this bug if it's glibc bug?
Thanks!

Comment 7 Carlos O'Donell 2014-12-17 02:31:34 UTC

(In reply to Jakub Čajka from comment #6)
> Could someone from glibc please check this bug if it's glibc bug?
> Thanks!

We need a cut down reproducer please.

Please provide as small a test case as possible that shows the problem and that can be run outside of the ruby testsuite.

Comment 8 Vít Ondruch 2014-12-17 10:41:46 UTC

Single test case can be simple run such as:

make test-all TESTS="-v -n test_str_crypt test/ruby/test_m17n_comb.rb"

Comment 9 Jakub Čajka 2015-01-09 13:54:23 UTC

Hello, thank you for checking this bug.

I have looked in to it more and I finally managed to find reproducer outside ruby.

Steps to reproduce:

assuming ppc64 machine with Fedora 21 and glibc-2.19.90-19>(last successful build of ruby used 2.19.90-19, also it doesn't crash in f20 with glibc-2.18)

packages used:

glibc-2.20-5.fc21.ppc64p7
ruby-libs-2.1.2-23.fc21.ppc64

steps to reproduce:

<mock-chroot>[root@power /]# LD_PRELOAD=/lib64/libruby.so.2.1.0
<mock-chroot>[root@power /]# export LD_PRELOAD
<mock-chroot>[root@power /]# /lib64/libc.so.6 
Segmentation fault (core dumped)

without LD_PRELOAD :

<mock-chroot>[root@power /]# /lib64/libc.so.6 
GNU C Library (GNU libc) stable release version 2.20, by Roland McGrath et al.
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.9.1 20140912 (Red Hat 4.9.1-9).
Available extensions:
	The C stubs add-on version 2.1.2.
	crypt add-on version 2.1 by Michael Glad and others
	GNU Libidn by Simon Josefsson
	Native POSIX Threads Library by Ulrich Drepper et al
	BIND-8.2.3-T5B
	RT using linux kernel aio
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.


Segfault seems to be triggered by LD_PRELOAD-ing library (in case of ruby build it is libruby).  It can be reproduce by preloading some libraries (libruby,) libssl, libssh..., but not others libselinux, libsepol, libc, libbz...(all chosen randomly).

Comment 10 Carlos O'Donell 2015-01-12 15:39:40 UTC

(In reply to Jakub Čajka from comment #9)
> <mock-chroot>[root@power /]# LD_PRELOAD=/lib64/libruby.so.2.1.0
> <mock-chroot>[root@power /]# export LD_PRELOAD
> <mock-chroot>[root@power /]# /lib64/libc.so.6 
> Segmentation fault (core dumped)

Can you run this through gdb and valgrind and get a stack trace at the point of the failure please?

Comment 11 Jakub Čajka 2015-01-13 18:25:42 UTC

Created attachment 979709 [details]
valgrind

(gdb) run
Starting program: /usr/lib64/libc.so.6 

Program received signal SIGSEGV, Segmentation fault.
allocate_dtv (result=0x3fffffffe810) at dl-tls.c:327
327	      dtv[0].counter = dtv_length;
(gdb) bt
#0  allocate_dtv (result=0x3fffffffe810) at dl-tls.c:327
#1  _dl_allocate_tls_storage () at dl-tls.c:391
#2  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

Comment 12 Carlos O'Donell 2015-01-13 20:29:22 UTC

(In reply to Jakub Čajka from comment #11)
> Created attachment 979709 [details]
> valgrind
> 
> (gdb) run
> Starting program: /usr/lib64/libc.so.6 
> 
> Program received signal SIGSEGV, Segmentation fault.
> allocate_dtv (result=0x3fffffffe810) at dl-tls.c:327
> 327	      dtv[0].counter = dtv_length;
> (gdb) bt
> #0  allocate_dtv (result=0x3fffffffe810) at dl-tls.c:327
> #1  _dl_allocate_tls_storage () at dl-tls.c:391
> #2  0x0000000000000000 in ?? ()
> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
> (gdb)

Excellent work. That's exactly the kind of backtrace I need to pinpoint a possible solution.

This looks like one of 2 upstream problems:

Bug 13862 - Reuse of cached stack can cause bounds overrun of thread DTV
https://sourceware.org/bugzilla/show_bug.cgi?id=13862

Bug 17621 - DTV update for Static TLS dlopened modules is racy
https://sourceware.org/bugzilla/show_bug.cgi?id=17621

I think you are seeing bug 13862.

Can you try upstream commit d8dd00805b8f3a011735d7a407097fb1c408d867 and see if it fixes the issue for you?

One is fixed, the other Alex is fixing (or has a patch already).
Alex won't be back for several weeks, so this is going ot have to wait (unless it's bug 13862 and you verify it fixes your issue).

Comment 13 Jakub Čajka 2015-01-14 15:06:24 UTC

Using latest glibc(glibc-2.20.90-18) from fedora22(should include fix for Bug 13862, seems to help a bit).

Suggested upstream commit doesn't seems to make a difference.

Without upstream commit:

(gdb) run
Starting program: /usr/lib64/libc.so.6 

Program received signal SIGSEGV, Segmentation fault.
0x00003fffffffec10 in ?? ()
(gdb) bt
#0  0x00003fffffffec10 in ?? ()
#1  0x00003fffb7fb8e4c in resolve_ifunc (sym_map=0x3fffb7ff1f08, map=0x3fffb7ff3de8, value=<optimized out>) at ../sysdeps/powerpc/powerpc64/dl-machine.h:630
#2  elf_machine_rela (skip_ifunc=<optimized out>, reloc_addr_arg=0x3fffb7e5d900, version=<optimized out>, sym=<optimized out>, reloc=0x3fffb7c6d4c0, map=0x3fffb7ff3de8)
    at ../sysdeps/powerpc/powerpc64/dl-machine.h:672
#3  elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>, map=0x3fffb7ff3de8) at do-rel.h:137
#4  _dl_relocate_object (scope=0x3fffb7ff4160, reloc_mode=<optimized out>, consider_profiling=<optimized out>) at dl-reloc.c:264
#5  0x00003fffb7fa6e68 in dl_main (phdr=<optimized out>, phnum=<optimized out>, user_entry=<optimized out>, auxv=<optimized out>) at rtld.c:2070
#6  0x00003fffb7fc9dd4 in _dl_sysdep_start (start_argptr=<optimized out>, dl_main=@0x3fffb7ff0120: 0x3fffb7fa47f0 <dl_main>) at ../elf/dl-sysdep.c:249
#7  0x00003fffb7fa4038 in _dl_start_final (arg=arg@entry=0x3ffffffff640, info=info@entry=0x3ffffffff0c0) at rtld.c:306
#8  0x00003fffb7fa84b0 in _dl_start (arg=0x3ffffffff640) at rtld.c:414
#9  0x00003fffb7fa37f0 in ._start () from /lib64/ld64.so.1

with upstream commit:

(gdb) run
Starting program: /usr/lib64/libc.so.6 

Program received signal SIGSEGV, Segmentation fault.
0x00003fffffffec10 in ?? ()
(gdb) bt
#0  0x00003fffffffec10 in ?? ()
#1  0x00003fffb7fb8e4c in resolve_ifunc (sym_map=0x3fffb7ff1f08, map=0x3fffb7ff3de8, value=<optimized out>) at ../sysdeps/powerpc/powerpc64/dl-machine.h:630
#2  elf_machine_rela (skip_ifunc=<optimized out>, reloc_addr_arg=0x3fffb7e5d900, version=<optimized out>, sym=<optimized out>, reloc=0x3fffb7c6d4c0, map=0x3fffb7ff3de8)
    at ../sysdeps/powerpc/powerpc64/dl-machine.h:672
#3  elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>, map=0x3fffb7ff3de8) at do-rel.h:137
#4  _dl_relocate_object (scope=0x3fffb7ff4160, reloc_mode=<optimized out>, consider_profiling=<optimized out>) at dl-reloc.c:264
#5  0x00003fffb7fa6e68 in dl_main (phdr=<optimized out>, phnum=<optimized out>, user_entry=<optimized out>, auxv=<optimized out>) at rtld.c:2070
#6  0x00003fffb7fc9dd4 in _dl_sysdep_start (start_argptr=<optimized out>, dl_main=@0x3fffb7ff0120: 0x3fffb7fa47f0 <dl_main>) at ../elf/dl-sysdep.c:249
#7  0x00003fffb7fa4038 in _dl_start_final (arg=arg@entry=0x3ffffffff640, info=info@entry=0x3ffffffff0c0) at rtld.c:306
#8  0x00003fffb7fa84b0 in _dl_start (arg=0x3ffffffff640) at rtld.c:414
#9  0x00003fffb7fa37f0 in ._start () from /lib64/ld64.so.1

Comment 14 Jakub Čajka 2015-01-14 15:10:16 UTC

Created attachment 980049 [details]
valgring output

glibc-2.20.90-18 with suggested commit

Comment 15 Carlos O'Donell 2015-01-14 15:28:38 UTC

(In reply to Jakub Čajka from comment #13)
> Program received signal SIGSEGV, Segmentation fault.
> 0x00003fffffffec10 in ?? ()
> (gdb) bt
> #0  0x00003fffffffec10 in ?? ()
> #1  0x00003fffb7fb8e4c in resolve_ifunc (sym_map=0x3fffb7ff1f08,
> map=0x3fffb7ff3de8, value=<optimized out>) at
> ../sysdeps/powerpc/powerpc64/dl-machine.h:630
> #2  elf_machine_rela (skip_ifunc=<optimized out>,
> reloc_addr_arg=0x3fffb7e5d900, version=<optimized out>, sym=<optimized out>,
> reloc=0x3fffb7c6d4c0, map=0x3fffb7ff3de8)
>     at ../sysdeps/powerpc/powerpc64/dl-machine.h:672
> #3  elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>,
> nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>,
> map=0x3fffb7ff3de8) at do-rel.h:137
> #4  _dl_relocate_object (scope=0x3fffb7ff4160, reloc_mode=<optimized out>,
> consider_profiling=<optimized out>) at dl-reloc.c:264
> #5  0x00003fffb7fa6e68 in dl_main (phdr=<optimized out>, phnum=<optimized
> out>, user_entry=<optimized out>, auxv=<optimized out>) at rtld.c:2070
> #6  0x00003fffb7fc9dd4 in _dl_sysdep_start (start_argptr=<optimized out>,
> dl_main=@0x3fffb7ff0120: 0x3fffb7fa47f0 <dl_main>) at ../elf/dl-sysdep.c:249
> #7  0x00003fffb7fa4038 in _dl_start_final (arg=arg@entry=0x3ffffffff640,
> info=info@entry=0x3ffffffff0c0) at rtld.c:306
> #8  0x00003fffb7fa84b0 in _dl_start (arg=0x3ffffffff640) at rtld.c:414
> #9  0x00003fffb7fa37f0 in ._start () from /lib64/ld64.so.1

This is an GNU indirect function resolution failure, either a bug in glibc or a corrupt library. This is very different from the original backtrace you present.

Exactly what hardware are you using on ppc64? Is it stable? Well cooled?

Comment 16 Jakub Čajka 2015-01-16 12:37:19 UTC

I originally reproduced it on power7+ KVM guest and just tested it on power8 bare metal with same results.

HW should be OK. Just checked on them.

Hope this answers your question.

Please note that this still fails the same way with different preloaded libraries. Fails with libssl, libssh, libruby prelaoded(chose one) but not with libbz for example.

Comment 17 Fedora End Of Life 2015-11-04 10:32:08 UTC

This message is a reminder that Fedora 21 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 21. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '21'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 21 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 18 Fedora End Of Life 2015-12-02 04:58:53 UTC

Fedora 21 changed to end-of-life (EOL) status on 2015-12-01. Fedora 21 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.