Bug 2259845 - ruby: enum_sort_by modifies array while it is being sorted by qsort_r
Summary: ruby: enum_sort_by modifies array while it is being sorted by qsort_r
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: ruby
Version: rawhide
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Vít Ondruch
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-01-23 10:08 UTC by Vít Ondruch
Modified: 2025-04-28 08:22 UTC (History)
28 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type: ---
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ruby 20203 0 None None None 2024-01-24 11:26:08 UTC

Description Vít Ondruch 2024-01-23 10:08:34 UTC
First I have reported this upstream [1], but now I realize this might be GCC issue, because everything was just fine with GCC 13.

The issue is that during mass rebuild, Ruby started to fail its test suite

https://koji.fedoraproject.org/koji/taskinfo?taskID=112176941

There is unfortunately not much detail. I'll try to dig deeper. But just thought I'll report it sooner should it impact also other packages.


[1] https://bugs.ruby-lang.org/issues/20203


Reproducible: Always

Actual Results:  
Test suite fails with errors such as:

~~~
[ 3000/26419] TestEnumerable#test_transient_heap_sort_bymalloc_consolidate(): unaligned fastbin chunk detected
~~~

~~~
[ 2455/26535] TestEnumerable#test_transient_heap_sort_bycorrupted size vs. prev_size in fastbins
~~~

~~~
[ 9716/26532] TestEnumerable#test_any_with_unused_blockdouble free or corruption (fasttop)
~~~

Expected Results:  
Test suite passes as it did with GCC 13

Comment 1 Florian Weimer 2024-01-23 11:15:02 UTC
Could you narrow this down to one of the tests? How do we invoke that test from the command line? Thanks.

Comment 2 Jakub Jelinek 2024-01-23 11:32:42 UTC
Another interesting question is if it is reproduceable with %define _lto_cflags %{nil}, but guess I can try that myself.  But which exact testcase and how to test just that single test rather than all of them would be really helpful.

Comment 3 Vít Ondruch 2024-01-23 11:48:55 UTC
This is the backtrace I was able to get:

~~~
[72/83] TestEnumerable#test_inject_array_op_redefined[Detaching after vfork from child process 94]
 = 0.00 s
 10) Error:
TestEnumerable#test_inject_array_op_redefined:
Errno::ENOENT: No such file or directory - /usr/bin/ruby
    /builddir/build/BUILD/ruby-3.3.0/tool/lib/envutil.rb:161:in `spawn'
    /builddir/build/BUILD/ruby-3.3.0/tool/lib/envutil.rb:161:in `invoke_ruby'

malloc_consolidate(): unaligned fastbin chunk detected                  

Thread 1 "ruby" received signal SIGABRT, Aborted.
0x00007ffff7723184 in __pthread_kill_implementation () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.38.9000-33.fc40.x86_64 gmp-6.2.1-5.fc39.x86_64 libgcc-14.0.1-0.2.fc40.x86_64 libxcrypt-4.4.36-4.fc40.x86_64 zlib-ng-compat-2.1.6-1.fc40.x86_64
(gdb) bt
#0  0x00007ffff7723184 in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007ffff76cb65e in raise () from /lib64/libc.so.6
#2  0x00007ffff76b3902 in abort () from /lib64/libc.so.6
#3  0x00007ffff76b4767 in __libc_message_impl.cold () from /lib64/libc.so.6
#4  0x00007ffff772d1b5 in malloc_printerr () from /lib64/libc.so.6
#5  0x00007ffff772dd7c in malloc_consolidate () from /lib64/libc.so.6
#6  0x00007ffff772ef90 in _int_free_maybe_consolidate.part.0 () from /lib64/libc.so.6
#7  0x00007ffff772f5fa in _int_free () from /lib64/libc.so.6
#8  0x00007ffff7731e0e in free () from /lib64/libc.so.6
#9  0x00007ffff7ad52ec in objspace_xfree (old_size=11256, ptr=0x555555930250, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:12823
#10 objspace_xfree (old_size=<optimized out>, ptr=0x555555930250, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:12754
#11 ruby_sized_xfree (x=0x555555930250, size=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/gc.c:12927
#12 0x00007ffff7a91ec1 in cont_free (ptr=0x555555c9f3e0) at /builddir/build/BUILD/ruby-3.3.0/cont.c:1059
#13 0x00007ffff7accd01 in rb_data_free (obj=140736887130000, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:3500
#14 obj_free (objspace=0x55555555dc70, obj=140736887130000) at /builddir/build/BUILD/ruby-3.3.0/gc.c:3659
#15 0x00007ffff7cd9b10 in gc_sweep_plane (heap=0x55555555dce0, ctx=<optimized out>, bitset=<optimized out>, p=140736887130000, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:5680
#16 gc_sweep_page.constprop.0 (objspace=0x55555555dc70, heap=0x55555555dce0, ctx=0x7fffffffca40) at /builddir/build/BUILD/ruby-3.3.0/gc.c:5758
#17 0x00007ffff7acac51 in gc_sweep_step (objspace=objspace@entry=0x55555555dc70, size_pool=size_pool@entry=0x55555555dc90, heap=heap@entry=0x55555555dce0) at /builddir/build/BUILD/ruby-3.3.0/gc.c:6047
#18 0x00007ffff7acfa71 in gc_sweep (objspace=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/gc.c:6272
#19 0x00007ffff7adadce in gc_start (objspace=0x55555555dc70, reason=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/gc.c:9609
#20 0x00007ffff7ad35ab in heap_prepare (heap=0x55555555dce0, size_pool=0x55555555dc90, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2517
#21 heap_next_free_page (heap=0x55555555dce0, size_pool=0x55555555dc90, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2725
#22 newobj_alloc (objspace=0x55555555dc70, cr=0x55555555e960, size_pool_idx=0, vm_locked=<optimized out>, vm_locked@entry=false) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2827
#23 0x00007ffff7ad3eb4 in newobj_of0 (alloc_size=<optimized out>, cr=<optimized out>, wb_protected=1, flags=<optimized out>, klass=140736918383680) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2930
#24 newobj_of (alloc_size=<optimized out>, wb_protected=1, v3=0, v2=0, v1=0, flags=<optimized out>, klass=140736918383680, cr=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2947
#25 rb_wb_protected_newobj_of (ec=<optimized out>, klass=140736918383680, flags=<optimized out>, size=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2962
#26 0x00007ffff7bdf0c1 in str_alloc_embed (capa=6, klass=140736918383680) at /builddir/build/BUILD/ruby-3.3.0/vm_core.h:1954
#27 str_new0 (klass=140736918383680, ptr=0x5555555fa945 "vt100", len=5, termlen=1) at /builddir/build/BUILD/ruby-3.3.0/string.c:871
#28 0x00007ffff7bdf9fe in rb_enc_str_new (ptr=<optimized out>, len=<optimized out>, enc=0x555555579650) at /builddir/build/BUILD/ruby-3.3.0/string.c:928
#29 0x00007ffff7ae4ae5 in env_enc_str_new (enc=<optimized out>, len=5, ptr=0x5555555fa945 "vt100") at /builddir/build/BUILD/ruby-3.3.0/hash.c:4810
#30 env_str_new (len=5, ptr=0x5555555fa945 "vt100") at /builddir/build/BUILD/ruby-3.3.0/hash.c:4819
#31 env_str_new2 (ptr=0x5555555fa945 "vt100") at /builddir/build/BUILD/ruby-3.3.0/hash.c:4826
#32 env_to_hash () at /builddir/build/BUILD/ruby-3.3.0/hash.c:6257
#33 0x00007ffff7ae4bf0 in env_to_h (_=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/hash.c:6317
#34 0x00007ffff7c3aa46 in vm_call_cfunc_with_frame_ (ec=0x55555555ec10, reg_cfp=0x7ffff75fca98, calling=<optimized out>, argc=0, argv=0x7ffff74fd4a8, stack_bottom=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:3490
#35 0x00007ffff7c4233b in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:5581
#36 vm_exec_core (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/redhat-linux-build/insns.def:834
#37 0x00007ffff7c5a680 in vm_exec_loop (result=<optimized out>, tag=0x7fffffffd0b0, state=<optimized out>, ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2513
#38 rb_vm_exec (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2492
#39 0x00007ffff7c47c67 in vm_yield_with_cref (is_lambda=0, cref=0x0, kw_splat=0, argv=0x7fffffffd1e8, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1634
#40 vm_yield (kw_splat=0, argv=0x7fffffffd1e8, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1642
#41 rb_yield_0 (argv=0x7fffffffd1e8, argc=1) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1366
#42 rb_yield (val=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1382
#43 0x00007ffff7a4220c in rb_ary_collect (ary=140736886959720) at /builddir/build/BUILD/ruby-3.3.0/array.c:3630
#44 0x00007ffff7c3aa46 in vm_call_cfunc_with_frame_ (ec=0x55555555ec10, reg_cfp=0x7ffff75fcbb0, calling=<optimized out>, argc=0, argv=0x7ffff74fd3a8, stack_bottom=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:3490
#45 0x00007ffff7c4527d in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:5581
#46 vm_exec_core (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/redhat-linux-build/insns.def:814
#47 0x00007ffff7c5a46d in rb_vm_exec (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2486
#48 0x00007ffff7c47c67 in vm_yield_with_cref (is_lambda=0, cref=0x0, kw_splat=0, argv=0x7fffffffd538, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1634
#49 vm_yield (kw_splat=0, argv=0x7fffffffd538, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1642
#50 rb_yield_0 (argv=0x7fffffffd538, argc=1) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1366
#51 rb_yield (val=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1382
#52 0x00007ffff7a41fc4 in rb_ary_each (ary=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/array.c:2538
#53 rb_ary_each (ary=140736887015160) at /builddir/build/BUILD/ruby-3.3.0/array.c:2532
#54 0x00007ffff7c3aa46 in vm_call_cfunc_with_frame_ (ec=0x55555555ec10, reg_cfp=0x7ffff75fcc90, calling=<optimized out>, argc=0, argv=0x7ffff74fd2c0, stack_bottom=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:3490
#55 0x00007ffff7c4527d in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:5581
#56 vm_exec_core (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/redhat-linux-build/insns.def:814
#57 0x00007ffff7c5a46d in rb_vm_exec (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2486
#58 0x00007ffff7c47c67 in vm_yield_with_cref (is_lambda=0, cref=0x0, kw_splat=0, argv=0x7fffffffd888, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1634
#59 vm_yield (kw_splat=0, argv=0x7fffffffd888, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1642
--Type <RET> for more, q to quit, c to continue without paging--
#60 rb_yield_0 (argv=0x7fffffffd888, argc=1) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1366
#61 rb_yield (val=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1382
#62 0x00007ffff7a41fc4 in rb_ary_each (ary=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/array.c:2538
#63 rb_ary_each (ary=140736886986000) at /builddir/build/BUILD/ruby-3.3.0/array.c:2532
#64 0x00007ffff7c3aa46 in vm_call_cfunc_with_frame_ (ec=0x55555555ec10, reg_cfp=0x7ffff75fce18, calling=<optimized out>, argc=0, argv=0x7ffff74fd160, stack_bottom=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:3490
#65 0x00007ffff7c4527d in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:5581
#66 vm_exec_core (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/redhat-linux-build/insns.def:814
#67 0x00007ffff7c5a680 in vm_exec_loop (result=<optimized out>, tag=0x7fffffffdaa0, state=<optimized out>, ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2513
#68 rb_vm_exec (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2492
#69 0x00007ffff7c5b92e in rb_vm_invoke_proc (ec=<optimized out>, proc=<optimized out>, argc=<optimized out>, argv=<optimized out>, kw_splat=<optimized out>, passed_block_handler=<optimized out>)
    at /builddir/build/BUILD/ruby-3.3.0/vm.c:1728
#70 0x00007ffff7b75d61 in rb_proc_call_kw (self=<optimized out>, args=<optimized out>, kw_splat=0) at /builddir/build/BUILD/ruby-3.3.0/proc.c:978
#71 0x00007ffff7ab8e0a in exec_end_procs_chain (errp=<optimized out>, procs=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/eval_jump.c:105
#72 rb_ec_exec_end_proc (ec=ec@entry=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/eval_jump.c:120
#73 0x00007ffff7ab9d98 in rb_ec_teardown (ec=ec@entry=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/eval.c:159
#74 0x00007ffff7aba32c in rb_ec_cleanup (ec=ec@entry=0x55555555ec10, ex=RUBY_TAG_NONE) at /builddir/build/BUILD/ruby-3.3.0/eval.c:212
#75 0x00007ffff7aba99d in ruby_run_node (n=0x7fffdc315b30) at /builddir/build/BUILD/ruby-3.3.0/eval.c:328
#76 0x0000555555555195 in rb_main (argv=0x7fffffffe1c8, argc=7) at /builddir/build/BUILD/ruby-3.3.0/main.c:39
#77 main (argc=<optimized out>, argv=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/main.c:58
~~~

The steps after failed build are:

~~~
$ mock -r fedora-rawhide-x86_64 shell --unpriv

$ cd /builddir/build/BUILD/ruby-3.3.0/

$ cd redhat-linux-build/

$ LD_LIBRARY_PATH=. gdb -x run.gdb --quiet --args ./ruby -I/builddir/build/BUILD/ruby-3.3.0/lib -I/builddir/build/BUILD/ruby-3.3.0/tool/lib -I. -I.ext/common -I.ext/x86_64-linux ../test/ruby/test_enum.rb
~~~

There are some test errors reported and what not, but hopefully, this is enough to reproduce

Comment 4 Vít Ondruch 2024-01-23 11:56:01 UTC
(In reply to Vít Ondruch from comment #3)
> There are some test errors reported and what not, but hopefully, this is
> enough to reproduce

Actually this is the command which should fix the errors:

~~~
$ RUBY=./ruby LD_LIBRARY_PATH=. gdb -x run.gdb --quiet --args ./ruby -I/builddir/build/BUILD/ruby-3.3.0/lib -I/builddir/build/BUILD/ruby-3.3.0/tool/lib -I. -I.ext/common -I.ext/x86_64-linux ../test/ruby/test_enum.rb
~~~

I don't think the issue happens with specific test case. But the whole `test/ruby/test_enum.rb` is always problematic for some reason.

Comment 5 Florian Weimer 2024-01-23 12:12:22 UTC
Looks like an issue within glibc qsort:

==86== Invalid write of size 1
==86==    at 0x484FDFD: mempcpy (vg_replace_strmem.c:1696)
==86==    by 0x4FDB972: msort_with_tmp.part.0 (in /usr/lib64/libc.so.6)
==86==    by 0x4FDBC19: qsort_r (in /usr/lib64/libc.so.6)
==86==    by 0x48FE231: enum_sort_by (enum.c:1691)
==86==    by 0x4A90A45: vm_call_cfunc_with_frame_ (vm_insnhelper.c:3490)
==86==    by 0x4A9B27C: UnknownInlinedFun (vm_insnhelper.c:5581)
==86==    by 0x4A9B27C: vm_exec_core.lto_priv.0 (insns.def:814)
==86==    by 0x4AB067F: UnknownInlinedFun (vm.c:2513)
==86==    by 0x4AB067F: rb_vm_exec (vm.c:2492)
==86==    by 0x4A9DC66: UnknownInlinedFun (vm.c:1634)
==86==    by 0x4A9DC66: UnknownInlinedFun (vm.c:1642)
==86==    by 0x4A9DC66: UnknownInlinedFun (vm_eval.c:1366)
==86==    by 0x4A9DC66: rb_yield (vm_eval.c:1382)
==86==    by 0x489820B: rb_ary_collect.lto_priv.0 (array.c:3630)
==86==    by 0x4A90A45: vm_call_cfunc_with_frame_ (vm_insnhelper.c:3490)
==86==    by 0x4A9B27C: UnknownInlinedFun (vm_insnhelper.c:5581)
==86==    by 0x4A9B27C: vm_exec_core.lto_priv.0 (insns.def:814)
==86==    by 0x4AB046C: rb_vm_exec (vm.c:2486)

Comment 6 Florian Weimer 2024-01-23 15:53:06 UTC
Sorry, I took a wrong turn during debugging. Now I no longer think this is another qsort problem. Back to gcc.

The issue reproduces without LTO, too.

Comment 7 Florian Weimer 2024-01-23 17:29:17 UTC
FWIW, just because it's assigned to gcc currently doesn't mean it's a gcc problem, or a problem that impacts anything but the ruby package. Just clarifying that this bug probably is not a reason to do another mass rebuild.

Comment 8 Jakub Jelinek 2024-01-23 17:40:17 UTC
The issue reproduces even when I remove all *.o files from redhat-linux-build,
for i in `find . -name Makefile | xargs grep -l -- -O2`; do sed -i -e 's/-O2/-O0/g' $i; done
there and make again.  Sure, in theory it could be some gcc bug that affects even -O0, but it isn't much likely.
Anyway, will try to build it in f39 mock now.

Comment 9 Jakub Jelinek 2024-01-23 18:12:31 UTC
So, after manually installing rpm-local-generator-support-1-2.fc40.noarch.rpm into f39 mock buildroot (otherwise it doesn't build),
the test succeeds in that buildroot.  But when I copy ruby and libruby.so.3.3.0 from the f39 buildroot where it doesn't fail into the f40 buildroot,
it still fails the same.
So, I really don't understand how this could have anything to do with gcc.
And it fails even if I put into the a subdirectory libz.so* and libgcc_s.so* from the f39 buildroot and point LD_LIBRARY_PATH to it.
Additionally, I've checked that none of the other shared libraries ruby links against were built by the buggy binutils-2.41-{27,28}.fc40.

Comment 10 Jakub Jelinek 2024-01-23 18:36:22 UTC
And, copying the gcc 14.0.1 built ruby and libruby.so.3.3.0 from f40 mock buildroot to f39 doesn't result in a failure in the f39 buildroot.
But if I also copy ld-linux-x86-64.so.2, libc.so.6 and libm.so.6 from the f40 buildroot to f39 (newglibc subdirectory), then running
RUBY=./ruby LD_LIBRARY_PATH=. newglibc/ld-linux-x86-64.so.2 --library-path newglibc/:. ./ruby -I/builddir/build/BUILD/ruby-3.3.0/lib -I/builddir/build/BUILD/ruby-3.3.0/tool/lib -I. -I.ext/common -I.ext/x86_64-linux ../test/ruby/test_enum.rb
crashes.

Comment 11 Vít Ondruch 2024-01-24 09:12:49 UTC
(In reply to Jakub Jelinek from comment #9)
> So, after manually installing
> rpm-local-generator-support-1-2.fc40.noarch.rpm into f39 mock buildroot
> (otherwise it doesn't build),

Commenting out the `%global __local_generator` lines would also do the job.

(In reply to Jakub Jelinek from comment #10)
Do I understand correctly that you again suspect that issue is in glibc?

Comment 12 Vít Ondruch 2024-01-24 09:24:07 UTC
(In reply to Vít Ondruch from comment #11)
> Do I understand correctly that you again suspect that issue is in glibc?

If the issue was in glibc, then there have to be some change in between  2.38.9000-30.fc40 / 2.38.9000-33.fc40

Actually, this is the diff of installed BR components between successful and failing build:

https://koschei.fedoraproject.org/build/17081016

I'll try to selectively downgrade some of them.

Comment 14 Jakub Jelinek 2024-01-24 10:15:06 UTC
(In reply to Vít Ondruch from comment #12)
> (In reply to Vít Ondruch from comment #11)
> > Do I understand correctly that you again suspect that issue is in glibc?
> 
> If the issue was in glibc, then there have to be some change in between 
> 2.38.9000-30.fc40 / 2.38.9000-33.fc40

I bet the most important difference between the 2 is that the latter has been built
with gcc-14.0.1-0.1.fc40, while the former with gcc 13.2.1.  So, miscompilation of glibc
is one of the options.

Comment 15 Florian Weimer 2024-01-24 10:19:36 UTC
Thank you. I think I know what is going on.

In current rawhide glibc (glibc-2.38.9000-33.fc40.x86_64), a buffer allocated with malloc is used for the qsort scratch buffer. This is actually a glibc bug because the array is very short and we should use an on-stack buffer.

I need to confirm the details yet, but I think what happens is that the Ruby garbage collector runs during the sort_by callback. I suspect the collector writes to the array, which is quite undefined (“The comparison function shall not alter the contents of the array.” says the C standard). This causes problems subsequently when we copy back previous array contents from the scratch buffer. With a stack-based buffer, the collector pins objects, so the issue is not visible. Sorry, this is all very speculative, but I don't want you to spend more time chasing this.

I can reproduce the crash in Fedora 38 (with upstream Ruby sources) if I increase the size of the array being sorted so that qsort_r uses a malloc-based buffer there as well:

diff --git a/test/ruby/test_enum.rb b/test/ruby/test_enum.rb
index f7c8f012d8..23e18cc590 100644
--- a/test/ruby/test_enum.rb
+++ b/test/ruby/test_enum.rb
@@ -871,7 +871,9 @@ class << o; self; end.class_eval do
           0
         end
       end
-      [o, o, o].sort_by {|x| x }
+      l = []
+      (1..100).each {|x| l += [o] }
+      l.sort_by {|x| x }
       c.call
     end

The whole thing is probably quite sensitive to allocation patterns etc., so I have no idea how reliable this is as a trigger for the bug.

I will push a fix to glibc which will again paper over this bug.

Comment 16 Mattias Ellert 2024-01-24 10:30:19 UTC
There is this recent commit to glibc upstream:

https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=dfa3394a605c8f6f25e4f827789bc89eca1d206c

Not sure it is relevant, but from "qsort" and "malloc vs. stack allocation" being mentioned here I just wanted to mention it.

Comment 17 Florian Weimer 2024-01-24 10:34:41 UTC
(In reply to Mattias Ellert from comment #16)
> There is this recent commit to glibc upstream:
> 
> https://sourceware.org/git/?p=glibc.git;a=commitdiff;
> h=dfa3394a605c8f6f25e4f827789bc89eca1d206c
> 
> Not sure it is relevant, but from "qsort" and "malloc vs. stack allocation"
> being mentioned here I just wanted to mention it.

Yes, that's the commit that we don't have in rawhide yet.

Comment 18 Florian Weimer 2024-01-24 10:39:46 UTC
With this instrumentation patch applied to glibc:

diff --git a/stdlib/qsort.c b/stdlib/qsort.c
index 7f5a00fb33..c5263d9f5f 100644
--- a/stdlib/qsort.c
+++ b/stdlib/qsort.c
@@ -25,6 +25,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <stdbool.h>
+#include <assert.h>
 
 /* Swap SIZE bytes between addresses A and B.  These helpers are provided
    along the generic one as an optimization.  */
@@ -338,9 +339,9 @@ indirect_msort_with_tmp (const struct msort_param *p, void *b, size_t n,
       }
 }
 
-void
-__qsort_r (void *const pbase, size_t total_elems, size_t size,
-	   __compar_d_fn_t cmp, void *arg)
+static void
+__qsort_r_real (void *const pbase, size_t total_elems, size_t size,
+		__compar_d_fn_t cmp, void *arg)
 {
   if (total_elems <= 1)
     return;
@@ -396,6 +397,43 @@ __qsort_r (void *const pbase, size_t total_elems, size_t size,
   if (buf != tmp)
     free (buf);
 }
+
+struct qsort_r_data
+{
+  __compar_d_fn_t cmp;
+  void *arg;
+  void *array;
+  size_t size;
+  void *copy;
+};
+
+static int
+qsort_compare_wrapper (const void *a, const void *b, void *data1)
+{
+  struct qsort_r_data *data = data1;
+  memcpy (data->copy, data->array, data->size);
+  int ret = data->cmp (a, b, data->arg);
+  assert (memcmp (data->array, data->copy, data->size) == 0);
+  return ret;
+}
+
+void
+__qsort_r (void *pbase, size_t total_elems, size_t size,
+	   __compar_d_fn_t cmp, void *arg)
+{
+  struct qsort_r_data data =
+    {
+      .cmp = cmp,
+      .arg = arg,
+      .array = pbase,
+      .size = total_elems * size,
+    };
+  data.copy = malloc (data.size);
+  assert (data.copy != NULL);
+  __qsort_r_real (pbase, total_elems, size, qsort_compare_wrapper, &data);
+  free (data.copy);
+}
+
 libc_hidden_def (__qsort_r)
 weak_alias (__qsort_r, qsort_r)

And using the Fedora rawhide glibc variant with the heap allocation and the unchanged Ruby test case, I get:

[54/83] TestEnumerable#test_callccFatal glibc error: qsort.c:416 (qsort_compare_wrapper): assertion failed: memcmp (data->array, data->copy, data->size) == 0

Thread 1 "ruby" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, 
    no_tid=no_tid@entry=0) at pthread_kill.c:44
44	      return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, 
    signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff7c57423 in __pthread_kill_internal (signo=6, 
    threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff7c0493e in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7bec8ff in __GI_abort () at abort.c:79
#4  0x00007ffff7bed7d5 in __libc_message_impl (
    fmt=fmt@entry=0x7ffff7d6cba0 "Fatal glibc error: %s:%s (%s): assertion failed: %s\n") at ../sysdeps/posix/libc_fatal.c:132
#5  0x00007ffff7bfcaa9 in __libc_assert_fail (
    assertion=assertion@entry=0x7ffff7d6cd70 "memcmp (data->array, data->copy, data->size) == 0", file=file@entry=0x7ffff7d67d51 "qsort.c", 
    line=line@entry=416, 
    function=function@entry=0x7ffff7d71390 <__PRETTY_FUNCTION__.1> "qsort_compare_wrapper") at __libc_assert_fail.c:31
#6  0x00007ffff7c0873c in qsort_compare_wrapper (a=a@entry=0x7fffdc852fe0, 
    b=b@entry=0x7fffdc852ff0, data1=data1@entry=0x7fffffffd520) at qsort.c:416
#7  0x00007ffff7c08923 in msort_with_tmp (p=p@entry=0x7fffffffd0a0, 
    b=b@entry=0x7fffdc852fe0, n=n@entry=2) at qsort.c:276
#8  0x00007ffff7c08ced in msort_with_tmp (n=2, b=0x7fffdc852fe0, 
    p=0x7fffffffd0a0) at qsort.c:202
#9  __qsort_r_real (pbase=pbase@entry=0x7fffdc852fe0, 
    total_elems=total_elems@entry=2, size=size@entry=16, 
    arg=arg@entry=0x7fffffffd520, cmp=0x7ffff7c086c0 <qsort_compare_wrapper>)
    at qsort.c:394
#10 0x00007ffff7c09140 in __GI___qsort_r (pbase=0x7fffdc852fe0, total_elems=2, 
    size=size@entry=16, cmp=cmp@entry=0x5555559709a0 <sort_by_cmp>, 
    arg=arg@entry=0x7fffdc852fd0) at qsort.c:433
#11 0x000055555596f3ad in enum_sort_by (obj=<optimized out>) at enum.c:1691

 
I think that's pretty good evidence that ruby uses qsort_r in an undefined way, so reassigning.

Comment 20 Jonathan Wakely 2024-01-24 11:08:01 UTC
It's just undefined behaviour in ruby, so the fact it appears to work or not work with any given version of gcc or glibc is just different ways the undefined behaviour happens to manifest.

Comment 21 Vít Ondruch 2024-01-24 11:12:05 UTC
(In reply to Florian Weimer from comment #18)
> I think that's pretty good evidence that ruby uses qsort_r in an undefined
> way, so reassigning.

Oh god. I have forwarded the analysis to Ruby upstream. But is there a chance that glibc could do something about that, at least temporary (given that up until now, the issue have not exhibited itself)?

Comment 22 Florian Weimer 2024-01-24 11:26:09 UTC
(In reply to Vít Ondruch from comment #21)
> (In reply to Florian Weimer from comment #18)
> > I think that's pretty good evidence that ruby uses qsort_r in an undefined
> > way, so reassigning.
> 
> Oh god. I have forwarded the analysis to Ruby upstream. But is there a
> chance that glibc could do something about that, at least temporary (given
> that up until now, the issue have not exhibited itself)?

Sorry, probably got lost in the chatter: the Ruby test case started revealing this pre-existing bug because we currently have a missed optimization in glibc. I'll restore the optimization, which should make the test case pass again (and generally paper over the problem for small arrays). But the underlying issue will of course remain.

Comment 23 Vít Ondruch 2024-01-24 12:24:36 UTC
(In reply to Florian Weimer from comment #22)
> I'll restore the optimization, which should make the
> test case pass again (and generally paper over the problem for small
> arrays).

Thx. I thought that you dropped the idea, because you reassigned the ticket back to Ruby.

> But the underlying issue will of course remain.

Right. Thx a lot for helping with the analysis!

Comment 24 Vít Ondruch 2024-01-24 12:44:27 UTC
Just FTR, glibc 2.38.9000-32.fc40.x86_64.rpm is the first exposing the issue.

Comment 25 Florian Weimer 2024-01-24 13:56:59 UTC
A new build, glibc-2.38.9000-35.fc40, is heading towards rawhide, which will stop using malloc for small arrays.

Apparently you can pass ac_cv_func_qsort_r=no to configure to avoid the use of qsort_r. The Linux From Scratch folks are doing that:

https://wiki.linuxfromscratch.org/blfs/changeset/77b455955c17e22d6aa544c766b30850978960ab

Comment 26 Vít Ondruch 2024-01-24 16:26:15 UTC
(In reply to Florian Weimer from comment #25)
> A new build, glibc-2.38.9000-35.fc40, is heading towards rawhide, which will
> stop using malloc for small arrays.

Thank you. I can confirm that -35 helps.

> Apparently you can pass ac_cv_func_qsort_r=no to configure to avoid the use
> of qsort_r. The Linux From Scratch folks are doing that:
> 
> https://wiki.linuxfromscratch.org/blfs/changeset/
> 77b455955c17e22d6aa544c766b30850978960ab

Thx for the tip. Will consider this if the issue become more prominent. But I have hope upstream will fix this.

Comment 27 Aoife Moloney 2024-02-15 23:11:58 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 40 development cycle.
Changing version to 40.

Comment 28 Aoife Moloney 2025-04-25 10:15:06 UTC
This message is a reminder that Fedora Linux 40 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 40 on 2025-05-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '40'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 40 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 29 Vít Ondruch 2025-04-28 08:22:53 UTC
Not sure the changes in Ruby were applied. Checking the status:

https://bugs.ruby-lang.org/issues/20203#note-12


Note You need to log in before you can comment on or make changes to this bug.