2336127 – Power9 instruction gets executed on Power8

Bug 2336127 - Power9 instruction gets executed on Power8

Summary: Power9 instruction gets executed on Power8

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	numpy
Sub Component:
Version:	rawhide
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Gwyn Ciesla
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2025-01-07 14:28 UTC by Tulio Magno Quites Machado Filho
Modified:	2025-01-20 16:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:	numpy-2.2.1-2.fc42
Clone Of:
Environment:
Last Closed:	2025-01-10 17:19:57 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	numpy numpy issues 28124	0	None	open	Power9 instruction gets executed on Power8	2025-01-07 22:20:23 UTC

Description Tulio Magno Quites Machado Filho 2025-01-07 14:28:07 UTC

In 2024-12-21, we started to see an MLIR test failure on LLVM daily snapshots running on Rawhide on Power8.
We can only reproduce this issue on Rawhide.

After investigating, I found a numpy function using a Power9/Power ISA 3.0 instruction (mtvsrws):

Disassembly:
(gdb) disas
Dump of assembler code for function HALF_exp2(char**, npy_intp const*, npy_intp const*, void*):
...
   0x00007fffe3bad210 <+160>:   addis   r9,r9,14336
   0x00007fffe3bad214 <+164>:   add     r7,r7,r9
=> 0x00007fffe3bad218 <+168>:   mtvsrws vs1,r7
   0x00007fffe3bad21c <+172>:   xscvspdpn vs1,vs1
   0x00007fffe3bad220 <+176>:   bl      0x7fffe38b3580 <0000001a.plt_call.exp2f@@GLIBC_2.27>

Backtrace:
(gdb) bt
#0  HALF_exp2 (args=<optimized out>, dimensions=<optimized out>, steps=<optimized out>, __NPY_UNUSED_TAGGEDdata=<optimized out>)
    at ../numpy/_core/src/umath/loops_umath_fp.dispatch.c.src:182
#1  0x00007fffe3af615c in generic_wrapped_legacy_loop (__NPY_UNUSED_TAGGEDcontext=<optimized out>, data=<optimized out>, dimensions=<optimized out>, 
    strides=<optimized out>, auxdata=<optimized out>) at ../numpy/_core/src/umath/legacy_array_method.c:98
#2  0x00007fffe3b0d2f0 in try_trivial_single_output_loop (context=0x7fffffff8410, op=0x7fffffff8b30, order=<optimized out>, 
    errormask=<optimized out>) at ../numpy/_core/src/umath/ufunc_object.c:969
#3  PyUFunc_GenericFunctionInternal (ufunc=<optimized out>, ufuncimpl=<optimized out>, operation_descrs=0x7fffffff8730, op=0x7fffffff8b30, 
    casting=NPY_SAME_KIND_CASTING, order=<optimized out>, wheremask=0x0) at ../numpy/_core/src/umath/ufunc_object.c:2237
#4  ufunc_generic_fastcall (ufunc=<optimized out>, args=<optimized out>, len_args=<optimized out>, kwnames=<optimized out>, outer=<optimized out>)
    at ../numpy/_core/src/umath/ufunc_object.c:4530
#5  0x00007ffff79e9e30 in PyObject_Vectorcall () from /lib64/libpython3.13.so.1.0

Reproducible: Always

Comment 1 Tulio Magno Quites Machado Filho 2025-01-08 14:20:57 UTC

I proposed a fix here: https://src.fedoraproject.org/rpms/numpy/pull-request/51

Comment 2 Fedora Update System 2025-01-10 17:17:20 UTC

FEDORA-2025-adaf2943f9 (numpy-2.2.1-2.fc42) has been submitted as an update to Fedora 42.
https://bodhi.fedoraproject.org/updates/FEDORA-2025-adaf2943f9

Comment 3 Fedora Update System 2025-01-10 17:19:57 UTC

FEDORA-2025-adaf2943f9 (numpy-2.2.1-2.fc42) has been pushed to the Fedora 42 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 4 Sandro 2025-01-13 08:33:13 UTC

(In reply to Tulio Magno Quites Machado Filho from comment #1)
> I proposed a fix here:
> https://src.fedoraproject.org/rpms/numpy/pull-request/51

This doesn't make a difference. If you grep through `build.log` for `mcpu=power`, you will only find `-mcpu=power8` for the Fedora rawhide ppc64le build and only `-mcpu=power9` for the ELN ppc64le build - both before and after that PR was merged. Or is there more going on behind the scenes?

Interestingly, it does appear to resolve bug 2334097. So, the question "how so?" becomes intriguing.

Comment 5 Elliott Sales de Andrade 2025-01-13 08:47:45 UTC

The prior build was not using verbose compilation, so you wouldn't see the patched compile arguments. All you are seeing is the default build flags being set before the build.

I also don't think this patching is correct, but because it is just a blunt-force sed. NumPy is using CPU-dispatching, so forcing a file to Power9 when it was attempting to build a file as Power8 doesn't make sense. You'll just get two files that run Power9, with NumPy thinking one of them should be Power8 (hence the crashing.)

If you don't want it to even try to dispatch to Power8, then you should set the cpu-baseline option: https://numpy.org/doc/stable/reference/simd/build-options.html

Comment 6 Sandro 2025-01-13 17:05:24 UTC

Indeed. Enabling verbose output during compilation shows many more occurrences of `mcpu=power`. Looking more closely at the latest ppc64le rawhide `build.log`[1], I notice six occurrences of `-mcpu=power9` and another five occurrences of `-mcpu=power10` still. I'm not sure where that leaves us.

I'll be running a few test builds with `cpu-baseline` and `cpu-dispatch` and compare the results to what we have now. At least for fedora the default of 'baseline: min+detect' appears to do the right thing. It selects VSX and VSX2 as baseline and dispatches VSX3 and VSX4. I suppose the power9/power10 occurrences noted above are related to the dispatched optimizations.

[1] I wasn't looking very closely, before, when I stated that nothing had changed. I blame low caffeine levels.

Comment 7 Sandro 2025-01-13 20:16:40 UTC

(In reply to Elliott Sales de Andrade from comment #5)
> I also don't think this patching is correct, but because it is just a
> blunt-force sed. NumPy is using CPU-dispatching, so forcing a file to Power9
> when it was attempting to build a file as Power8 doesn't make sense. You'll
> just get two files that run Power9, with NumPy thinking one of them should
> be Power8 (hence the crashing.)
> 
> If you don't want it to even try to dispatch to Power8, then you should set
> the cpu-baseline option:
> https://numpy.org/doc/stable/reference/simd/build-options.html

Having run a few test builds and having played with `cpu-baseline` and `cpu-dispatch`, I have come to the conclusion that the applied patch is correct - at least in our build environment and considering the results below.

On rhel >= 10 `-mcpu=power9 -mtune=power10` is set in the build flags. According to the build options that you linked in comment 5, I should be able to achieve the same with defining `-Csetup-args=-Dcpu-baseline="vsx3"`. I tried just that and it fails for both Fedora and ELN with the same error thrown in two places. Looking at the output of the ELN build, which uses `-mcpu=power9` and `-mtune-power10` by default, you can observe that NumPy is throwing in a `-mcpu=power8` which overrules the settings from the build flags, exactly as assumed in bug 2332211 comment 2:

  [91/342] g++ -Inumpy/_core/libhighway_qsort_16bit.dispatch.h_VSX2.a.p -Inumpy/_core -I../numpy/_core -Inumpy/_core/include -I../numpy/_core/include -I../numpy/_core/src/common -I../numpy/_core/src/multiarray -I../numpy/_core/src/npymath -I../numpy/_core/src/umath -I../numpy/_core/src/highway -I/usr/include/python3.12 -I/builddir/build/BUILD/numpy-2.2.0/.mesonpy-wkjt5lbq/meson_cpu -fdiagnostics-color=always -DNDEBUG -Wall -Winvalid-pch -std=c++17 -O3 -mcpu=power9 -DNPY_HAVE_VSX -DNPY_HAVE_VSX_ASM -DNPY_HAVE_VSX3 -DNPY_HAVE_VSX3_HALF_DOUBLE -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mcpu=power9 -mtune=power10 -fasynchronous-unwind-tables -fstack-clash-protection -fPIC -DNPY_INTERNAL_BUILD -DHAVE_NPY_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -fno-exceptions -fno-rtti -O3 -DNPY_HAVE_VSX2 -mcpu=power8 -DNPY_MTARGETS_CURRENT=VSX2 -MD -MQ numpy/_core/libhighway_qsort_16bit.dispatch.h_VSX2.a.p/src_npysort_highway_qsort_16bit.dispatch.cpp.o -MF numpy/_core/libhighway_qsort_16bit.dispatch.h_VSX2.a.p/src_npysort_highway_qsort_16bit.dispatch.cpp.o.d -o numpy/_core/libhighway_qsort_16bit.dispatch.h_VSX2.a.p/src_npysort_highway_qsort_16bit.dispatch.cpp.o -c ../numpy/_core/src/npysort/highway_qsort_16bit.dispatch.cpp
  FAILED: numpy/_core/libhighway_qsort_16bit.dispatch.h_VSX2.a.p/src_npysort_highway_qsort_16bit.dispatch.cpp.o
  g++ -Inumpy/_core/libhighway_qsort_16bit.dispatch.h_VSX2.a.p -Inumpy/_core -I../numpy/_core -Inumpy/_core/include -I../numpy/_core/include -I../numpy/_core/src/common -I../numpy/_core/src/multiarray -I../numpy/_core/src/npymath -I../numpy/_core/src/umath -I../numpy/_core/src/highway -I/usr/include/python3.12 -I/builddir/build/BUILD/numpy-2.2.0/.mesonpy-wkjt5lbq/meson_cpu -fdiagnostics-color=always -DNDEBUG -Wall -Winvalid-pch -std=c++17 -O3 -mcpu=power9 -DNPY_HAVE_VSX -DNPY_HAVE_VSX_ASM -DNPY_HAVE_VSX3 -DNPY_HAVE_VSX3_HALF_DOUBLE -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mcpu=power9 -mtune=power10 -fasynchronous-unwind-tables -fstack-clash-protection -fPIC -DNPY_INTERNAL_BUILD -DHAVE_NPY_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -fno-exceptions -fno-rtti -O3 -DNPY_HAVE_VSX2 -mcpu=power8 -DNPY_MTARGETS_CURRENT=VSX2 -MD -MQ numpy/_core/libhighway_qsort_16bit.dispatch.h_VSX2.a.p/src_npysort_highway_qsort_16bit.dispatch.cpp.o -MF numpy/_core/libhighway_qsort_16bit.dispatch.h_VSX2.a.p/src_npysort_highway_qsort_16bit.dispatch.cpp.o.d -o numpy/_core/libhighway_qsort_16bit.dispatch.h_VSX2.a.p/src_npysort_highway_qsort_16bit.dispatch.cpp.o -c ../numpy/_core/src/npysort/highway_qsort_16bit.dispatch.cpp
  In file included from ../numpy/_core/src/common/common.hpp:10,
                   from ../numpy/_core/src/npysort/highway_qsort.hpp:6,
                   from ../numpy/_core/src/npysort/highway_qsort_16bit.dispatch.cpp:1:
  ../numpy/_core/src/common/half.hpp: In member function ‘np::Half::operator float() const’:
  ../numpy/_core/src/common/half.hpp:95:54: error: ‘__builtin_vsx_vextract_fp_from_shorth’ requires the ‘-mcpu=power9’ and ‘-mvsx’ options
     95 |         return vec_extract(vec_extract_fp_from_shorth(vec_splats(bits_)), 0);
        |                                                      ^
  ../numpy/_core/src/common/half.hpp:95:54: note: overloaded builtin ‘__builtin_vec_vextract_fp_from_shorth’ is implemented by builtin ‘__builtin_vsx_vextract_fp_from_shorth’

The output above is without the rhel tweak, but with a SVX3 baseline. It fails for Fedora the same way, except that the output shows Fedora's build flags before NumPy's overruling flags.

I'd appreciate a second pair of eyes. But it looks to me like this is a bug in NumPy since NumPy enforces power8 where it needs power9. It should be easy for upstream to reproduce by simply passing `-Csetup-args=-Dcpu-baseline="svx3"`.

Comment 8 Tulio Magno Quites Machado Filho 2025-01-20 12:56:49 UTC

Sandro, IMHO it's OK for a project to have part of the files built with a processor-specific compiler flag.
I could not spot any wrong usage of those flags yet, but I'm not an expert in numpy.

I have confirmed the patch I proposed did fix the issue we were seeing on LLVM/MLIR.
Log of the build: https://copr.fedorainfracloud.org/coprs/g/fedora-llvm-team/llvm-snapshots-big-merge-20250113/build/8507960/

Could you elaborate what is the issue you're seeing?

Comment 9 Sandro 2025-01-20 16:32:47 UTC

(In reply to Tulio Magno Quites Machado Filho from comment #8)

> Could you elaborate what is the issue you're seeing?

The only issue I'm seeing is that what upstream suggests doing is not working. 

https://numpy.org/doc/stable/reference/simd/build-options.html

Upstream documents the use of build options for enabling / limiting processor specific options. Elliott suggested using those instead of overwriting build flags. I kind of agree. At the same time I found out that the mechanism upstream suggests, does not work. I think the package maintainer, or someone knowledgeable enough, should report that upstream. I'd be willing to do that, if someone is able to confirm my findings.

Downstream the issue is solved for now. We could revisit the solution if upstream's build options offer a working alternative.

Note You need to log in before you can comment on or make changes to this bug.