2280347 – Invalid opcode on Xeon Phi 7290

Bug 2280347 - Invalid opcode on Xeon Phi 7290

Summary: Invalid opcode on Xeon Phi 7290

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	zlib-ng
Sub Component:
Version:	40
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Tulio Magno Quites Machado Filho
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-05-14 12:42 UTC by Jocelyn Falempe
Modified:	2024-06-13 04:04 UTC (History)
CC List:	4 users (show)
Fixed In Version:	zlib-ng-2.1.6-3.fc41 zlib-ng-2.1.6-5.fc40
Clone Of:
Environment:
Last Closed:	2024-05-17 17:17:09 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jocelyn Falempe 2024-05-14 12:42:16 UTC

When running git clone on a Xeon PHi 7290 CPU, it crashes, with the following in dmesg:

[ 2796.536850] traps: git[26394] trap invalid opcode ip:7f6ca9c489f8 sp:7ffd9b131198 error:0 in libz.so.1.3.0.zlib-ng[7f6ca9c37000+16000]


cat /proc/cpuinfo
processor	: 287
vendor_id	: GenuineIntel
cpu family	: 6
model		: 87
model name	: Intel(R) Xeon Phi(TM) CPU 7290 @ 1.50GHz
stepping	: 1
microcode	: 0x1b6
cpu MHz		: 1000.000
cache size	: 1024 KB
physical id	: 0
siblings	: 288
core id		: 73
cpu cores	: 72
apicid		: 295
initial apicid	: 295
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ring3mwait cpuid_fault epb pti ibrs ibpb fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt dtherm ida arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 mds msbds_only mmio_unknown
bogomips	: 2996.73
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:



Reproducible: Always

Steps to Reproduce:
1. Install F40 on Xeon Phi 7290 system
2. Install git
3. Run git clone
Actual Results:  
Crashes with
[ 2796.536850] traps: git[26394] trap invalid opcode ip:7f6ca9c489f8 sp:7ffd9b131198 error:0 in libz.so.1.3.0.zlib-ng[7f6ca9c37000+16000]


Expected Results:  
clone the repository

Xeon Phi processors are not common, so that's probably why it wasn't caught before.

Comment 1 Tulio Magno Quites Machado Filho 2024-05-14 13:12:48 UTC

I had a chat with Jocelyn and we managed to collect more data.

Stack trace of thread 28701:
                #0  0x00007f2ee88399f8 adler32_avx512 (libz.so.1 + 0x139f8)
                #1  0x00007f2ee88348a4 inflate (libz.so.1 + 0xe8a4)
                #2  0x0000562f3d0beb3f git_inflate (git + 0x1c6b3f)
                #3  0x0000562f3cf7730b parse_pack_objects.lto_priv.0 (git + 0x7f30b)
                #4  0x0000562f3cf80cb0 cmd_index_pack (git + 0x88cb0)
                #5  0x0000562f3cf04215 handle_builtin.lto_priv.0 (git + 0xc215)
                #6  0x0000562f3cf047d2 run_argv.lto_priv.0 (git + 0xc7d2)
                #7  0x0000562f3ceff6eb main (git + 0x76eb)
                #8  0x00007f2ee8663088 __libc_start_call_main (libc.so.6 + 0x2a088)
                #9  0x00007f2ee866314b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2a14b)
                #10 0x0000562f3ceffb55 _start (git + 0x7b55)

GDB identified the correct source code line and invalid instruction:

Program terminated with signal SIGILL, Illegal instruction.
#0  0x00007f2ee88399f8 in _mm512_madd_epi16 (__B=..., __A=...)
    at /usr/lib/gcc/x86_64-redhat-linux/14/include/avx512bwintrin.h:1544

warning: Source file is more recent than executable.
1544      return (__m512i) __builtin_ia32_pmaddwd512_mask ((__v32hi) __A,
(gdb) disas
Dump of assembler code for function adler32_avx512:
   0x00007f2ee88399b0 <+0>:     endbr64
   0x00007f2ee88399b4 <+4>:     mov    %rdx,%rcx
   0x00007f2ee88399b7 <+7>:     test   %rsi,%rsi
   0x00007f2ee88399ba <+10>:    je     0x7f2ee8839b44 <adler32_avx512+404>
   0x00007f2ee88399c0 <+16>:    mov    %edi,%eax
   0x00007f2ee88399c2 <+18>:    test   %rdx,%rdx
   0x00007f2ee88399c5 <+21>:    je     0x7f2ee8839b3f <adler32_avx512+399>
   0x00007f2ee88399cb <+27>:    shr    $0x10,%eax
   0x00007f2ee88399ce <+30>:    movzwl %di,%edx
   0x00007f2ee88399d1 <+33>:    cmp    $0x3f,%rcx
   0x00007f2ee88399d5 <+37>:    jbe    0x7f2ee8839b4a <adler32_avx512+410>
   0x00007f2ee88399db <+43>:    mov    $0x1,%edi
   0x00007f2ee88399e0 <+48>:    vmovdqa64 0x8916(%rip),%zmm7        # 0x7f2ee8842300
   0x00007f2ee88399ea <+58>:    vmovdqa32 0x894c(%rip),%zmm8        # 0x7f2ee8842340
   0x00007f2ee88399f4 <+68>:    vpxor  %xmm6,%xmm6,%xmm6
=> 0x00007f2ee88399f8 <+72>:    vpbroadcastw %edi,%zmm5

I'm writing a bug report upstream.

Comment 2 Ben Beasley 2024-05-14 13:42:01 UTC

Per https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX_512&text=madd_epi16&ig_expand=4203, the _mm512_madd_epi16 intrinsic requires the AVX512BW CPUID flag. Upstream only checks for AVX512F here. https://github.com/zlib-ng/zlib-ng/blob/1007e7a9c74148fe915384d7cc44921559500241/arch/x86/x86_features.c#L102

Changing ebx & 0x00010000 to ebx & 0x40010000 should suffice, but someone should really audit the AVX-512 code to see if any of the other intrinsics are outside AVX512F.

It makes sense that this showed up on Xeon PHI, because those processors have an unusual combination of AVX-512 features compared to the “normal” Xeon server chips that upstream probably tested with.

Comment 3 Tulio Magno Quites Machado Filho 2024-05-14 18:26:33 UTC

I've just created the following pull request upstream: https://github.com/zlib-ng/zlib-ng/pull/1723

Comment 4 Tulio Magno Quites Machado Filho 2024-05-14 18:39:12 UTC

Pull request downstream has just been created: https://src.fedoraproject.org/rpms/zlib-ng/pull-request/13

Comment 5 Fedora Update System 2024-05-15 15:45:51 UTC

FEDORA-2024-1a7f4b98c3 (zlib-ng-2.1.6-3.fc41) has been submitted as an update to Fedora 41.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-1a7f4b98c3

Comment 6 Jocelyn Falempe 2024-05-16 09:41:42 UTC

I can confirm that by installing
 zlib-ng-compat x86_64 2.1.6-3.fc41 
 zlib-ng-compat-devel 2.1.6-3.fc41

from the bodhi link in comment #5 on my F40, fixes the issue, and now git clone works.

Thanks for the fix.

Comment 7 Fedora Update System 2024-05-17 17:17:09 UTC

FEDORA-2024-1a7f4b98c3 (zlib-ng-2.1.6-3.fc41) has been pushed to the Fedora 41 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 8 Tulio Magno Quites Machado Filho 2024-05-31 19:08:47 UTC

Jocelyn, upstream asked for some important changes to the fix I had implemented.

Could you check if it still fixes the issue you were seeing, please?

This has now landed upstream and has been backported to rawhide in zlib-ng 2.1.6-4.
I have a scratch build for F40 available here too: https://koji.fedoraproject.org/koji/taskinfo?taskID=118346414

Comment 9 Jocelyn Falempe 2024-05-31 23:16:05 UTC

Sure, I will test your scratch build on Monday, and report here.

Thanks.

Comment 10 Jocelyn Falempe 2024-06-03 08:28:01 UTC

I have tested on the hpe-xl260 machine with F40, and installing zlib-ng-compat-2.1.6-5.fc40.x86_64.rpm does fix the issue.

Thanks for you support.

Comment 11 Fedora Update System 2024-06-04 18:28:24 UTC

FEDORA-2024-241a7a825b (zlib-ng-2.1.6-5.fc40) has been submitted as an update to Fedora 40.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-241a7a825b

Comment 12 Fedora Update System 2024-06-05 02:03:52 UTC

FEDORA-2024-241a7a825b has been pushed to the Fedora 40 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-241a7a825b`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-241a7a825b

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 13 Fedora Update System 2024-06-13 04:04:54 UTC

FEDORA-2024-241a7a825b (zlib-ng-2.1.6-5.fc40) has been pushed to the Fedora 40 stable repository.
If problem still persists, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.