+++ This bug was initially created as a clone of Bug #2100393 +++ Description of problem: It is originally detected in OpenStack Upstream CI https://bugs.launchpad.net/tripleo/+bug/1979276 with ovn-2021-21.12.0-46, issue was not seen with last FDP release ovn-2021-21.12.0-11 so it's a regression in new FDP 22.B release. OVN binaries like ovn-northd, ovn-controller, ovn-nbctl, ovn-sbctl etc crashes with "Illegal instruction (core dumped)" For example, ovn-nbctl --version crashed as below:- # ovn-nbctl --version Illegal instruction (core dumped) # coredumpctl info PID: 640886 (ovn-nbctl) UID: 0 (root) GID: 0 (root) Signal: 4 (ILL) Timestamp: Thu 2022-06-23 08:48:35 UTC (3s ago) Command Line: ovn-nbctl --version Executable: /usr/bin/ovn-nbctl Control Group: /machine.slice/libpod-449776acdb3089ad2f92d49b850b234089c5ec549b6f5d0fcfb5414b5f19717a.scope/container Unit: libpod-449776acdb3089ad2f92d49b850b234089c5ec549b6f5d0fcfb5414b5f19717a.scope Slice: machine.slice Boot ID: 4f2c55fc25f34c84a6160468479ece43 Machine ID: c26d255f89064955aa655cf12e74d969 Hostname: standalone.localdomain Storage: /var/lib/systemd/coredump/core.ovn-nbctl.0.4f2c55fc25f34c84a6160468479ece43.640886.1655974115000000.zst (present) Disk Size: 160.0K Message: Process 640886 (ovn-nbctl) of user 0 dumped core. Module /usr/bin/ovn-nbctl with build-id 2798d30ce0833d6e0fcabb6d8a0a98cba4da707d Module linux-vdso.so.1 with build-id 826a46efc5a1c4a55cc6fdceeb06554eda66067e Module libnghttp2.so.14 with build-id 7eadbd56a0e5bcd3d8a6b39b9bab2327e380283a Module libpython3.9.so.1.0 with build-id bb4578c381c6d22045835e803bf846e2b5a28502 Module libevent-2.1.so.7 with build-id af406c254338ff6ceff47360cba92cdcf233cf14 Module libprotobuf-c.so.1 with build-id 46661ae5d66cbaa2aa82b1b765472bdfa4712a24 Module ld-linux-x86-64.so.2 with build-id 1d95aae3e4174446d3b885ad234d4f7e573e71db Module libz.so.1 with build-id 25486226566596e403da5485fb0ec85deed6b9fa Module libc.so.6 with build-id 14830f7e71953d5f0dac317543ac1e3fcdd874f5 Module libunbound.so.8 with build-id def32d1bb7a7d99c59bf62e00c628af0246afa91 Module libm.so.6 with build-id 3eb525d2e163793ef2e888d5bb46e104d11a3201 Module libcap-ng.so.0 with build-id fdca0a301667e15db99d726152b57feeb35e4dbe Module libcrypto.so.3 with build-id 12bfb8486a63c1daa0d3b1d901401cd152c09f8e Module libssl.so.3 with build-id 4f82a7edeeafe3698ccc5442d011a8cd5aaf4e9d Stack trace of thread 96216: #0 0x000055d209c3dba8 n/a (/usr/bin/ovn-nbctl + 0x16ba8) ELF object binary architecture: AMD x86-64 - Seen issue with below cpu models Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Intel Xeon E312xx (Sandy Bridge) - Not seen issue with below cpu models Intel Core Processor (Haswell, no TSX) Intel Xeon Processor (Cascadelake) AMD EPYC-Rome Processor /proc/cpuinfo looks like below on affected system:- =========================== processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz stepping : 7 microcode : 0x71a cpu MHz : 2593.881 cache size : 20480 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp xsaveopt md_clear flush_l1d bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit bogomips : 5187.52 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: =========================== processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz stepping : 4 microcode : 0x42e cpu MHz : 2599.955 cache size : 20480 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase smep erms xsaveopt md_clear flush_l1d bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit bogomips : 5200.03 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: ================================================ Version-Release number of selected component (if applicable): - ovn-2021-21.12.0-46 How reproducible: Always on certain CPU models Steps to Reproduce: 1. Install affected ovn version on vms with one of below cpu models:- Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Intel Xeon E312xx (Sandy Bridge) 2. Run ovn-nbctl --version Actual results: Fails with Illegal instruction (core dumped) Expected results: Should succeed Additional info: --- Additional comment from Yatin Karel on 2022-06-23 09:17:16 UTC --- Issue is also seen with downstream rhel9 builds:- First bad build:- ovn-2021-21.12.0-30.el9fdp Latest build ovn-2021-21.12.0-73.el9fdp also impacted Last good build:- - ovn-2021-21.12.0-11.el9fdp Tried rebuild(as brew only had el8 build for it) of ovn-2021-21.12.0-15 that is also good. --- Additional comment from Ales Musil on 2022-06-23 11:39:57 UTC --- So the illegal instruction is "shlx" which is part of BMI2 and the affected CPU do not support that. The cpu.c is compiled with -mbmi2 flag and others that might not be really supported (avx512, bmi1, etc.): libtool: compile: gcc -DHAVE_CONFIG_H -I. -I ./include -I ./include -I ./lib -I ./lib -mavx512f -mavx512bw -mavx512dq -mbmi -mbmi2 -fPIC -Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith -Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast -Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes -Wmissing-field-initializers -fno-strict-aliasing -Wswitch-bool -Wlogical-not-parentheses -Wsizeof-array-argument -Wbool-compare -Wshift-negative-value -Wduplicated-cond -Wshadow -Wmultistatement-macros -Wcast-align=strict -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -c lib/cpu.c -o lib/libopenvswitchavx512_la-cpu.o Which I am not sure if that really makes sense, as this should detect the capabilities so it shouldn't be compiled with assumptions about those capabilities. --- Additional comment from Ales Musil on 2022-06-23 11:40:34 UTC --- GDB disassemble --- Additional comment from Ales Musil on 2022-06-23 11:41:03 UTC --- cpuid output --- Additional comment from Ales Musil on 2022-06-23 12:29:44 UTC --- Moving to openvswitch --- Additional comment from Ales Musil on 2022-06-23 12:41:37 UTC --- Patch posted: https://patchwork.ozlabs.org/project/openvswitch/patch/20220623124037.252709-1-amusil@redhat.com/ --- Additional comment from Yatin Karel on 2022-06-23 14:26:43 UTC --- Just for record, https://github.com/openvswitch/ovs/commit/b366fa2f4947f7e64154c7656b938b7ef4834ae8 had triggered the issue, with the revert of this issue is not seen. --- Additional comment from David Marchand on 2022-06-24 07:37:45 UTC --- I posted a more complete fix, following upstream report of AVX512 breakage, and discussion with Ilya: https://patchwork.ozlabs.org/project/openvswitch/patch/20220624072959.240183-1-david.marchand@redhat.com/ --- Additional comment from Yatin Karel on 2022-06-27 11:47:42 UTC --- < I posted a more complete fix, following upstream report of AVX512 breakage, and discussion with Ilya: https://patchwork.ozlabs.org/project/openvswitch/patch/20220624072959.240183-1-david.marchand@redhat.com/ I tried with a custom build[1](ovn-2021-21.12.0-46 + ^ fix) and it worked(just checked ovn-nbctl --version) on the affected node(missing avx512 CPU flag). The patch didn't applied cleanly on top of ovs commit 498cedc483f3239c839c55b4d9f2261b61fb6ace, so i had to cherry pick two more commits[2][3] in order to get it build. /me Just did this to test the fix, likely actually backports will be done differently. [1] https://cbs.centos.org/koji/taskinfo?taskID=2872936 [2] https://github.com/openvswitch/ovs/commit/fb85ae4340a51bea26b9a4099448a982834afeff [3] https://github.com/openvswitch/ovs/commit/cb1c64007734cbaa4b23d3e569a550c0beaa4afd Wrt the fix, do we need separated bzs for ovn(like for ovn-2021, ovn22.03) too to get the new OVN builds with the fix included? or those will be automatically taken care with this bz itself? --- Additional comment from OvS team on 2022-06-29 14:03:31 UTC --- * Wed Jun 29 2022 Open vSwitch CI <ovs-ci> - 2.17.0-28 - Merging upstream branch-2.17 [RH git: f3aee3f437] Commit list: a77ad9693c dpif-netdev: Refactor AVX512 runtime checks. (#2100393) --- Additional comment from David Hill on 2022-06-29 15:40:13 UTC --- ovs binaries too ... [root@undercloud-0-rhosp17 ~]# ovs-vsctl show Illegal instruction (core dumped) --- Additional comment from David Hill on 2022-06-29 16:04:29 UTC --- I got this issue with 2022-06-29T10:32:47-0400 SUBDEBUG Installed: openvswitch2.17-2.17.0-18.el9fdp.x86_64 flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ibrs ibpb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt arat and installing the brew package -25 solved this issue.
Posted ovs submodule bumps for main and 22.06: https://patchwork.ozlabs.org/project/ovn/patch/20220630133958.199238-1-amusil@redhat.com/ And for 22.03 and 21.12: https://patchwork.ozlabs.org/project/ovn/patch/20220630140100.288949-1-amusil@redhat.com/
This issue is fixed in ovn-2021-21.12.0-82.el8fdp
This issue has been cloned at https://bugzilla.redhat.com/show_bug.cgi?id=2103307
This issue has been cloned at https://bugzilla.redhat.com/show_bug.cgi?id=2103310
This issue has been cloned at https://bugzilla.redhat.com/show_bug.cgi?id=2103311
This issue has been cloned at https://bugzilla.redhat.com/show_bug.cgi?id=2103314
This issue has been cloned at https://bugzilla.redhat.com/show_bug.cgi?id=2103315
failed to reproduce the failure on ovn-2021-21.12.0-46: [zuul@centos-8-stream-rax-iad-0030376224 ovn-2021-82]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 8 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Model name: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Stepping: 7 CPU MHz: 2593.863 BogoMIPS: 5187.50 Hypervisor vendor: Xen Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp xsaveopt md_clear flush_l1d [zuul@centos-8-stream-rax-iad-0030376224 bz2102618]$ sudo rpm -Uvh * --nodeps --force Verifying... ################################# [100%] Preparing... ################################# [100%] Updating / installing... 1:ovn-2021-21.12.0-46.el8fdp ################################# [ 25%] Unit ovn-northd.service could not be found. 2:ovn-2021-central-21.12.0-46.el8fd################################# [ 50%] Unit ovn-controller.service could not be found. 3:ovn-2021-host-21.12.0-46.el8fdp ################################# [ 75%] Cleaning up / removing... 4:ovn-2021-21.12.0-46.el8s ################################# [100%] [zuul@centos-8-stream-rax-iad-0030376224 bz2102618]$ ovn-nbctl --version ovn-nbctl 21.12.2 Open vSwitch Library 2.17.1 DB Schema 5.35.1 and the failure didn't occur on ovn-2021-21.12.0-82: [zuul@centos-8-stream-rax-iad-0030376224 ovn-2021-82]$ rpm -qa | grep -E "openvswitch|ovn-2021" rdo-openvswitch-2.12-1.el8.noarch ovn-2021-central-21.12.0-82.el8fdp.x86_64 openvswitch-2.12.0-1.1.el8.x86_64 ovn-2021-21.12.0-82.el8fdp.x86_64 ovn-2021-host-21.12.0-82.el8fdp.x86_64 network-scripts-openvswitch-2.12.0-1.1.el8.x86_64 centos-release-nfv-openvswitch-1-3.el8.noarch [zuul@centos-8-stream-rax-iad-0030376224 ovn-2021-82]$ ovn-nbctl --version ovn-nbctl 21.12.3 Open vSwitch Library 2.17.3 DB Schema 5.35.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5787