The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 2102618 - ovn binaries crashes with Illegal instruction (core dumped) on certain CPU models
Summary: ovn binaries crashes with Illegal instruction (core dumped) on certain CPU mo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn-2021
Version: FDP 22.B
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Ales Musil
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On: 2100393
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-30 11:26 UTC by Ales Musil
Modified: 2022-08-01 14:11 UTC (History)
9 users (show)

Fixed In Version: ovn-2021-21.12.0-82.el8fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2100393
Environment:
Last Closed: 2022-08-01 14:11:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-2071 0 None None None 2022-06-30 11:35:47 UTC
Red Hat Product Errata RHBA-2022:5787 0 None None None 2022-08-01 14:11:14 UTC

Description Ales Musil 2022-06-30 11:26:22 UTC
+++ This bug was initially created as a clone of Bug #2100393 +++

Description of problem:
It is originally detected in OpenStack Upstream CI https://bugs.launchpad.net/tripleo/+bug/1979276 with ovn-2021-21.12.0-46, issue was not seen with last FDP release ovn-2021-21.12.0-11 so it's a regression in new FDP 22.B release.

OVN binaries like ovn-northd, ovn-controller, ovn-nbctl, ovn-sbctl etc crashes with "Illegal instruction (core dumped)"

For example, ovn-nbctl --version crashed as below:-
# ovn-nbctl --version
Illegal instruction (core dumped)

# coredumpctl info
           PID: 640886 (ovn-nbctl)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 4 (ILL)
     Timestamp: Thu 2022-06-23 08:48:35 UTC (3s ago)
  Command Line: ovn-nbctl --version
    Executable: /usr/bin/ovn-nbctl
 Control Group: /machine.slice/libpod-449776acdb3089ad2f92d49b850b234089c5ec549b6f5d0fcfb5414b5f19717a.scope/container
          Unit: libpod-449776acdb3089ad2f92d49b850b234089c5ec549b6f5d0fcfb5414b5f19717a.scope
         Slice: machine.slice
       Boot ID: 4f2c55fc25f34c84a6160468479ece43
    Machine ID: c26d255f89064955aa655cf12e74d969
      Hostname: standalone.localdomain
       Storage: /var/lib/systemd/coredump/core.ovn-nbctl.0.4f2c55fc25f34c84a6160468479ece43.640886.1655974115000000.zst (present)
     Disk Size: 160.0K
       Message: Process 640886 (ovn-nbctl) of user 0 dumped core.
                
                Module /usr/bin/ovn-nbctl with build-id 2798d30ce0833d6e0fcabb6d8a0a98cba4da707d
                Module linux-vdso.so.1 with build-id 826a46efc5a1c4a55cc6fdceeb06554eda66067e
                Module libnghttp2.so.14 with build-id 7eadbd56a0e5bcd3d8a6b39b9bab2327e380283a
                Module libpython3.9.so.1.0 with build-id bb4578c381c6d22045835e803bf846e2b5a28502
                Module libevent-2.1.so.7 with build-id af406c254338ff6ceff47360cba92cdcf233cf14
                Module libprotobuf-c.so.1 with build-id 46661ae5d66cbaa2aa82b1b765472bdfa4712a24
                Module ld-linux-x86-64.so.2 with build-id 1d95aae3e4174446d3b885ad234d4f7e573e71db
                Module libz.so.1 with build-id 25486226566596e403da5485fb0ec85deed6b9fa
                Module libc.so.6 with build-id 14830f7e71953d5f0dac317543ac1e3fcdd874f5
                Module libunbound.so.8 with build-id def32d1bb7a7d99c59bf62e00c628af0246afa91
                Module libm.so.6 with build-id 3eb525d2e163793ef2e888d5bb46e104d11a3201
                Module libcap-ng.so.0 with build-id fdca0a301667e15db99d726152b57feeb35e4dbe
                Module libcrypto.so.3 with build-id 12bfb8486a63c1daa0d3b1d901401cd152c09f8e
                Module libssl.so.3 with build-id 4f82a7edeeafe3698ccc5442d011a8cd5aaf4e9d
                Stack trace of thread 96216:
                #0  0x000055d209c3dba8 n/a (/usr/bin/ovn-nbctl + 0x16ba8)
                ELF object binary architecture: AMD x86-64

- Seen issue with below cpu models
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Intel Xeon E312xx (Sandy Bridge)

- Not seen issue with below cpu models
Intel Core Processor (Haswell, no TSX)
Intel Xeon Processor (Cascadelake)
AMD EPYC-Rome Processor

/proc/cpuinfo looks like below on affected system:-
===========================
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
stepping : 7
microcode : 0x71a
cpu MHz : 2593.881
cache size : 20480 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp xsaveopt md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5187.52
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
===========================
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
stepping : 4
microcode : 0x42e
cpu MHz : 2599.955
cache size : 20480 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase smep erms xsaveopt md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5200.03
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
================================================


Version-Release number of selected component (if applicable):
- ovn-2021-21.12.0-46

How reproducible:
Always on certain CPU models

Steps to Reproduce:
1. Install affected ovn version on vms with one of below cpu models:-
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Intel Xeon E312xx (Sandy Bridge)
2. Run ovn-nbctl --version


Actual results:
Fails with Illegal instruction (core dumped)

Expected results:
Should succeed

Additional info:

--- Additional comment from Yatin Karel on 2022-06-23 09:17:16 UTC ---

Issue is also seen with downstream rhel9 builds:-
First bad build:- ovn-2021-21.12.0-30.el9fdp
Latest build ovn-2021-21.12.0-73.el9fdp also impacted

Last good build:-
- ovn-2021-21.12.0-11.el9fdp
Tried rebuild(as brew only had el8 build for it) of ovn-2021-21.12.0-15 that is also good.

--- Additional comment from Ales Musil on 2022-06-23 11:39:57 UTC ---

So the illegal instruction is "shlx" which is part of BMI2 and the affected CPU do not support that. 

The cpu.c is compiled with -mbmi2 flag and others that might not be really supported (avx512, bmi1, etc.):

libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I ./include -I ./include -I ./lib -I ./lib -mavx512f -mavx512bw -mavx512dq -mbmi -mbmi2 -fPIC -Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith -Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast -Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes -Wmissing-field-initializers -fno-strict-aliasing -Wswitch-bool -Wlogical-not-parentheses -Wsizeof-array-argument -Wbool-compare -Wshift-negative-value -Wduplicated-cond -Wshadow -Wmultistatement-macros -Wcast-align=strict -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -c lib/cpu.c -o lib/libopenvswitchavx512_la-cpu.o


Which I am not sure if that really makes sense, as this should detect the capabilities so it shouldn't be compiled with assumptions about those capabilities.

--- Additional comment from Ales Musil on 2022-06-23 11:40:34 UTC ---

GDB disassemble

--- Additional comment from Ales Musil on 2022-06-23 11:41:03 UTC ---

cpuid output

--- Additional comment from Ales Musil on 2022-06-23 12:29:44 UTC ---

Moving to openvswitch

--- Additional comment from Ales Musil on 2022-06-23 12:41:37 UTC ---

Patch posted: 
https://patchwork.ozlabs.org/project/openvswitch/patch/20220623124037.252709-1-amusil@redhat.com/

--- Additional comment from Yatin Karel on 2022-06-23 14:26:43 UTC ---

Just for record, https://github.com/openvswitch/ovs/commit/b366fa2f4947f7e64154c7656b938b7ef4834ae8 had triggered the issue, with the revert of this issue is not seen.

--- Additional comment from David Marchand on 2022-06-24 07:37:45 UTC ---

I posted a more complete fix, following upstream report of AVX512 breakage, and discussion with Ilya:
https://patchwork.ozlabs.org/project/openvswitch/patch/20220624072959.240183-1-david.marchand@redhat.com/

--- Additional comment from Yatin Karel on 2022-06-27 11:47:42 UTC ---

< I posted a more complete fix, following upstream report of AVX512 breakage, and discussion with Ilya:
https://patchwork.ozlabs.org/project/openvswitch/patch/20220624072959.240183-1-david.marchand@redhat.com/

I tried with a custom build[1](ovn-2021-21.12.0-46 + ^ fix) and it worked(just checked ovn-nbctl --version) on the affected node(missing avx512 CPU flag).
The patch didn't applied cleanly on top of ovs commit 498cedc483f3239c839c55b4d9f2261b61fb6ace, so i had to cherry pick two more commits[2][3] in order to get it build. /me Just did this to test the fix, likely actually backports will be done differently.

[1] https://cbs.centos.org/koji/taskinfo?taskID=2872936
[2] https://github.com/openvswitch/ovs/commit/fb85ae4340a51bea26b9a4099448a982834afeff
[3] https://github.com/openvswitch/ovs/commit/cb1c64007734cbaa4b23d3e569a550c0beaa4afd




Wrt the fix, do we need separated bzs for ovn(like for ovn-2021, ovn22.03) too to get the new OVN builds with the fix included? or those will be automatically taken care with this bz itself?

--- Additional comment from OvS team on 2022-06-29 14:03:31 UTC ---

* Wed Jun 29 2022 Open vSwitch CI <ovs-ci> - 2.17.0-28
- Merging upstream branch-2.17 [RH git: f3aee3f437]
    Commit list:
    a77ad9693c dpif-netdev: Refactor AVX512 runtime checks. (#2100393)

--- Additional comment from David Hill on 2022-06-29 15:40:13 UTC ---

ovs binaries too ... 

[root@undercloud-0-rhosp17 ~]# ovs-vsctl show
Illegal instruction (core dumped)

--- Additional comment from David Hill on 2022-06-29 16:04:29 UTC ---

I got this issue with

2022-06-29T10:32:47-0400 SUBDEBUG Installed: openvswitch2.17-2.17.0-18.el9fdp.x86_64


flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ibrs ibpb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt arat


and installing the brew package -25 solved this issue.

Comment 2 OVN Bot 2022-07-02 04:05:26 UTC
This issue is fixed in ovn-2021-21.12.0-82.el8fdp

Comment 3 OVN Bot 2022-07-02 04:05:33 UTC
This issue has been cloned at https://bugzilla.redhat.com/show_bug.cgi?id=2103307

Comment 4 OVN Bot 2022-07-02 04:05:59 UTC
This issue has been cloned at https://bugzilla.redhat.com/show_bug.cgi?id=2103310

Comment 5 OVN Bot 2022-07-02 04:06:07 UTC
This issue has been cloned at https://bugzilla.redhat.com/show_bug.cgi?id=2103311

Comment 6 OVN Bot 2022-07-02 04:06:30 UTC
This issue has been cloned at https://bugzilla.redhat.com/show_bug.cgi?id=2103314

Comment 7 OVN Bot 2022-07-02 04:06:37 UTC
This issue has been cloned at https://bugzilla.redhat.com/show_bug.cgi?id=2103315

Comment 10 Jianlin Shi 2022-07-13 07:57:08 UTC
failed to reproduce the failure on ovn-2021-21.12.0-46:

[zuul@centos-8-stream-rax-iad-0030376224 ovn-2021-82]$ lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           8
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               45
Model name:          Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Stepping:            7
CPU MHz:             2593.863
BogoMIPS:            5187.50
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp xsaveopt md_clear flush_l1d

[zuul@centos-8-stream-rax-iad-0030376224 bz2102618]$ sudo rpm -Uvh * --nodeps --force
Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]                        
Updating / installing...                                                                              
   1:ovn-2021-21.12.0-46.el8fdp       ################################# [ 25%]                        
Unit ovn-northd.service could not be found.
   2:ovn-2021-central-21.12.0-46.el8fd################################# [ 50%]                        
Unit ovn-controller.service could not be found.                                                       
   3:ovn-2021-host-21.12.0-46.el8fdp  ################################# [ 75%]
Cleaning up / removing...                                                                             
   4:ovn-2021-21.12.0-46.el8s         ################################# [100%]
[zuul@centos-8-stream-rax-iad-0030376224 bz2102618]$ ovn-nbctl --version
ovn-nbctl 21.12.2                                  
Open vSwitch Library 2.17.1                                                                           
DB Schema 5.35.1

and the failure didn't occur on ovn-2021-21.12.0-82:

[zuul@centos-8-stream-rax-iad-0030376224 ovn-2021-82]$ rpm -qa | grep -E "openvswitch|ovn-2021"
rdo-openvswitch-2.12-1.el8.noarch
ovn-2021-central-21.12.0-82.el8fdp.x86_64
openvswitch-2.12.0-1.1.el8.x86_64
ovn-2021-21.12.0-82.el8fdp.x86_64
ovn-2021-host-21.12.0-82.el8fdp.x86_64
network-scripts-openvswitch-2.12.0-1.1.el8.x86_64
centos-release-nfv-openvswitch-1-3.el8.noarch
[zuul@centos-8-stream-rax-iad-0030376224 ovn-2021-82]$ ovn-nbctl --version
ovn-nbctl 21.12.3
Open vSwitch Library 2.17.3
DB Schema 5.35.1

Comment 12 errata-xmlrpc 2022-08-01 14:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5787


Note You need to log in before you can comment on or make changes to this bug.