Bug 2173754 - openvswitch provokes kernel warnings [NEEDINFO]
Summary: openvswitch provokes kernel warnings
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: kernel
Version: CentOS Stream
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Michael Petlan
QA Contact: Qiao Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-27 21:17 UTC by Colin Walters
Modified: 2023-07-24 08:26 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:
zsun: needinfo? (walters)
zsun: needinfo? (ralongi)
mpetlan: needinfo? (walters)
mpetlan: needinfo? (walters)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-149962 0 None None None 2023-02-27 21:18:09 UTC

Description Colin Walters 2023-02-27 21:17:23 UTC
Description of problem:

systemctl start ovsdb-server gives a ton of kernel warnings.  In digging in a bit, I was very surprised to discover that ovs uses the `perf_event_open` system call, from `perf trace`:

  2873.260 (13.631 ms): ovsdb-server/2354 perf_event_open(attr_uptr: 0x7ffd1c915ab0, cpu: -1, group_fd: -1)     = 16

Traces look like this:

````
[ 1015.975610] Call Trace:
[ 1015.975611]  <TASK>
[ 1015.975611]  perf_event_context_sched_out+0x1bf/0x460
[ 1015.975613]  __perf_event_task_sched_out+0x50/0x170
[ 1015.975614]  ? put_prev_task_fair+0x1b/0x30
[ 1015.975617]  ? pick_next_task+0x851/0x940
[ 1015.975617]  prepare_task_switch+0xbd/0x2a0
[ 1015.975619]  __schedule+0x1cb/0x620
[ 1015.975621]  schedule+0x5a/0xc0
[ 1015.975622]  schedule_hrtimeout_range_clock+0xbf/0x1b0
[ 1015.975623]  ? __hrtimer_init+0xc0/0xc0
[ 1015.975624]  do_poll.constprop.0+0x225/0x390
[ 1015.975625]  do_sys_poll+0x1c8/0x260
[ 1015.975627]  ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1015.975628]  ? __x64_sys_accept+0x17/0x20
[ 1015.975628]  ? xas_load+0x5/0xa0
[ 1015.975629]  ? xa_load+0x70/0xb0
[ 1015.975630]  ? xas_load+0x5/0xa0
[ 1015.975631]  ? xa_load+0x70/0xb0
[ 1015.975632]  ? memcg_list_lru_alloc+0xa8/0x350
[ 1015.975633]  ? __pollwait+0xe0/0xe0
[ 1015.975634]  ? __pollwait+0xe0/0xe0
[ 1015.975635]  ? __pollwait+0xe0/0xe0
[ 1015.975636]  ? __skb_recv_datagram+0x85/0xc0
[ 1015.975637]  ? skb_recv_datagram+0x3b/0x60
[ 1015.975638]  ? unix_accept+0x93/0x180
[ 1015.975639]  ? rcu_nocb_try_bypass+0x4d/0x440
[ 1015.975640]  ? task_work_add+0x79/0x90
[ 1015.975642]  ? __inode_wait_for_writeback+0x7e/0xf0
[ 1015.975643]  ? rcu_nocb_try_bypass+0x4d/0x440
[ 1015.975644]  ? rseq_get_rseq_cs.isra.0+0x16/0x220
[ 1015.975645]  ? rseq_ip_fixup+0x72/0x1a0
[ 1015.975645]  ? call_rcu+0x114/0x320
[ 1015.975646]  ? kvm_clock_get_cycles+0x14/0x30
[ 1015.975647]  ? ktime_get_ts64+0x49/0xf0
[ 1015.975648]  __x64_sys_poll+0xa6/0x140
[ 1015.975649]  do_syscall_64+0x59/0x90
[ 1015.975650]  ? do_syscall_64+0x69/0x90
[ 1015.975651]  ? do_syscall_64+0x69/0x90
[ 1015.975652]  ? exit_to_user_mode_prepare+0xec/0x100
[ 1015.975653]  ? syscall_exit_to_user_mode+0x12/0x30
[ 1015.975654]  ? do_syscall_64+0x69/0x90
[ 1015.975655]  ? syscall_exit_to_user_mode+0x12/0x30
[ 1015.975656]  ? do_syscall_64+0x69/0x90
[ 1015.975657]  ? do_syscall_64+0x69/0x90
[ 1015.975657]  ? sysvec_apic_timer_interrupt+0x3c/0x90
[ 1015.975659]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1015.975659] RIP: 0033:0x7f8014ac6937
[ 1015.975660] Code: 00 00 00 5b 49 8b 45 10 5d 41 5c 41 5d 41 5e c3 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 07 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[ 1015.975660] RSP: 002b:00007ffd1c915a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000007
[ 1015.975661] RAX: ffffffffffffffda RBX: 00000000000f8e31 RCX: 00007f8014ac6937
[ 1015.975662] RDX: 00000000000009c4 RSI: 0000000000000003 RDI: 000055d929cad760
[ 1015.975662] RBP: 0000000000000001 R08: 0000016310adb0be R09: 7fffffffffffffff
[ 1015.975663] R10: 00007ffd1c99d080 R11: 0000000000000246 R12: 0000000000000003
[ 1015.975663] R13: 000055d929cad760 R14: 0000000000000000 R15: 00000000000009c4
[ 1015.975664]  </TASK>
```

Version-Release number of selected component (if applicable):

[root@cosa-devsh ~]# rpm -q kernel openvswitch3.1
kernel-5.14.0-282.el9.x86_64
openvswitch3.1-3.1.0-2.el9fdp.x86_64
[root@cosa-devsh ~]#

How reproducible: Always

Steps to Reproduce:
1. Install above packages
2. systemctl start ovsdb-server

Comment 1 Colin Walters 2023-02-27 21:25:14 UTC
It seems that openvswitch2.17 also reproduces the same warnings, so this apparently isn't new behavior for openvswitch.

Comment 2 Dan Williams 2023-02-27 21:45:13 UTC
(In reply to Colin Walters from comment #1)
> It seems that openvswitch2.17 also reproduces the same warnings, so this
> apparently isn't new behavior for openvswitch.

OVS 2.17 on RHEL9, or OVS 2.17 on RHEL8?

Comment 3 Colin Walters 2023-02-28 00:52:09 UTC
> OVS 2.17 on RHEL9,

This.

Also, not a very recent issue since this also reproduces with 5.14.0-252.el9.x86_64 from Feb 01 (with ovs 3.1).

Comment 4 Colin Walters 2023-02-28 01:02:53 UTC
But this is new in C9S/9.2; doesn't reproduce with 5.14.0-70.13.1.el9_0.x86_64 (same userspace).

Comment 5 Flavio Leitner 2023-02-28 14:44:44 UTC
The call trace in the bug seems different from the one reported upstream:
https://github.com/openshift/os/issues/1179

Feb 27 15:05:36.752873 ovs-ctl[1333]: /etc/openvswitch/conf.db does not exist ... (warning).
Feb 27 15:05:36.783800 ovs-ctl[1333]: Creating empty database /etc/openvswitch/conf.db.
Feb 27 15:05:36.810769 ovs-ctl[1400]: 2023-02-27T20:05:36Z|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
Feb 27 15:05:36.810863 ovsdb-server[1400]: ovs|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
Feb 27 15:05:36.814311 kernel: Call Trace:
Feb 27 15:05:36.814334 kernel:  <TASK>
Feb 27 15:05:36.814339 kernel:  __perf_install_in_context+0x16c/0x220
Feb 27 15:05:36.814347 kernel:  remote_function+0x44/0x50
Feb 27 15:05:36.814351 kernel:  generic_exec_single+0x75/0xb0
Feb 27 15:05:36.814354 kernel:  smp_call_function_single+0xeb/0x130
Feb 27 15:05:36.814357 kernel:  ? sw_perf_event_destroy+0x60/0x60
Feb 27 15:05:36.814361 kernel:  perf_install_in_context+0xcf/0x200
Feb 27 15:05:36.814367 kernel:  ? ctx_resched+0xe0/0xe0
Feb 27 15:05:36.814370 kernel:  __do_sys_perf_event_open+0xa99/0xe30
Feb 27 15:05:36.814374 kernel:  do_syscall_64+0x59/0x90
Feb 27 15:05:36.814377 kernel:  ? do_user_addr_fault+0x1d8/0x690
Feb 27 15:05:36.814381 kernel:  ? exc_page_fault+0x62/0x150
Feb 27 15:05:36.814385 kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
Feb 27 15:05:36.814388 kernel: RIP: 0033:0x7f5b66e3ee5d
Feb 27 15:05:36.814392 kernel: Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
Feb 27 15:05:36.814397 kernel: RSP: 002b:00007ffe7f8fe6a8 EFLAGS: 00000246 ORIG_RAX: 000000000000012a
Feb 27 15:05:36.814401 kernel: RAX: ffffffffffffffda RBX: 00007ffe7f8fe860 RCX: 00007f5b66e3ee5d
Feb 27 15:05:36.814404 kernel: RDX: 00000000ffffffff RSI: 0000000000000000 RDI: 00007ffe7f8fe6b0
Feb 27 15:05:36.814408 kernel: RBP: 0000558b0ba7bb60 R08: 0000000000000000 R09: 0000008000000000
Feb 27 15:05:36.814412 kernel: R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000000
Feb 27 15:05:36.814415 kernel: R13: 0000558b0ae8a260 R14: 0000558b0ba7ef60 R15: 0000000000000000
Feb 27 15:05:36.814419 kernel:  </TASK>


That seems to have come from:
https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/blob/main/kernel/events/core.c#L2757

Comment 6 Colin Walters 2023-02-28 15:59:38 UTC
Skimming through MRs, I see https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1540 AKA https://bugzilla.redhat.com/show_bug.cgi?id=2123229 which seems likely to be relevant here.

Comment 7 Rick Alongi 2023-02-28 20:31:21 UTC
I previously ran some tests using openvswitch3.1-3.1.0-0.2.el9fdp.x86_64.rpm with RHEL-9.2.0-20230209.13 (kernel: 5.14.0-252.el9.x86_64) and did not detect any call trace.  The tests don't start ovsdb-server.service individually as it is started as part of openvswitch.service.

However, I just set up a test bed running openvswitch3.1-3.1.0-0.2.el9fdp.x86_64.rpm with RHEL-9.2.0-20230227.0 (kernel: 5.14.0-282.el9.x86_64), stopped openvswitch.service, then ran "systemctl start ovsdb-server".  There were no issues reported via the console or dmesg.  Details are below.

Is there a step that I am missing?  

Thanks,
Rick

[root@netqe25 ~]# rpm -q kernel openvswitch3.1
kernel-5.14.0-282.el9.x86_64
openvswitch3.1-3.1.0-2.el9fdp.x86_64

[root@netqe25 ~]# systemctl stop openvswitch

[root@netqe25 ~]# systemctl status ovsdb-server
○ ovsdb-server.service - Open vSwitch Database Unit
     Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static)
     Active: inactive (dead) since Tue 2023-02-28 15:27:15 EST; 17s ago
   Duration: 33min 3.657s
    Process: 142249 ExecStartPre=/usr/bin/rm -f /run/openvswitch.useropts (code>
    Process: 142250 ExecStartPre=/usr/bin/chown ${OVS_USER_ID} /run/openvswitch>
    Process: 142251 ExecStartPre=/bin/sh -c /usr/bin/echo "OVS_USER_ID=${OVS_US>
    Process: 142253 ExecStartPre=/bin/sh -c if [ "$${OVS_USER_ID/:*/}" != "root>
    Process: 142255 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovs-v>
    Process: 143404 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovs-vs>
   Main PID: 142298 (code=exited, status=0/SUCCESS)
        CPU: 202ms

Feb 28 14:54:11 netqe25.knqe.lab.eng.bos.redhat.com ovs-vsctl[142299]: ovs|0000>
Feb 28 14:54:11 netqe25.knqe.lab.eng.bos.redhat.com ovs-vsctl[142304]: ovs|0000>
Feb 28 14:54:11 netqe25.knqe.lab.eng.bos.redhat.com ovs-ctl[142255]: Configurin>
Feb 28 14:54:11 netqe25.knqe.lab.eng.bos.redhat.com ovs-ctl[142255]: Enabling r>
Feb 28 14:54:11 netqe25.knqe.lab.eng.bos.redhat.com systemd[1]: Started Open vS>
Feb 28 14:54:11 netqe25.knqe.lab.eng.bos.redhat.com ovs-vsctl[142312]: ovs|0000>
Feb 28 15:27:14 netqe25.knqe.lab.eng.bos.redhat.com systemd[1]: Stopping Open v>
Feb 28 15:27:14 netqe25.knqe.lab.eng.bos.redhat.com ovs-ctl[143404]: Exiting ov>
Feb 28 15:27:15 netqe25.knqe.lab.eng.bos.redhat.com systemd[1]: ovsdb-server.se>
Feb 28 15:27:15 netqe25.knqe.lab.eng.bos.redhat.com systemd[1]: Stopped Open vS>

[root@netqe25 ~]# dmesg
[  836.688899] systemd-rc-local-generator[141502]: /etc/rc.d/rc.local is not marked executable, skipping.
[  874.357212] No such timeout policy "ovs_test_tp"
[  874.362389] Failed to associated timeout policy `ovs_test_tp'
[ 1119.158565] No such timeout policy "ovs_test_tp"
[ 1119.163737] Failed to associated timeout policy `ovs_test_tp'

[root@netqe25 ~]# systemctl start ovsdb-server

[root@netqe25 ~]# systemctl status ovsdb-server
● ovsdb-server.service - Open vSwitch Database Unit
     Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static)
     Active: active (running) since Tue 2023-02-28 15:27:54 EST; 4s ago
    Process: 143451 ExecStartPre=/usr/bin/rm -f /run/openvswitch.useropts (code>
    Process: 143452 ExecStartPre=/usr/bin/chown ${OVS_USER_ID} /run/openvswitch>
    Process: 143453 ExecStartPre=/bin/sh -c /usr/bin/echo "OVS_USER_ID=${OVS_US>
    Process: 143455 ExecStartPre=/bin/sh -c if [ "$${OVS_USER_ID/:*/}" != "root>
    Process: 143457 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovs-v>
   Main PID: 143501 (ovsdb-server)
      Tasks: 1 (limit: 407974)
     Memory: 3.2M
        CPU: 106ms
     CGroup: /system.slice/ovsdb-server.service
             └─143501 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsy>

Feb 28 15:27:54 netqe25.knqe.lab.eng.bos.redhat.com systemd[1]: Starting Open v>
Feb 28 15:27:54 netqe25.knqe.lab.eng.bos.redhat.com ovs-ctl[143457]: Starting o>
Feb 28 15:27:54 netqe25.knqe.lab.eng.bos.redhat.com ovs-vsctl[143502]: ovs|0000>
Feb 28 15:27:54 netqe25.knqe.lab.eng.bos.redhat.com ovs-vsctl[143507]: ovs|0000>
Feb 28 15:27:54 netqe25.knqe.lab.eng.bos.redhat.com ovs-ctl[143457]: Configurin>
Feb 28 15:27:54 netqe25.knqe.lab.eng.bos.redhat.com ovs-ctl[143457]: Enabling r>
Feb 28 15:27:54 netqe25.knqe.lab.eng.bos.redhat.com systemd[1]: Started Open vS>
Feb 28 15:27:54 netqe25.knqe.lab.eng.bos.redhat.com ovs-vsctl[143513]: ovs|0000>

[root@netqe25 ~]# dmesg
[  836.688899] systemd-rc-local-generator[141502]: /etc/rc.d/rc.local is not marked executable, skipping.
[  874.357212] No such timeout policy "ovs_test_tp"
[  874.362389] Failed to associated timeout policy `ovs_test_tp'
[ 1119.158565] No such timeout policy "ovs_test_tp"
[ 1119.163737] Failed to associated timeout policy `ovs_test_tp'
[root@netqe25 ~]#

Comment 13 Michael Petlan 2023-04-18 14:40:37 UTC
I'd need someone familiar with openvswitch to help set up a reproduction environment - a machine in Beaker reserved for investigation, where I could try it out with different kernels, etc.
Could you provide such box? Thanks.

Comment 14 Flavio Leitner 2023-04-18 14:52:18 UTC
Hi,

The messages from comment#5 (copied below) indicates that there is no OVS configuration yet:
Feb 27 15:05:36.752873 ovs-ctl[1333]: /etc/openvswitch/conf.db does not exist ... (warning).
Feb 27 15:05:36.783800 ovs-ctl[1333]: Creating empty database /etc/openvswitch/conf.db.
[...]


So, all you need to do is install openvswitch3.1 and the openvswitch-selinux-extra-policy from here:
http://download-node-02.eng.bos.redhat.com/rhel-9/nightly/updates/FDP/latest-FDP-9-RHEL-9/compose/Server/x86_64/os/Packages/

and enable/start the service using systemctl:
# systemctl enable openvswitch
# systemctl start openvswitch

The above will start ovs-vswitchd and ovsdb-server daemons. The latter is the one reproducing the issue.
HTH,
fbl

Comment 15 Michael Petlan 2023-07-19 10:32:15 UTC
I have tested it on a VM with kernel-5.14.0-339.el9 and I cannot reproduce the problem. I'll have a look at some older kernel and see whether it is caused by version (and thus already fixed) or by the hardware...

Yes, it seems that OVS uses perf in some way.

Comment 16 Michael Petlan 2023-07-19 14:29:59 UTC
I have tested RHEL-9.2.0 with kernel-5.14.0-284.11.1.el9_2.x86_64 on a bare metal hardware (Skylake CPU) with openvswitch3.1 and again, no fail.

If you experience issues with any of the later kernels, please help me reproducing it. It might be reproducible only on certain CPU model (CPU vendor, family, model and stepping might be important to know).


Note You need to log in before you can comment on or make changes to this bug.