1936540 – Leverage BPF for more efficient vcpu.<n>.wait

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1936540 - Leverage BPF for more efficient vcpu.<n>.wait

Summary: Leverage BPF for more efficient vcpu.<n>.wait

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	beta
Target Release:	---
Assignee:	Pavel Hrdina
QA Contact:	Luyao Huang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-08 17:38 UTC by Fabian Deutsch
Modified:	2023-05-04 12:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-05-04 12:25:00 UTC
Type:	Feature Request
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Description Fabian Deutsch 2021-03-08 17:38:51 UTC

Description of problem:
For a customer we need to expose the vcpu.<n>.wait metric:
 virtual cpu time spent by virtual CPU <num> waiting on I/O (in microseconds)

This requires # echo 1 > /proc/sys/kernel/sched_schedstats which is adding a scheduling penalty.

In
http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html
it is recommended (update 2) to use BPF to make this much leaner:

Update 2: Since Linux 4.6 you can use BPF to do this much more efficiently: aggregating the stacks in-kernel context (using the BPF stack trace feature in 4.6), and only passing the summary to user-level. I developed the offcputime tool bcc to do this. I also wrote a post about it, Off-CPU eBPF Flame Graph, although that was written before Linux 4.6's stack trace support, so I was unwinding stacks the hard way.

Can libvirt do the same?

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Pavel Hrdina 2021-05-05 09:47:03 UTC

Hi Fabian,

I was reading the link you mentioned and investigating a bit more about this BZ but it's still not clear to me how the BPF program should look like. Can you please provide more details here in the BZ?

Thanks,

Pavel

Comment 3 Fabian Deutsch 2021-05-05 10:47:11 UTC

Hey Pavel,

I don't have all the informations myself, but IIUIC then the current way of collecting the stats is adding a measureable penalty (3-5% iirc).

By using eBPF this overhead can be reduced. The article above mentions indirectly https://github.com/iovisor/bcc/blob/master/tools/runqlat.py and http://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html that explain how eBPF can be used to colelct several metrics, including cpu_wait (see github link earlier in this sentence).
The idea would be that libvirt is using such a eBPF program to provide the metric.

I don't know how the exact eBPF program should look

Comment 6 Itamar Holder 2021-09-01 11:40:41 UTC

Hey all,

It seems that since eBPF stack support arrived in Linux 4.8.
BCC made a tool called "offcputime" that IIUC does exactly what we're looking for.

See their implementation here: https://github.com/iovisor/bcc/blob/master/tools/runqlat.py

WDYT?

Comment 15 RHEL Program Management 2022-09-08 07:27:46 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 16 Fabian Deutsch 2023-03-24 11:14:10 UTC

Re-opening, we need to improve this in order to get this stats by default without impacting performance.
The reason for gathering these stats is to allow the infra side to determine when guests are not served properly.

Comment 18 Marcelo Tosatti 2023-03-31 14:31:10 UTC

Fabian,

He is talking about the overhead of recording the callgraphs on the tracepoints:

perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -a -g -o perf.data.raw sleep 1

       -g
           Enables call-graph (stack chain/backtrace) recording for both kernel space and user space.

Then he mentions that:

"Warning: scheduler events can be very frequent, and the overhead of dumping them to the file system (perf.data) may be prohibitive in production environments. Test carefully. This is also why I put a "sleep 1" in the perf record (the dummy command that sets the duration), to start with a small amount of trace data. If I had to do this in production, I'd consider other tools that could summarize data in-kernel to reduce overhead, including perf_events once it supports more in-kernel programming (eBPF).

Update 2: Since Linux 4.6 you can use BPF to do this much more efficiently: aggregating the stacks in-kernel context (using the BPF stack trace feature in 4.6), and only passing the summary to user-level. I developed the offcputime tool bcc to do this. I also wrote a post about it, Off-CPU eBPF Flame Graph, although that was written before Linux 4.6's stack trace support, so I was unwinding stacks the hard way."

The overhead of SCHEDSTATS is increasing counters on particular locations in the scheduler:

#define   schedstat_enabled()           static_branch_unlikely(&sched_schedstats)
#define __schedstat_inc(var)            do { var++; } while (0)
#define   schedstat_inc(var)            do { if (schedstat_enabled()) { var++; } } while (0)
#define __schedstat_add(var, amt)       do { var += (amt); } while (0)
#define   schedstat_add(var, amt)       do { if (schedstat_enabled()) { var += (amt); } } while (0)
#define __schedstat_set(var, val)       do { var = (val); } while (0)
#define   schedstat_set(var, val)       do { if (schedstat_enabled()) { var = (val); } } while (0)
#define   schedstat_val(var)            (var)
#define   schedstat_val_or_zero(var)    ((schedstat_enabled()) ? (var) : 0)

For example when adding/removing tasks to the runqueue. Which would be significant with a very high number
of tasks and very frequent scheduler activity. Even then, the stats (only increasing or setting integers
in memory) are a small portion of that.

So for schedstats overhead to be noticeable you need a workload for which scheduling activities
are a significant portion of runtime, but for case of VMs you'd like to maximize the amount of
time a process is in guest mode (therefore scheduling should not be involved).

On the BZ you mention:
"I don't have all the informations myself, but IIUIC then the current way of collecting the stats is adding a measureable penalty (3-5% iirc)."

Do you have more details for that overhead numbers?

Comment 19 Fabian Deutsch 2023-05-03 12:00:05 UTC

Marcello, thanks a lot for looking into this.

And thanks for the additional details.

> Do you have more details for that overhead numbers?

No, but iiuic in the context of sched_schedstats it was mentioned that the performance impact is the reason for keeping it off by default.

Now, how can we measure the impact?
It would be nice to run this by the CNV Perf Team, in order to understand the impact.

Comment 20 Marcelo Tosatti 2023-05-03 12:53:35 UTC

(In reply to Fabian Deutsch from comment #0)
> Description of problem:
> For a customer we need to expose the vcpu.<n>.wait metric:
>  virtual cpu time spent by virtual CPU <num> waiting on I/O (in microseconds)
> 
> This requires # echo 1 > /proc/sys/kernel/sched_schedstats which is adding a
> scheduling penalty.
> 
> In
> http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.
> html
> it is recommended (update 2) to use BPF to make this much leaner:
> 
> Update 2: Since Linux 4.6 you can use BPF to do this much more efficiently:
> aggregating the stacks in-kernel context (using the BPF stack trace feature
> in 4.6), and only passing the summary to user-level. I developed the
> offcputime tool bcc to do this. I also wrote a post about it, Off-CPU eBPF
> Flame Graph, although that was written before Linux 4.6's stack trace
> support, so I was unwinding stacks the hard way.
> 
> Can libvirt do the same?
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> 
> 
> Steps to Reproduce:
> 1.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:

(In reply to Fabian Deutsch from comment #19)
> Marcello, thanks a lot for looking into this.
> 
> And thanks for the additional details.
> 
> > Do you have more details for that overhead numbers?
> 
> No, but iiuic in the context of sched_schedstats it was mentioned that the
> performance impact is the reason for keeping it off by default.
> 
> Now, how can we measure the impact?
> It would be nice to run this by the CNV Perf Team, in order to understand
> the impact.

Some benchmark which performs heavy sched switching,
lmbench's lat_pipe benchmark, for example, which does
a lot of context switches:

schedstats were disabled:

$ bin/x86_64-linux-gnu/lat_pipe -P 2 -N 5
Pipe latency: 4.8317 microseconds
$ bin/x86_64-linux-gnu/lat_pipe -P 2 -N 5
Pipe latency: 4.9764 microseconds
$ bin/x86_64-linux-gnu/lat_pipe -P 2 -N 5
Pipe latency: 5.0727 microseconds

$ echo 1 > /proc/sys/kernel/sched_schedstats
$ bin/x86_64-linux-gnu/lat_pipe -P 2 -N 5
Pipe latency: 5.1501 microseconds
$ bin/x86_64-linux-gnu/lat_pipe -P 2 -N 5 
Pipe latency: 4.1404 microseconds
$ bin/x86_64-linux-gnu/lat_pipe -P 2 -N 5 
Pipe latency: 4.4736 microseconds

Comment 21 Fabian Deutsch 2023-05-03 13:05:33 UTC

Looks like it is an 8% hit in this benchmark.
But as you said, it's on a ctx switch heavy benchmark.

Jenifer, can we create a task to run some workload benchmark against nodes with 0 and 1 in /proc/sys/kernel/sched_schedstats?
In order to get a more us-ecase specific feedback.

Comment 22 Marcelo Tosatti 2023-05-03 17:42:03 UTC

(In reply to Fabian Deutsch from comment #21)
> Looks like it is an 8% hit in this benchmark.

Don't see that? (in fact the results are better when enabling schedstats).

> But as you said, it's on a ctx switch heavy benchmark.
> 
> Jenifer, can we create a task to run some workload benchmark against nodes
> with 0 and 1 in /proc/sys/kernel/sched_schedstats?
> In order to get a more us-ecase specific feedback.

Comment 24 Fabian Deutsch 2023-05-03 20:03:57 UTC

Marcello, thanks.

13.7641 / 14.8808 = 0,92496

Thus indeed, it is faster.
However, Jenifers results show a different impact.

Comment 26 Fabian Deutsch 2023-05-04 12:25:00 UTC

Marcello and myself had an offline discussion.

As stated above, enabling schedstats has a measurable impact - 5% as noted above by Jenifer - on heavy context switching synthentic bechnmarks.
VM workloads are expected to be not as context switch heavy.

Because the performance impact is on the lower end, and because VM workloads are not context switching heavy, and because no customer complained yet, I'm closing this bug.

Please reopen if you disagree, or we get evidence that this area should be improved.

Note You need to log in before you can comment on or make changes to this bug.