Bug 2024347

Summary:	glibc: Optional sched_getcpu acceleration using rseq
Product:	Red Hat Enterprise Linux 9	Reporter:	Jeremy Linton (ARM) <jlinton>
Component:	glibc	Assignee:	Florian Weimer <fweimer>
Status:	CLOSED ERRATA	QA Contact:	Sergey Kolosov <skolosov>
Severity:	unspecified	Docs Contact:	mtimar
Priority:	unspecified
Version:	9.0	CC:	ashankar, codonell, dj, fweimer, gfialova, jeremy.linton, jvaldez, mnewsome, mtimar, pfrankli, sipoyare
Target Milestone:	rc	Keywords:	FutureFeature, Patch, Triaged
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	glibc-2.34-19.el9	Doc Type:	Enhancement
Doc Text:	.`sched_getcpu` implementation can now, optionally, use `rseq` (restartable sequences) to improve performance on the 64-bit ARM architectures and other architectures The previous implementation of `sched_getcpu` on the 64-bit ARM architectures uses the `getcpu` system call, which is too slow for efficient use in most parallel algorithms. Other architectures use vDSO (virtual dynamic shared object) acceleration to work around this. Implementing `sched_getcpu` using `rseq` greatly improves performance on the 64-bit ARM architectures. Other architectures see a slight improvement. To configure `sched_getcpu` to use `rseq`, set the `GLIBC_TUNABLES=glibc.pthread.rseq=1` environment variable: ---- # GLIBC_TUNABLES=glibc.pthread.rseq=1 # export GLIBC_TUNABLES ----	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-17 15:48:51 UTC	Type:	Enhancement
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2030872
Bug Blocks:	1877135

Description Jeremy Linton (ARM) 2021-11-17 21:57:51 UTC

Description of problem: Some applications (mysql for example) rely heavily on sched_getcpu() which on arm traps to a full blown syscall rather than depending on the vdso or some other fairly fast mechanism. This is causing performance problems not present on other architectures.

Following conversations between arm/redhat glibc developers the possibility was presented of fixing rseq() for use in this case and back porting a fix.


Version-Release number of selected component (if applicable): glibc-2.34


Expected results:

Lower overhead in class of applications depending on sched_getcpu()

Comment 1 Florian Weimer 2021-11-18 10:18:56 UTC

I've started an upstream discussion:

Bringing rseq back into glibc
https://sourceware.org/pipermail/libc-alpha/2021-November/133221.html

Comment 2 Florian Weimer 2021-12-06 20:41:00 UTC

I've posted patches:

[PATCH 0/5] Extensible rseq support for glibc
https://sourceware.org/pipermail/libc-alpha/2021-December/133656.html

Comment 3 Florian Weimer 2021-12-09 23:14:28 UTC

The upstream patches have been integrated, and I believe we have fixed the regression.

Jeremy, would you be able to arrange for performance tests once we have a test build?

We are not quite ready to backport this because we need to teach valgrind about rseq first (bug 2030872).

Comment 8 Florian Weimer 2022-01-14 19:53:56 UTC

valgrind is fixed, but criu is not, so this has to be opt-in for now, using the glibc.pthread.rseq tunable.

Comment 16 Jeremy Linton 2022-04-25 15:17:31 UTC

Yes, I will test it again, I had a testing setup over christmas.

Comment 17 mtimar 2022-05-11 11:08:13 UTC

Hi Jeremy,
sorry to bother you, any luck with the testing?
Thanks, Matej

Comment 18 Jeremy Linton 2022-05-11 17:00:25 UTC

Yah, I'm about to post a small benchmark set here. I've been running various sysbench/etc things over the past couple days on an ampere ultra, the gravaton plan hasn't managed to pan out yet (still in progress). Right now the general oltp results show a small uplift, but i'm now running the exact tests that the AST team used last year, so the results should be more noticeable. I'm planning on calling this done in the next ~day.

Comment 19 Jeremy Linton 2022-05-12 00:30:46 UTC

Well, I guess I continue to fail to identify that peak 20%+ uplift in memory/mysql/OLTP style workloads with this patch applied, I think the general OLTP uplift with the mysql specific patch was something like 3% and in some tests and I can see that. Or at least something along those lines since i'm not sure my tests are repeatable enough that 2-3% can be attributed to this patch vs lucky scheduling/memory placement/whatever.  Part of the problem may be some difference between mariadb as shipped with RHEL that i'm using, and the actual mysql originally used in the test environment. For sure there are system configuration/innodb tuning parameter differences that I can't reconcile simply because the code bases have diverged around numa tuning/etc.

I can keep banging on it, but really the core request was to fix sched_getcpu(), for which I can report with a hand rolled "am I still on the same CPU" loop the uplift is a pedestrian ~47X on an Altra. LoL. 

AKA, that much uplift can't help but show up in all sorts of places.

So, for purposes of closing this, I think the answer is overwhelmingly that its been fixed. OTOH, I'm not sure that I can say with confidence that there is a double digit % mariadb/OLTP uplift because of it, at this point.

Comment 20 Florian Weimer 2022-05-13 09:26:57 UTC

Thanks. Was the mysql-specific patch inling the rseq access, by chance? Avoiding the sched_getcpu function call overhead?

We may have to export the GLIBC_2.35 symbols for interoperability purposes once we turn on rseq by default, and if we do that, mysql could switch back to inlining rseq access.

Comment 23 errata-xmlrpc 2022-05-17 15:48:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: glibc), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:3917

Comment 24 Jeremy Linton 2022-05-23 15:10:10 UTC

Yah, just as a FYI, the original testing was with a custom mysql patch, but it was also using a very custom test setup, using stored procedures and backing the DB with ram/etc. So lots of variables that individually could be affecting it.