1652930 – libffi: Incomplete cache flushing after code generation on aarch64 with SELinux enabled

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1652930 - libffi: Incomplete cache flushing after code generation on aarch64 with SELinux enabled

Summary: libffi: Incomplete cache flushing after code generation on aarch64 with SELin...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	libffi
Sub Component:
Version:	8.0
Hardware:	aarch64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	8.1
Assignee:	DJ Delorie
QA Contact:	Michal Kolar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1683831
TreeView+	depends on / blocked

Reported:	2018-11-23 15:38 UTC by Lukáš Zachar
Modified:	2023-07-18 14:30 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-01 07:30:44 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
core.python2.lz4 (4.33 MB, application/octet-stream) 2018-11-23 15:42 UTC, Lukáš Zachar	no flags	Details
0001-aarch64-Flush-code-alias-mapping-after-creating-clos.patch (1.41 KB, patch) 2019-02-18 19:48 UTC, Florian Weimer	no flags	Details \| Diff
0001-aarch64-Flush-code-alias-mapping-after-creating-clos.patch (3.83 KB, patch) 2019-02-18 20:08 UTC, Florian Weimer	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	libffi libffi issues 470	0	None	closed	Closure creation on aarch64 needs flush data and code mappings	2021-02-02 22:16:57 UTC
Python	36024	0	None	None	None	2019-02-18 16:29:42 UTC
Red Hat Bugzilla	1418019	1	None	None	None	2021-01-20 06:05:38 UTC
Red Hat Bugzilla	1721569	1	None	None	None	2023-07-18 14:30:35 UTC

Internal Links: 1418019 1721569

Description Lukáš Zachar 2018-11-23 15:38:43 UTC

Description of problem:

While executing test_ctypes.py python core dumps on test_callbacks (ctypes.test.test_as_parameter.AsParamPropertyWrapperTestCase)

Version-Release number of selected component (if applicable):
python2-2.7.15-16.module+el8+2201+95d9d403.aarch64

How reproducible:
always on specific machine

Steps to Reproduce:
1. reserve machine
2. python2 -m test --verbose test_ctypes.py


Actual results:
test_callbacks (ctypes.test.test_as_parameter.AsParamPropertyWrapperTestCase) ... bash: line 1: 45304 Illegal instruction     python2 -m test --verbose test_ctypes.py < empty


Additional info:
I coudn't get proper bt, with all installed debuginfo I was advised by gdb:

Core was generated by `python2 -m test --verbose test_ctypes.py'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x0000ffff9be90058 in ?? ()
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.6-26.el8.aarch64 glibc-2.28-28.el8.aarch64 libffi-3.1-17.el8.aarch64 openssl-libs-1.1.1-7.el8.aarch64 zlib-1.2.11-10.el8.aarch64
(gdb) bt
#0  0x0000ffff9be90058 in ?? ()
#1  0x0000ffffc0bb95e0 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Comment 2 Lukáš Zachar 2018-11-23 15:42:35 UTC

Created attachment 1508293 [details]
core.python2.lz4

core.python2.1001.33b1244a258841b8895a320e545ad21f.63737.1542980871000000.lz4

Comment 3 Victor Stinner 2019-01-23 13:10:40 UTC

I just got a fresh RHEL8/AArch64 VM from Beaker, but test_ctypes doesn't crash on Python 2.7.

@Lukas Zachar: Can you please check if you are still able to reproduce the issue?

--

Python 2.7 is compiled with --with-system-ffi: the ctypes (_ctypes) module is linked to system libffi:

# ldd $(python2 -c 'import _ctypes; print(_ctypes.__file__)')|grep libffi
	libffi.so.6 => /lib64/libffi.so.6 (0x0000ffff86d70000)
# rpm -qf /lib64/libffi.so.6
libffi-3.1-18.el8.aarch64

test_ctypes pass:

# python2 -m test test_ctypes
Run tests sequentially
0:00:00 load avg: 0.05 [1/1] test_ctypes
1 test OK.

Total duration: 525 ms
Tests result: SUCCESS

Platform:

* uname -r: 4.18.0-60.el8.aarch64
* uname -p: aarch64
* rpm -q python2: python2-2.7.15-21.module+el8+2540+b19c9b35.aarch64
* rpm -q python2-test: python2-test-2.7.15-21.module+el8+2540+b19c9b35.aarch64
* rpm -q libffi: libffi-3.1-18.el8.aarch64
* getenforce: Enforcing

Note: "python3 -m test test_ctypes" test also pass (python36-3.6.6-18.module+el8+2339+1a6691f8.aarch64, platform-python-3.6.8-1.el8.aarch64).

Comment 4 Victor Stinner 2019-01-30 17:49:47 UTC

I plan to close this issue next week if nobody is able to reproduce the issue on RHEL8 with python2 (python2-2.7.15-21).

Comment 5 Lukáš Zachar 2019-02-11 14:23:51 UTC

Sorry for late response

This bug happens on one particular machine (see comment #1) and not on the others. 
Unfortunately that machine is busy in Beaker and I am not able to reserve it - so I can't say whether issue can be still reproduced or not.

Once I got failing machine I'll keep it (and reserve passing aarch64 machine) so you can debug / compare it.

Comment 7 Victor Stinner 2019-02-18 12:55:16 UTC

"This bug happens on one particular machine (see comment #1) and not on the others. Unfortunately that machine is busy in Beaker and I am not able to reserve it - so I can't say whether issue can be still reproduced or not."

It sounds quite strange that bug only occurs on one specific machine. Maybe this one is outdated or has a compiler bug? Python isn't compiled on Beaker machines, but one builders, all Beaker VMs hould use the same binaries.

If you fail to reproduce the issue, I will have to close the issue.

Comment 9 Victor Stinner 2019-02-18 16:40:17 UTC

Commands to install required debug symbols:

dnf install gdb
dnf debuginfo-install glibc libffi

Commands to compile Python 2.7 manually:

dnf install libffi-devel
./configure --enable-unicode=ucs4 --with-system-ffi
make

Command to trigger the crash:

./python -m test -F -m ctypes.test.test_as_parameter.AsParamPropertyWrapperTestCase.test_callbacks -v test_ctypes

Comment 10 Victor Stinner 2019-02-18 17:29:52 UTC

See also the previous AArch64 bug fixed in libffi: bz #1174037 (fixed in 2015).


RHEL8 uses libffi 3.1, whereas the latest release is 3.2.1 (released at November 12, 2014). Differences on src/aarch64/ subdir:

$ diff -u libffi-3.1/src/aarch64/ libffi-3.2.1/src/aarch64/:

diff -u libffi-3.1/src/aarch64/ffi.c libffi-3.2.1/src/aarch64/ffi.c
--- libffi-3.1/src/aarch64/ffi.c	2014-04-25 19:45:13.000000000 +0200
+++ libffi-3.2.1/src/aarch64/ffi.c	2014-11-12 12:57:29.000000000 +0100
@@ -146,6 +146,9 @@
   switch (type)
     {
     case FFI_TYPE_FLOAT:
+#if defined (__APPLE__)
+      return sizeof (UINT32);
+#endif
     case FFI_TYPE_DOUBLE:
       return sizeof (UINT64);
 #if FFI_TYPE_DOUBLE != FFI_TYPE_LONGDOUBLE
@@ -779,6 +782,10 @@
           }
     }
 
+#if defined (__APPLE__)
+  cif->aarch64_nfixedargs = 0;
+#endif
+
   return FFI_OK;
 }
 
@@ -789,9 +796,13 @@
 				    unsigned int nfixedargs,
 				    unsigned int ntotalargs)
 {
+  ffi_status status;
+
+  status = ffi_prep_cif_machdep (cif);
+
   cif->aarch64_nfixedargs = nfixedargs;
 
-  return ffi_prep_cif_machdep(cif);
+  return status;
 }
 
 #endif

Comment 13 Peter Robinson 2019-02-18 18:24:20 UTC

Victor: why was I added to this?

Comment 14 Victor Stinner 2019-02-18 18:31:09 UTC

Details on the memory mapping.

selinux_enabled_check() of libffi:src/closure.c returns 1: libffi selinux_enabled variable is set to 1.

# mount|grep selinux
selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime)

dlmmap() calls dlmmap_locked() which creates a temporary file, delete it and then create a memory mapping on it using:

  ptr = mmap (NULL, length, (prot & ~PROT_WRITE) | PROT_EXEC,
	      flags, execfd, offset);

Function called with:

   dlmmap_locked (length=length@entry=65536, offset=0, flags=34, prot=3, start=0x0) at ../src/closures.c:434

with:

  flags = MAP_PRIVATE | MAP_ANONYMOUS = 34
  prot = PROT_READ | PROT_WRITE = 3

Comment 16 Victor Stinner 2019-02-18 18:34:33 UTC

> Details on the memory mapping.

I looked into that because RHEL8 has SELinux enabled in Enforcing mode, and a bug similar to the issue has been fixed on Android related to this memory mapping:

* https://bugs.python.org/issue26942
* https://github.com/libffi/libffi/commit/93d8e7dd17b08ff195af3580584ccd5c2228202f
* https://github.com/libffi/libffi/issues/262

Comment 17 Victor Stinner 2019-02-18 18:39:38 UTC

Oh, in fact the file is mapped *twice* in memory. Let me complete my previous comment:

The temporary deleted file is mapped with flags = MAP_PRIVATE | MAP_ANONYMOUS (34) and prot = PROT_READ | PROT_WRITE (3), first with:

  ptr = mmap (NULL, length, (prot & ~PROT_WRITE) | PROT_EXEC,
	      flags, execfd, offset);

then with:

  start = mmap (start, length, prot, flags, execfd, offset);

Example of /proc/pid/maps:

# grep /tmp /proc/97196/maps
ffffbde20000-ffffbde30000 rw-s 00000000 fd:00 34943877                   /tmp/ffi0ntJmg (deleted)
ffffbded0000-ffffbdee0000 r-xs 00000000 fd:00 34943877                   /tmp/ffi0ntJmg (deleted)

The first is read+write (no execute), the second is read+execute (no write).

Comment 18 Victor Stinner 2019-02-18 19:02:14 UTC

Theory of Florian Weimer: maybe it's an issue of CPU instruction cache.

The AArch64 implementation of libffi explicitly flush the CPU instruction cache:

* ffi_prep_closure_loc() calls ffi_clear_cache(tramp, tramp + FFI_TRAMPOLINE_SIZE);
* ffi_clear_cache() is implemented with __builtin___clear_cache() on GCC
* https://github.com/libffi/libffi/blob/042ef8c314a946ef1cd58c6e10cd74e403ef5bf9/src/aarch64/ffi.c#L71
* https://github.com/libffi/libffi/blob/042ef8c314a946ef1cd58c6e10cd74e403ef5bf9/src/aarch64/ffi.c#L775

ffi_closure_alloc() calls dlmalloc() which creates 2 memory mappings on the same deleted temporary file.

Problem: ffi_prep_closure_loc() only clear the CPU cache in one memory block, not the other.

Comment 19 Florian Weimer 2019-02-18 19:02:31 UTC

My hypothesis is that this is related to instruction cache flushing in libffi.  libffi invokes __builtin__clear_cache in its aarch64 implementation, but only once, on the writable mapping.

Victor Stinner mentioned the double mapping feature.  There is no flushing on the executable alias mapping in libffi.  I wonder if this is the cause of the problem.

We cannot reproduce the crash if we trick libffi into believing that SELinux is disabled, causing it not to use an alias mapping.

Comment 20 Florian Weimer 2019-02-18 19:48:46 UTC

Created attachment 1536103 [details]
0001-aarch64-Flush-code-alias-mapping-after-creating-clos.patch

Patch which should fix the bug.

Note that segment_holding has a race condition, but ffi_closure_alloc has exactly the same problem, so I consider this a completely separate bug.

I made the new symbol ffi_data_to_code_pointer hidden, and it is not used by the other architectures, to the risk from this patch is quite low.

Comment 23 Florian Weimer 2019-02-18 20:08:30 UTC

Created attachment 1536104 [details]
0001-aarch64-Flush-code-alias-mapping-after-creating-clos.patch

Actual patch was missing, sorry.

Comment 29 Victor Stinner 2019-02-18 22:58:53 UTC

You can try to run https://bugs.python.org/file48150/bug2.py in a loop. In my experience, python is killed between 1 and 30 attempts.

Comment 34 Victor Stinner 2019-02-19 12:13:22 UTC

https://github.com/libffi/libffi/pull/471/files PR has been merged upstream. Would it be possible to get a backport in RHEL8 and Fedora?

Comment 36 John Feeney 2019-04-03 18:20:56 UTC

Any status on this?

Comment 39 Florian Weimer 2019-07-04 12:54:40 UTC

The patch caused a regression, bug 1721569.

Comment 42 DJ Delorie 2019-08-01 19:21:00 UTC

Reverting until 1721569 can be fixed.

Comment 47 RHEL Program Management 2021-02-01 07:30:44 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.