Bug 1652930
Summary: | libffi: Incomplete cache flushing after code generation on aarch64 with SELinux enabled | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Lukáš Zachar <lzachar> | ||||||||
Component: | libffi | Assignee: | DJ Delorie <dj> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | Michal Kolar <mkolar> | ||||||||
Severity: | low | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 8.0 | CC: | codonell, dj, fweimer, jbastian, jfeeney, jlinton, lzachar, mcermak, mnewsome, perobins, pviktori, vstinner | ||||||||
Target Milestone: | rc | Keywords: | Patch, Triaged | ||||||||
Target Release: | 8.1 | ||||||||||
Hardware: | aarch64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2021-02-01 07:30:44 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1683831 | ||||||||||
Attachments: |
|
Description
Lukáš Zachar
2018-11-23 15:38:43 UTC
Created attachment 1508293 [details]
core.python2.lz4
core.python2.1001.33b1244a258841b8895a320e545ad21f.63737.1542980871000000.lz4
I just got a fresh RHEL8/AArch64 VM from Beaker, but test_ctypes doesn't crash on Python 2.7. @Lukas Zachar: Can you please check if you are still able to reproduce the issue? -- Python 2.7 is compiled with --with-system-ffi: the ctypes (_ctypes) module is linked to system libffi: # ldd $(python2 -c 'import _ctypes; print(_ctypes.__file__)')|grep libffi libffi.so.6 => /lib64/libffi.so.6 (0x0000ffff86d70000) # rpm -qf /lib64/libffi.so.6 libffi-3.1-18.el8.aarch64 test_ctypes pass: # python2 -m test test_ctypes Run tests sequentially 0:00:00 load avg: 0.05 [1/1] test_ctypes 1 test OK. Total duration: 525 ms Tests result: SUCCESS Platform: * uname -r: 4.18.0-60.el8.aarch64 * uname -p: aarch64 * rpm -q python2: python2-2.7.15-21.module+el8+2540+b19c9b35.aarch64 * rpm -q python2-test: python2-test-2.7.15-21.module+el8+2540+b19c9b35.aarch64 * rpm -q libffi: libffi-3.1-18.el8.aarch64 * getenforce: Enforcing Note: "python3 -m test test_ctypes" test also pass (python36-3.6.6-18.module+el8+2339+1a6691f8.aarch64, platform-python-3.6.8-1.el8.aarch64). I plan to close this issue next week if nobody is able to reproduce the issue on RHEL8 with python2 (python2-2.7.15-21). Sorry for late response This bug happens on one particular machine (see comment #1) and not on the others. Unfortunately that machine is busy in Beaker and I am not able to reserve it - so I can't say whether issue can be still reproduced or not. Once I got failing machine I'll keep it (and reserve passing aarch64 machine) so you can debug / compare it. "This bug happens on one particular machine (see comment #1) and not on the others. Unfortunately that machine is busy in Beaker and I am not able to reserve it - so I can't say whether issue can be still reproduced or not." It sounds quite strange that bug only occurs on one specific machine. Maybe this one is outdated or has a compiler bug? Python isn't compiled on Beaker machines, but one builders, all Beaker VMs hould use the same binaries. If you fail to reproduce the issue, I will have to close the issue. Commands to install required debug symbols: dnf install gdb dnf debuginfo-install glibc libffi Commands to compile Python 2.7 manually: dnf install libffi-devel ./configure --enable-unicode=ucs4 --with-system-ffi make Command to trigger the crash: ./python -m test -F -m ctypes.test.test_as_parameter.AsParamPropertyWrapperTestCase.test_callbacks -v test_ctypes See also the previous AArch64 bug fixed in libffi: bz #1174037 (fixed in 2015). RHEL8 uses libffi 3.1, whereas the latest release is 3.2.1 (released at November 12, 2014). Differences on src/aarch64/ subdir: $ diff -u libffi-3.1/src/aarch64/ libffi-3.2.1/src/aarch64/: diff -u libffi-3.1/src/aarch64/ffi.c libffi-3.2.1/src/aarch64/ffi.c --- libffi-3.1/src/aarch64/ffi.c 2014-04-25 19:45:13.000000000 +0200 +++ libffi-3.2.1/src/aarch64/ffi.c 2014-11-12 12:57:29.000000000 +0100 @@ -146,6 +146,9 @@ switch (type) { case FFI_TYPE_FLOAT: +#if defined (__APPLE__) + return sizeof (UINT32); +#endif case FFI_TYPE_DOUBLE: return sizeof (UINT64); #if FFI_TYPE_DOUBLE != FFI_TYPE_LONGDOUBLE @@ -779,6 +782,10 @@ } } +#if defined (__APPLE__) + cif->aarch64_nfixedargs = 0; +#endif + return FFI_OK; } @@ -789,9 +796,13 @@ unsigned int nfixedargs, unsigned int ntotalargs) { + ffi_status status; + + status = ffi_prep_cif_machdep (cif); + cif->aarch64_nfixedargs = nfixedargs; - return ffi_prep_cif_machdep(cif); + return status; } #endif Victor: why was I added to this? Details on the memory mapping. selinux_enabled_check() of libffi:src/closure.c returns 1: libffi selinux_enabled variable is set to 1. # mount|grep selinux selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime) dlmmap() calls dlmmap_locked() which creates a temporary file, delete it and then create a memory mapping on it using: ptr = mmap (NULL, length, (prot & ~PROT_WRITE) | PROT_EXEC, flags, execfd, offset); Function called with: dlmmap_locked (length=length@entry=65536, offset=0, flags=34, prot=3, start=0x0) at ../src/closures.c:434 with: flags = MAP_PRIVATE | MAP_ANONYMOUS = 34 prot = PROT_READ | PROT_WRITE = 3 > Details on the memory mapping. I looked into that because RHEL8 has SELinux enabled in Enforcing mode, and a bug similar to the issue has been fixed on Android related to this memory mapping: * https://bugs.python.org/issue26942 * https://github.com/libffi/libffi/commit/93d8e7dd17b08ff195af3580584ccd5c2228202f * https://github.com/libffi/libffi/issues/262 Oh, in fact the file is mapped *twice* in memory. Let me complete my previous comment: The temporary deleted file is mapped with flags = MAP_PRIVATE | MAP_ANONYMOUS (34) and prot = PROT_READ | PROT_WRITE (3), first with: ptr = mmap (NULL, length, (prot & ~PROT_WRITE) | PROT_EXEC, flags, execfd, offset); then with: start = mmap (start, length, prot, flags, execfd, offset); Example of /proc/pid/maps: # grep /tmp /proc/97196/maps ffffbde20000-ffffbde30000 rw-s 00000000 fd:00 34943877 /tmp/ffi0ntJmg (deleted) ffffbded0000-ffffbdee0000 r-xs 00000000 fd:00 34943877 /tmp/ffi0ntJmg (deleted) The first is read+write (no execute), the second is read+execute (no write). Theory of Florian Weimer: maybe it's an issue of CPU instruction cache. The AArch64 implementation of libffi explicitly flush the CPU instruction cache: * ffi_prep_closure_loc() calls ffi_clear_cache(tramp, tramp + FFI_TRAMPOLINE_SIZE); * ffi_clear_cache() is implemented with __builtin___clear_cache() on GCC * https://github.com/libffi/libffi/blob/042ef8c314a946ef1cd58c6e10cd74e403ef5bf9/src/aarch64/ffi.c#L71 * https://github.com/libffi/libffi/blob/042ef8c314a946ef1cd58c6e10cd74e403ef5bf9/src/aarch64/ffi.c#L775 ffi_closure_alloc() calls dlmalloc() which creates 2 memory mappings on the same deleted temporary file. Problem: ffi_prep_closure_loc() only clear the CPU cache in one memory block, not the other. My hypothesis is that this is related to instruction cache flushing in libffi. libffi invokes __builtin__clear_cache in its aarch64 implementation, but only once, on the writable mapping. Victor Stinner mentioned the double mapping feature. There is no flushing on the executable alias mapping in libffi. I wonder if this is the cause of the problem. We cannot reproduce the crash if we trick libffi into believing that SELinux is disabled, causing it not to use an alias mapping. Created attachment 1536103 [details]
0001-aarch64-Flush-code-alias-mapping-after-creating-clos.patch
Patch which should fix the bug.
Note that segment_holding has a race condition, but ffi_closure_alloc has exactly the same problem, so I consider this a completely separate bug.
I made the new symbol ffi_data_to_code_pointer hidden, and it is not used by the other architectures, to the risk from this patch is quite low.
Created attachment 1536104 [details]
0001-aarch64-Flush-code-alias-mapping-after-creating-clos.patch
Actual patch was missing, sorry.
You can try to run https://bugs.python.org/file48150/bug2.py in a loop. In my experience, python is killed between 1 and 30 attempts. https://github.com/libffi/libffi/pull/471/files PR has been merged upstream. Would it be possible to get a backport in RHEL8 and Fedora? Any status on this? The patch caused a regression, bug 1721569. Reverting until 1721569 can be fixed. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |