Bug 2124845 - SIGILL on Power8 systems after recent sync from RHEL
Summary: SIGILL on Power8 systems after recent sync from RHEL
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: openssl
Version: rawhide
Hardware: ppc64le
OS: Unspecified
high
urgent
Target Milestone: ---
Assignee: Clemens Lang
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 2125295 (view as bug list)
Depends On:
Blocks: PPCTracker
TreeView+ depends on / blocked
 
Reported: 2022-09-07 09:14 UTC by Dan Horák
Modified: 2022-09-12 11:02 UTC (History)
15 users (show)

Fixed In Version: openssl-3.0.5-5.fc38
Clone Of:
Environment:
Last Closed: 2022-09-12 11:02:05 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Fedora Package Sources openssl pull-request 37 0 None None None 2022-09-09 15:18:20 UTC
Github openssl openssl issues 19163 0 None open test failures on Power8 machine 2022-09-07 16:35:48 UTC
Github openssl openssl pull 19182 0 None open Fix AES-GCM on Power 8 CPUs 2022-09-09 14:37:48 UTC
Red Hat Issue Tracker FC-600 0 None None None 2022-09-07 09:23:06 UTC

Description Dan Horák 2022-09-07 09:14:32 UTC
Description of problem:
Applications (eg. sshd) abort with SIGILL on Power8 machines after recent openssl update. I suspect the sync from RHEL brought some HW level expectation valid only in RHEL (eg. RHEL-9 requires a Power9 or newer system).

Version-Release number of selected component (if applicable):
openssl-1:3.0.5-3.fc38

How reproducible:
100%

Steps to Reproduce:
1. ssh to sshd running on power8
or 2. try "dnf udpate" from local console
or 3. use some other app

Actual results:
[ 3705.137658] sshd[1703]: illegal instruction (4) at 7fff85526aac nip 7fff85526aac lr 7fff854828e0 code 1 in libcrypto.so.3.0.5[7fff85240000+300000]
[ 3705.137866] sshd[1703]: code: 7f4909ce 39290010 7f6909ce 39290010 7f8909ce 39290010 7fa909ce 39290010 
[ 3705.137920] sshd[1703]: code: 7fc909ce 39290010 7fe909ce f8010210 <7c0046d9> 39400020 7c4a4699 39400030 


Expected results:
no SIGILL

Additional info:

Comment 1 Dan Horák 2022-09-07 09:16:14 UTC
a downgrade to openssl-3.0.5-2.fc37 (via rpm) fixes the issue

Comment 2 Dmitry Belyavskiy 2022-09-07 09:43:42 UTC
Could you please check this scratch build? https://koji.fedoraproject.org/koji/taskinfo?taskID=91732215

Comment 3 Clemens Lang 2022-09-07 09:56:12 UTC
The only thing we did in this area is https://bugzilla.redhat.com/show_bug.cgi?id=2051312.

These are performance optimizations also applied upstream in https://github.com/openssl/openssl/commit/44a563dde1584cd9284e80b6e45ee5019be8d36c and https://github.com/openssl/openssl/commit/345c99b6654b8313c792d54f829943068911ddbd for AES-GCM and https://github.com/openssl/openssl/commit/f596bbe4da779b56eea34d96168b557d78e1149 and https://github.com/openssl/openssl/commit/7e1f3ffcc5bc15fb9a12b9e3bb202f544c6ed5aa for ChaCha20.

If these don't work on Power8, we should also notify OpenSSL upstream. I'm Cc'ing the IBM people that worked on this.

Comment 4 Dan Horák 2022-09-07 10:10:52 UTC
(In reply to Dmitry Belyavskiy from comment #2)
> Could you please check this scratch build?
> https://koji.fedoraproject.org/koji/taskinfo?taskID=91732215

seems this build is OK for Power8

Comment 5 Dan Horák 2022-09-07 10:18:50 UTC
backtrace from sshd captured by coredumpctl

                Stack trace of thread 1703:
                #0  0x00007fff85526aac n/a (libcrypto.so.3 + 0x396aac)
                #1  0x00007fff854828e0 aes_p10_gcm_crypt.lto_priv.0 (libcrypto.so.3 + 0x2f28e0)
                #2  0x00007fff85485888 generic_aes_gcm_cipher_update (libcrypto.so.3 + 0x2f5888)
                #3  0x00007fff854de720 gcm_cipher_internal (libcrypto.so.3 + 0x34e720)
                #4  0x00007fff854debd8 ossl_gcm_cipher (libcrypto.so.3 + 0x34ebd8)
                #5  0x00007fff85361cf0 EVP_Cipher (libcrypto.so.3 + 0x1d1cf0)
                #6  0x0000000114961320 cipher_crypt (sshd + 0x71320)
                #7  0x0000000114971890 ssh_packet_send2_wrapped (sshd + 0x81890)
                #8  0x00000001149756f8 sshpkt_send (sshd + 0x856f8)
                #9  0x00000001149837f0 kex_send_newkeys (sshd + 0x937f0)
                #10 0x0000000114987c18 input_kex_gen_init (sshd + 0x97c18)
                #11 0x0000000114979278 ssh_dispatch_run_fatal (sshd + 0x89278)
                #12 0x000000011490b048 do_ssh2_kex (sshd + 0x1b048)
                #13 0x0000000114907f8c main (sshd + 0x17f8c)
                #14 0x00007fff84bd802c __libc_start_call_main (libc.so.6 + 0x3802c)
                #15 0x00007fff84bd826c __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x3826c)
                ELF object binary architecture: PowerPC64

seems like it's using the p10 variant ...

Comment 6 Dmitry Belyavskiy 2022-09-07 10:42:52 UTC
Dan, could you also check against openssl upstream?

Comment 7 Clemens Lang 2022-09-07 10:48:09 UTC
Despite the aes_p10_gcm_crypt name, there does not seem to be anything Power10-specific about this function: https://github.com/openssl/openssl/commit/345c99b6654b8313c792d54f829943068911ddbd#diff-603e722ab30f575238f8b4b59fd4a6c1f6120463db2165e0578975067ff900f4R50. It delegates to ppc_aes_gcm_encrypt, which is implemented in https://github.com/openssl/openssl/commit/44a563dde1584cd9284e80b6e45ee5019be8d36c#diff-4dc4358ced630b88de0636ecb930703c759366f83dcc73302d6ec31c25a14aa7R458, and the commit message claims it should work on Power9 and above.

https://github.com/openssl/openssl/commit/44a563dde1584cd9284e80b6e45ee5019be8d36c#diff-603e722ab30f575238f8b4b59fd4a6c1f6120463db2165e0578975067ff900f4R37 should only select this algorithm if supported by the current architecture, and does that by checking OPENSSL_ppccap_P's PPC_MADD300 bit (1<<4), which according to https://github.com/openssl/openssl/blob/master/crypto/ppccap.c#L193-L194 should be set on POWER9 and later.

Could you share the output of openssl version -c on your machine?

Comment 8 Dan Horák 2022-09-07 11:09:56 UTC
openssl version -c returns "CPUINFO: N/A" for both my Power8 and Power9 systems

Comment 9 Dan Horák 2022-09-07 11:20:50 UTC
for the record, OPENSSL_cpuid_setup() is taking the GETAUXVAL codepath on Fedora

Comment 10 Dan Horák 2022-09-07 12:04:36 UTC
(In reply to Dmitry Belyavskiy from comment #6)
> Dan, could you also check against openssl upstream?

upstream "make test" from master branch sees a bunch of failures with the same symptoms (illegal instruction) when run on my p8 machine

let me verify on another p8 ...

Comment 11 Dmitry Belyavskiy 2022-09-07 12:10:30 UTC
Great! Could you please raise a bug report upstream then?

Comment 12 Clemens Lang 2022-09-07 12:35:13 UTC
For the record:

[root@ibm-p9z-25-lp3 ~]# grep cpu /proc/cpuinfo | sort -u
cpu             : POWER9 (architected), altivec supported
[root@ibm-p9z-25-lp3 ~]# ./test
getauxval(AT_HWCAP):  0xdc0065c2
getauxval(AT_HWCAP2): 0xfff00000

[root@ibm-p8-pvm-09-guest-11 ~]# grep cpu /proc/cpuinfo | sort -u
cpu             : POWER8 (architected), altivec supported
[root@ibm-p8-pvm-09-guest-11 ~]# gcc -o test test.c && ./test
getauxval(AT_HWCAP):  0xdc0065c2
getauxval(AT_HWCAP2): 0xff000000

Comment 13 Clemens Lang 2022-09-07 12:47:26 UTC
From what I can see, a getauxval(AT_HWCAP2) of 0xff000000 should correctly disable this code, since crypto/ppccap.c checks for 0xff000000 & HWCAP_ARCH_3_00, which is 0xff000000 & (1U << 23) = 0x0.

Dan, could you compile and run

#include <sys/auxv.h>
#include <stdio.h>

int main() {
    fprintf(stderr, "getauxval(AT_HWCAP):  0x%lx\n", getauxval(AT_HWCAP));
    fprintf(stderr, "getauxval(AT_HWCAP2): 0x%lx\n", getauxval(AT_HWCAP2));
}

on your machine?

Comment 14 Dan Horák 2022-09-07 16:17:42 UTC
For the record I have reserved a p8 machine (VM/LPAR) from beaker with F-36 and re-run the upstream build and tests. This setup reproduced the results from my development machine.

Test Summary Report
-------------------
30-test_evp.t                    (Wstat: 256 (exited 1) Tests: 74 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
70-test_asyncio.t                (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
70-test_comp.t                   (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_key_share.t              (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_recordlen.t              (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
70-test_servername.t             (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
70-test_sslextension.t           (Wstat: 27392 (exited 107) Tests: 7 Failed: 0)
  Non-zero exit status: 107
  Parse errors: Bad plan.  You planned 8 tests but ran 7.
70-test_sslrecords.t             (Wstat: 27392 (exited 107) Tests: 12 Failed: 0)
  Non-zero exit status: 107
  Parse errors: Bad plan.  You planned 21 tests but ran 12.
70-test_sslsigalgs.t             (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_sslsignature.t           (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_sslversions.t            (Wstat: 27392 (exited 107) Tests: 4 Failed: 0)
  Non-zero exit status: 107
  Parse errors: Bad plan.  You planned 8 tests but ran 4.
70-test_tls13alerts.t            (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_tls13cookie.t            (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_tls13downgrade.t         (Wstat: 27392 (exited 107) Tests: 4 Failed: 0)
  Non-zero exit status: 107
  Parse errors: Bad plan.  You planned 6 tests but ran 4.
70-test_tls13hrr.t               (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_tls13kexmodes.t          (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_tls13messages.t          (Wstat: 28416 (exited 111) Tests: 0 Failed: 0)
  Non-zero exit status: 111
  Parse errors: No plan found in TAP output
70-test_tls13psk.t               (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_tlsextms.t               (Wstat: 27392 (exited 107) Tests: 9 Failed: 0)
  Non-zero exit status: 107
  Parse errors: Bad plan.  You planned 10 tests but ran 9.
80-test_dtls_mtu.t               (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
80-test_ssl_new.t                (Wstat: 6144 (exited 24) Tests: 31 Failed: 24)
  Failed tests:  1-18, 20-21, 24, 26-28
  Non-zero exit status: 24
80-test_ssl_old.t                (Wstat: 512 (exited 2) Tests: 7 Failed: 2)
  Failed tests:  2-3
  Non-zero exit status: 2
80-test_sslcorrupt.t             (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
90-test_sslapi.t                 (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
90-test_sslbuffers.t             (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
90-test_tls13ccs.t               (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
90-test_tls13encryption.t        (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
99-test_fuzz_client.t            (Wstat: 256 (exited 1) Tests: 2 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
99-test_fuzz_server.t            (Wstat: 256 (exited 1) Tests: 2 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
Files=256, Tests=3316, 562 wallclock secs (10.93 usr  0.69 sys + 495.51 cusr 36.15 csys = 543.28 CPU)
Result: FAIL

Comment 15 Dan Horák 2022-09-07 16:18:38 UTC
the AUXVAL results

[sharkcz@tyan-openpower-01 ~]$ ./test
getauxval(AT_HWCAP):  0xdc0065c2
getauxval(AT_HWCAP2): 0xff000000

Comment 16 Dan Horák 2022-09-07 16:19:09 UTC
I am going to open an upstream bug now ...

Comment 17 Pavel Raiskup 2022-09-08 15:00:32 UTC
*** Bug 2125295 has been marked as a duplicate of this bug. ***

Comment 18 Dan Horák 2022-09-09 07:57:05 UTC
Dmitry, could you build an update with the problematic AES patch disabled? Seems the upstream fix take some time.

Comment 19 Dmitry Belyavskiy 2022-09-09 10:07:56 UTC
Dan, if you need it ASAP, I will do. If not, could we delay it to, say, Tuesday?

Comment 20 Dan Horák 2022-09-09 10:18:55 UTC
I think Tuesday will be OK, the COPR guys have already applied a workaround for their buildsystem to unblock rawhide builds, thus it's not super urgent.

Comment 21 Dmitry Belyavskiy 2022-09-09 10:23:05 UTC
Could you also check that ChaCha is not affected?

Comment 22 Dan Horák 2022-09-09 12:42:26 UTC
(In reply to Dmitry Belyavskiy from comment #21)
> Could you also check that ChaCha is not affected?

Seems it is OK, the test-suite passes in local rpm build when only the AES patch is disabled. Also the accelerated ChaCha is plugged in in a different way if I understand it right.

Comment 23 Dmitry Belyavskiy 2022-09-09 14:37:48 UTC
https://github.com/openssl/openssl/pull/19182 looks like a relevant fix.

Comment 24 Fedora Update System 2022-09-12 11:00:20 UTC
FEDORA-2022-343ea0d960 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2022-343ea0d960

Comment 25 Fedora Update System 2022-09-12 11:02:05 UTC
FEDORA-2022-343ea0d960 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.