Bug 2124845

Summary: SIGILL on Power8 systems after recent sync from RHEL
Product: [Fedora] Fedora Reporter: Dan Horák <dan>
Component: opensslAssignee: Clemens Lang <cllang>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: high    
Version: rawhideCC: cllang, crypto-team, dbelyavs, fsumsal, fweimer, mhofmann, mspacek, mturk, nasastry, naynjain, praiskup, sahana, support.web-tv, tm, vkabatov
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: ppc64le   
OS: Unspecified   
Whiteboard:
Fixed In Version: openssl-3.0.5-5.fc38 Doc Type: No Doc Update
Doc Text:
If this bug requires documentation, please select an appropriate Doc Type value.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-12 11:02:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1071880    

Description Dan Horák 2022-09-07 09:14:32 UTC
Description of problem:
Applications (eg. sshd) abort with SIGILL on Power8 machines after recent openssl update. I suspect the sync from RHEL brought some HW level expectation valid only in RHEL (eg. RHEL-9 requires a Power9 or newer system).

Version-Release number of selected component (if applicable):
openssl-1:3.0.5-3.fc38

How reproducible:
100%

Steps to Reproduce:
1. ssh to sshd running on power8
or 2. try "dnf udpate" from local console
or 3. use some other app

Actual results:
[ 3705.137658] sshd[1703]: illegal instruction (4) at 7fff85526aac nip 7fff85526aac lr 7fff854828e0 code 1 in libcrypto.so.3.0.5[7fff85240000+300000]
[ 3705.137866] sshd[1703]: code: 7f4909ce 39290010 7f6909ce 39290010 7f8909ce 39290010 7fa909ce 39290010 
[ 3705.137920] sshd[1703]: code: 7fc909ce 39290010 7fe909ce f8010210 <7c0046d9> 39400020 7c4a4699 39400030 


Expected results:
no SIGILL

Additional info:

Comment 1 Dan Horák 2022-09-07 09:16:14 UTC
a downgrade to openssl-3.0.5-2.fc37 (via rpm) fixes the issue

Comment 2 Dmitry Belyavskiy 2022-09-07 09:43:42 UTC
Could you please check this scratch build? https://koji.fedoraproject.org/koji/taskinfo?taskID=91732215

Comment 3 Clemens Lang 2022-09-07 09:56:12 UTC
The only thing we did in this area is https://bugzilla.redhat.com/show_bug.cgi?id=2051312.

These are performance optimizations also applied upstream in https://github.com/openssl/openssl/commit/44a563dde1584cd9284e80b6e45ee5019be8d36c and https://github.com/openssl/openssl/commit/345c99b6654b8313c792d54f829943068911ddbd for AES-GCM and https://github.com/openssl/openssl/commit/f596bbe4da779b56eea34d96168b557d78e1149 and https://github.com/openssl/openssl/commit/7e1f3ffcc5bc15fb9a12b9e3bb202f544c6ed5aa for ChaCha20.

If these don't work on Power8, we should also notify OpenSSL upstream. I'm Cc'ing the IBM people that worked on this.

Comment 4 Dan Horák 2022-09-07 10:10:52 UTC
(In reply to Dmitry Belyavskiy from comment #2)
> Could you please check this scratch build?
> https://koji.fedoraproject.org/koji/taskinfo?taskID=91732215

seems this build is OK for Power8

Comment 5 Dan Horák 2022-09-07 10:18:50 UTC
backtrace from sshd captured by coredumpctl

                Stack trace of thread 1703:
                #0  0x00007fff85526aac n/a (libcrypto.so.3 + 0x396aac)
                #1  0x00007fff854828e0 aes_p10_gcm_crypt.lto_priv.0 (libcrypto.so.3 + 0x2f28e0)
                #2  0x00007fff85485888 generic_aes_gcm_cipher_update (libcrypto.so.3 + 0x2f5888)
                #3  0x00007fff854de720 gcm_cipher_internal (libcrypto.so.3 + 0x34e720)
                #4  0x00007fff854debd8 ossl_gcm_cipher (libcrypto.so.3 + 0x34ebd8)
                #5  0x00007fff85361cf0 EVP_Cipher (libcrypto.so.3 + 0x1d1cf0)
                #6  0x0000000114961320 cipher_crypt (sshd + 0x71320)
                #7  0x0000000114971890 ssh_packet_send2_wrapped (sshd + 0x81890)
                #8  0x00000001149756f8 sshpkt_send (sshd + 0x856f8)
                #9  0x00000001149837f0 kex_send_newkeys (sshd + 0x937f0)
                #10 0x0000000114987c18 input_kex_gen_init (sshd + 0x97c18)
                #11 0x0000000114979278 ssh_dispatch_run_fatal (sshd + 0x89278)
                #12 0x000000011490b048 do_ssh2_kex (sshd + 0x1b048)
                #13 0x0000000114907f8c main (sshd + 0x17f8c)
                #14 0x00007fff84bd802c __libc_start_call_main (libc.so.6 + 0x3802c)
                #15 0x00007fff84bd826c __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x3826c)
                ELF object binary architecture: PowerPC64

seems like it's using the p10 variant ...

Comment 6 Dmitry Belyavskiy 2022-09-07 10:42:52 UTC
Dan, could you also check against openssl upstream?

Comment 7 Clemens Lang 2022-09-07 10:48:09 UTC
Despite the aes_p10_gcm_crypt name, there does not seem to be anything Power10-specific about this function: https://github.com/openssl/openssl/commit/345c99b6654b8313c792d54f829943068911ddbd#diff-603e722ab30f575238f8b4b59fd4a6c1f6120463db2165e0578975067ff900f4R50. It delegates to ppc_aes_gcm_encrypt, which is implemented in https://github.com/openssl/openssl/commit/44a563dde1584cd9284e80b6e45ee5019be8d36c#diff-4dc4358ced630b88de0636ecb930703c759366f83dcc73302d6ec31c25a14aa7R458, and the commit message claims it should work on Power9 and above.

https://github.com/openssl/openssl/commit/44a563dde1584cd9284e80b6e45ee5019be8d36c#diff-603e722ab30f575238f8b4b59fd4a6c1f6120463db2165e0578975067ff900f4R37 should only select this algorithm if supported by the current architecture, and does that by checking OPENSSL_ppccap_P's PPC_MADD300 bit (1<<4), which according to https://github.com/openssl/openssl/blob/master/crypto/ppccap.c#L193-L194 should be set on POWER9 and later.

Could you share the output of openssl version -c on your machine?

Comment 8 Dan Horák 2022-09-07 11:09:56 UTC
openssl version -c returns "CPUINFO: N/A" for both my Power8 and Power9 systems

Comment 9 Dan Horák 2022-09-07 11:20:50 UTC
for the record, OPENSSL_cpuid_setup() is taking the GETAUXVAL codepath on Fedora

Comment 10 Dan Horák 2022-09-07 12:04:36 UTC
(In reply to Dmitry Belyavskiy from comment #6)
> Dan, could you also check against openssl upstream?

upstream "make test" from master branch sees a bunch of failures with the same symptoms (illegal instruction) when run on my p8 machine

let me verify on another p8 ...

Comment 11 Dmitry Belyavskiy 2022-09-07 12:10:30 UTC
Great! Could you please raise a bug report upstream then?

Comment 12 Clemens Lang 2022-09-07 12:35:13 UTC
For the record:

[root@ibm-p9z-25-lp3 ~]# grep cpu /proc/cpuinfo | sort -u
cpu             : POWER9 (architected), altivec supported
[root@ibm-p9z-25-lp3 ~]# ./test
getauxval(AT_HWCAP):  0xdc0065c2
getauxval(AT_HWCAP2): 0xfff00000

[root@ibm-p8-pvm-09-guest-11 ~]# grep cpu /proc/cpuinfo | sort -u
cpu             : POWER8 (architected), altivec supported
[root@ibm-p8-pvm-09-guest-11 ~]# gcc -o test test.c && ./test
getauxval(AT_HWCAP):  0xdc0065c2
getauxval(AT_HWCAP2): 0xff000000

Comment 13 Clemens Lang 2022-09-07 12:47:26 UTC
From what I can see, a getauxval(AT_HWCAP2) of 0xff000000 should correctly disable this code, since crypto/ppccap.c checks for 0xff000000 & HWCAP_ARCH_3_00, which is 0xff000000 & (1U << 23) = 0x0.

Dan, could you compile and run

#include <sys/auxv.h>
#include <stdio.h>

int main() {
    fprintf(stderr, "getauxval(AT_HWCAP):  0x%lx\n", getauxval(AT_HWCAP));
    fprintf(stderr, "getauxval(AT_HWCAP2): 0x%lx\n", getauxval(AT_HWCAP2));
}

on your machine?

Comment 14 Dan Horák 2022-09-07 16:17:42 UTC
For the record I have reserved a p8 machine (VM/LPAR) from beaker with F-36 and re-run the upstream build and tests. This setup reproduced the results from my development machine.

Test Summary Report
-------------------
30-test_evp.t                    (Wstat: 256 (exited 1) Tests: 74 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
70-test_asyncio.t                (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
70-test_comp.t                   (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_key_share.t              (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_recordlen.t              (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
70-test_servername.t             (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
70-test_sslextension.t           (Wstat: 27392 (exited 107) Tests: 7 Failed: 0)
  Non-zero exit status: 107
  Parse errors: Bad plan.  You planned 8 tests but ran 7.
70-test_sslrecords.t             (Wstat: 27392 (exited 107) Tests: 12 Failed: 0)
  Non-zero exit status: 107
  Parse errors: Bad plan.  You planned 21 tests but ran 12.
70-test_sslsigalgs.t             (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_sslsignature.t           (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_sslversions.t            (Wstat: 27392 (exited 107) Tests: 4 Failed: 0)
  Non-zero exit status: 107
  Parse errors: Bad plan.  You planned 8 tests but ran 4.
70-test_tls13alerts.t            (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_tls13cookie.t            (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_tls13downgrade.t         (Wstat: 27392 (exited 107) Tests: 4 Failed: 0)
  Non-zero exit status: 107
  Parse errors: Bad plan.  You planned 6 tests but ran 4.
70-test_tls13hrr.t               (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_tls13kexmodes.t          (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_tls13messages.t          (Wstat: 28416 (exited 111) Tests: 0 Failed: 0)
  Non-zero exit status: 111
  Parse errors: No plan found in TAP output
70-test_tls13psk.t               (Wstat: 27392 (exited 107) Tests: 0 Failed: 0)
  Non-zero exit status: 107
  Parse errors: No plan found in TAP output
70-test_tlsextms.t               (Wstat: 27392 (exited 107) Tests: 9 Failed: 0)
  Non-zero exit status: 107
  Parse errors: Bad plan.  You planned 10 tests but ran 9.
80-test_dtls_mtu.t               (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
80-test_ssl_new.t                (Wstat: 6144 (exited 24) Tests: 31 Failed: 24)
  Failed tests:  1-18, 20-21, 24, 26-28
  Non-zero exit status: 24
80-test_ssl_old.t                (Wstat: 512 (exited 2) Tests: 7 Failed: 2)
  Failed tests:  2-3
  Non-zero exit status: 2
80-test_sslcorrupt.t             (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
90-test_sslapi.t                 (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
90-test_sslbuffers.t             (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
90-test_tls13ccs.t               (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
90-test_tls13encryption.t        (Wstat: 256 (exited 1) Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
99-test_fuzz_client.t            (Wstat: 256 (exited 1) Tests: 2 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
99-test_fuzz_server.t            (Wstat: 256 (exited 1) Tests: 2 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
Files=256, Tests=3316, 562 wallclock secs (10.93 usr  0.69 sys + 495.51 cusr 36.15 csys = 543.28 CPU)
Result: FAIL

Comment 15 Dan Horák 2022-09-07 16:18:38 UTC
the AUXVAL results

[sharkcz@tyan-openpower-01 ~]$ ./test
getauxval(AT_HWCAP):  0xdc0065c2
getauxval(AT_HWCAP2): 0xff000000

Comment 16 Dan Horák 2022-09-07 16:19:09 UTC
I am going to open an upstream bug now ...

Comment 17 Pavel Raiskup 2022-09-08 15:00:32 UTC
*** Bug 2125295 has been marked as a duplicate of this bug. ***

Comment 18 Dan Horák 2022-09-09 07:57:05 UTC
Dmitry, could you build an update with the problematic AES patch disabled? Seems the upstream fix take some time.

Comment 19 Dmitry Belyavskiy 2022-09-09 10:07:56 UTC
Dan, if you need it ASAP, I will do. If not, could we delay it to, say, Tuesday?

Comment 20 Dan Horák 2022-09-09 10:18:55 UTC
I think Tuesday will be OK, the COPR guys have already applied a workaround for their buildsystem to unblock rawhide builds, thus it's not super urgent.

Comment 21 Dmitry Belyavskiy 2022-09-09 10:23:05 UTC
Could you also check that ChaCha is not affected?

Comment 22 Dan Horák 2022-09-09 12:42:26 UTC
(In reply to Dmitry Belyavskiy from comment #21)
> Could you also check that ChaCha is not affected?

Seems it is OK, the test-suite passes in local rpm build when only the AES patch is disabled. Also the accelerated ChaCha is plugged in in a different way if I understand it right.

Comment 23 Dmitry Belyavskiy 2022-09-09 14:37:48 UTC
https://github.com/openssl/openssl/pull/19182 looks like a relevant fix.

Comment 24 Fedora Update System 2022-09-12 11:00:20 UTC
FEDORA-2022-343ea0d960 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2022-343ea0d960

Comment 25 Fedora Update System 2022-09-12 11:02:05 UTC
FEDORA-2022-343ea0d960 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.