Created attachment 1758810 [details] rook log Description of problem (please be detailed as possible and provide log snippests): Configuring KMS to work with signed certificates (with no SKIP_VAULT_VERIFY) keys are fetch from the vault but OSD is in init:CrashLoopBackOff Version of all relevant components (if applicable): quay.io/rhceph-dev/ocs-registry:4.7.0-268.ci. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Can't work with signed certificates with KMS. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? 2/2 on different ocs versions Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: No Steps to Reproduce: 1. Install OCP 2. Install OCS with KMS external vault with signed certificates via UI advanced in KMS Actual results: OSD are in init:CrashLoopBackOff although vault keys are fetched. Expected results: OSD should be up and running Additional info: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 35m default-scheduler Successfully assigned openshift-storage/rook-ceph-osd-0-76549b578-nz25g to compute-0 Warning FailedAttachVolume 35m attachdetach-controller Multi-Attach error for volume "pvc-25109648-6fad-42a6-bc91-c51fd0e40aac" Volume is already used by pod(s) rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0lxlg5-fjqz4 Normal SuccessfulAttachVolume 35m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-25109648-6fad-42a6-bc91-c51fd0e40aac" Normal SuccessfulMountVolume 35m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-25109648-6fad-42a6-bc91-c51fd0e40aac" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/volumeDevices/[vsanDatastore] 66242d5f-cafa-91c3-8164-e4434bd7df48/srozen1-feb17-q7-2sqps-dynamic-pvc-25109648-6fad-42a6-bc91-c51fd0e40aac.vmdk" Normal SuccessfulMountVolume 35m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-25109648-6fad-42a6-bc91-c51fd0e40aac" volumeMapPath "/var/lib/kubelet/pods/8fb2fc65-5f32-4202-8350-6c9b862a4908/volumeDevices/kubernetes.io~vsphere-volume" Normal Pulled 35m kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:526393c0bf0093d77a5a34560fc228be91942e562aea44f398d3ab5ea370915d" already present on machine Normal AddedInterface 35m multus Add eth0 [10.128.2.75/23] Normal Created 35m kubelet Created container blkdevmapper Normal Started 35m kubelet Started container blkdevmapper Normal Pulled 34m (x4 over 35m) kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:526393c0bf0093d77a5a34560fc228be91942e562aea44f398d3ab5ea370915d" already present on machine Normal Created 34m (x4 over 35m) kubelet Created container encryption-kms-get-kek Normal Started 34m (x4 over 35m) kubelet Started container encryption-kms-get-kek Warning BackOff 9s (x162 over 35m) kubelet Back-off restarting failed container oc logs deployment/rook-ceph-osd-0 --all-containers Error from server (BadRequest): container "encryption-open" in pod "rook-ceph-osd-0-76549b578-nz25g" is waiting to start: PodInitializing oc logs rook-ceph-osd-0-76549b578-nz25g --all-containers + PVC_SOURCE=/ocs-deviceset-thin-2-data-0lxlg5 + PVC_DEST=/var/lib/ceph/osd/ceph-0/block-tmp + CP_ARGS=(--archive --dereference --verbose) + '[' -b /var/lib/ceph/osd/ceph-0/block-tmp ']' + cp --archive --dereference --verbose /ocs-deviceset-thin-2-data-0lxlg5 /var/lib/ceph/osd/ceph-0/block-tmp '/ocs-deviceset-thin-2-data-0lxlg5' -> '/var/lib/ceph/osd/ceph-0/block-tmp' Error from server (BadRequest): container "encrypted-block-status" in pod "rook-ceph-osd-0-76549b578-nz25g" is waiting to start: PodInitializing
Shay, can you make sure Vault is configured with the fullchain.pem in tls_cert_file? If not, please change it and restart the server, then let me know so I can try.
When using the fullchain cert I can see the request going through: [root@rook-ceph-osd-0-5db7949fc5-fw29r /]# curl -vvvvv --request GET --header 'X-Vault-Token: s.eCRWUBYpXQVkWYkNlHuRLZI7' --cert vault.fullchain --key /etc/vault/vault.key --connect-to ::shay-vault.qe.rh-ocs.com: https://shay-vault.qe.rh-ocs.com:8200/v1/rook/rook-ceph-osd-encryption-key-ocs-deviceset-thin-2-data-0lxlg5 Note: Unnecessary use of -X or --request, GET is already inferred. * Connecting to hostname: shay-vault.qe.rh-ocs.com * Trying 3.133.152.79... * TCP_NODELAY set * Connected to shay-vault.qe.rh-ocs.com (3.133.152.79) port 8200 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Request CERT (13): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.3 (OUT), TLS handshake, [no content] (0): * TLSv1.3 (OUT), TLS handshake, Certificate (11): * TLSv1.3 (OUT), TLS handshake, [no content] (0): * TLSv1.3 (OUT), TLS handshake, CERT verify (15): * TLSv1.3 (OUT), TLS handshake, [no content] (0): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN, server accepted to use h2 * Server certificate: * subject: CN=shay-vault.qe.rh-ocs.com * start date: Feb 22 07:51:47 2021 GMT * expire date: May 23 07:51:47 2021 GMT * subjectAltName: host "shay-vault.qe.rh-ocs.com" matched cert's "shay-vault.qe.rh-ocs.com" * issuer: C=US; O=Let's Encrypt; CN=R3 * SSL certificate verify ok. * Using HTTP2, server supports multi-use * Connection state changed (HTTP/2 confirmed) * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 * TLSv1.3 (OUT), TLS app data, [no content] (0): * TLSv1.3 (OUT), TLS app data, [no content] (0): * TLSv1.3 (OUT), TLS app data, [no content] (0): * Using Stream ID: 1 (easy handle 0x5615d2e7a6c0) * TLSv1.3 (OUT), TLS app data, [no content] (0): > GET /v1/rook/rook-ceph-osd-encryption-key-ocs-deviceset-thin-2-data-0lxlg5 HTTP/2 > Host: shay-vault.qe.rh-ocs.com:8200 > User-Agent: curl/7.61.1 > Accept: */* > X-Vault-Token: s.eCRWUBYpXQVkWYkNlHuRLZI7 > * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.3 (IN), TLS app data, [no content] (0): * Connection state changed (MAX_CONCURRENT_STREAMS == 250)! * TLSv1.3 (OUT), TLS app data, [no content] (0): * TLSv1.3 (IN), TLS app data, [no content] (0): * TLSv1.3 (IN), TLS app data, [no content] (0): < HTTP/2 200 < cache-control: no-store < content-type: application/json < content-length: 404 < date: Tue, 23 Feb 2021 17:20:44 GMT < * TLSv1.3 (IN), TLS app data, [no content] (0): {"request_id":"133cc4e9-4d8e-ce75-60f3-72a76af9ec23","lease_id":"","renewable":false,"lease_duration":2764800,"data":{"rook-ceph-osd-encryption-key-ocs-deviceset-thin-2-data-0lxlg5":"Lr1BdnIhoIHjWKihJRcEYv6uo4oGnoAz7VNnGa2ujKDFtpyf1eIZdH7Meqb+3FWvfChi/wZsfSXFB0OYqYjjxrSV3WRE0txThrNnr9iwI4dgrN92Up3z3f2AE2R2aeZ6FTfrOpZ40uent/arg/FO9YHtgGPVVUkeI9mtGVYgiaA="},"wrap_info":null,"warnings":null,"auth":null} * Connection #0 to host shay-vault.qe.rh-ocs.com left intact Where previously it was failing with: [root@rook-ceph-osd-0-5db7949fc5-fw29r /]# curl -vvvvv --request GET --header 'X-Vault-Token: s.eCRWUBYpXQVkWYkNlHuRLZI7' --cacert /etc/vault/vault.ca --cert /etc/vault/vault.crt --key /etc/vault/vault.key --connect-to ::shay-vault.qe.rh-ocs.com: https://shay-vault.qe.rh-ocs.com:8200/v1/rook/rook-ceph-osd-encryption-key-ocs-deviceset-thin-2-data-0lxlg5 Note: Unnecessary use of -X or --request, GET is already inferred. * Connecting to hostname: shay-vault.qe.rh-ocs.com * Trying 3.133.152.79... * TCP_NODELAY set * Connected to shay-vault.qe.rh-ocs.com (3.133.152.79) port 8200 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/vault/vault.ca CApath: none * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Request CERT (13): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (OUT), TLS alert, unknown CA (560): * SSL certificate problem: unable to get issuer certificate * Closing connection 0 curl: (60) SSL certificate problem: unable to get issuer certificate More details here: https://curl.haxx.se/docs/sslcerts.html curl failed to verify the legitimacy of the server and therefore could not establish a secure connection to it. To learn more about this situation and how to fix it, please visit the web page mentioned above. So just to be clear, there is no bug in Rook, just a misconfiguration from the Vault server. The PR attached in Rook exists only to surface curl errors in a better way. For the Vault configuration, we need to pass the fullchain of certificates (fullchain.pem) on tls_cert_file config option. Also, we need to upload the fullchain.pem from the UI for the cacert field.
Changed tls_cert_file in vault configuration to have fullchain.pem. In UI change ca certificate to have fullchain.pem. Still OSD doesn't init.
In the end, there was something to do in Rook, addressed in https://github.com/rook/rook/pull/7298. Moving back to POST.
Same issue SetUp: Provider:Vmware OCS Version:4.7.0-273.ci OCP Version: Procedure: 1.Install OCS via UI 2.Configure KMS Settings: Service Name: vault IP: https://vault.qe.rh-ocs.com PORT:8200 TOKEN:*** advanced settings: CA Certificate: fullchain.pem Client Certificate: cert.pem Client Private Key: privkey.pem 3.Check OSD pods status OSD PODs status is Error $ oc logs rook-ceph-osd-0-5d459565bd-zq7qk error: a container name must be specified for pod rook-ceph-osd-0-5d459565bd-zq7qk, choose one of: [osd log-collector] or one of the init containers: [blkdevmapper encryption-kms-get-kek encryption-open blkdevmapper-encryption encrypted-block-status expand-encrypted-bluefs activate expand-bluefs chown-container-data-dir] $ oc logs rook-ceph-osd-0-5d459565bd-zq7qk -c encryption-kms-get-kek curl: (60) SSL certificate problem: unable to get issuer certificate More details here: https://curl.haxx.se/docs/sslcerts.html curl failed to verify the legitimacy of the server and therefore could not establish a secure connection to it. To learn more about this situation and how to fix it, please visit the web page mentioned above.
Oded, have you tried with the latest build? Moving to ON_QA again
osd are up: rook-ceph-osd-0-7b7b68cd67-gn9cd 2/2 Running 0 115s rook-ceph-osd-1-54f56cd7fd-g5lbc 2/2 Running 0 104s rook-ceph-osd-2-8c4cc5774-fdgrs 2/2 Running 0 98s encryption is on: lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 512G 0 loop sda 8:0 0 120G 0 disk |-sda1 8:1 0 384M 0 part /boot |-sda2 8:2 0 127M 0 part |-sda3 8:3 0 1M 0 part `-sda4 8:4 0 119.5G 0 part `-coreos-luks-root-nocrypt 253:0 0 119.5G 0 dm /sysroot sdb 8:16 0 10G 0 disk /var/lib/kubelet/pods/38764748-d69c-4015-809f-881ee0596a19/volumes/kubernetes.io~vsphere-volume/pvc-d66a21cf-ee17-4847-b4b1-abcb8a553112 sdc 8:32 0 512G 0 disk `-ocs-deviceset-thin-0-data-0ktvw6-block-dmcrypt 253:1 0 512G 0 crypt Check on version: quay.io/rhceph-dev/ocs-registry:4.7.0-280.ci Moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days