Bug 1931839 - OSD in state init:CrashLoopBackOff with KMS signed certificates
Summary: OSD in state init:CrashLoopBackOff with KMS signed certificates
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: OCS 4.7.0
Assignee: Sébastien Han
QA Contact: Shay Rozen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-23 11:14 UTC by Shay Rozen
Modified: 2023-09-15 01:01 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-19 09:20:01 UTC
Embargoed:


Attachments (Terms of Use)
rook log (204.98 KB, text/plain)
2021-02-23 11:14 UTC, Shay Rozen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github rook rook pull 7292 0 None open ceph: show error even in silent curl 2021-02-23 16:16:45 UTC
Github rook rook pull 7298 0 None open ceph: do not use curl ca-cert for signed certificates 2021-02-24 10:54:59 UTC
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:20:29 UTC

Description Shay Rozen 2021-02-23 11:14:13 UTC
Created attachment 1758810 [details]
rook log

Description of problem (please be detailed as possible and provide log
snippests):
Configuring KMS to work with signed certificates (with no SKIP_VAULT_VERIFY) keys are fetch from the vault but OSD is in init:CrashLoopBackOff


Version of all relevant components (if applicable):
quay.io/rhceph-dev/ocs-registry:4.7.0-268.ci.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Can't work with signed certificates with KMS.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
2/2 on different ocs versions

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
1. Install OCP
2. Install OCS with KMS external vault with signed certificates via UI advanced in KMS



Actual results:
OSD are in init:CrashLoopBackOff although vault keys are fetched.

Expected results:
OSD should be up and running 

Additional info:
Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Normal   Scheduled               35m                 default-scheduler        Successfully assigned openshift-storage/rook-ceph-osd-0-76549b578-nz25g to compute-0
  Warning  FailedAttachVolume      35m                 attachdetach-controller  Multi-Attach error for volume "pvc-25109648-6fad-42a6-bc91-c51fd0e40aac" Volume is already used by pod(s) rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0lxlg5-fjqz4
  Normal   SuccessfulAttachVolume  35m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-25109648-6fad-42a6-bc91-c51fd0e40aac"
  Normal   SuccessfulMountVolume   35m                 kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-25109648-6fad-42a6-bc91-c51fd0e40aac" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/volumeDevices/[vsanDatastore] 66242d5f-cafa-91c3-8164-e4434bd7df48/srozen1-feb17-q7-2sqps-dynamic-pvc-25109648-6fad-42a6-bc91-c51fd0e40aac.vmdk"
  Normal   SuccessfulMountVolume   35m                 kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-25109648-6fad-42a6-bc91-c51fd0e40aac" volumeMapPath "/var/lib/kubelet/pods/8fb2fc65-5f32-4202-8350-6c9b862a4908/volumeDevices/kubernetes.io~vsphere-volume"
  Normal   Pulled                  35m                 kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:526393c0bf0093d77a5a34560fc228be91942e562aea44f398d3ab5ea370915d" already present on machine
  Normal   AddedInterface          35m                 multus                   Add eth0 [10.128.2.75/23]
  Normal   Created                 35m                 kubelet                  Created container blkdevmapper
  Normal   Started                 35m                 kubelet                  Started container blkdevmapper
  Normal   Pulled                  34m (x4 over 35m)   kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:526393c0bf0093d77a5a34560fc228be91942e562aea44f398d3ab5ea370915d" already present on machine
  Normal   Created                 34m (x4 over 35m)   kubelet                  Created container encryption-kms-get-kek
  Normal   Started                 34m (x4 over 35m)   kubelet                  Started container encryption-kms-get-kek
  Warning  BackOff                 9s (x162 over 35m)  kubelet                  Back-off restarting failed container


oc logs deployment/rook-ceph-osd-0 --all-containers
Error from server (BadRequest): container "encryption-open" in pod "rook-ceph-osd-0-76549b578-nz25g" is waiting to start: PodInitializing

oc logs rook-ceph-osd-0-76549b578-nz25g --all-containers
+ PVC_SOURCE=/ocs-deviceset-thin-2-data-0lxlg5
+ PVC_DEST=/var/lib/ceph/osd/ceph-0/block-tmp
+ CP_ARGS=(--archive --dereference --verbose)
+ '[' -b /var/lib/ceph/osd/ceph-0/block-tmp ']'
+ cp --archive --dereference --verbose /ocs-deviceset-thin-2-data-0lxlg5 /var/lib/ceph/osd/ceph-0/block-tmp
'/ocs-deviceset-thin-2-data-0lxlg5' -> '/var/lib/ceph/osd/ceph-0/block-tmp'
Error from server (BadRequest): container "encrypted-block-status" in pod "rook-ceph-osd-0-76549b578-nz25g" is waiting to start: PodInitializing

Comment 4 Sébastien Han 2021-02-23 15:22:51 UTC
Shay, can you make sure Vault is configured with the fullchain.pem in tls_cert_file? If not, please change it and restart the server, then let me know so I can try.

Comment 7 Sébastien Han 2021-02-23 17:33:35 UTC
When using the fullchain cert I can see the request going through:


[root@rook-ceph-osd-0-5db7949fc5-fw29r /]# curl -vvvvv   --request GET --header 'X-Vault-Token: s.eCRWUBYpXQVkWYkNlHuRLZI7'  --cert vault.fullchain --key /etc/vault/vault.key --connect-to ::shay-vault.qe.rh-ocs.com: https://shay-vault.qe.rh-ocs.com:8200/v1/rook/rook-ceph-osd-encryption-key-ocs-deviceset-thin-2-data-0lxlg5
Note: Unnecessary use of -X or --request, GET is already inferred.
* Connecting to hostname: shay-vault.qe.rh-ocs.com
*   Trying 3.133.152.79...
* TCP_NODELAY set
* Connected to shay-vault.qe.rh-ocs.com (3.133.152.79) port 8200 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=shay-vault.qe.rh-ocs.com
*  start date: Feb 22 07:51:47 2021 GMT
*  expire date: May 23 07:51:47 2021 GMT
*  subjectAltName: host "shay-vault.qe.rh-ocs.com" matched cert's "shay-vault.qe.rh-ocs.com"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* Using Stream ID: 1 (easy handle 0x5615d2e7a6c0)
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> GET /v1/rook/rook-ceph-osd-encryption-key-ocs-deviceset-thin-2-data-0lxlg5 HTTP/2
> Host: shay-vault.qe.rh-ocs.com:8200
> User-Agent: curl/7.61.1
> Accept: */*
> X-Vault-Token: s.eCRWUBYpXQVkWYkNlHuRLZI7
>
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/2 200
< cache-control: no-store
< content-type: application/json
< content-length: 404
< date: Tue, 23 Feb 2021 17:20:44 GMT
<
* TLSv1.3 (IN), TLS app data, [no content] (0):
{"request_id":"133cc4e9-4d8e-ce75-60f3-72a76af9ec23","lease_id":"","renewable":false,"lease_duration":2764800,"data":{"rook-ceph-osd-encryption-key-ocs-deviceset-thin-2-data-0lxlg5":"Lr1BdnIhoIHjWKihJRcEYv6uo4oGnoAz7VNnGa2ujKDFtpyf1eIZdH7Meqb+3FWvfChi/wZsfSXFB0OYqYjjxrSV3WRE0txThrNnr9iwI4dgrN92Up3z3f2AE2R2aeZ6FTfrOpZ40uent/arg/FO9YHtgGPVVUkeI9mtGVYgiaA="},"wrap_info":null,"warnings":null,"auth":null}
* Connection #0 to host shay-vault.qe.rh-ocs.com left intact



Where previously it was failing with:

[root@rook-ceph-osd-0-5db7949fc5-fw29r /]# curl -vvvvv   --request GET --header 'X-Vault-Token: s.eCRWUBYpXQVkWYkNlHuRLZI7' --cacert /etc/vault/vault.ca --cert /etc/vault/vault.crt --key /etc/vault/vault.key --connect-to ::shay-vault.qe.rh-ocs.com: https://shay-vault.qe.rh-ocs.com:8200/v1/rook/rook-ceph-osd-encryption-key-ocs-deviceset-thin-2-data-0lxlg5
Note: Unnecessary use of -X or --request, GET is already inferred.
* Connecting to hostname: shay-vault.qe.rh-ocs.com
*   Trying 3.133.152.79...
* TCP_NODELAY set
* Connected to shay-vault.qe.rh-ocs.com (3.133.152.79) port 8200 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/vault/vault.ca
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS alert, unknown CA (560):
* SSL certificate problem: unable to get issuer certificate
* Closing connection 0
curl: (60) SSL certificate problem: unable to get issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.


So just to be clear, there is no bug in Rook, just a misconfiguration from the Vault server.
The PR attached in Rook exists only to surface curl errors in a better way.

For the Vault configuration, we need to pass the fullchain of certificates (fullchain.pem) on tls_cert_file config option.
Also, we need to upload the fullchain.pem from the UI for the cacert field.

Comment 8 Shay Rozen 2021-02-23 18:16:19 UTC
Changed tls_cert_file in vault configuration to have fullchain.pem.
In UI change ca certificate to have fullchain.pem. Still OSD doesn't init.

Comment 10 Sébastien Han 2021-02-24 10:55:00 UTC
In the end, there was something to do in Rook, addressed in https://github.com/rook/rook/pull/7298.
Moving back to POST.

Comment 11 Oded 2021-03-01 10:27:01 UTC
Same issue

SetUp:
Provider:Vmware
OCS Version:4.7.0-273.ci
OCP Version:

Procedure: 
1.Install OCS via UI
2.Configure KMS Settings:
Service Name: vault
IP: https://vault.qe.rh-ocs.com
PORT:8200
TOKEN:***

advanced settings:
CA Certificate: fullchain.pem   
Client Certificate: cert.pem  
Client Private Key: privkey.pem 

3.Check OSD pods status
OSD PODs status is Error

$ oc logs rook-ceph-osd-0-5d459565bd-zq7qk
error: a container name must be specified for pod rook-ceph-osd-0-5d459565bd-zq7qk, choose one of: [osd log-collector] or one of the init containers: [blkdevmapper encryption-kms-get-kek encryption-open blkdevmapper-encryption encrypted-block-status expand-encrypted-bluefs activate expand-bluefs chown-container-data-dir]


$ oc logs rook-ceph-osd-0-5d459565bd-zq7qk -c encryption-kms-get-kek
curl: (60) SSL certificate problem: unable to get issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

Comment 12 Sébastien Han 2021-03-01 13:19:15 UTC
Oded, have you tried with the latest build?
Moving to ON_QA again

Comment 13 Shay Rozen 2021-03-02 13:50:49 UTC
osd are up:
rook-ceph-osd-0-7b7b68cd67-gn9cd                                  2/2     Running     0          115s
rook-ceph-osd-1-54f56cd7fd-g5lbc                                  2/2     Running     0          104s
rook-ceph-osd-2-8c4cc5774-fdgrs                                   2/2     Running     0          98s

encryption is on:
lsblk
NAME                                             MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop0                                              7:0    0   512G  0 loop  
sda                                                8:0    0   120G  0 disk  
|-sda1                                             8:1    0   384M  0 part  /boot
|-sda2                                             8:2    0   127M  0 part  
|-sda3                                             8:3    0     1M  0 part  
`-sda4                                             8:4    0 119.5G  0 part  
  `-coreos-luks-root-nocrypt                     253:0    0 119.5G  0 dm    /sysroot
sdb                                                8:16   0    10G  0 disk  /var/lib/kubelet/pods/38764748-d69c-4015-809f-881ee0596a19/volumes/kubernetes.io~vsphere-volume/pvc-d66a21cf-ee17-4847-b4b1-abcb8a553112
sdc                                                8:32   0   512G  0 disk  
`-ocs-deviceset-thin-0-data-0ktvw6-block-dmcrypt 253:1    0   512G  0 crypt 

Check on version:

quay.io/rhceph-dev/ocs-registry:4.7.0-280.ci

Moving to verified.

Comment 16 errata-xmlrpc 2021-05-19 09:20:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Comment 17 Red Hat Bugzilla 2023-09-15 01:01:55 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.