Description of problem: ----------------------- hco.spec.tlsSecurityProfile was set to 'Old' profile. Then made sure other CNV components' TLS security profile are update accordingly. After removing 'hco.spec.tlsSecurityProfile', SSP pod goes in to CrashLoopBackOff state for a while, the restarts the pod after few seconds Version-Release number of selected component (if applicable): -------------------------------------------------------------- CNV v4.12.0 How reproducible: ----------------- Always Steps to Reproduce: ------------------- 1. Modify the hco.spec.tlsSecurityProfile to 'Old' policy # oc patch hco --type=json kubevirt-hyperconverged -n openshift-cnv -p '[{"op": "replace", "path": "/spec/tlsSecurityProfile", "value": {"old": {}, "type": "Old"}}]' 2. Remove the hco.spec.tlsSecurityProfile # oc patch hco --type=json kubevirt-hyperconverged -n openshift-cnv -p '[{"op": "replace", "path": "/spec/tlsSecurityProfile", "value": }]' 3. Check for SSP pods in openshift-cnv namespace Actual results: --------------- SSP pods goes in to CrashLoopBackOff state for sometime <snip> kubemacpool-cert-manager-754d94cdd7-xsw9r 1/1 Running 0 13h kubemacpool-mac-controller-manager-588894c6bd-8s9bb 2/2 Running 0 24s kubevirt-plugin-b4c89cbc6-2r8nx 1/1 Running 0 13h ssp-operator-5d7cdcdd65-pkr76 0/1 CrashLoopBackOff 32 (28s ago) 13h </snip> Expected results: ------------------ SSP operator should not go in to CrashLoopBackOff even for a brief amount of time
Very consistent reproducer is: 1. Modify hco.spec.tlsSecurityProfile to *Old* 2. Verify that the SSP operator is running and all the managed CRs (CNAO, SSP, KubeVirt & CDI ) are updated with security profile as 'old' 3. Now remove hco.spec.tlsSecurityProfile 4. Validate the status of SSP operator pod. Now capturing the output of above steps in 4.12 cluster -------------------------------------------------------- 1. Editing hco.spec to modify tlsSecurityProfile as old <snip> [cnv-qe-jenkins@ ~]$ oc edit hco kubevirt-hyperconverged -n openshift-cnv hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited [cnv-qe-jenkins@ ~]$ sh getsecprof.sh API server - HCO - {"old":{},"type":"Old"} CNAO - {"old":{},"type":"Old"} SSP - {"old":{},"type":"Old"} CDI - {"old":{},"type":"Old"} Kubevirt - {"ciphers":["TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256","TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256","TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256","TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256","TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA","TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA","TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA","TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA","TLS_RSA_WITH_AES_128_GCM_SHA256","TLS_RSA_WITH_AES_256_GCM_SHA384","TLS_RSA_WITH_AES_128_CBC_SHA256","TLS_RSA_WITH_AES_128_CBC_SHA","TLS_RSA_WITH_AES_256_CBC_SHA","TLS_RSA_WITH_3DES_EDE_CBC_SHA"],"minTLSVersion":"VersionTLS10"} </snip> 2. Confirming that SSP is running <snip> [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 1/1 Running 4 (21s ago) 15h [cnv-qe-jenkins@ ~]$ </snip> 3. Remove the hco.spec.tlsSecurityProfile and then check for SSP operator status <snip> [cnv-qe-jenkins@ ~]$ oc edit hco -n openshift-cnv kubevirt-hyperconverged hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 Completed 4 (56s ago) 15h [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 Completed 4 (58s ago) 15h [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 Completed 4 (60s ago) 15h [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 CrashLoopBackOff 4 (8s ago) 15h [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 CrashLoopBackOff 4 (9s ago) 15h [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 CrashLoopBackOff 4 (10s ago) 15h [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 CrashLoopBackOff 4 (12s ago) 15h [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 CrashLoopBackOff 4 (13s ago) 15h [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 CrashLoopBackOff 4 (14s ago) 15h [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 CrashLoopBackOff 4 (16s ago) 15h [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 CrashLoopBackOff 4 (17s ago) 15h [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 Running 5 (19s ago) 15h [cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp ssp-operator-5d7cdcdd65-m4fjs 0/1 Running 5 (21s ago) 15h </snip>
Unable to reproduce with v4.12.0-745; ssp restarts only once as expected: [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get hco -n openshift-cnv kubevirt-hyperconverged -o=json | jq '.spec.tlsSecurityProfile' { "old": {}, "type": "Old" } [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator NAME READY STATUS RESTARTS AGE ssp-operator-5d7cdcdd65-2664s 1/1 Running 2 (3m26s ago) 64m [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc patch hco --type=json kubevirt-hyperconverged -n openshift-cnv -p '[{"op": "remove", "path": "/spec/tlsSecurityProfile" }]' hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged patched [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get hco -n openshift-cnv kubevirt-hyperconverged -o=json | jq '.spec.tlsSecurityProfile' null [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator NAME READY STATUS RESTARTS AGE ssp-operator-5d7cdcdd65-2664s 0/1 Running 3 (15s ago) 64m [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator NAME READY STATUS RESTARTS AGE ssp-operator-5d7cdcdd65-2664s 0/1 Running 3 (19s ago) 64m [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator NAME READY STATUS RESTARTS AGE ssp-operator-5d7cdcdd65-2664s 0/1 Running 3 (20s ago) 64m [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator NAME READY STATUS RESTARTS AGE ssp-operator-5d7cdcdd65-2664s 1/1 Running 3 (24s ago) 64m [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator NAME READY STATUS RESTARTS AGE ssp-operator-5d7cdcdd65-2664s 1/1 Running 3 (27s ago) 64m [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator NAME READY STATUS RESTARTS AGE ssp-operator-5d7cdcdd65-2664s 1/1 Running 3 (30s ago) 64m [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator NAME READY STATUS RESTARTS AGE ssp-operator-5d7cdcdd65-2664s 1/1 Running 3 (3m6s ago) 67m [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator NAME READY STATUS RESTARTS AGE ssp-operator-5d7cdcdd65-2664s 1/1 Running 3 (146m ago) 3h31m [cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ even in the example on comment 2, ssp-operator moved from: Completed 4 (56s ago) to Running 5 (19s ago) So moving from 4 to 5, the pod got restarted only once. The issue is just that the pod has been running only for 56s seconds, and so on restart it still got marked as CrashLoopBackOff. The more we repeat the experiment in a row, the more the exponential backoff time is going to increase. Closing as not a bug, feel free to reopen a different one on ssp component if you think that having a pod self-killing itself in order to consume a configuration change is bad idea.
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy says: After containers in a Pod exit, the kubelet restarts them with an exponential back-off delay (10s, 20s, 40s, …), that is capped at five minutes. Once a container has executed for 10 minutes without any problems, the kubelet resets the restart backoff timer for that container. So, in order to avoid getting the CrashLoopBackOff state, you should wait 10 minutes between a configuration change and the next one.
*** This bug has been marked as a duplicate of bug 2151248 ***