2150333 – SSP operator goes in to crashloopbackoff after removing hco.spec.tlsSecurityProfile for sometime

Bug 2150333 - SSP operator goes in to crashloopbackoff after removing hco.spec.tlsSecurityProfile for sometime

Summary: SSP operator goes in to crashloopbackoff after removing hco.spec.tlsSecurityP...

Keywords:
Status:	CLOSED DUPLICATE of bug 2151248
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	4.12.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.12.1
Assignee:	Simone Tiraboschi
QA Contact:	Natalie Gavrielov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-12-02 14:27 UTC by SATHEESARAN
Modified:	2022-12-16 14:55 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-12-16 13:56:42 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-23076	0	None	None	None	2022-12-02 14:30:10 UTC

Description SATHEESARAN 2022-12-02 14:27:06 UTC

Description of problem:
-----------------------
hco.spec.tlsSecurityProfile was set to 'Old' profile. Then made sure other CNV components' TLS security profile are update accordingly. After removing 'hco.spec.tlsSecurityProfile', SSP pod goes in to CrashLoopBackOff state for a while, the restarts the pod after few seconds

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
CNV v4.12.0

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Modify the hco.spec.tlsSecurityProfile to 'Old' policy
# oc patch hco --type=json kubevirt-hyperconverged -n openshift-cnv -p '[{"op": "replace", "path": "/spec/tlsSecurityProfile", "value": {"old": {}, "type": "Old"}}]'

2. Remove the hco.spec.tlsSecurityProfile
# oc patch hco --type=json kubevirt-hyperconverged -n openshift-cnv -p '[{"op": "replace", "path": "/spec/tlsSecurityProfile", "value": }]'

3. Check for SSP pods in openshift-cnv namespace

Actual results:
---------------
SSP pods goes in to CrashLoopBackOff state for sometime

<snip>
kubemacpool-cert-manager-754d94cdd7-xsw9r                         1/1     Running            0              13h
kubemacpool-mac-controller-manager-588894c6bd-8s9bb               2/2     Running            0              24s
kubevirt-plugin-b4c89cbc6-2r8nx                                   1/1     Running            0              13h
ssp-operator-5d7cdcdd65-pkr76                                     0/1     CrashLoopBackOff   32 (28s ago)   13h
</snip>

Expected results:
------------------
SSP operator should not go in to CrashLoopBackOff even for a brief amount of time

Comment 2 SATHEESARAN 2022-12-16 05:02:57 UTC

Very consistent reproducer is:
1. Modify hco.spec.tlsSecurityProfile to *Old*
2. Verify that the SSP operator is running and all the managed CRs (CNAO, SSP, KubeVirt & CDI ) are updated with security profile as 'old'
3. Now remove hco.spec.tlsSecurityProfile
4. Validate the status of SSP operator pod.

Now capturing the output of above steps in 4.12 cluster
--------------------------------------------------------

1. Editing hco.spec to modify tlsSecurityProfile as old

<snip>
[cnv-qe-jenkins@ ~]$ oc edit hco kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited
[cnv-qe-jenkins@ ~]$ sh getsecprof.sh 
API server - 
HCO - {"old":{},"type":"Old"}
CNAO - {"old":{},"type":"Old"}
SSP - {"old":{},"type":"Old"}
CDI - {"old":{},"type":"Old"}
Kubevirt - {"ciphers":["TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256","TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256","TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256","TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256","TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA","TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA","TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA","TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA","TLS_RSA_WITH_AES_128_GCM_SHA256","TLS_RSA_WITH_AES_256_GCM_SHA384","TLS_RSA_WITH_AES_128_CBC_SHA256","TLS_RSA_WITH_AES_128_CBC_SHA","TLS_RSA_WITH_AES_256_CBC_SHA","TLS_RSA_WITH_3DES_EDE_CBC_SHA"],"minTLSVersion":"VersionTLS10"}
</snip>

2. Confirming that SSP is running
<snip>
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     1/1     Running     4 (21s ago)     15h
[cnv-qe-jenkins@ ~]$ 
</snip>

3. Remove the hco.spec.tlsSecurityProfile and then check for SSP operator status
<snip>
[cnv-qe-jenkins@ ~]$ oc edit hco -n openshift-cnv kubevirt-hyperconverged
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     Completed     4 (56s ago)     15h
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     Completed           4 (58s ago)     15h
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     Completed   4 (60s ago)     15h
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     CrashLoopBackOff   4 (8s ago)      15h
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     CrashLoopBackOff   4 (9s ago)      15h
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     CrashLoopBackOff   4 (10s ago)     15h
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     CrashLoopBackOff   4 (12s ago)     15h
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     CrashLoopBackOff   4 (13s ago)     15h
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     CrashLoopBackOff   4 (14s ago)     15h
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     CrashLoopBackOff   4 (16s ago)     15h
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     CrashLoopBackOff   4 (17s ago)     15h
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     Running     5 (19s ago)     15h
[cnv-qe-jenkins@ ~]$ oc get pods -n openshift-cnv | grep ssp
ssp-operator-5d7cdcdd65-m4fjs                                     0/1     Running     5 (21s ago)     15h
</snip>

Comment 3 Simone Tiraboschi 2022-12-16 13:56:42 UTC

Unable to reproduce with v4.12.0-745; ssp restarts only once as expected:

[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get hco -n openshift-cnv kubevirt-hyperconverged -o=json | jq '.spec.tlsSecurityProfile'
{
  "old": {},
  "type": "Old"
}
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator
NAME                            READY   STATUS    RESTARTS        AGE
ssp-operator-5d7cdcdd65-2664s   1/1     Running   2 (3m26s ago)   64m
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc patch hco --type=json kubevirt-hyperconverged -n openshift-cnv -p '[{"op": "remove", "path": "/spec/tlsSecurityProfile" }]'
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged patched
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get hco -n openshift-cnv kubevirt-hyperconverged -o=json | jq '.spec.tlsSecurityProfile'
null
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator
NAME                            READY   STATUS    RESTARTS      AGE
ssp-operator-5d7cdcdd65-2664s   0/1     Running   3 (15s ago)   64m
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator
NAME                            READY   STATUS    RESTARTS      AGE
ssp-operator-5d7cdcdd65-2664s   0/1     Running   3 (19s ago)   64m
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator
NAME                            READY   STATUS    RESTARTS      AGE
ssp-operator-5d7cdcdd65-2664s   0/1     Running   3 (20s ago)   64m
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator
NAME                            READY   STATUS    RESTARTS      AGE
ssp-operator-5d7cdcdd65-2664s   1/1     Running   3 (24s ago)   64m
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator
NAME                            READY   STATUS    RESTARTS      AGE
ssp-operator-5d7cdcdd65-2664s   1/1     Running   3 (27s ago)   64m
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator
NAME                            READY   STATUS    RESTARTS      AGE
ssp-operator-5d7cdcdd65-2664s   1/1     Running   3 (30s ago)   64m
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator
NAME                            READY   STATUS    RESTARTS       AGE
ssp-operator-5d7cdcdd65-2664s   1/1     Running   3 (3m6s ago)   67m
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ oc get pods -n openshift-cnv -l name=ssp-operator
NAME                            READY   STATUS    RESTARTS       AGE
ssp-operator-5d7cdcdd65-2664s   1/1     Running   3 (146m ago)   3h31m
[cnv-qe-jenkins@c01-ss-412d-9ksd9-executor ~]$ 



even in the example on comment 2,
ssp-operator moved from:
Completed     4 (56s ago)
to
Running     5 (19s ago)

So moving from 4 to 5, the pod got restarted only once.
The issue is just that the pod has been running only for 56s seconds, and so on restart it still got marked as CrashLoopBackOff.
The more we repeat the experiment in a row, the more the exponential backoff time is going to increase.

Closing as not a bug, feel free to reopen a different one on ssp component if you think that having a pod self-killing itself in order to consume a configuration change is bad idea.

Comment 4 Simone Tiraboschi 2022-12-16 14:03:37 UTC

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
says:

After containers in a Pod exit, the kubelet restarts them with an exponential back-off delay (10s, 20s, 40s, …), that is capped at five minutes.
Once a container has executed for 10 minutes without any problems, the kubelet resets the restart backoff timer for that container.

So, in order to avoid getting the CrashLoopBackOff state, you should wait 10 minutes between a configuration change and the next one.

Comment 5 Simone Tiraboschi 2022-12-16 14:55:45 UTC


*** This bug has been marked as a duplicate of bug 2151248 ***

Note You need to log in before you can comment on or make changes to this bug.