2233811 – SSP - pods randomly fail with segmentation violation in client-go/discovery/aggregated_discovery.go

Bug 2233811 - SSP - pods randomly fail with segmentation violation in client-go/discovery/aggregated_discovery.go

Summary: SSP - pods randomly fail with segmentation violation in client-go/discovery/a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	SSP
Sub Component:
Version:	4.14.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.14.0
Assignee:	Karel Šimon
QA Contact:	zhe peng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-23 14:20 UTC by vsibirsk
Modified:	2023-11-08 14:06 UTC (History)
CC List:	2 users (show)
Fixed In Version:	kubevirt-ssp-operator-rhel9-container-v4.14.0-107
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-08 14:06:16 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt ssp-operator pull 676	None	open	fix: bump k8s dependencies	2023-08-31 08:20:56 UTC
Red Hat Issue Tracker	CNV-32375	None	None	None	2023-09-06 17:31:28 UTC
Red Hat Product Errata	RHSA-2023:6817	None	None	None	2023-11-08 14:06:27 UTC

Description vsibirsk 2023-08-23 14:20:34 UTC

Description of problem:
ssp-operator pods sometimes end-up in CrashLoopBackOff state.
Also VIRT pods are affected (virt-controller and/or virt-operator)

Version-Release number of selected component (if applicable):
4.14

How reproducible:
Sporadic. We couldn't find the exact trigger, not all deployed clusters can be affected

Steps to Reproduce:
1.Deploy 4.14 CNV cluster
2.After some time, CNV pods start to fail
3.

Actual results:
openshift-cnv pods are in CrashLoopBackOff state

Expected results:
All CNV pods are in Running state

Additional info:
pods -A | grep -v Running | grep -v Completed
NAMESPACE                                          NAME                                                              READY   STATUS             RESTARTS          AGE
openshift-cnv                                      cdi-deployment-844845fd6d-9pkjr                                   0/1     CrashLoopBackOff   186 (4m24s ago)   16h
openshift-cnv                                      cdi-operator-6499bcc5b7-xxtzc                                     0/1     CrashLoopBackOff   187 (4m32s ago)   16h
openshift-cnv                                      hostpath-provisioner-operator-f4dc64d86-vhvlf                     0/1     CrashLoopBackOff   187 (5m ago)      16h
openshift-cnv                                      ssp-operator-644c98fdc9-cjncw                                     0/1     CrashLoopBackOff   188 (4m49s ago)   16h
openshift-cnv                                      virt-controller-b5b88dd59-prtk7                                   0/1     CrashLoopBackOff   186 (84s ago)     16h
openshift-cnv                                      virt-controller-b5b88dd59-wkpvz                                   0/1     CrashLoopBackOff   187 (81s ago)     16h
openshift-cnv                                      virt-operator-5cb848c66c-2mzmk                                    0/1     CrashLoopBackOff   181 (2m16s ago)   16h
openshift-cnv                                      virt-operator-5cb848c66c-cnkvh                                    0/1     CrashLoopBackOff   182 (2m44s ago)   16h

Comment 1 Dominik Holler 2023-08-30 12:05:16 UTC

https://github.com/kubevirt/managed-tenant-quota/pull/11/ might be helpful to fix this bug

Comment 2 zhe peng 2023-09-14 07:27:40 UTC

verify with build: CNV-v4.14.0.rhel9-1914

1. deploy three CNV4.14 cluster
2. check pods after long time

ssp-operator-68fd8b6b98-2kzwp                                     1/1     Running   1 (20h ago)    20h
virt-api-7c84cd7ffd-c6zxx                                         1/1     Running   0              20h
virt-api-7c84cd7ffd-twfmk                                         1/1     Running   0              20h
virt-controller-674586dbb8-8jrnj                                  1/1     Running   0              3h37m
virt-controller-674586dbb8-vls98                                  1/1     Running   0              20h
virt-exportproxy-856598c54b-259wd                                 1/1     Running   0              20h
virt-exportproxy-856598c54b-9bmz2                                 1/1     Running   0              20h
virt-handler-97cvb                                                1/1     Running   0              20h
virt-handler-dzspl                                                1/1     Running   1 (164m ago)   20h
virt-handler-jd7fq                                                1/1     Running   0              20h
virt-operator-86b97dd8bc-5gvfz                                    1/1     Running   0              20h
virt-operator-86b97dd8bc-gf8jx                                    1/1     Running   0              20h


no CrashLoopBackOff state found. 

3. check all three clusters, no CrashLoopBackOff for cnv pods

move to verified.

Comment 4 errata-xmlrpc 2023-11-08 14:06:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817

Note You need to log in before you can comment on or make changes to this bug.