2233098 – Storage - pods randomly fail with segmentation violation in client-go/discovery/aggregated_discovery.go

Bug 2233098 - Storage - pods randomly fail with segmentation violation in client-go/discovery/aggregated_discovery.go

Summary: Storage - pods randomly fail with segmentation violation in client-go/discove...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.14.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.14.0
Assignee:	Alex Kalenyuk
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:	blocker
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-21 12:43 UTC by Kevin Alon Goldblatt
Modified:	2023-11-08 14:06 UTC (History)
CC List:	3 users (show)
Fixed In Version:	CNV v4.14.0.rhel9-1724
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-08 14:06:16 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt containerized-data-importer pull 2855	None	open	bump k8s.io/client-go dep for discovery client fixes	2023-08-21 16:38:11 UTC
Github	kubevirt containerized-data-importer pull 2863	None	Merged	[release-v1.57] Backport 'bump k8s.io/client-go dep for discovery client fixes'	2023-08-24 08:25:13 UTC
Red Hat Issue Tracker	CNV-32284	None	None	None	2023-08-21 13:03:16 UTC
Red Hat Product Errata	RHSA-2023:6817	None	None	None	2023-11-08 14:06:27 UTC

Description Kevin Alon Goldblatt 2023-08-21 12:43:27 UTC

Description of problem:
hostpath-provisioner-operator pods amongst others are failing randomly after cluster deployment

Other pods also failing:
virt-controller-b5b88dd59-8pr8j
machine-api-controllers-7dc65b48df-tbpcs
cdi-deployment-844845fd6d-n2pz2
cluster-node-tuning-operator-6f78bdb995-2qg77
openshift-adp-controller-manager

Version-Release number of selected component (if applicable):
Deployed: OCP-4.14.0-ec.3
Deployed: CNV-v4.14.0.rhel9-1576
virt-operator-64d8f997bf-r4q8f
How reproducible:
Happened on 3 deployments but each time random pods are failing

Steps to Reproduce:
1. Deploy PSI env with 4.14
2. After a while, sometimes more than an hour hostpath-provisioner-operator amongst others pods start failing 
3.

Actual results:
openshift-cnv pods are in CrashLoopBackOff status

Expected results:
Pods should be in Running state

Additional info:

oc get pods -A |grep -v Running |grep -v Completed
NAMESPACE                                          NAME                                                              READY   STATUS             RESTARTS          AGE
openshift-adp                                      openshift-adp-controller-manager-5dbc95bc86-cxw84                 0/1     CrashLoopBackOff   942 (2m51s ago)   3d19h
openshift-cluster-node-tuning-operator             cluster-node-tuning-operator-6f78bdb995-2qg77                     0/1     CrashLoopBackOff   749 (16s ago)     3d20h
openshift-cnv                                      cdi-deployment-844845fd6d-n2pz2                                   0/1     CrashLoopBackOff   943 (3m8s ago)    3d19h
openshift-cnv                                      cdi-operator-58b6766b45-mhmg2                                     0/1     CrashLoopBackOff   943 (19s ago)     3d19h
openshift-cnv                                      hostpath-provisioner-operator-b74bbd4ff-7x4bm                     0/1     CrashLoopBackOff   943 (33s ago)     3d19h
openshift-cnv                                      virt-controller-b5b88dd59-8pr8j                                   0/1     CrashLoopBackOff   1050 (4m4s ago)   3d19h
openshift-cnv                                      virt-controller-b5b88dd59-zmcx7                                   0/1     CrashLoopBackOff   938 (2m54s ago)   3d19h
openshift-cnv                                      virt-operator-64d8f997bf-r4q8f                                    0/1     CrashLoopBackOff   993 (2m54s ago)   3d19h
openshift-cnv                                      virt-operator-64d8f997bf-zfhgf                                    0/1     CrashLoopBackOff   910 (109s ago)    3d19h
openshift-machine-api                              machine-api-controllers-7dc65b48df-tbpcs                          6/7     CrashLoopBackOff   760 (36s ago)     3d20h

Comment 1 Kevin Alon Goldblatt 2023-09-07 22:20:16 UTC

Verified with the following code:
-----------------------------------------------------------------------
oc get csv -n openshift-cnv
NAME                                       DISPLAY                       VERSION   REPLACES                                   PHASE
kubevirt-hyperconverged-operator.v4.14.0   OpenShift Virtualization      4.14.0    kubevirt-hyperconverged-operator.v4.13.3   Succeeded
openshift-pipelines-operator-rh.v1.11.0    Red Hat OpenShift Pipelines   1.11.0                                               Succeeded


v4.14.0.rhel9-1793

oc version
Client Version: 4.14.0-ec.3
Kustomize Version: v5.0.1
Server Version: 4.14.0-ec.3
Kubernetes Version: v1.27.3+e8b13aa

Verifed with the following scenario:
------------------------------------------------------------------------
oc get pods -A |grep -v Running |grep -v Completed  >>>> no pods are failing on env running several days
Not seeing this happend in the latest builds

Moving to verified!

Comment 3 errata-xmlrpc 2023-11-08 14:06:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817

Note You need to log in before you can comment on or make changes to this bug.