Bug 2233098 - Storage - pods randomly fail with segmentation violation in client-go/discovery/aggregated_discovery.go
Summary: Storage - pods randomly fail with segmentation violation in client-go/discove...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.14.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.14.0
Assignee: Alex Kalenyuk
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard: blocker
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-21 12:43 UTC by Kevin Alon Goldblatt
Modified: 2023-11-08 14:06 UTC (History)
3 users (show)

Fixed In Version: CNV v4.14.0.rhel9-1724
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-11-08 14:06:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt containerized-data-importer pull 2855 0 None open bump k8s.io/client-go dep for discovery client fixes 2023-08-21 16:38:11 UTC
Github kubevirt containerized-data-importer pull 2863 0 None Merged [release-v1.57] Backport 'bump k8s.io/client-go dep for discovery client fixes' 2023-08-24 08:25:13 UTC
Red Hat Issue Tracker CNV-32284 0 None None None 2023-08-21 13:03:16 UTC
Red Hat Product Errata RHSA-2023:6817 0 None None None 2023-11-08 14:06:27 UTC

Description Kevin Alon Goldblatt 2023-08-21 12:43:27 UTC
Description of problem:
hostpath-provisioner-operator pods amongst others are failing randomly after cluster deployment

Other pods also failing:
virt-controller-b5b88dd59-8pr8j
machine-api-controllers-7dc65b48df-tbpcs
cdi-deployment-844845fd6d-n2pz2
cluster-node-tuning-operator-6f78bdb995-2qg77
openshift-adp-controller-manager

Version-Release number of selected component (if applicable):
Deployed: OCP-4.14.0-ec.3
Deployed: CNV-v4.14.0.rhel9-1576
virt-operator-64d8f997bf-r4q8f
How reproducible:
Happened on 3 deployments but each time random pods are failing

Steps to Reproduce:
1. Deploy PSI env with 4.14
2. After a while, sometimes more than an hour hostpath-provisioner-operator amongst others pods start failing 
3.

Actual results:
openshift-cnv pods are in CrashLoopBackOff status

Expected results:
Pods should be in Running state

Additional info:

oc get pods -A |grep -v Running |grep -v Completed
NAMESPACE                                          NAME                                                              READY   STATUS             RESTARTS          AGE
openshift-adp                                      openshift-adp-controller-manager-5dbc95bc86-cxw84                 0/1     CrashLoopBackOff   942 (2m51s ago)   3d19h
openshift-cluster-node-tuning-operator             cluster-node-tuning-operator-6f78bdb995-2qg77                     0/1     CrashLoopBackOff   749 (16s ago)     3d20h
openshift-cnv                                      cdi-deployment-844845fd6d-n2pz2                                   0/1     CrashLoopBackOff   943 (3m8s ago)    3d19h
openshift-cnv                                      cdi-operator-58b6766b45-mhmg2                                     0/1     CrashLoopBackOff   943 (19s ago)     3d19h
openshift-cnv                                      hostpath-provisioner-operator-b74bbd4ff-7x4bm                     0/1     CrashLoopBackOff   943 (33s ago)     3d19h
openshift-cnv                                      virt-controller-b5b88dd59-8pr8j                                   0/1     CrashLoopBackOff   1050 (4m4s ago)   3d19h
openshift-cnv                                      virt-controller-b5b88dd59-zmcx7                                   0/1     CrashLoopBackOff   938 (2m54s ago)   3d19h
openshift-cnv                                      virt-operator-64d8f997bf-r4q8f                                    0/1     CrashLoopBackOff   993 (2m54s ago)   3d19h
openshift-cnv                                      virt-operator-64d8f997bf-zfhgf                                    0/1     CrashLoopBackOff   910 (109s ago)    3d19h
openshift-machine-api                              machine-api-controllers-7dc65b48df-tbpcs                          6/7     CrashLoopBackOff   760 (36s ago)     3d20h

Comment 1 Kevin Alon Goldblatt 2023-09-07 22:20:16 UTC
Verified with the following code:
-----------------------------------------------------------------------
oc get csv -n openshift-cnv
NAME                                       DISPLAY                       VERSION   REPLACES                                   PHASE
kubevirt-hyperconverged-operator.v4.14.0   OpenShift Virtualization      4.14.0    kubevirt-hyperconverged-operator.v4.13.3   Succeeded
openshift-pipelines-operator-rh.v1.11.0    Red Hat OpenShift Pipelines   1.11.0                                               Succeeded


v4.14.0.rhel9-1793

oc version
Client Version: 4.14.0-ec.3
Kustomize Version: v5.0.1
Server Version: 4.14.0-ec.3
Kubernetes Version: v1.27.3+e8b13aa

Verifed with the following scenario:
------------------------------------------------------------------------
oc get pods -A |grep -v Running |grep -v Completed  >>>> no pods are failing on env running several days
Not seeing this happend in the latest builds

Moving to verified!

Comment 3 errata-xmlrpc 2023-11-08 14:06:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817


Note You need to log in before you can comment on or make changes to this bug.