Bug 2108473 - [vSphere CSI driver operator] CSI controller pod restarting constantly
Summary: [vSphere CSI driver operator] CSI controller pod restarting constantly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.10
Hardware: All
OS: All
unspecified
low
Target Milestone: ---
: 4.12.0
Assignee: Hemant Kumar
QA Contact: Wei Duan
Olivia Payne
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-19 07:18 UTC by Miguel Blach
Modified: 2023-05-08 14:54 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, if more than one secret was present for vSphere, the vSphere CSI Operator randomly picked a secret and sometimes caused the Operator to restart. With this update, a warning appears when there is more than one secret on the vCenter CSI Operator. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2108473[*BZ#2108473*])
Clone Of:
Environment:
Last Closed: 2023-01-17 19:53:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift vmware-vsphere-csi-driver-operator pull 104 0 None open Bug 2108473: Warn if multiple credentials exists in secrets 2022-08-18 19:50:03 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:53:15 UTC

Description Miguel Blach 2022-07-19 07:18:59 UTC
Description of problem:

After upgrading OCP from 4.9 to 4.10 and making the appropiate changes for the CSI deployment(Changing VMX version and setting the proper permissions), the CSI driver got deployed.

After deployment the CSI controller are constantly getting restarted and different replicasets exist for the controller:

NAME                                             DESIRED  CURRENT  READY  AGE
vmware-vsphere-csi-driver-controller-5f768dbffb  0        0        0      16m
vmware-vsphere-csi-driver-controller-6c47778856  1        1        1      23m
vmware-vsphere-csi-driver-controller-6fcf8d669d  0        0        0      17m
vmware-vsphere-csi-driver-controller-7d4d7dc494  1        0        0      17m

The provisioning and attachment operations are working fine so far.

Version-Release number of selected component (if applicable):

OCP 4.10.17

How reproducible:

All the time in specific environment.

Steps to Reproduce:
1. Upgrade OCP from 4.9 to 4.10
2. Make the required changes for the CSI deployment


Actual results:

CSI Controller pod constantly getting redeployed

Expected results:

CSI Controller pod not getting restarts.

Additional info:

Comment 1 Hemant Kumar 2022-07-19 19:11:20 UTC
Still unsure what is causing this behaviour but I found following very strange logs in KCM logs:

./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:12.324375839Z I0713 05:35:12.324358       1 replica_set.go:563] "Too few replicas" replicaSet="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856" need=2 creating=1
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:12.352374793Z I0713 05:35:12.352333       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on deployments.apps \"vmware-vsphere-csi-driver-controller\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:12.362557790Z I0713 05:35:12.362517       1 event.go:294] "Event occurred" object="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: vmware-vsphere-csi-driver-controller-6c47778856-bn42l"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:12.385904435Z I0713 05:35:12.385838       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on deployments.apps \"vmware-vsphere-csi-driver-controller\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.266629870Z I0713 05:35:13.266578       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on replicasets.apps \"vmware-vsphere-csi-driver-controller-5f768dbffb\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.292255781Z I0713 05:35:13.292210       1 replica_set.go:599] "Too many replicas" replicaSet="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856" need=1 deleting=1
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.292284049Z I0713 05:35:13.292257       1 replica_set.go:227] "Found related ReplicaSets" replicaSet="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856" relatedReplicaSets=[vmware-vsphere-csi-driver-controller-7d4d7dc494 vmware-vsphere-csi-driver-controller-6fcf8d669d vmware-vsphere-csi-driver-controller-5f768dbffb vmware-vsphere-csi-driver-controller-6c47778856]
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.292370500Z I0713 05:35:13.292350       1 controller_utils.go:592] "Deleting pod" controller="vmware-vsphere-csi-driver-controller-6c47778856" pod="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856-bn42l"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.293773800Z I0713 05:35:13.293719       1 event.go:294] "Event occurred" object="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled down replica set vmware-vsphere-csi-driver-controller-6c47778856 to 1"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.303790403Z I0713 05:35:13.303742       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on deployments.apps \"vmware-vsphere-csi-driver-controller\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.316508686Z I0713 05:35:13.316470       1 event.go:294] "Event occurred" object="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulDelete" message="Deleted pod: vmware-vsphere-csi-driver-controller-6c47778856-bn42l"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.327327787Z I0713 05:35:13.324156       1 replica_set.go:563] "Too few replicas" replicaSet="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-5f768dbffb" need=1 creating=1
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.327327787Z I0713 05:35:13.325246       1 event.go:294] "Event occurred" object="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set vmware-vsphere-csi-driver-controller-5f768dbffb to 1"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.346966178Z I0713 05:35:13.346907       1 event.go:294] "Event occurred" object="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-5f768dbffb" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: vmware-vsphere-csi-driver-controller-5f768dbffb-476sn"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.348045597Z I0713 05:35:13.348001       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on deployments.apps \"vmware-vsphere-csi-driver-controller\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.398192442Z I0713 05:35:13.398145       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on deployments.apps \"vmware-vsphere-csi-driver-controller\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:14.262704020Z I0713 05:35:14.262661       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on replicasets.apps \"vmware-vsphere-csi-driver-controller-6fcf8d669d\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:14.290775599Z I0713 05:35:14.290733       1 replica_set.go:599] "Too many replicas" replicaSet="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-5f768dbffb" need=0 deleting=1


It appears that Replica count of deployment is fluctuating between 0, 1 and 2 very frequently.  I am not sure why this could be happening.

Comment 2 Pablo Rodriguez Guillamon 2022-07-20 14:50:06 UTC
Hi Hemant, the customer did a comment about the upgrade method:

They updated from 4.6 (via 4.8) to 4.10

Is there any info I may ask them to help you investigate this issue?

Comment 5 Pablo Rodriguez Guillamon 2022-08-02 06:57:00 UTC
Hi @hekumar @jsafrane 

Is there any other data I should ask the customer? Do we have everything we need to work on the bug?

Thanks!

Comment 8 Hemant Kumar 2022-08-18 19:51:23 UTC
This was happening because when more than one credential is present in secret then code can arbitrarily pick one of them and hence secrets may keep changing (and thereby causing deployment rollouts).

For now - we are going to warn the users if this happens via - https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/104

Comment 10 Wei Duan 2022-08-24 06:03:42 UTC
Setting another vCenter hostname in secret like below:
$ oc -n openshift-cluster-csi-drivers get secret  vmware-vsphere-cloud-credentials -ojson | jq .data
{
  "vcenter.xxx-1.vmwarevmc.com.password": "xxx",
  "vcenter.xxx-1.vmwarevmc.com.username": "xxx",
  "vcenter.xxx-2.vmwarevmc.com.password": "xxx",
  "vcenter.xxx-2.vmwarevmc.com.username": "xxx"
}


The vmware-vsphere-csi-driver-controller updateing could be reproduced: 
$ oc -n openshift-cluster-csi-drivers get deployment.apps/vmware-vsphere-csi-driver-controller -ojson | jq .metadata.generation;sleep 30;oc -n openshift-cluster-csi-drivers get deployment.apps/vmware-vsphere-csi-driver-controller -ojson | jq .metadata.generation
179
204


We could get the clear message from the operator log which could alert us for such unsupported configuration.
W0824 05:59:00.732442       1 driver_starter.go:151] CSI driver can only connect to one vcenter, more than 1 set of credentials found for CSI driver
W0824 05:59:01.113462       1 driver_starter.go:151] CSI driver can only connect to one vcenter, more than 1 set of credentials found for CSI driver
W0824 05:59:02.280448       1 driver_starter.go:151] CSI driver can only connect to one vcenter, more than 1 set of credentials found for CSI driver
W0824 05:59:02.299887       1 driver_starter.go:151] CSI driver can only connect to one vcenter, more than 1 set of credentials found for CSI driver

Marked as "Verified".

Comment 13 errata-xmlrpc 2023-01-17 19:53:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.