Bug 2108473

Summary: [vSphere CSI driver operator] CSI controller pod restarting constantly
Product: OpenShift Container Platform Reporter: Miguel Blach <mblach>
Component: StorageAssignee: Hemant Kumar <hekumar>
Storage sub component: Operators QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact: Olivia Payne <opayne>
Severity: low    
Priority: unspecified CC: hekumar, jsafrane, opayne, parodrig
Version: 4.10   
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
* Previously, if more than one secret was present for vSphere, the vSphere CSI Operator randomly picked a secret and sometimes caused the Operator to restart. With this update, a warning appears when there is more than one secret on the vCenter CSI Operator. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2108473[*BZ#2108473*])
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-17 19:53:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Miguel Blach 2022-07-19 07:18:59 UTC
Description of problem:

After upgrading OCP from 4.9 to 4.10 and making the appropiate changes for the CSI deployment(Changing VMX version and setting the proper permissions), the CSI driver got deployed.

After deployment the CSI controller are constantly getting restarted and different replicasets exist for the controller:

NAME                                             DESIRED  CURRENT  READY  AGE
vmware-vsphere-csi-driver-controller-5f768dbffb  0        0        0      16m
vmware-vsphere-csi-driver-controller-6c47778856  1        1        1      23m
vmware-vsphere-csi-driver-controller-6fcf8d669d  0        0        0      17m
vmware-vsphere-csi-driver-controller-7d4d7dc494  1        0        0      17m

The provisioning and attachment operations are working fine so far.

Version-Release number of selected component (if applicable):

OCP 4.10.17

How reproducible:

All the time in specific environment.

Steps to Reproduce:
1. Upgrade OCP from 4.9 to 4.10
2. Make the required changes for the CSI deployment


Actual results:

CSI Controller pod constantly getting redeployed

Expected results:

CSI Controller pod not getting restarts.

Additional info:

Comment 1 Hemant Kumar 2022-07-19 19:11:20 UTC
Still unsure what is causing this behaviour but I found following very strange logs in KCM logs:

./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:12.324375839Z I0713 05:35:12.324358       1 replica_set.go:563] "Too few replicas" replicaSet="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856" need=2 creating=1
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:12.352374793Z I0713 05:35:12.352333       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on deployments.apps \"vmware-vsphere-csi-driver-controller\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:12.362557790Z I0713 05:35:12.362517       1 event.go:294] "Event occurred" object="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: vmware-vsphere-csi-driver-controller-6c47778856-bn42l"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:12.385904435Z I0713 05:35:12.385838       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on deployments.apps \"vmware-vsphere-csi-driver-controller\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.266629870Z I0713 05:35:13.266578       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on replicasets.apps \"vmware-vsphere-csi-driver-controller-5f768dbffb\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.292255781Z I0713 05:35:13.292210       1 replica_set.go:599] "Too many replicas" replicaSet="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856" need=1 deleting=1
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.292284049Z I0713 05:35:13.292257       1 replica_set.go:227] "Found related ReplicaSets" replicaSet="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856" relatedReplicaSets=[vmware-vsphere-csi-driver-controller-7d4d7dc494 vmware-vsphere-csi-driver-controller-6fcf8d669d vmware-vsphere-csi-driver-controller-5f768dbffb vmware-vsphere-csi-driver-controller-6c47778856]
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.292370500Z I0713 05:35:13.292350       1 controller_utils.go:592] "Deleting pod" controller="vmware-vsphere-csi-driver-controller-6c47778856" pod="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856-bn42l"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.293773800Z I0713 05:35:13.293719       1 event.go:294] "Event occurred" object="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled down replica set vmware-vsphere-csi-driver-controller-6c47778856 to 1"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.303790403Z I0713 05:35:13.303742       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on deployments.apps \"vmware-vsphere-csi-driver-controller\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.316508686Z I0713 05:35:13.316470       1 event.go:294] "Event occurred" object="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-6c47778856" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulDelete" message="Deleted pod: vmware-vsphere-csi-driver-controller-6c47778856-bn42l"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.327327787Z I0713 05:35:13.324156       1 replica_set.go:563] "Too few replicas" replicaSet="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-5f768dbffb" need=1 creating=1
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.327327787Z I0713 05:35:13.325246       1 event.go:294] "Event occurred" object="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set vmware-vsphere-csi-driver-controller-5f768dbffb to 1"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.346966178Z I0713 05:35:13.346907       1 event.go:294] "Event occurred" object="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-5f768dbffb" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: vmware-vsphere-csi-driver-controller-5f768dbffb-476sn"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.348045597Z I0713 05:35:13.348001       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on deployments.apps \"vmware-vsphere-csi-driver-controller\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:13.398192442Z I0713 05:35:13.398145       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on deployments.apps \"vmware-vsphere-csi-driver-controller\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:14.262704020Z I0713 05:35:14.262661       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller" err="Operation cannot be fulfilled on replicasets.apps \"vmware-vsphere-csi-driver-controller-6fcf8d669d\": the object has been modified; please apply your changes to the latest version and try again"
./namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ocp-l-77f2m-master-1/kube-controller-manager/kube-controller-manager/logs/current.log:2022-07-13T05:35:14.290775599Z I0713 05:35:14.290733       1 replica_set.go:599] "Too many replicas" replicaSet="openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-5f768dbffb" need=0 deleting=1


It appears that Replica count of deployment is fluctuating between 0, 1 and 2 very frequently.  I am not sure why this could be happening.

Comment 2 Pablo Rodriguez Guillamon 2022-07-20 14:50:06 UTC
Hi Hemant, the customer did a comment about the upgrade method:

They updated from 4.6 (via 4.8) to 4.10

Is there any info I may ask them to help you investigate this issue?

Comment 5 Pablo Rodriguez Guillamon 2022-08-02 06:57:00 UTC
Hi @hekumar @jsafrane 

Is there any other data I should ask the customer? Do we have everything we need to work on the bug?

Thanks!

Comment 8 Hemant Kumar 2022-08-18 19:51:23 UTC
This was happening because when more than one credential is present in secret then code can arbitrarily pick one of them and hence secrets may keep changing (and thereby causing deployment rollouts).

For now - we are going to warn the users if this happens via - https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/104

Comment 10 Wei Duan 2022-08-24 06:03:42 UTC
Setting another vCenter hostname in secret like below:
$ oc -n openshift-cluster-csi-drivers get secret  vmware-vsphere-cloud-credentials -ojson | jq .data
{
  "vcenter.xxx-1.vmwarevmc.com.password": "xxx",
  "vcenter.xxx-1.vmwarevmc.com.username": "xxx",
  "vcenter.xxx-2.vmwarevmc.com.password": "xxx",
  "vcenter.xxx-2.vmwarevmc.com.username": "xxx"
}


The vmware-vsphere-csi-driver-controller updateing could be reproduced: 
$ oc -n openshift-cluster-csi-drivers get deployment.apps/vmware-vsphere-csi-driver-controller -ojson | jq .metadata.generation;sleep 30;oc -n openshift-cluster-csi-drivers get deployment.apps/vmware-vsphere-csi-driver-controller -ojson | jq .metadata.generation
179
204


We could get the clear message from the operator log which could alert us for such unsupported configuration.
W0824 05:59:00.732442       1 driver_starter.go:151] CSI driver can only connect to one vcenter, more than 1 set of credentials found for CSI driver
W0824 05:59:01.113462       1 driver_starter.go:151] CSI driver can only connect to one vcenter, more than 1 set of credentials found for CSI driver
W0824 05:59:02.280448       1 driver_starter.go:151] CSI driver can only connect to one vcenter, more than 1 set of credentials found for CSI driver
W0824 05:59:02.299887       1 driver_starter.go:151] CSI driver can only connect to one vcenter, more than 1 set of credentials found for CSI driver

Marked as "Verified".

Comment 13 errata-xmlrpc 2023-01-17 19:53:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399