Bug 1952826

Summary:	[azure disk csi operator] CSO is not available for 'AzureDiskCSIDriverOperatorDeploymentAvailable: Waiting for a Deployment pod to start
Product:	OpenShift Container Platform	Reporter:	Qin Ping <piqin>
Component:	Storage	Assignee:	Jan Safranek <jsafrane>
Storage sub component:	Operators	QA Contact:	Wei Duan <wduan>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	low
Priority:	unspecified	CC:	aos-bugs, jsafrane
Version:	4.8
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-13 14:07:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Qin Ping 2021-04-23 09:44:48 UTC

Description of problem:
[azure disk csi operator] CSO is not available for 'AzureDiskCSIDriverOperatorDeploymentAvailable: Waiting for a Deployment pod to start'


Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-04-22-182303

How reproducible:
Hit once, not to reproduce it yet

Steps to Reproduce:(just the operations I did)
1. Install an OCP cluster on Azure
2. Install Azure disk csi driver with: oc patch  featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type merge
3. Delete featureset `TechPreviewNoUpgrade`
4. Run cmd: oc patch featuregate cluster -p '{"spec": {"customNoUpgrade": {"enabled": ["CSIDriverAzureDisk"]}, "featureSet": "CustomNoUpgrade"}}' --type merge
5. Run cmd: oc patch featuregate cluster -p '{"spec": {"customNoUpgrade": {"disabled": ["CSIDriverAzureDisk"]}, "featureSet": "CustomNoUpgrade"}}' --type merge
6. Run “[sig-arch]” e2e test cases

Actual results:
1. 3 test cases failed.
2. Check the CO:
$ oc get co storage
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.8.0-0.nightly-2021-04-22-182303   False       False         False      3h42m

3. CSO conditions:
Status:
  Conditions:
    Last Transition Time:  2021-04-23T05:35:26Z
    Message:               AzureDiskCSIDriverOperatorCRDegraded: All is well
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2021-04-23T05:35:26Z
    Message:               AzureDiskCSIDriverOperatorCRProgressing: All is well
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2021-04-23T05:35:26Z
    Message:               AzureDiskCSIDriverOperatorDeploymentAvailable: Waiting for a Deployment pod to start
    Reason:                AzureDiskCSIDriverOperatorDeployment_Deploying
    Status:                False
    Type:                  Available
    Last Transition Time:  2021-04-23T05:35:26Z
    Message:               All is well
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:
$ oc -n openshift-cluster-csi-drivers  get deployment azure-disk-csi-driver-operator -ojson|jq .status
{
  "availableReplicas": 1,
  "conditions": [
    {
      "lastTransitionTime": "2021-04-23T03:18:11Z",
      "lastUpdateTime": "2021-04-23T03:18:18Z",
      "message": "ReplicaSet \"azure-disk-csi-driver-operator-77f45b9d97\" has successfully progressed.",
      "reason": "NewReplicaSetAvailable",
      "status": "True",
      "type": "Progressing"
    },
    {
      "lastTransitionTime": "2021-04-23T05:25:46Z",
      "lastUpdateTime": "2021-04-23T05:25:46Z",
      "message": "Deployment has minimum availability.",
      "reason": "MinimumReplicasAvailable",
      "status": "True",
      "type": "Available"
    }
  ],
  "observedGeneration": 1,
  "readyReplicas": 1,
  "replicas": 1,
  "updatedReplicas": 1
}
Logs from the CSO:
I0423 05:36:52.055017       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"57823da1-cfce-45c9-8354-d5aa7e70eef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "DefaultStorageClassController" resync interval is set to 0s which might lead to client request throttling
I0423 05:36:52.055124       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"57823da1-cfce-45c9-8354-d5aa7e70eef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "SnapshotCRDController" resync interval is set to 0s which might lead to client request throttling
I0423 05:36:52.062590       1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator-lock", UID:"c4038e45-e071-487c-a1e5-2b9bc0a42f5b", APIVersion:"v1", ResourceVersion:"147142", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-storage-operator-c8d7c5bb5-bq9h8_9bd0a141-ee5f-4adb-8996-25979f6c2a65 became leader
I0423 05:36:52.089448       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"57823da1-cfce-45c9-8354-d5aa7e70eef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "CSIDriverStarter" resync interval is set to 0s which might lead to client request throttling
I0423 05:36:52.096129       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"57823da1-cfce-45c9-8354-d5aa7e70eef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "VSphereProblemDetectorStarter" resync interval is set to 0s which might lead to client request throttling
I0423 05:36:52.098097       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"57823da1-cfce-45c9-8354-d5aa7e70eef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "LoggingSyncer" resync interval is set to 0s which might lead to client request throttling


When ran must gather cmd, got the following msg:
ClusterID: 31d5f283-5dc5-472e-8460-50285d9c8b44
ClusterVersion: Stable at "4.8.0-0.nightly-2021-04-22-182303"
ClusterOperators:
	clusteroperator/kube-apiserver is not upgradeable because FeatureGatesUpgradeable: "CustomNoUpgrade" does not allow updates
	clusteroperator/storage is not available (AzureDiskCSIDriverOperatorDeploymentAvailable: Waiting for a Deployment pod to start) because AzureDiskCSIDriverOperatorCRDegraded: All is well
	clusteroperator/cloud-credential is missing
	clusteroperator/cluster-autoscaler is missing

Comment 2 Jan Safranek 2021-04-23 15:21:47 UTC

We don't support un-installation of the CSI driver. Once TechPreviewNoUpgrade is set, it should be sticky and it should not be possible to disable it. And that's the thing we support as tech preview.

It's possible to enable the driver also with customNoUpgrade, but we do not support it. I am lowering the severity, as customNoUpgrade is not supported, even as tech preview (at least I hope so). We'll see if we can do something better than Available=false, but still I do not want to support un-installation of the driver. It's hard and messy and can break apps that use PVs provisioned by the driver.

Comment 3 Jan Safranek 2021-04-29 15:55:35 UTC

I have a theory what may be wrong.

1. Removing TechPreviewNoUpgrade removes CSI migration, therefore all nodes are drained and restarted.
2. CSI driver controller pod is drained from its node - the CSI driver operator marks ClusterCSIDriver as AzureDiskCSIDriverOperatorDeploymentAvailable=false with "Waiting for a Deployment pod to start"
3. cluster-storage-operator copies the condition to Storage CR conditions.
4. cluster-storage-operator is drained from its node.
5. New cluster-storage-operator starts and as it does not see CSIDriverAzureDisk / TechPreviewNoUpgrade, it does not start a controller to watch ClusterCSIDriver and the Azure conditions on Storage are never cleared.

I need to test it.

Comment 4 Jan Safranek 2021-05-03 16:01:30 UTC

We discussed this issue and we decided not to fix it. We don't want to support removal of the driver, either via TechPreviewNoUpgrade or CustomNoUpgrade FeatureSets. While the feature can be disabled via CustomNoUpgrade, status of storage ClusterOperator is unpredictable. It's up to the user to clean up its conditions. Presence of these conditions also suggest the user that there is an operator and CSI driver still running and they need to be cleaned up too.

Exactly the same thing may happen when user downgrades from OCP version that has Azure CSI driver installed by default (say 4.10) to a version where the driver is optional (say 4.9) - 4.9 CSO does not know that there is a 4.10 version of the driver + operator still running and it will not remove them. Again, it's up to the user to clean up the mess.

Ping, what do you think?

Comment 5 Qin Ping 2021-05-07 09:58:14 UTC

Hi, Jan

Make sense to me. Only one question for me, do we need to document this?

Comment 6 Jan Safranek 2021-05-13 14:07:20 UTC

We do not document CustomNoUpgrade and its values. If you think we should, please raise a BZ against docs.