Description of problem: [azure disk csi operator] CSO is not available for 'AzureDiskCSIDriverOperatorDeploymentAvailable: Waiting for a Deployment pod to start' Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-04-22-182303 How reproducible: Hit once, not to reproduce it yet Steps to Reproduce:(just the operations I did) 1. Install an OCP cluster on Azure 2. Install Azure disk csi driver with: oc patch featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type merge 3. Delete featureset `TechPreviewNoUpgrade` 4. Run cmd: oc patch featuregate cluster -p '{"spec": {"customNoUpgrade": {"enabled": ["CSIDriverAzureDisk"]}, "featureSet": "CustomNoUpgrade"}}' --type merge 5. Run cmd: oc patch featuregate cluster -p '{"spec": {"customNoUpgrade": {"disabled": ["CSIDriverAzureDisk"]}, "featureSet": "CustomNoUpgrade"}}' --type merge 6. Run “[sig-arch]” e2e test cases Actual results: 1. 3 test cases failed. 2. Check the CO: $ oc get co storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE storage 4.8.0-0.nightly-2021-04-22-182303 False False False 3h42m 3. CSO conditions: Status: Conditions: Last Transition Time: 2021-04-23T05:35:26Z Message: AzureDiskCSIDriverOperatorCRDegraded: All is well Reason: AsExpected Status: False Type: Degraded Last Transition Time: 2021-04-23T05:35:26Z Message: AzureDiskCSIDriverOperatorCRProgressing: All is well Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2021-04-23T05:35:26Z Message: AzureDiskCSIDriverOperatorDeploymentAvailable: Waiting for a Deployment pod to start Reason: AzureDiskCSIDriverOperatorDeployment_Deploying Status: False Type: Available Last Transition Time: 2021-04-23T05:35:26Z Message: All is well Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Expected results: Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info: $ oc -n openshift-cluster-csi-drivers get deployment azure-disk-csi-driver-operator -ojson|jq .status { "availableReplicas": 1, "conditions": [ { "lastTransitionTime": "2021-04-23T03:18:11Z", "lastUpdateTime": "2021-04-23T03:18:18Z", "message": "ReplicaSet \"azure-disk-csi-driver-operator-77f45b9d97\" has successfully progressed.", "reason": "NewReplicaSetAvailable", "status": "True", "type": "Progressing" }, { "lastTransitionTime": "2021-04-23T05:25:46Z", "lastUpdateTime": "2021-04-23T05:25:46Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" } ], "observedGeneration": 1, "readyReplicas": 1, "replicas": 1, "updatedReplicas": 1 } Logs from the CSO: I0423 05:36:52.055017 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"57823da1-cfce-45c9-8354-d5aa7e70eef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "DefaultStorageClassController" resync interval is set to 0s which might lead to client request throttling I0423 05:36:52.055124 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"57823da1-cfce-45c9-8354-d5aa7e70eef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "SnapshotCRDController" resync interval is set to 0s which might lead to client request throttling I0423 05:36:52.062590 1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator-lock", UID:"c4038e45-e071-487c-a1e5-2b9bc0a42f5b", APIVersion:"v1", ResourceVersion:"147142", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-storage-operator-c8d7c5bb5-bq9h8_9bd0a141-ee5f-4adb-8996-25979f6c2a65 became leader I0423 05:36:52.089448 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"57823da1-cfce-45c9-8354-d5aa7e70eef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "CSIDriverStarter" resync interval is set to 0s which might lead to client request throttling I0423 05:36:52.096129 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"57823da1-cfce-45c9-8354-d5aa7e70eef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "VSphereProblemDetectorStarter" resync interval is set to 0s which might lead to client request throttling I0423 05:36:52.098097 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"57823da1-cfce-45c9-8354-d5aa7e70eef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "LoggingSyncer" resync interval is set to 0s which might lead to client request throttling When ran must gather cmd, got the following msg: ClusterID: 31d5f283-5dc5-472e-8460-50285d9c8b44 ClusterVersion: Stable at "4.8.0-0.nightly-2021-04-22-182303" ClusterOperators: clusteroperator/kube-apiserver is not upgradeable because FeatureGatesUpgradeable: "CustomNoUpgrade" does not allow updates clusteroperator/storage is not available (AzureDiskCSIDriverOperatorDeploymentAvailable: Waiting for a Deployment pod to start) because AzureDiskCSIDriverOperatorCRDegraded: All is well clusteroperator/cloud-credential is missing clusteroperator/cluster-autoscaler is missing
We don't support un-installation of the CSI driver. Once TechPreviewNoUpgrade is set, it should be sticky and it should not be possible to disable it. And that's the thing we support as tech preview. It's possible to enable the driver also with customNoUpgrade, but we do not support it. I am lowering the severity, as customNoUpgrade is not supported, even as tech preview (at least I hope so). We'll see if we can do something better than Available=false, but still I do not want to support un-installation of the driver. It's hard and messy and can break apps that use PVs provisioned by the driver.
I have a theory what may be wrong. 1. Removing TechPreviewNoUpgrade removes CSI migration, therefore all nodes are drained and restarted. 2. CSI driver controller pod is drained from its node - the CSI driver operator marks ClusterCSIDriver as AzureDiskCSIDriverOperatorDeploymentAvailable=false with "Waiting for a Deployment pod to start" 3. cluster-storage-operator copies the condition to Storage CR conditions. 4. cluster-storage-operator is drained from its node. 5. New cluster-storage-operator starts and as it does not see CSIDriverAzureDisk / TechPreviewNoUpgrade, it does not start a controller to watch ClusterCSIDriver and the Azure conditions on Storage are never cleared. I need to test it.
We discussed this issue and we decided not to fix it. We don't want to support removal of the driver, either via TechPreviewNoUpgrade or CustomNoUpgrade FeatureSets. While the feature can be disabled via CustomNoUpgrade, status of storage ClusterOperator is unpredictable. It's up to the user to clean up its conditions. Presence of these conditions also suggest the user that there is an operator and CSI driver still running and they need to be cleaned up too. Exactly the same thing may happen when user downgrades from OCP version that has Azure CSI driver installed by default (say 4.10) to a version where the driver is optional (say 4.9) - 4.9 CSO does not know that there is a 4.10 version of the driver + operator still running and it will not remove them. Again, it's up to the user to clean up the mess. Ping, what do you think?
Hi, Jan Make sense to me. Only one question for me, do we need to document this?
We do not document CustomNoUpgrade and its values. If you think we should, please raise a BZ against docs.