Bug 2081557

Summary:	[bz-Storage] clusteroperator/storage should not change condition/Available
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Storage	Assignee:	Jan Safranek <jsafrane>
Storage sub component:	Operators	QA Contact:	Wei Duan <wduan>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, jsafrane, sippy
Version:	4.11
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-03 11:42:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2022-05-04 00:11:48 UTC

[bz-Storage] clusteroperator/storage should not change condition/Available

is failing frequently in CI, see [1] and:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=storage+should+not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade (all) - 4 runs, 75% failed, 100% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.11-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.11-e2e-azure-upgrade-single-node (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 62 runs, 50% failed, 168% of failures match = 84% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-upgrade (all) - 20 runs, 100% failed, 10% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-machine-config-operator-release-4.10-e2e-aws-upgrade-single-node (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-machine-config-operator-release-4.10-e2e-vsphere-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade (all) - 9 runs, 78% failed, 100% of failures match = 78% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.5-to-4.6-to-4.7-to-4.8-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.7-to-4.8-to-4.9-to-4.10-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

For example, [2] has:

  : [bz-Management Console] clusteroperator/console should not change condition/Available
    Run #0: Failed	2h2m40s
    {  4 unexpected clusteroperator state transitions during e2e test run 

    May 03 14:18:09.477 - 1s    E clusteroperator/console condition/Available status/False reason/RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-mqrp73ib-0e085.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-mqrp73ib-0e085.aws-2.ci.openshift.org": dial tcp: lookup console-openshift-console.apps.ci-op-mqrp73ib-0e085.aws-2.ci.openshift.org on 172.30.0.10:53: read udp 10.129.0.113:33286->172.30.0.10:53: read: connection refused
    1 tests failed during this blip (2022-05-03 14:18:09.477172196 +0000 UTC to 2022-05-03 14:18:09.477172196 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
    May 03 14:52:43.462 - 598ms E clusteroperator/console condition/Available status/False reason/RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-mqrp73ib-0e085.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-mqrp73ib-0e085.aws-2.ci.openshift.org": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    1 tests failed during this blip (2022-05-03 14:52:43.462555534 +0000 UTC to 2022-05-03 14:52:43.462555534 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]}

With:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1521470811435700224/build-log.txt | grep 'clusteroperator
/storage condition/Available.*changed'
  May 03 14:12:27.267 E clusteroperator/storage condition/Available status/False reason/DefaultStorageClassController_SyncError changed: DefaultStorageClassControllerAvailable: Get "https://172.30.0.1:443/apis/storage.k8s.io/v1/storageclasses/gp2": context canceled
  May 03 14:12:27.313 W clusteroperator/storage condition/Available status/True reason/AsExpected changed: AWSEBSCSIDriverOperatorCRAvailable: All is well

The test-case is flake-only, so this isn't impacting CI success rates.  But having the operator claim Available=False is not a great customer experience. Possibly not a big enough UX impact to be worth backports, but certainly a big enough UX impact to be worth fixing in the development branch.

[1]: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bbz-Storage%5D%20clusteroperator%2Fstorage%20should%20not%20change%20condition%2FAvailable
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1521470811435700224

Comment 1 Jan Safranek 2022-05-06 14:21:08 UTC

For the DefaultStorageClassController_SyncError: the controller should not change Available: true -> false after an error.
(It could mark it as Degraded: true, with inertia).

At the same time, the controller / ApplyStorageClass should use an informer and not to issue GET calls.

(And someone should check if all Available: true -> false changes in CI are just DefaultStorageClassController_SyncError or we need to fix also other controllers)

Comment 2 W. Trevor King 2022-05-08 06:15:40 UTC

Aggregating reasons over the past 24h:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&context=0&search=clusteroperator/storage+condition/Available+status/False+reason/' | jq -r 'to_entries[].value | to_entries[].va
lue[].context[]' | sed 's|.*reason/\([^ ]*\) changed:.*|\1|' | sort | uniq -c | sort -n
      1 DefaultStorageClassController_SyncError
      1 GCPPDCSIDriverOperatorDeployment_Deploying
      1 OVirtCSIDriverOperatorCR_OvirtDriverControllerServiceController_Deploying
      1 VSphereCSIDriverOperatorCR_VMwareVSphereDriverWebhookController_Deploying
      1 VSphereCSIDriverOperatorCR_WaitForOperator
      1 VSphereProblemDetectorDeploymentController_Deploying
      3 OVirtCSIDriverOperatorDeployment_Deploying
      5 AWSEBSCSIDriverOperatorCR_AWSEBSDriverControllerServiceController_Deploying
      5 AWSEBSCSIDriverOperatorDeployment_Deploying
      5 AzureDiskCSIDriverOperatorCR_WaitForOperator
     56 AzureFileCSIDriverOperatorCR_WaitForOperator

Comment 3 Jan Safranek 2022-05-09 10:24:47 UTC

Thanks for the statistics!

>      56 AzureFileCSIDriverOperatorCR_WaitForOperator

This is actually expected. In 4.11 we install a new Azure File CSI driver during upgrade and that driver starts as Available=False. It's a question *when* should Available switch from 4.10 to 4.11 features - we do it when CSO container is actually updated to 4.11 and before it starts upgrading the rest of storage in OCP to 4.11 (i.e. before it starts installing the new CSI driver).

The other reasons might be random hiccups in the cluster and they should be probably fixed. It may be quite challenging, since there is too many of them. Another good case for some inertia?

Comment 4 Jan Safranek 2022-05-10 08:30:46 UTC

openshift/cluster-storage-operator/pull/277 fixes just DefaultStorageClassController_SyncError.

Comment 6 Jan Safranek 2022-05-26 08:21:21 UTC

> openshift/cluster-storage-operator/pull/277 fixes just DefaultStorageClassController_SyncError.

Comment 8 Jan Safranek 2022-06-27 09:32:08 UTC

I went through results in past 2 weeks in 4.11 jobs.

* Excluding Azure, see comment #3
* Excluding "aggregator" jobs, because they don't really fail, they complain that the test was not run enough times:

> {Passed 2 times, failed 0 times, skipped 0 times: we require at least 3 attempts to have a chance at success  name: operator conditions storage
> testsuitename: Operator results
> summary: 'Passed 2 times, failed 0 times, skipped 0 times: we require at least 3 attempts
>   to have a chance at success'

* Excluding single-node jobs (will follow up later).

$ w3m -dump -cols 200 "https://search.ci.openshift.org/?search=storage+should+not+change+condition%2FAvailable&maxAge=336h" | grep 'failures match' | sort | grep 4.11  | egrep -v "azure|aggregator|single"
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-vsphere-upgrade (all) - 14 runs, 64% failed, 122% of failures match = 79% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-techpreview-serial (all) - 14 runs, 100% failed, 50% of failures match = 50% impact

vSphere job is flipping Available true->false because of:

> Jun 14 20:29:17.959 - 11s   E clusteroperator/storage condition/Available status/False reason/VSphereProblemDetectorDeploymentControllerAvailable: Waiting for a Deployment pod to start}

vsphere-problem-detector webhook Deployment does not have the right disruption budget.

Techpreview jobs are flipping Available true->false because of:

> Jun 20 19:24:41.025 - 55s   E clusteroperator/storage condition/Available status/False reason/SHARESCSIDriverOperatorCRAvailable: SharedResourceCSIDriverWebhookControllerAvailable: Waiting for Deployment

The webhook Deployment does not have the right disruption budget.

Comment 9 W. Trevor King 2022-06-27 19:11:48 UTC

(In reply to Jan Safranek from comment #8)
> vSphere job is flipping Available true->false because of:
> 
> > Jun 14 20:29:17.959 - 11s   E clusteroperator/storage condition/Available status/False reason/VSphereProblemDetectorDeploymentControllerAvailable: Waiting for a Deployment pod to start}
> 
> vsphere-problem-detector webhook Deployment does not have the right
> disruption budget.
> 
> Techpreview jobs are flipping Available true->false because of:
> 
> > Jun 20 19:24:41.025 - 55s   E clusteroperator/storage condition/Available status/False reason/SHARESCSIDriverOperatorCRAvailable: SharedResourceCSIDriverWebhookControllerAvailable: Waiting for Deployment
> 
> The webhook Deployment does not have the right disruption budget.

Neither of these showed up in [1], where I'm currently planning on excepting a bunch of *_Deploying reasons.  Please weigh in on [1] if you would like to see changes to the storage Available=False exception list.

https://github.com/openshift/origin/pull/27231/files

Comment 10 Jan Safranek 2022-06-28 13:25:55 UTC

> Neither of these showed up in [1], where I'm currently planning on excepting a bunch of *_Deploying reasons.  Please weigh in on [1] if you would like to see changes to the storage Available=False exception list.

Thanks for heads-up. I hope the webhook PRs gets merged soon and we don't need exceptions for them (in 4.12). I commented about the other regexps in the PR.

Comment 12 Jan Safranek 2022-08-16 15:51:15 UTC

I looked at today's failures in 4.12:

# w3m -dump -cols 200 "https://search.ci.openshift.org/?search=storage+should+not+change+condition%2FAvailable&maxAge=336h" | grep 'failures match' | sort | grep 4.12  | egrep -v "azure|aggregator|single"

> periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-vsphere-upgrade (all) - 14 runs, 29% failed, 300% of failures match = 86% impact
> reason/VSphereProblemDetectorDeploymentControllerAvailable: Waiting for a Deployment pod to start

It seems the problem-detector does not have PDB, I will fix it.

> periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-serial (all) - 59 runs, 97% failed, 18% of failures match = 17% impact
> periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-techpreview-serial (all) - 58 runs, 98% failed, 12% of failures match = 12% impact

Both -serial- jobs run a test that does something bad to master nodes, e.g. kube-apiserver endpoints become unavailable and network becomes even Degraded

[sig-etcd][Serial] etcd is able to vertically scale up and down with a single node [Timeout:60m] [Suite:openshift/conformance/serial]
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-serial/1559450115490451456

> periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.12-e2e-openstack-parallel (all) - 14 runs, 64% failed, 11% of failures match = 7% impact

The cluster is very unhappy in the singe failure, it's not our operator's fault. IMO it's fair to report storage as unavailable, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-serial/1559450115490451456

Comment 13 Jan Safranek 2023-01-03 11:42:56 UTC

Checking 4.13 today: w3m -dump -cols 200 "https://search.ci.openshift.org/?search=storage+should+not+change+condition%2FAvailable&maxAge=336h" | grep 'failures match' | sort | grep 4.13

etcd-scaling job had one hiccup:
periodic-ci-openshift-release-master-nightly-4.13-e2e-vsphere-ovn-etcd-scaling (all) - 14 runs, 100% failed, 7% of failures match = 7% impact
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-vsphere-ovn-etcd-scaling/1609013431443132416

etcd was Degraded during this period and several other ClusterOperators got Available=False during this time.


Single-node upgrades get Available=false, because storage indeed got unavailable during the upgrade for a brief period.

periodic-ci-openshift-release-master-ci-4.13-e2e-aws-upgrade-ovn-single-node (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-vsphere-sdn-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.13-upgrade-from-stable-4.12-e2e-aws-upgrade-ovn-single-node (all) - 2 runs, 100% failed, 100% of failures match = 100% impact

While the state is not perfect, it's much better than it was when highly available clusters got Available=False during upgrade. Now it's mostly just single-node and one etcd hiccup. We will not have time to fix these properly in near future.