Hide Forgot
[bz-Storage] clusteroperator/storage should not change condition/Available is failing frequently in CI, see [1] and: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=storage+should+not+change+condition/Available' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade (all) - 4 runs, 75% failed, 100% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.11-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.11-e2e-azure-upgrade-single-node (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 62 runs, 50% failed, 168% of failures match = 84% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-upgrade (all) - 20 runs, 100% failed, 10% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact pull-ci-openshift-machine-config-operator-release-4.10-e2e-aws-upgrade-single-node (all) - 2 runs, 100% failed, 100% of failures match = 100% impact pull-ci-openshift-machine-config-operator-release-4.10-e2e-vsphere-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade (all) - 9 runs, 78% failed, 100% of failures match = 78% impact release-openshift-origin-installer-e2e-aws-upgrade-4.5-to-4.6-to-4.7-to-4.8-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-aws-upgrade-4.7-to-4.8-to-4.9-to-4.10-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact For example, [2] has: : [bz-Management Console] clusteroperator/console should not change condition/Available Run #0: Failed 2h2m40s { 4 unexpected clusteroperator state transitions during e2e test run May 03 14:18:09.477 - 1s E clusteroperator/console condition/Available status/False reason/RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-mqrp73ib-0e085.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-mqrp73ib-0e085.aws-2.ci.openshift.org": dial tcp: lookup console-openshift-console.apps.ci-op-mqrp73ib-0e085.aws-2.ci.openshift.org on 172.30.0.10:53: read udp 10.129.0.113:33286->172.30.0.10:53: read: connection refused 1 tests failed during this blip (2022-05-03 14:18:09.477172196 +0000 UTC to 2022-05-03 14:18:09.477172196 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] May 03 14:52:43.462 - 598ms E clusteroperator/console condition/Available status/False reason/RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-mqrp73ib-0e085.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-mqrp73ib-0e085.aws-2.ci.openshift.org": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 1 tests failed during this blip (2022-05-03 14:52:43.462555534 +0000 UTC to 2022-05-03 14:52:43.462555534 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]} With: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1521470811435700224/build-log.txt | grep 'clusteroperator /storage condition/Available.*changed' May 03 14:12:27.267 E clusteroperator/storage condition/Available status/False reason/DefaultStorageClassController_SyncError changed: DefaultStorageClassControllerAvailable: Get "https://172.30.0.1:443/apis/storage.k8s.io/v1/storageclasses/gp2": context canceled May 03 14:12:27.313 W clusteroperator/storage condition/Available status/True reason/AsExpected changed: AWSEBSCSIDriverOperatorCRAvailable: All is well The test-case is flake-only, so this isn't impacting CI success rates. But having the operator claim Available=False is not a great customer experience. Possibly not a big enough UX impact to be worth backports, but certainly a big enough UX impact to be worth fixing in the development branch. [1]: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bbz-Storage%5D%20clusteroperator%2Fstorage%20should%20not%20change%20condition%2FAvailable [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1521470811435700224
For the DefaultStorageClassController_SyncError: the controller should not change Available: true -> false after an error. (It could mark it as Degraded: true, with inertia). At the same time, the controller / ApplyStorageClass should use an informer and not to issue GET calls. (And someone should check if all Available: true -> false changes in CI are just DefaultStorageClassController_SyncError or we need to fix also other controllers)
Aggregating reasons over the past 24h: $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&context=0&search=clusteroperator/storage+condition/Available+status/False+reason/' | jq -r 'to_entries[].value | to_entries[].va lue[].context[]' | sed 's|.*reason/\([^ ]*\) changed:.*|\1|' | sort | uniq -c | sort -n 1 DefaultStorageClassController_SyncError 1 GCPPDCSIDriverOperatorDeployment_Deploying 1 OVirtCSIDriverOperatorCR_OvirtDriverControllerServiceController_Deploying 1 VSphereCSIDriverOperatorCR_VMwareVSphereDriverWebhookController_Deploying 1 VSphereCSIDriverOperatorCR_WaitForOperator 1 VSphereProblemDetectorDeploymentController_Deploying 3 OVirtCSIDriverOperatorDeployment_Deploying 5 AWSEBSCSIDriverOperatorCR_AWSEBSDriverControllerServiceController_Deploying 5 AWSEBSCSIDriverOperatorDeployment_Deploying 5 AzureDiskCSIDriverOperatorCR_WaitForOperator 56 AzureFileCSIDriverOperatorCR_WaitForOperator
Thanks for the statistics! > 56 AzureFileCSIDriverOperatorCR_WaitForOperator This is actually expected. In 4.11 we install a new Azure File CSI driver during upgrade and that driver starts as Available=False. It's a question *when* should Available switch from 4.10 to 4.11 features - we do it when CSO container is actually updated to 4.11 and before it starts upgrading the rest of storage in OCP to 4.11 (i.e. before it starts installing the new CSI driver). The other reasons might be random hiccups in the cluster and they should be probably fixed. It may be quite challenging, since there is too many of them. Another good case for some inertia?
openshift/cluster-storage-operator/pull/277 fixes just DefaultStorageClassController_SyncError.
> openshift/cluster-storage-operator/pull/277 fixes just DefaultStorageClassController_SyncError.
I went through results in past 2 weeks in 4.11 jobs. * Excluding Azure, see comment #3 * Excluding "aggregator" jobs, because they don't really fail, they complain that the test was not run enough times: > {Passed 2 times, failed 0 times, skipped 0 times: we require at least 3 attempts to have a chance at success name: operator conditions storage > testsuitename: Operator results > summary: 'Passed 2 times, failed 0 times, skipped 0 times: we require at least 3 attempts > to have a chance at success' * Excluding single-node jobs (will follow up later). $ w3m -dump -cols 200 "https://search.ci.openshift.org/?search=storage+should+not+change+condition%2FAvailable&maxAge=336h" | grep 'failures match' | sort | grep 4.11 | egrep -v "azure|aggregator|single" periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-vsphere-upgrade (all) - 14 runs, 64% failed, 122% of failures match = 79% impact periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-techpreview-serial (all) - 14 runs, 100% failed, 50% of failures match = 50% impact vSphere job is flipping Available true->false because of: > Jun 14 20:29:17.959 - 11s E clusteroperator/storage condition/Available status/False reason/VSphereProblemDetectorDeploymentControllerAvailable: Waiting for a Deployment pod to start} vsphere-problem-detector webhook Deployment does not have the right disruption budget. Techpreview jobs are flipping Available true->false because of: > Jun 20 19:24:41.025 - 55s E clusteroperator/storage condition/Available status/False reason/SHARESCSIDriverOperatorCRAvailable: SharedResourceCSIDriverWebhookControllerAvailable: Waiting for Deployment The webhook Deployment does not have the right disruption budget.
(In reply to Jan Safranek from comment #8) > vSphere job is flipping Available true->false because of: > > > Jun 14 20:29:17.959 - 11s E clusteroperator/storage condition/Available status/False reason/VSphereProblemDetectorDeploymentControllerAvailable: Waiting for a Deployment pod to start} > > vsphere-problem-detector webhook Deployment does not have the right > disruption budget. > > Techpreview jobs are flipping Available true->false because of: > > > Jun 20 19:24:41.025 - 55s E clusteroperator/storage condition/Available status/False reason/SHARESCSIDriverOperatorCRAvailable: SharedResourceCSIDriverWebhookControllerAvailable: Waiting for Deployment > > The webhook Deployment does not have the right disruption budget. Neither of these showed up in [1], where I'm currently planning on excepting a bunch of *_Deploying reasons. Please weigh in on [1] if you would like to see changes to the storage Available=False exception list. https://github.com/openshift/origin/pull/27231/files
> Neither of these showed up in [1], where I'm currently planning on excepting a bunch of *_Deploying reasons. Please weigh in on [1] if you would like to see changes to the storage Available=False exception list. Thanks for heads-up. I hope the webhook PRs gets merged soon and we don't need exceptions for them (in 4.12). I commented about the other regexps in the PR.
I looked at today's failures in 4.12: # w3m -dump -cols 200 "https://search.ci.openshift.org/?search=storage+should+not+change+condition%2FAvailable&maxAge=336h" | grep 'failures match' | sort | grep 4.12 | egrep -v "azure|aggregator|single" > periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-vsphere-upgrade (all) - 14 runs, 29% failed, 300% of failures match = 86% impact > reason/VSphereProblemDetectorDeploymentControllerAvailable: Waiting for a Deployment pod to start It seems the problem-detector does not have PDB, I will fix it. > periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-serial (all) - 59 runs, 97% failed, 18% of failures match = 17% impact > periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-techpreview-serial (all) - 58 runs, 98% failed, 12% of failures match = 12% impact Both -serial- jobs run a test that does something bad to master nodes, e.g. kube-apiserver endpoints become unavailable and network becomes even Degraded [sig-etcd][Serial] etcd is able to vertically scale up and down with a single node [Timeout:60m] [Suite:openshift/conformance/serial] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-serial/1559450115490451456 > periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.12-e2e-openstack-parallel (all) - 14 runs, 64% failed, 11% of failures match = 7% impact The cluster is very unhappy in the singe failure, it's not our operator's fault. IMO it's fair to report storage as unavailable, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-serial/1559450115490451456
Checking 4.13 today: w3m -dump -cols 200 "https://search.ci.openshift.org/?search=storage+should+not+change+condition%2FAvailable&maxAge=336h" | grep 'failures match' | sort | grep 4.13 etcd-scaling job had one hiccup: periodic-ci-openshift-release-master-nightly-4.13-e2e-vsphere-ovn-etcd-scaling (all) - 14 runs, 100% failed, 7% of failures match = 7% impact https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-vsphere-ovn-etcd-scaling/1609013431443132416 etcd was Degraded during this period and several other ClusterOperators got Available=False during this time. Single-node upgrades get Available=false, because storage indeed got unavailable during the upgrade for a brief period. periodic-ci-openshift-release-master-ci-4.13-e2e-aws-upgrade-ovn-single-node (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-vsphere-sdn-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.13-upgrade-from-stable-4.12-e2e-aws-upgrade-ovn-single-node (all) - 2 runs, 100% failed, 100% of failures match = 100% impact While the state is not perfect, it's much better than it was when highly available clusters got Available=False during upgrade. Now it's mostly just single-node and one etcd hiccup. We will not have time to fix these properly in near future.