Bug 1994655
Summary: | openshift-apiserver should not set Available=False APIServicesAvailable on update | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Scott Dodson <sdodson> |
Component: | Networking | Assignee: | jamo luhrsen <jluhrsen> |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
Severity: | high | ||
Priority: | low | CC: | aos-bugs, bleanhar, ffernand, kewang, lszaszki, mfojtik, sttts, wking, wlewis, xxia |
Version: | 4.8 | Keywords: | Reopened, Upgrades |
Target Milestone: | --- | ||
Target Release: | 4.8.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | tag-ci LifecycleReset | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1948089 | Environment: | |
Last Closed: | 2022-01-05 15:14:51 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1948089 | ||
Bug Blocks: |
Description
Scott Dodson
2021-08-17 16:01:42 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. PR are opened, waiting for review @lszaszki Failures are still popular. Can you please confirm on this? w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep 'failures match' | sort|grep 4.8 periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 6 runs, 50% failed, 67% of failures match = 33% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 7 runs, 14% failed, 700% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 6 runs, 50% failed, 167% of failures match = 83% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 4 runs, 50% failed, 150% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 5 runs, 80% failed, 50% of failures match = 40% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact The availability can be affected by many things and never will be perfect (100%) I used the following query to compare the results [1] vs [2]. I excused updates from 4.7 because the fix has been applied to 4.8+. I also excluded OVN since the outage in this environment is usually longer which might indicate the underlying network provider. [1] https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver.*status%2FFalse+reason%2FAPIServicesAvailable&maxAge=24h&context=1&type=junit&name=%5Eperiodic.*upgrade&excludeName=4.7%7Covn%7Csingle-node&maxMatches=5&maxBytes=20971520&groupBy=job [2] https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver.*status%2FFalse+reason%2FAPIServicesAvailable&maxAge=336h&context=1&type=junit&name=%5Eperiodic.*upgrade&excludeName=4.7%7Covn%7Csingle-node&maxMatches=5&maxBytes=20971520&groupBy=job @lszaszki I used above query as you shared and failures are still popular. periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade (all) - 82 runs, 51% failed, 5% of failures match = 2% impact #1460265835862953984 junit 10 hours ago Nov 15 16:44:02.854 E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field Nov 15 16:44:02.854 - 1s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field Nov 15 16:45:11.000 - 10s E disruption/service-loadbalancer-with-pdb connection/reused disruption/service-loadbalancer-with-pdb connection/reused is not responding to GET requests over reused connections: missing error in the code #1460265835862953984 junit 10 hours ago Nov 15 16:44:02.854 - 1s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field 1 tests failed during this blip (2021-11-15 16:44:02.854351947 +0000 UTC to 2021-11-15 16:44:02.854351947 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] #1460108076450320384 junit 21 hours ago Nov 15 05:32:41.654 E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request Nov 15 05:32:41.654 - 2s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request Nov 15 05:37:35.015 E ns/openshift-machine-api pod/machine-api-operator-56d84db788-djsq7 node/ci-op-348q3xzi-a9175-m75l6-master-2 container/machine-api-operator reason/ContainerExit code/2 cause/Error #1460108076450320384 junit 21 hours ago Nov 15 05:32:41.654 - 2s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request I looked into the recent failures and they seem to correspond to updating the sdn containers. In general, the failures are brief and seem to happen during installing/restarting the sdn. It looks like installing a new sdn container is not graceful and cuts off aggregated apis (at least). For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1473107794574970880 the sdn-zstbb (master-1) was killed at 03:02:27 and we lost openshift-oauth-apiserver (apiserver-5b7547db44-4vccw) and openshift-apiserver (apiserver-84c48bd9df-bnqmh) both were running on the same node (master-1) the sdn-hp5f5 (master-2) was killed at 03:02:40 and we lost openshift-oauth-apiserver (apiserver-5b7547db44-pk484) and openshift-apiserver (apiserver-84c48bd9df-xjwvk) both were running on the same node (master-2) it looks similar on the master-0 and happened around initialization of openshift-sdn/sdn-bxfqj container The failures are quite common in CI https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver.*status%2FFalse+reason%2FAPIServicesAvailable&maxAge=24h&context=1&type=junit&name=%5Eperiodic.*upgrade&excludeName=4.7%7Covn%7Csingle-node&maxMatches=5&maxBytes=20971520&groupBy=job Based on the above I am moving to the SDN team. (In reply to Scott Dodson from comment #0) > This clone is intended to track backporting of the library-go bumps for 4.8 > cluster-kube-apiserver-operator. @sdodson , looks like the backports have all merged. @rgangwar, @wking, @lszaszki, this clone of an already closed bug fell on my plate recently. I didn't know much about it, but as I dug in I realized all the "failures" we are getting from those search queries are coming from "flakes". The test case is failing once, but it's always passing on the 2nd try. These are not affecting our job pass rate. I checked 4.7->4.8, 4.8->4.9 and 4.9->4.10 in testgrid and you can see that the test case is never marked as a failure, just a flake: "openshift-tests.[bz-apiserver-auth] clusteroperator/authentication should not change condition/Available" I checked both GCP and AWS. BTW, there are *lots* of test cases in these upgrade jobs that flake once and pass on the 2nd try. Do we want to close this bug or is there something I'm missing that we want to dig deeper on to get fixed. FYI, here are the 6 testgrid links I referenced above: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade&show-stale-tests= https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade&show-stale-tests= https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade&show-stale-tests= https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade&show-stale-tests= https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade&show-stale-tests= https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade&show-stale-tests= Using the queries based on the 4.9.0 verification [1], but rolled back to look at 4.8 -> 4.9: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable&maxAge=24h&type=junit&name=%5Eperiodic.*4.8.*upgrade&excludeName=4%5C.7' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 5 runs, 80% failed, 125% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact periodic-ci-openshift-release-master-okd-4.9-upgrade-from-okd-4.8-e2e-upgrade-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact Those runs: $ curl -s 'https://search.ci.openshift.org/search?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable&maxAge=24h&type=junit&name=%5Eperiodic.*4.8.*upgrade&excludeName=4%5C.7' | jq -r 'keys[]' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node/1478496947181457408 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478239379674632192 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478309813623459840 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478413114000019456 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478420659305451520 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478508721863659520 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade/1478442060850663424 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade/1478219006035890176 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-okd-4.9-upgrade-from-okd-4.8-e2e-upgrade-gcp/1478432819796512768 I dunno how important single-node is. Let's skip it and look at [2]: : [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available expand_less Run #0: Failed expand_less 1h43m50s 1 unexpected clusteroperator state transitions during e2e test run Jan 05 01:46:38.189 - 155s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: "apps.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "authorization.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "build.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "image.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "project.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "quota.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "route.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "template.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request That certainly feels similar to the issue that caused me to initially open the 4.9.0 bug [3]. But I haven't internalized the distinctions in [4]; perhaps this 503 is one of the issues that got punted to some other bug series? And yes, the test-case is never fatal [5], which I guess is grounds to say we don't care all that much about backporting fixes, although ideally we're driving out alarmist noise like this. [2] actually included some fatal test-cases, including: disruption_tests: [sig-api-machinery] OpenShift APIs remain available with reused connections 1h40m53s Jan 4 08:04:46.751: API "openshift-api-available-reused-connections" was unreachable during disruption (AWS has a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1943804) for at least 30s of 1h40m51s (1%): Jan 04 07:57:10.654 E openshift-apiserver-reused-connection openshift-apiserver-reused-connection started failing: Get "https://api.ci-op-byyjrxly-978ed.aws-2.ci.openshift.org:6443/apis/image.openshift.io/v1/namespaces/default/imagestreams": read tcp 10.129.9.1:56734->52.9.155.127:6443: read: connection reset by peer Jan 04 07:57:10.654 - 30s E openshift-apiserver-reused-connection openshift-apiserver-reused-connection is not responding to GET requests Jan 04 07:57:41.260 I openshift-apiserver-reused-connection openshift-apiserver-reused-connection started responding to GET requests although that 7:57 business diverges from the 7:38 Available=False block: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478239379674632192/build-log.txt | grep 'clusteroperator/openshift-apiserver condition/Available Jan 04 07:38:00.674 E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: "authorization.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "build.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "image.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "project.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "quota.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "route.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "template.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request Jan 04 07:38:00.674 - 150s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: "authorization.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "build.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "image.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "project.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "quota.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "route.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "template.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request Jan 04 07:40:31.187 W clusteroperator/openshift-apiserver condition/Available status/True reason/AsExpected changed: All is well Anyhow, I'm agnostic on backports here, so feel free to WONTFIX or CURRENTRELEASE or MODIFIED or whatever, as you see fit. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1948089#c31 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478239379674632192 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1948089#c0 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1948089#c20 [5]: https://github.com/openshift/origin/blame/73f3c46763dc2afe16400f5e1cc18f7d2f399a59/pkg/synthetictests/operators.go#L67-L68 The changes which were made to 4.9 have been backported successfully to 4.8 which was my request. Given that there are other contributing factors which lead to these tests flaking, but not failing, as Jamo mentioned, I will mark this as CLOSED CURRENTRELEASE. That should not, however, stop us from pursuing additional fixes which reduce the flake rate of this job. |