From CI runs like [1]: : [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available Run #0: Failed 0s 6 unexpected clusteroperator state transitions during e2e test run Apr 09 13:17:46.730 - 1s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.128.0.73:8443/apis/apps.openshift.io/v1: Get "https://10.128.0.73:8443/apis/apps.openshift.io/v1": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 09 13:18:01.727 - 1s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.build.openshift.io: not available: failing or missing response from https://10.128.0.73:8443/apis/build.openshift.io/v1: Get "https://10.128.0.73:8443/apis/build.openshift.io/v1": context deadline exceeded Apr 09 13:18:16.874 - 1s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.50:8443/apis/apps.openshift.io/v1: Get "https://10.129.0.50:8443/apis/apps.openshift.io/v1": context deadline exceeded\nAPIServicesAvailable: apiservices.apiregistration.k8s.io/v1.project.openshift.io: not available: failing or missing response from https://10.128.0.73:8443/apis/project.openshift.io/v1: Get "https://10.128.0.73:8443/apis/project.openshift.io/v1": dial tcp 10.128.0.73:8443: i/o timeout Apr 09 13:25:10.247 - 25s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: "apps.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "build.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "route.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "security.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) Apr 09 13:31:25.504 - 1s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.128.0.10:8443/apis/apps.openshift.io/v1: Get "https://10.128.0.10:8443/apis/apps.openshift.io/v1": context deadline exceeded Apr 09 13:31:45.328 - 1s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.build.openshift.io: not available: failing or missing response from https://10.130.0.14:8443/apis/build.openshift.io/v1: Get "https://10.130.0.14:8443/apis/build.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Very popular: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+ change+condition/Available' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 17 runs, 100% failed, 88% of failures match = 88% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 95% of failures match = 95% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 21 runs, 100% failed, 81% of failures match = 81% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 10 runs, 50% failed, 60% of failures match = 30% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact Possibly a dup of some non-update bug, but if so, please mention the test-case in that bug for Sippy ;). [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152
*** This bug has been marked as a duplicate of bug 1943442 ***
This update issue is possibly a dup of the bug 1926867 series?
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
Still popular: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 18 runs, 78% failed, 107% of failures match = 83% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 84% of failures match = 84% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 18 runs, 100% failed, 89% of failures match = 89% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 20 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 7 runs, 86% failed, 50% of failures match = 43% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 8 runs, 100% failed, 63% of failures match = 63% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 8 runs, 100% failed, 100% of failures match = 100% impact
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Still popular: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 4 runs, 75% failed, 100% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 9 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 4 runs, 75% failed, 100% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 10 runs, 80% failed, 88% of failures match = 70% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 14 runs, 93% failed, 23% of failures match = 21% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 25% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-metal-ipi-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded during normal operations.
Updating fields based on C9 and to clear blocker? list.
PR has long merged and we did bumps since then in cluster-kube-apiserver-operator.
Still popular in CI, including for 4.9 jobs: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep 'failures match' | sort periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 7 runs, 71% failed, 40% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 7 runs, 29% failed, 350% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 7 runs, 86% failed, 83% of failures match = 71% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 59 runs, 90% failed, 34% of failures match = 31% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 13 runs, 100% failed, 31% of failures match = 31% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 13 runs, 85% failed, 45% of failures match = 38% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 3 runs, 33% failed, 300% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact Drilling into the periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade job: $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | jq -r 'keys[]' | grep nightly-4.9-e2e-aws-upgrade/ https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1428038724414869504 which has: : [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available Run #0: Failed 48m18s 1 unexpected clusteroperator state transitions during e2e test run Aug 18 18:16:03.612 - 793ms E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: [Get "https://172.30.0.1:443/apis/apiregistration.k8s.io/v1/apiservices/v1.route.openshift.io": context canceled, context canceled] From the e2e-interval chart, that's happening as the first control-plane machine to reboot is coming back up, and the second one is about to start draining. I dunno if the "context canceled" is sufficiently different from this bug's original 503 to be worth a separate bug or not.
Statistics on the messages: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable.*is not ready: 503' | grep 'failures match' ...no hits... The context issue is unique to 4.9, which supports the new message being the same underlying issue but with messaging altered by the library-go pivot (or some other 4.9 change): $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable.*Get.*context+canceled' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 55 runs, 89% failed, 8% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 13 runs, 100% failed, 8% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 13 runs, 85% failed, 18% of failures match = 15% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact Poking around in job names involving 4.9: $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&name=^periodic.*4.9.*upgrade&type=junit&context=0&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*APIServicesAvailable: //' | sort | uniq -c | sort -n ...lots of stuff... other common failure message seems to include "context deadline exceeded", "dial tcp ...:8443: i/o timeout", "Client.Timeout exceeded while awaiting headers", "etcdserver: leader changed", and other things going on as well. $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&name=^periodic.*4.9.*upgrade&type=junit&context=0&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*APIServicesAvailable: //' | sort | grep "context canceled" | wc -l 12 $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&name=^periodic.*4.9.*upgrade&type=junit&context=0&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*APIServicesAvailable: //' | sort | grep "context deadline exceeded" | wc -l 3 $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&name=^periodic.*4.9.*upgrade&type=junit&context=0&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*APIServicesAvailable: //' | sort | grep "dial tcp.*8443: i/o timeout" | wc -l 4 The other modes each occurred only once in the past 24h.
It looks like there are at least 3 categories of errors: 1. an "i/o timeout" while connecting to an aggregated API - possibly an SDN error https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*Get.*dial&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 2. client-side timeout (context canceled) while getting an API service resource from the Kube API Server https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*Get.*context+canceled&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 3."failing or missing response from an aggregated API ... context deadline exceeded" error - reported by the Kube API server status controller. https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*Get.*8443%2Fapis.*context&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Looking at the first category, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1429964492430643200 # The APIServer was unavailable around 01:43 01:43:32 openshift-apiserver-operator openshift-apiserver-operator-status-controller-statussyncer_openshift-apiserver openshift-apiserver-operator OperatorStatusChanged Status for clusteroperator/openshift-apiserver changed: Available changed from True to False ("APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.130.0.60:8443/apis/apps.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/apps.openshift.io/v1\": dial tcp 10.130.0.60:8443: i/o timeout\nAPIServicesAvailable: apiservices.apiregistration.k8s.io/v1.build.openshift.io: not available: failing or missing response from https://10.130.0.60:8443/apis/build.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/build.openshift.io/v1\": dial tcp 10.130.0.60:8443: i/o timeout") # The Kube API Server (actually a controller) on master-0 marked the API service as unavailable around that time 2021-08-24T01:43:32.042537997Z I0824 01:43:32.042433 18 available_controller.go:474] "changing APIService availability" name="v1.image.openshift.io" oldStatus=True newStatus=False message="failing or missing response from https://10.130.0.60:8443/apis/image.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/image.openshift.io/v1\": dial tcp 10.130.0.60:8443: i/o timeout" reason="FailedDiscoveryCheck" 2021-08-24T01:43:32.049374015Z I0824 01:43:32.049281 18 available_controller.go:474] "changing APIService availability" name="v1.apps.openshift.io" oldStatus=True newStatus=False message="failing or missing response from https://10.130.0.60:8443/apis/apps.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/apps.openshift.io/v1\": dial tcp 10.130.0.60:8443: i/o timeout" reason="FailedDiscoveryCheck" 2021-08-24T01:43:32.050004876Z I0824 01:43:32.049947 18 available_controller.go:474] "changing APIService availability" name="v1.build.openshift.io" oldStatus=True newStatus=False message="failing or missing response from https://10.130.0.60:8443/apis/build.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/build.openshift.io/v1\": dial tcp 10.130.0.60:8443: i/o timeout" reason="FailedDiscoveryCheck" 2021-08-24T01:43:32.050549703Z I0824 01:43:32.050489 18 available_controller.go:474] "changing APIService availability" name="v1.project.openshift.io" oldStatus=True newStatus=False message="failing or missing response from https://10.130.0.60:8443/apis/project.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/project.openshift.io/v1\": context deadline exceeded" reason="FailedDiscoveryCheck" 2021-08-24T01:43:32.051029638Z I0824 01:43:32.050973 18 available_controller.go:474] "changing APIService availability" name="v1.template.openshift.io" oldStatus=True newStatus=False message="failing or missing response from https://10.128.0.74:8443/apis/template.openshift.io/v1: Get \"https://10.128.0.74:8443/apis/template.openshift.io/v1\": context deadline exceeded" reason="FailedDiscoveryCheck" # The master-0 was drained before 1:43 # 10.130.0.60 was running on master-1 # SDN on master-0 wasn't ready around 1:43 # In consequence, the controller wasn't able to reach a pod on a different host and marked the API service as unavailable # The outage was reported by the operator
The failures in the second category seem to be corresponding with the operator being terminated. During termination, the context is canceled. The cancelation signal is propagated to the network calls which makes them fail. The failures are reported by the operator and finally, the operator terminates.
https://bugzilla.redhat.com/show_bug.cgi?id=1998516 will address #3 on c20 (non-blocker).
The first category seems to be resolved - https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*Get.*context+canceled&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job The second looks better - https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*Get.*dial&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job The overall availability seems also better - https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=4.8&maxMatches=5&maxBytes=20971520&groupBy=job Not sure if periodic-ci-openshift-release-master jobs contains the fixes
QE also noticed similar issue in QE upgrade test case recorded in https://issues.redhat.com/browse/OCPQE-5300 not yet able to be timely investigated, will keep watching latest QE CI jobs and investigate. Could you give some clue how to verify this bug? Seems only reply on CI? But CI does not absolutely have none of any failures, like https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse.*openshift.io.v1%22+is+not+ready&maxAge=24h&context=1&type=junit&name=upgrade.*4%5C.9%7C4%5C.9.*upgrade&excludeName=4%5C.7&maxMatches=1&maxBytes=20971520&groupBy=job . Thanks
probably the best way is to search for occurrences in CI. The fixes were applied only to 4.9 So far we are getting ( [1] ) - dial tcp 172.30.0.1:443: connect: connection refused - it seems to be restricted to SNO clusters - APIServicesAvailable: etcdserver: leader changed - which seems to be a response from the Kube API Server [1] https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=4.8&maxMatches=5&maxBytes=20971520&groupBy=job
OK, used comment 19 way, proved comment 30's conclusion: [xxia 2021-09-02 18:49:07 CST my]$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable&maxAge=24h&context=0&type=junit&name=%5Eperiodic.*4.9.*upgrade&excludeName=4%5C.8&maxMatches=5&maxBytes=20971520&groupBy=job' [clusteroperator/open] [1d ] [No context] [junit ] [Search] Job: [^periodic.*4.9.*upgr] [4\.8 ] [5 ] [20971520 ] [job ] [ ] Wrap lines periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) #1433278112061198336 junit 5 hours ago Sep 02 05:18:10.803 - 1s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed Sep 02 05:18:10.803 - 1s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed #1433223340943740928 junit 8 hours ago Sep 02 01:48:19.803 - 2s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed Sep 02 01:48:19.803 - 2s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed #1433174896422162432 junit 12 hours ago Sep 01 21:45:13.697 - 8s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed Sep 01 21:45:13.697 - 8s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed #1433038397445771264 junit 21 hours ago Sep 01 12:38:18.730 - 1s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed Sep 01 12:38:18.730 - 1s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) #1433116964133277696 junit 14 hours ago Sep 01 18:03:14.778 - 5s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: Get "https://172.30.0.1:443/apis/apiregistration.k8s.io/v1/apiservices/v1.apps.openshift.io": dial tcp 172.30.0.1:443: i/o timeout Sep 01 18:03:14.778 - 5s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: Get "https://172.30.0.1:443/apis/apiregistration.k8s.io/v1/apiservices/v1.apps.openshift.io": dial tcp 172.30.0.1:443: i/o timeout periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.7-e2e-aws-upgrade-paused (all) #1433132315776651264 junit 15 hours ago Sep 01 19:58:20.010 - 6s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.11:8443/apis/apps.openshift.io/v1: Get "https://10.129.0.11:8443/apis/apps.openshift.io/v1": net/http: request canceled (Client.Timeout exceeded while awaiting headers) Sep 01 19:58:20.010 - 6s E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.11:8443/apis/apps.openshift.io/v1: Get "https://10.129.0.11:8443/apis/apps.openshift.io/v1": net/http: request canceled (Client.Timeout exceeded while awaiting headers) Would like to move it to VERIFIED therefore.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759