The upstream "Should be able to support the 1.17 Sample API Server using the current Aggregator" test causes an alert to fire on cluster which we are trying to suppress to ensure we don't fire spurious alerts. There was a bug in the test that masked this firing before. The PR run https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4686/pull-ci-openshift-installer-master-e2e-aws/1364976402780131328 shows the problem { s: "query failed: sort_desc(count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\",severity!=\"info\"}[25m11s:1s]) > 0): promQL query: sort_desc(count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\",severity!=\"info\"}[25m11s:1s]) > 0) had reported incorrect results:\n[{\"metric\":{\"alertname\":\"AggregatedAPIDown\",\"alertstate\":\"firing\",\"name\":\"v1alpha1.wardle.example.com\",\"namespace\":\"default\",\"severity\":\"warning\"},\"value\":[1614275407.062,\"300\"]}]", }, The test that creates this server finishes at 17:30 passed: (42.5s) 2021-02-25T17:30:02 "[sig-api-machinery] Aggregator Should be able to support the 1.17 Sample API Server using the current Aggregator [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]" and looking at the metrics for this PR the alert goes pending at 17:30:23 (which is awesome! our metrics work) and then fires after 5m until 5m after that (stops firing at 17:40). Looking at the test we try to clean up the aggregated API object but it seems to take 10m to delete the aggregated API object(?). Expect: The test to clean up the aggregated API object and ensure it is gone before removing the deployment such that the alert doesn't fire (assuming that the alert is correct / a cache that the alert is reading isn't out of date). Setting high priority but medium severity, we will work around this for now in 4.8.
Fails about 1-2% of CI jobs today.
Temporarily blocked from happenining in 4.8 in https://github.com/openshift/origin/pull/25925
This seems to be a race between deregistering an aggregated API server and observing it from individual Kube APIs. For example, in the test run mentioned in the description the aggregated API was deleted at ~17:30:00 and it was observed as unavailable around that time by Kube APIs (see the logs below). I suspect that it was observed as unavailable because the test started tearing down the deployment right after deregistration. In addition to that, it seems that the alert itself is problematic. # wardle API went unavailable at 17:30 query: aggregator_unavailable_apiservice{name="v1alpha1.wardle.example.com"} output: aggregator_unavailable_apiservice{apiserver="kube-apiserver",endpoint="https",instance="10.0.242.84:6443",job="apiserver",name="v1alpha1.wardle.example.com",namespace="default",service="kubernetes"} 1 # repeating the same query one minute later 17:31 gives: query: aggregator_unavailable_apiservice{name="v1alpha1.wardle.example.com"} output: no data # the alert is defined as: (1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 85 # which means that it takes an average over 10 minute period which would match the observation: started firing at 17:30 and stopped at 17:40 # there were no changes in availability for that service past 17:30 hence the above alert held true for 10 minutes # delete request from the audit log: {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"ffa6fad7-f24b-4651-a469-f03492ca2924","stage":"ResponseComplete","requestURI":"/apis/apiregistration.k8s.io/v1/apiservices/v1alpha1.wardle.example.com","verb":"delete","user":{"username":"system:admin","groups":["system:masters","system:authenticated"]},"sourceIPs":["10.0.81.123"],"userAgent":"openshift-tests/v0.0.0 (linux/amd64) kubernetes/$Format -- [sig-api-machinery] Aggregator Should be able to support the 1.17 Sample API Server using the current Aggregator [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]","objectRef":{"resource":"apiservices","name":"v1alpha1.wardle.example.com","apiGroup":"apiregistration.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2021-02-25T17:30:00.408853Z","stageTimestamp":"2021-02-25T17:30:00.494837Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} # master kube-apiserver-ip-10-0-159-102.us-west-1.compute.internal: 2021-02-25T17:29:54.221208060Z W0225 17:29:54.221094 18 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured 2021-02-25T17:29:54.222102147Z E0225 17:29:54.222041 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": read tcp 10.130.0.1:50444->10.129.2.149:443: read: connection reset by peer 2021-02-25T17:29:54.230045482Z E0225 17:29:54.229960 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:54.231349135Z E0225 17:29:54.231300 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:54.233393346Z E0225 17:29:54.233326 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:54.242763564Z E0225 17:29:54.242711 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:54.324968152Z E0225 17:29:54.323778 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:54.484799182Z E0225 17:29:54.484723 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:54.805843725Z E0225 17:29:54.805754 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:55.103813426Z W0225 17:29:55.103730 18 handler_proxy.go:103] no RequestInfo found in the context 2021-02-25T17:29:55.103877085Z E0225 17:29:55.103775 18 controller.go:116] loading OpenAPI spec for "v1alpha1.wardle.example.com" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable 2021-02-25T17:29:55.103877085Z , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]] 2021-02-25T17:29:55.103877085Z I0225 17:29:55.103784 18 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.wardle.example.com: Rate Limited Requeue. 2021-02-25T17:29:55.398163348Z I0225 17:29:55.398083 18 client.go:360] parsed scheme: "passthrough" 2021-02-25T17:29:55.398163348Z I0225 17:29:55.398119 18 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.171.93:2379 <nil> 0 <nil>}] <nil> <nil>} 2021-02-25T17:29:55.398163348Z I0225 17:29:55.398130 18 clientconn.go:948] ClientConn switching balancer to "pick_first" 2021-02-25T17:29:55.398296362Z I0225 17:29:55.398243 18 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0440f8480, {CONNECTING <nil>} 2021-02-25T17:29:55.404970728Z I0225 17:29:55.404889 18 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0440f8480, {READY <nil>} 2021-02-25T17:29:55.405815321Z I0225 17:29:55.405760 18 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" 2021-02-25T17:29:55.446826087Z E0225 17:29:55.446741 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:55.572147540Z E0225 17:29:55.572009 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.wardle.example.com": the object has been modified; please apply your changes to the latest version and try again 2021-02-25T17:29:56.096089652Z W0225 17:29:56.096008 18 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured 2021-02-25T17:29:56.570719216Z W0225 17:29:56.570634 18 handler_proxy.go:103] no RequestInfo found in the context 2021-02-25T17:29:56.570768751Z E0225 17:29:56.570675 18 controller.go:116] loading OpenAPI spec for "v1alpha1.wardle.example.com" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable 2021-02-25T17:29:56.570768751Z , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]] 2021-02-25T17:29:56.570768751Z I0225 17:29:56.570686 18 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.wardle.example.com: Rate Limited Requeue. 2021-02-25T17:29:57.340713273Z W0225 17:29:57.340634 18 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured 2021-02-25T17:29:57.443778327Z E0225 17:29:57.443641 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: bad status from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: 403 2021-02-25T17:29:57.447601048Z E0225 17:29:57.447540 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: bad status from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: 403 2021-02-25T17:29:57.450984552Z E0225 17:29:57.450922 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: bad status from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: 403 2021-02-25T17:29:57.472657323Z E0225 17:29:57.472564 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: bad status from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: 403 2021-02-25T17:29:57.479351423Z W0225 17:29:57.479284 18 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured 2021-02-25T17:29:57.522789843Z W0225 17:29:57.522724 18 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured 2021-02-25T17:29:57.530417313Z E0225 17:29:57.529943 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.wardle.example.com": the object has been modified; please apply your changes to the latest version and try again # master kube-apiserver-ip-10-0-171-93.us-west-1.compute.internal: 2021-02-25T17:29:54.221388604Z E0225 17:29:54.221285 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.wardle.example.com": the object has been modified; please apply your changes to the latest version and try again 2021-02-25T17:29:54.235478592Z E0225 17:29:54.235400 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.wardle.example.com": the object has been modified; please apply your changes to the latest version and try again 2021-02-25T17:29:54.236334357Z E0225 17:29:54.236248 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:54.246279813Z E0225 17:29:54.246222 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:54.287214831Z E0225 17:29:54.287145 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:54.368232298Z E0225 17:29:54.368157 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:54.529232205Z E0225 17:29:54.529142 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:54.850163891Z E0225 17:29:54.850012 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:55.002488977Z W0225 17:29:55.002333 18 handler_proxy.go:103] no RequestInfo found in the context 2021-02-25T17:29:55.002488977Z E0225 17:29:55.002392 18 controller.go:116] loading OpenAPI spec for "v1alpha1.wardle.example.com" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable 2021-02-25T17:29:55.002488977Z , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]] 2021-02-25T17:29:55.002488977Z I0225 17:29:55.002402 18 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.wardle.example.com: Rate Limited Requeue. 2021-02-25T17:29:55.495229074Z E0225 17:29:55.495133 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:56.570170378Z W0225 17:29:56.570077 18 handler_proxy.go:103] no RequestInfo found in the context 2021-02-25T17:29:56.570245865Z E0225 17:29:56.570136 18 controller.go:116] loading OpenAPI spec for "v1alpha1.wardle.example.com" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable 2021-02-25T17:29:56.570245865Z , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]] 2021-02-25T17:29:56.570245865Z I0225 17:29:56.570149 18 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.wardle.example.com: Rate Limited Requeue. 2021-02-25T17:29:57.452771150Z E0225 17:29:57.452661 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.wardle.example.com": the object has been modified; please apply your changes to the latest version and try again 2021-02-25T17:29:57.456771254Z E0225 17:29:57.454918 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: bad status from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: 403 2021-02-25T17:29:57.461016697Z E0225 17:29:57.459958 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: bad status from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: 403 2021-02-25T17:29:57.484659145Z E0225 17:29:57.484557 18 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: bad status from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: 403 # master kube-apiserver-ip-10-0-242-84.us-west-1.compute.internal: 2021-02-25T17:29:55.143192921Z E0225 17:29:55.143105 20 available_controller.go:508] v1alpha1.wardle.example.com failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.wardle.example.com": the object has been modified; please apply your changes to the latest version and try again 2021-02-25T17:29:55.160665156Z E0225 17:29:55.160571 20 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:55.169672724Z E0225 17:29:55.169606 20 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:55.177612428Z E0225 17:29:55.177550 20 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:55.238687448Z E0225 17:29:55.238615 20 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:55.321120974Z E0225 17:29:55.321046 20 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:55.484710217Z E0225 17:29:55.484631 20 available_controller.go:508] v1alpha1.wardle.example.com failed with: failing or missing response from https://10.129.2.149:443/apis/wardle.example.com/v1alpha1: Get "https://10.129.2.149:443/apis/wardle.example.com/v1alpha1": dial tcp 10.129.2.149:443: connect: connection refused 2021-02-25T17:29:55.584729640Z E0225 17:29:55.584654 20 available_controller.go:508] v1alpha1.wardle.example.com failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1alpha1.wardle.example.com": the object has been modified; please apply your changes to the latest version and try again
To solve the issue we could: 1. change the test to only tear down the aggregated API once it was deregistered from individuals Kube APIs. 2. change/fix the alert itself.
I'm going to open a PR that simply sets a long timeout as a workaround.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Closing as the upstream PR didn't get any traction and the test got better.
This is showing up in CI. You can reproduce there or add data as much as you like to debug, but it's clearly reproducible. I'm seeing it as "alert KubeAggregatedAPIDown fired for 300 seconds with labels: {name="v1alpha1.wardle.example.com", namespace="default", severity="warning"}" https://search.ci.openshift.org/?search=alert+KubeAggregatedAPIDown+fired+for+300+seconds+with+labels.*v1alpha1.wardle.example.com&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job it's in about 2% of runs, which is a lot to be failing.
Dear reporter, we greatly appreciate the bug you have reported here. Unfortunately, due to migration to a new issue-tracking system (https://issues.redhat.com/), we cannot continue triaging bugs reported in Bugzilla. Since this bug has been stale for multiple days, we, therefore, decided to close this bug. If you think this is a mistake or this bug has a higher priority or severity as set today, please feel free to reopen this bug and tell us why. We are going to move every re-opened bug to https://issues.redhat.com. Thank you for your patience and understanding.