Bug 1991601
| Summary: | alert TargetDown fired for XXX seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"} | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jan Chaloupka <jchaloup> |
| Component: | Machine Config Operator | Assignee: | John Kyros <jkyros> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED DEFERRED | Docs Contact: | |
| Severity: | high | ||
| Priority: | low | CC: | bpickard, dmistry, dperique, ffernand, mfojtik, mkennell, mkrejci, mtarsel, scuppett, skumari, ssadhale, surbania, vvinnako, wking, zzhao |
| Version: | 4.9 | Keywords: | Reopened |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | tag-ci LifecycleStale | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
|
|
| Last Closed: | 2023-03-09 01:05:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jan Chaloupka
2021-08-09 13:49:23 UTC
*** Bug 1997916 has been marked as a duplicate of this bug. *** oVirt e2e jobs have been painted red lately due to a test failure:
: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] 02:26:11.764:
Unexpected alerts fired or pending after the test run:
alert TargetDown fired for 60 seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"}
I think we might be hitting BZ#1991601.
Can you please take a look at one of the example jobs[1][2][3] and tell me if you think it is related to this BZ or is it something else?
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-ovirt/1430743201987170304
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-ovirt/1430706888411975680
[3] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ni[…].9-e2e-ovirt&sort-by-failures=&width=20
Looking through the logs these alerts fire after all of the MC* have settled into a config as was pointed out in the original post, which leads me to believe that our operator isn't causing the issue but that some sort of node availability/connectivity isuse. On closer inspection: when I look at the MCO oauth pods I see some uncharacteristic errors: 2021/08/08 10:14:03 http.go:107: HTTPS: listening on [::]:9001 2021/08/08 10:22:50 oauthproxy.go:791: requestauth: 192.168.3.52:59776 context canceled 2021/08/08 10:23:00 oauthproxy.go:791: requestauth: 192.168.3.51:39562 context canceled E0808 10:23:10.889173 7487 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded 2021/08/08 10:23:10 oauthproxy.go:791: requestauth: 192.168.3.52:60356 Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded E0808 10:23:20.922126 7487 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded 2021/08/08 10:23:20 oauthproxy.go:791: requestauth: 192.168.3.51:39996 Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded Somethings I notice: There was an unready master: ``` I0807 10:21:53.268258 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"1fff7539-7032-4e62-949a-17ba3147b2f5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: 2 of 3 members are available, libvirt-ppc64le-1-1-9-kfv8v-master-2 is unhealthy" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" ``` Looking closer there are connectivity issues: we also have a couple of events during the outage period that indicate there are connectivity issues: ``` 10:24:05 (x38) etcd-quorum-guard-59f44c6cc4-mj45q Unhealthy Readiness probe failed: 10:24:05 (x2) openshift-oauth-apiserver apiserver-6dcd4887f7-wdjt4 Unhealthy Liveness probe failed: Get "https://10.129.0.18:8443/healthz": net/http: request canceled (Client.Timeout exceeded while awaiting headers) ``` In auth: ``` 0807 10:22:23.695993 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"ee972910-ddbc-4b6e-a7e2-83fc666fa518", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Degraded message changed from "All is well" to "WellKnownReadyControllerDegraded: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)" ``` Looking closer: ``` I0807 10:26:32.000535 1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-08-07T10:24:44Z","message":"WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2","reason":"WellKnownReadyController_SyncError","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-08-07T10:17:13Z","message":"WellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)","reason":"WellKnownReadyController_OAuthMetadataNotYetServed","status":"True","type":"Progressing"},{"lastTransitionTime":"2021-08-07T10:11:20Z","message":"WellKnownAvailable: The well-known endpoint is not yet available: need at least 3 kube-apiservers, got 2","reason":"WellKnown_NotReady","status":"False","type":"Available"},{"lastTransitionTime":"2021-08-07T10:11:20Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} I0807 10:26:32.021561 1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-08-07T10:24:44Z","message":"WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2","reason":"WellKnownReadyController_SyncError","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-08-07T10:17:13Z","message":"WellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)","reason":"WellKnownReadyController_OAuthMetadataNotYetServed","status":"True","type":"Progressing"},{"lastTransitionTime":"2021-08-07T10:11:20Z","message":"WellKnownAvailable: The well-known endpoint is not yet available: need at least 3 kube-apiservers, got 2","reason":"WellKnown_NotReady","status":"False","type":"Available"},{"lastTransitionTime":"2021-08-07T10:11:20Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} I0807 10:26:32.022829 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"ee972910-ddbc-4b6e-a7e2-83fc666fa518", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Progressing message changed from "OAuthServerDeploymentProgressing: deployment/oauth-openshift.openshift-authentication: 1/3 pods have been updated to the latest generation\nWellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)" to "WellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)" I0807 10:26:32.033244 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"ee972910-ddbc-4b6e-a7e2-83fc666fa518", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Progressing message changed from "OAuthServerDeploymentProgressing: deployment/oauth-openshift.openshift-authentication: 1/3 pods have been updated to the latest generation\nWellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)" to "WellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)" I0807 10:26:33.368190 1 request.go:668] Waited for 1.071089597s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-authentication/serviceaccounts/oauth-openshift E0807 10:26:41.981307 1 base_controller.go:272] WellKnownReadyController reconciliation failed: need at least 3 kube-apiservers, got 2 E0807 10:27:11.966161 1 base_controller.go:272] WellKnownReadyController reconciliation failed: need at least 3 kube-apiservers, got 2 ``` I'm not certain what this e2e testing is doing but it seems to be causing issues with auth/kube-apiserver? Which seems to cause a brief machine-config-daemon target down alert when the tests are running. I'm going to pass this along to auth team to take a look at this. observing the failure from
'alert TargetDown fired for xxxx seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"} '
on recent 4.9 s390x CI jobs as well
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-remote-libvirt-s390x/1435150903186296832
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-remote-libvirt-s390x/1434969768300908544
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. Raised the severity which was set to "unspecified" hoping this bug will get more attention. The reason for the High severity is: - The impact it has on the CI jobs - painting them red. - The fact that we mostly see it on 4.9/4.10 jobs shows there is probably be an actual bug in 4.9. For oVirt: - 4.9: periodic-ci-openshift-release-master-nightly-4.9-e2e-ovirt (all) - 40 runs, 65% failed, 58% of failures match = 38% impact - 4.10: periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt (all) - 49 runs, 69% failed, 38% of failures match = 27% impact - upgrade: periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade (all) - 28 runs, 86% failed, 21% of failures match = 18% impact For Libvirt: - 4.9: periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-remote-libvirt-ppc64le (all) - 14 runs, 93% failed, 62% of failures match = 57% impact periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-image-ecosystem-remote-libvirt-ppc64le (all) - 14 runs, 50% failed, 57% of failures match = 29% impact periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-remote-libvirt-s390x (all) - 14 runs, 100% failed, 14% of failures match = 14% impact - 4.10: periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-s390x (all) - 14 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-image-ecosystem-remote-libvirt-ppc64le (all) - 14 runs, 57% failed, 63% of failures match = 36% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-ppc64le (all) - 14 runs, 57% failed, 63% of failures match = 36% impact Others: - pull-ci-openshift-ovn-kubernetes-release-4.9-e2e-vsphere-ovn (all) - 7 runs, 86% failed, 17% of failures match = 14% impact - periodic-ci-openshift-release-master-ci-4.9-e2e-azure (all) - 7 runs, 71% failed, 20% of failures match = 14% impact - periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 14% of failures match = 14% impact - periodic-ci-openshift-release-master-ci-4.9-e2e-azure-serial (all) - 7 runs, 43% failed, 33% of failures match = 14% impact Search: https://search.ci.openshift.org/?search=alert+TargetDown+fired+for+.*+seconds+with+labels%3A+%5C%7Bjob%3D%22machine-config-daemon%22%2C+namespace%3D%22openshift-machine-config-operator%22%2C+service%3D%22machine-config-daemon%22%2C+severity%3D%22warning%22%5C%7D&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. the errors here don't indicate an issue in the auth stack, i am only recognizing network timeouts during API calls throughout many layers of the stack (auth, etcd, machine config operator). Reassigning to network team to continue investigation. Errors only showing in mco namespace, so we are passing to them. We would expect networking errors to be present in the whole cluster. If you see errors in the overlay network hand it back and we will investigate. Looking into the logs for this, I found a gap in logs for the machine-config-daemon-qmgtp pod on master-2 (middle rectangle in the screen shot in the gist below). I don't know if a gap that big is normal (from 16:27:55 - 17:25:37). Also, the screenshot shows logs resuming after 17:25 (see bottom rectangle). https://gist.github.com/DennisPeriquet/8db3acc6ef690c41431f064ec349c98a I'm wondering if there is just some network connectivity condition that happens and then gets resolved and the logs are just being shipped to loki later (I don't know enough about how the loki vs. promtail is working and maybe logs get buffered during a network connectivity loss. I looked at this job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade/1498692901662625792 This loki/LogQL query shows lots of 'context cancelled' and 'Failed to make webhook authenticator request ... context deadline exceeded' messages throughout the time of the TargetDown alert: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221646148326000%22,%221646156966000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade%2F1498692901662625792%5C%22%7D%20%7C%20unpack%7C%20pod_name%3D%5C%22machine-config-daemon-qmgtp%5C%22%5Cn%5Cn%22,%22refId%22:%22A%22,%22instant%22:false,%22range%22:true,%22resolution%22:1%7D%5D This loki/LogQL query shows logs that are sent at 17:25 followed by logs at 16:26 (i.e., late logs) which suggest to me that the log scraper buffered logs until it was able to send them again. https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221646148326000%22,%221646156966000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade%2F1498692901662625792%5C%22%7D%20%7C%20unpack%7C%20host%3D%5C%22ovirt18-d6d8x-master-2%5C%22%7Csystemd_unit%3D%5C%22kubelet.service%5C%22%7C~%20%5C%22machine-config-daemon-qmgtp%5C%22%5Cn%22,%22refId%22:%22A%22,%22instant%22:false,%22range%22:true,%22resolution%22:1%7D%5D There was no evidence of a reboot causing the issue (the only reboots are at 17:30 which is after the upgrade. I'm very inclined to believe there was some networking issue going on. @bpickard while I have the logs handy if you can tell me what logs to look at, I can take a look. Looking into the logs for this, I found a gap in logs for the machine-config-daemon-qmgtp pod on master-2 (middle rectangle in the screen shot in the gist below). I don't know if a gap that big is normal (from 16:27:55 - 17:25:37). Also, the screenshot shows logs resuming after 17:25 (see bottom rectangle). https://gist.github.com/DennisPeriquet/8db3acc6ef690c41431f064ec349c98a I'm wondering if there is just some network connectivity condition that happens and then gets resolved and the logs are just being shipped to loki later (I don't know enough about how the loki vs. promtail is working and maybe logs get buffered during a network connectivity loss. I added logs in https://github.com/DennisPeriquet/misc-attachments/tree/master/BZ1991601 from the sdn pod BTW, I don't think the SDN pods I attached (via git repo) in the previous comment are present in the must-gather pod logs because they are from the time of the alert (16:42 - 16:52 UTC). The must-gather logs probably contain logs after the upgrade (and the -previous.log files are empty). Looking at run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-ovirt-upgrade/1538069955088486400 I see alert firing due to machine-config-operator metrics endpoint unavailability when it is polled periodically to determine its "up" status by the Prometheus stack. I see container named oauth-proxy which handles the authentication for the metrics endpoint log an error continuously during the time the alert is firing: E0618 08:40:47.924196 8637 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded oauth-proxy returns 403 (forbidden) to the caller and this is why Prometheus detects this endpoint as not "up" and the alert fires. An error occurs within the oauth-proxy - it is unable to contact the kube api server at 172.30.0.1. Other services on the node are able to contact this server. I notice this occurs during startup of the cluster and resolves itself after awhile. I notice SDN is not emitting any error and is syncing proxy rules. I notice machine-config-daemon is host networked. Looking into this further. I suspected the issue with this was something in oauth or the caller (prometheus) and not network connectivity because I do not see k8 api connectivity issues with other services e.g. machine-config-daemon reflector would error out during this time. So, I created this PR (https://github.com/openshift/machine-config-operator/pull/3199) and launched it 4 times with clusterbot to prove the k8 api port is reachable. After the 3rd launch I see the up alert firing just following cluster creation. Here is some screenshots showing the time when machine-config-daemon is NOT "up" and also screenshots of oauth container logs and also screenshots showing my PR attempting and succeeding to connect to k8 api port during the time where oauth container fails to connect to k8 api. Machine-config-daemon and oauth containers share the same network stack: https://gist.github.com/martinkennelly/6a7fef9f582c7faf454d6c687eb84adb So, something in the monitoring stack connects to machine-config-daemon's oauth containers port to detect its "up" status. Oauth component fails to connect to k8 api to validate requester permissions with "context deadline exceeded". However, I have shown with my PR and screenshots its not a network issue but something else in oauth or monitoring stack. Giving this BZ back to machine-config-operator team to make the judgement for where this bz should go now. It could also be an issue with the k8 api server but I have shown connectivity is ok to the k8 api port. Also, I only see this issue during cluster startup. :s/during/shortly after/ we are again seeing these failure in s390x in 4.9 and 4.10
alert TargetDown fired for 60 seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"}}
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-remote-libvirt-s390x/1550813067808346112
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-s390x/1549650434023493632
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-8938 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |