Bug 1991601

Summary: alert TargetDown fired for XXX seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"}
Product: OpenShift Container Platform Reporter: Jan Chaloupka <jchaloup>
Component: Machine Config OperatorAssignee: John Kyros <jkyros>
Machine Config Operator sub component: Machine Config Operator QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DEFERRED Docs Contact:
Severity: high    
Priority: low CC: bpickard, dmistry, dperique, ffernand, mfojtik, mkennell, mkrejci, mtarsel, scuppett, skumari, ssadhale, surbania, vvinnako, wking, zzhao
Version: 4.9Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: tag-ci LifecycleStale
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
Last Closed: 2023-03-09 01:05:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Chaloupka 2021-08-09 13:49:23 UTC
From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1423947091704549376:

```
alert TargetDown fired for 13 seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"}
```

Checking kubelet logs for all the nodes:
```
Aug 07 10:11:49.788245 libvirt-ppc64le-1-1-9-kfv8v-master-0 crio[1244]: time="2021-08-07 10:11:49.788169211Z" level=info msg="Started container dd7e2473c51870c1894531af9a3935b907340a31216f85c32e391bddf22d7fd0: openshift-machine-config-operator/machine-config-daemon-7r2bb/machine-config-daemon" id=15456b41-39c9-41ce-8f10-71398df6dd26 name=/runtime.v1alpha2.RuntimeService/StartContainer
Aug 07 10:11:49.265439 libvirt-ppc64le-1-1-9-kfv8v-master-1 crio[1242]: time="2021-08-07 10:11:49.264443242Z" level=info msg="Created container 0651d7904d63a3f2c1fa9177d2ccf890c8fc769e96c836074aa8cc28a8bd7e04: openshift-machine-config-operator/machine-config-daemon-pk29l/machine-config-daemon" id=a622e284-7d45-4b72-b271-c39081c2c77a name=/runtime.v1alpha2.RuntimeService/CreateContainer
Aug 07 10:11:49.602420 libvirt-ppc64le-1-1-9-kfv8v-master-2 crio[1243]: time="2021-08-07 10:11:49.602359290Z" level=info msg="Started container 5a24f464210595cd394aacd4e98903a196d67762a53d764bd6f4a6010cc17acf: openshift-machine-config-operator/machine-config-daemon-69fw6/machine-config-daemon" id=89b0650c-741e-4c61-ab49-f68aa82cb302 name=/runtime.v1alpha2.RuntimeService/StartContainer
Aug 07 10:15:54.666525 libvirt-ppc64le-1-1-9-kfv8v-worker-0-gddxw crio[1252]: time="2021-08-07 10:15:54.666233168Z" level=info msg="Started container 8ba32989af629e00c35578c51e9b5612ca8ddcf97b32f2b500d777a6eb2ff2e1: openshift-machine-config-operator/machine-config-daemon-5tb88/machine-config-daemon" id=4fa0e2ba-54aa-41a8-ab7b-7a3b6f6a9998 name=/runtime.v1alpha2.RuntimeService/StartContainer
Aug 07 10:16:14.170188 libvirt-ppc64le-1-1-9-kfv8v-worker-0-p76x7 crio[1235]: time="2021-08-07 10:16:14.170137303Z" level=info msg="Started container 78d933af1e7100050332b1df62e67d1fc71ca735c7a7d3c060411f61f32a0c74: openshift-machine-config-operator/machine-config-daemon-k6l8w/machine-config-daemon" id=c344fd94-abeb-4393-87f3-5bcaba21d45f name=/runtime.v1alpha2.RuntimeService/StartContainer
```

All containers started before the test started (before 2021-08-07T10:28:00Z, see https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1423947091704549376/build-log.txt). Checking https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1423947091704549376/artifacts/ocp-jenkins-e2e-remote-libvirt-ppc64le/gather-libvirt/artifacts/pods.json:

```
machine-config-daemon-5tb88_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-worker-0-gddxw, 0 restarts, ready since 2021-08-07T10:16:07Z
machine-config-daemon-k6l8w_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-worker-0-p76x7, 0 restarts, ready since 2021-08-07T10:16:14Z
machine-config-daemon-69fw6_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-2, 0 restarts, ready since 2021-08-07T10:11:49Z
machine-config-daemon-pk29l_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-1, 0 restarts, ready since 2021-08-07T10:11:49Z
machine-config-daemon-7r2bb_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-0, 0 restarts, ready since 2021-08-07T10:11:49Z
```

All containers were running since they got created and never restarted.

The incident (alert TargetDown fired for 13 seconds) occurred at August 7, 2021 10:33:18 AM. The test suite finished 2021-08-07T10:33:40Z.

Based on the TargetDown definition (see https://github.com/openshift/cluster-monitoring-operator/blob/001eccd81ff51af0ed7a9d463dd35bfa9b75d102/assets/cluster-monitoring-operator/prometheus-rule.yaml#L16-L28):
```
- alert: TargetDown
      annotations:
        description: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service
          }} targets in {{ $labels.namespace }} namespace have been unreachable for
          more than 15 minutes. This may be a symptom of network connectivity issues,
          down nodes, or failures within these components. Assess the health of the
          infrastructure and nodes running these targets and then contact support.'
        summary: Some targets were not reachable from the monitoring server for an
          extended period of time.
      expr: |
        100 * (count(up == 0 unless on (node) max by (node) (kube_node_spec_unschedulable == 1)) BY (job, namespace, service) /
          count(up unless on (node) max by (node) (kube_node_spec_unschedulable == 1)) BY (job, namespace, service)) > 10
      for: 15m
```

The machine-config-daemon was down for 15m and 13s. Given the test suite ran for ~5m42s (10:33:18-10:28:00), the target was down before the test suite started to run.

This patterns repears in other jobs as well:
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1424128286878863360 (alert TargetDown fired for 13 seconds with labels, Step ocp-jenkins-e2e-remote-libvirt-ppc64le-openshift-e2e-libvirt-test failed after 5m40s)
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1424309483785424896 (alert TargetDown fired for 43 seconds with labels, Step ocp-jenkins-e2e-remote-libvirt-ppc64le-openshift-e2e-libvirt-test failed after 6m20s)
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1424490678854881280 (alert TargetDown fired for 30 seconds with labels, Step ocp-jenkins-e2e-remote-libvirt-ppc64le-openshift-e2e-libvirt-test failed after 5m20s)
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-image-ecosystem-remote-libvirt-ppc64le/1424294383288586240 (alert TargetDown fired for 60 seconds with labels, Step ocp-image-ecosystem-remote-libvirt-ppc64le-openshift-e2e-libvirt-test failed after 10m0s.)

For other jobs see:
https://search.ci.openshift.org/?search=alert+TargetDown+fired+for+.*+seconds+with+labels%3A+%5C%7Bjob%3D%22machine-config-daemon%22%2C+namespace%3D%22openshift-machine-config-operator%22%2C+service%3D%22machine-config-daemon%22%2C+severity%3D%22warning%22%5C%7D&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Deep Mistry 2021-08-26 18:49:06 UTC
*** Bug 1997916 has been marked as a duplicate of this bug. ***

Comment 2 Gal Zaidman 2021-09-01 11:05:56 UTC
oVirt e2e jobs have been painted red lately due to a test failure:
: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] 02:26:11.764: 

Unexpected alerts fired or pending after the test run:

alert TargetDown fired for 60 seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"} 
I think we might be hitting BZ#1991601.

Can you please take a look at one of the example jobs[1][2][3] and tell me if you think it is related to this BZ or is it something else?

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-ovirt/1430743201987170304
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-ovirt/1430706888411975680
[3] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ni[…].9-e2e-ovirt&sort-by-failures=&width=20

Comment 4 Kirsten Garrison 2021-09-01 22:21:56 UTC
Looking through the logs these alerts fire after all of the MC* have settled into a config as was pointed out in the original post, which leads me to believe that our operator isn't causing the issue but that some sort of node availability/connectivity isuse.
On closer inspection: when I look at the MCO oauth pods I see some uncharacteristic errors:
2021/08/08 10:14:03 http.go:107: HTTPS: listening on [::]:9001
2021/08/08 10:22:50 oauthproxy.go:791: requestauth: 192.168.3.52:59776 context canceled
2021/08/08 10:23:00 oauthproxy.go:791: requestauth: 192.168.3.51:39562 context canceled
E0808 10:23:10.889173    7487 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded
2021/08/08 10:23:10 oauthproxy.go:791: requestauth: 192.168.3.52:60356 Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded
E0808 10:23:20.922126    7487 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded
2021/08/08 10:23:20 oauthproxy.go:791: requestauth: 192.168.3.51:39996 Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded

Somethings I notice: 

There was an unready master:
```
I0807 10:21:53.268258       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"1fff7539-7032-4e62-949a-17ba3147b2f5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: 2 of 3 members are available, libvirt-ppc64le-1-1-9-kfv8v-master-2 is unhealthy" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
```

Looking closer there are connectivity issues:
we also have a couple of events during the outage period that indicate there are connectivity issues:
```
10:24:05 (x38)	etcd-quorum-guard-59f44c6cc4-mj45q	Unhealthy Readiness probe failed:
10:24:05 (x2)	openshift-oauth-apiserver	apiserver-6dcd4887f7-wdjt4	Unhealthy Liveness probe failed: Get "https://10.129.0.18:8443/healthz": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
```

In auth:
```
0807 10:22:23.695993       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"ee972910-ddbc-4b6e-a7e2-83fc666fa518", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Degraded message changed from "All is well" to "WellKnownReadyControllerDegraded: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)"
```
Looking closer:
```
I0807 10:26:32.000535       1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-08-07T10:24:44Z","message":"WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2","reason":"WellKnownReadyController_SyncError","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-08-07T10:17:13Z","message":"WellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)","reason":"WellKnownReadyController_OAuthMetadataNotYetServed","status":"True","type":"Progressing"},{"lastTransitionTime":"2021-08-07T10:11:20Z","message":"WellKnownAvailable: The well-known endpoint is not yet available: need at least 3 kube-apiservers, got 2","reason":"WellKnown_NotReady","status":"False","type":"Available"},{"lastTransitionTime":"2021-08-07T10:11:20Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I0807 10:26:32.021561       1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-08-07T10:24:44Z","message":"WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2","reason":"WellKnownReadyController_SyncError","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-08-07T10:17:13Z","message":"WellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)","reason":"WellKnownReadyController_OAuthMetadataNotYetServed","status":"True","type":"Progressing"},{"lastTransitionTime":"2021-08-07T10:11:20Z","message":"WellKnownAvailable: The well-known endpoint is not yet available: need at least 3 kube-apiservers, got 2","reason":"WellKnown_NotReady","status":"False","type":"Available"},{"lastTransitionTime":"2021-08-07T10:11:20Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I0807 10:26:32.022829       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"ee972910-ddbc-4b6e-a7e2-83fc666fa518", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Progressing message changed from "OAuthServerDeploymentProgressing: deployment/oauth-openshift.openshift-authentication: 1/3 pods have been updated to the latest generation\nWellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)" to "WellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)"
I0807 10:26:32.033244       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"ee972910-ddbc-4b6e-a7e2-83fc666fa518", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Progressing message changed from "OAuthServerDeploymentProgressing: deployment/oauth-openshift.openshift-authentication: 1/3 pods have been updated to the latest generation\nWellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)" to "WellKnownReadyControllerProgressing: kube-apiserver oauth endpoint https://192.168.1.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)"
I0807 10:26:33.368190       1 request.go:668] Waited for 1.071089597s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-authentication/serviceaccounts/oauth-openshift
E0807 10:26:41.981307       1 base_controller.go:272] WellKnownReadyController reconciliation failed: need at least 3 kube-apiservers, got 2
E0807 10:27:11.966161       1 base_controller.go:272] WellKnownReadyController reconciliation failed: need at least 3 kube-apiservers, got 2
```

I'm not certain what this e2e testing is doing but it seems to be causing issues with auth/kube-apiserver? Which seems to cause a brief machine-config-daemon target down alert when the tests are running. I'm going to pass this along to auth team to take a look at this.

Comment 7 Lakshmi Ravichandran 2021-09-07 15:38:05 UTC
observing the failure from 

'alert TargetDown fired for xxxx seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"} '

 on recent 4.9 s390x CI jobs as well

- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-remote-libvirt-s390x/1435150903186296832
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-remote-libvirt-s390x/1434969768300908544

Comment 8 Michal Fojtik 2021-10-09 17:50:07 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 9 Gal Zaidman 2021-10-10 05:58:55 UTC
Raised the severity which was set to "unspecified" hoping this bug will get more attention.
The reason for the High severity is:
- The impact it has on the CI jobs - painting them red.
- The fact that we mostly see it on 4.9/4.10 jobs shows there is probably be an actual bug in 4.9.

For oVirt:
- 4.9: periodic-ci-openshift-release-master-nightly-4.9-e2e-ovirt (all) - 40 runs, 65% failed, 58% of failures match = 38% impact
- 4.10: periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt (all) - 49 runs, 69% failed, 38% of failures match = 27% impact
- upgrade: periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade (all) - 28 runs, 86% failed, 21% of failures match = 18% impact


For Libvirt:
- 4.9:
periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-remote-libvirt-ppc64le (all) - 14 runs, 93% failed, 62% of failures match = 57% impact
periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-image-ecosystem-remote-libvirt-ppc64le (all) - 14 runs, 50% failed, 57% of failures match = 29% impact
periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-remote-libvirt-s390x (all) - 14 runs, 100% failed, 14% of failures match = 14% impact

- 4.10:
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-s390x (all) - 14 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-image-ecosystem-remote-libvirt-ppc64le (all) - 14 runs, 57% failed, 63% of failures match = 36% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-ppc64le (all) - 14 runs, 57% failed, 63% of failures match = 36% impact

Others:
- pull-ci-openshift-ovn-kubernetes-release-4.9-e2e-vsphere-ovn (all) - 7 runs, 86% failed, 17% of failures match = 14% impact
- periodic-ci-openshift-release-master-ci-4.9-e2e-azure (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
- periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
- periodic-ci-openshift-release-master-ci-4.9-e2e-azure-serial (all) - 7 runs, 43% failed, 33% of failures match = 14% impact


Search:
https://search.ci.openshift.org/?search=alert+TargetDown+fired+for+.*+seconds+with+labels%3A+%5C%7Bjob%3D%22machine-config-daemon%22%2C+namespace%3D%22openshift-machine-config-operator%22%2C+service%3D%22machine-config-daemon%22%2C+severity%3D%22warning%22%5C%7D&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 13 Sergiusz Urbaniak 2021-11-26 07:20:25 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 14 Sergiusz Urbaniak 2021-11-26 07:30:53 UTC
the errors here don't indicate an issue in the auth stack, i am only recognizing network timeouts during API calls throughout many layers of the stack (auth, etcd, machine config operator). Reassigning to network team to continue investigation.

Comment 15 Ben Pickard 2021-11-29 14:39:09 UTC
Errors only showing in mco namespace, so we are passing to them. We would expect networking errors to be present in the whole cluster. If you see errors in the overlay network hand it back and we will investigate.

Comment 19 Dennis Periquet 2022-03-09 14:52:06 UTC
Looking into the logs for this, I found a gap in logs for the machine-config-daemon-qmgtp pod on master-2 (middle rectangle in the screen shot in the gist below).  I don't know if a gap that big is normal (from 16:27:55 - 17:25:37).

Also, the screenshot shows logs resuming after 17:25 (see bottom rectangle).

https://gist.github.com/DennisPeriquet/8db3acc6ef690c41431f064ec349c98a

I'm wondering if there is just some network connectivity condition that happens and then gets resolved and the logs are just being shipped to loki later (I don't know enough about how the loki vs. promtail is working and maybe logs get buffered during a network connectivity loss.

Comment 20 Dennis Periquet 2022-03-09 18:57:43 UTC
I looked at this job:  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade/1498692901662625792

This loki/LogQL query shows lots of 'context cancelled' and 'Failed to make webhook authenticator request ... context deadline exceeded' messages throughout the time of the TargetDown alert:

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221646148326000%22,%221646156966000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade%2F1498692901662625792%5C%22%7D%20%7C%20unpack%7C%20pod_name%3D%5C%22machine-config-daemon-qmgtp%5C%22%5Cn%5Cn%22,%22refId%22:%22A%22,%22instant%22:false,%22range%22:true,%22resolution%22:1%7D%5D

This loki/LogQL query shows logs that are sent at 17:25 followed by logs at 16:26 (i.e., late logs) which suggest to me that the log scraper buffered logs until it was able to send them again.

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221646148326000%22,%221646156966000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade%2F1498692901662625792%5C%22%7D%20%7C%20unpack%7C%20host%3D%5C%22ovirt18-d6d8x-master-2%5C%22%7Csystemd_unit%3D%5C%22kubelet.service%5C%22%7C~%20%5C%22machine-config-daemon-qmgtp%5C%22%5Cn%22,%22refId%22:%22A%22,%22instant%22:false,%22range%22:true,%22resolution%22:1%7D%5D

There was no evidence of a reboot causing the issue (the only reboots are at 17:30 which is after the upgrade.

I'm very inclined to believe there was some networking issue going on.

@bpickard while I have the logs handy if you can tell me what logs to look at, I can take a look.

Comment 22 Dennis Periquet 2022-03-09 22:13:12 UTC
Looking into the logs for this, I found a gap in logs for the machine-config-daemon-qmgtp pod on master-2 (middle rectangle in the screen shot in the gist below).  I don't know if a gap that big is normal (from 16:27:55 - 17:25:37).

Also, the screenshot shows logs resuming after 17:25 (see bottom rectangle).

https://gist.github.com/DennisPeriquet/8db3acc6ef690c41431f064ec349c98a

I'm wondering if there is just some network connectivity condition that happens and then gets resolved and the logs are just being shipped to loki later (I don't know enough about how the loki vs. promtail is working and maybe logs get buffered during a network connectivity loss.

I added logs in https://github.com/DennisPeriquet/misc-attachments/tree/master/BZ1991601 from the sdn pod

Comment 23 Dennis Periquet 2022-03-09 22:17:38 UTC
BTW, I don't think the SDN pods I attached (via git repo) in the previous comment are present in the must-gather pod logs because they are from the time of the alert (16:42 - 16:52 UTC).  The must-gather logs probably contain logs after the upgrade (and the -previous.log files are empty).

Comment 29 Martin Kennelly 2022-06-21 10:39:42 UTC
Looking at run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-ovirt-upgrade/1538069955088486400

I see alert firing due to machine-config-operator metrics endpoint unavailability when it is polled periodically to determine its "up" status by the Prometheus stack. I see container named oauth-proxy which handles the authentication for the metrics endpoint log an error continuously during the time the alert is firing:

E0618 08:40:47.924196    8637 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded

oauth-proxy returns 403 (forbidden) to the caller and this is why Prometheus detects this endpoint as not "up" and the alert fires.
An error occurs within the oauth-proxy - it is unable to contact the kube api server at 172.30.0.1. Other services on the node are able to contact this server.
I notice this occurs during startup of the cluster and resolves itself after awhile.
I notice SDN is not emitting any error and is syncing proxy rules.
I notice machine-config-daemon is host networked.

Looking into this further.

Comment 30 Martin Kennelly 2022-06-21 14:50:38 UTC
I suspected the issue with this was something in oauth or the caller (prometheus) and not network connectivity because I do not see k8 api connectivity issues with other services e.g. machine-config-daemon reflector would error out during this time.

So, I created this PR (https://github.com/openshift/machine-config-operator/pull/3199) and launched it 4 times with clusterbot to prove the k8 api port is reachable. After the 3rd launch I see the up alert firing just following cluster creation.

Here is some screenshots showing the time when machine-config-daemon is NOT "up" and also screenshots of oauth container logs and also screenshots showing my PR attempting and succeeding to connect to k8 api port during the time where oauth container fails to connect to k8 api. Machine-config-daemon and oauth containers share the same network stack: https://gist.github.com/martinkennelly/6a7fef9f582c7faf454d6c687eb84adb

So, something in the monitoring stack connects to machine-config-daemon's oauth containers port to detect its "up" status. Oauth component fails to connect to k8 api to validate requester permissions with "context deadline exceeded".
However, I have shown with my PR and screenshots its not a network issue but something else in oauth or monitoring stack.

Giving this BZ back to machine-config-operator team to make the judgement for where this bz should go now.

Comment 32 Martin Kennelly 2022-06-21 15:08:02 UTC
It could also be an issue with the k8 api server but I have shown connectivity is ok to the k8 api port.
Also, I only see this issue during cluster startup.

Comment 33 Martin Kennelly 2022-06-21 15:09:21 UTC
:s/during/shortly after/

Comment 35 Rahul Sharma 2022-07-25 11:46:13 UTC
we are again seeing these failure in s390x in 4.9 and 4.10

alert TargetDown fired for 60 seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"}}

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-remote-libvirt-s390x/1550813067808346112
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-s390x/1549650434023493632

Comment 40 Shiftzilla 2023-03-09 01:05:21 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-8938

Comment 41 Red Hat Bugzilla 2023-09-18 00:29:41 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days