Bug 1969598
| Summary: | During machine-config pool updates: apps.openshift.io.v1: the server is currently unable to handle the request | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
| Component: | Machine Config Operator | Assignee: | Yu Qi Zhang <jerzhang> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | aos-bugs, jerzhang, mfojtik, mkrejci, rioliu, skumari |
| Version: | 4.8 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | tag-ci | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-11-08 17:02:05 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
W. Trevor King
2021-06-08 18:17:07 UTC
Get "https://api-int.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller": read tcp 10.0.250.206:52822->10.0.216.48:6443: read: connection timed out is a strong hint that the draining and the shutdown of the node was not graceful. Otherwise, we cannot see a TCP connection timeout, but it would be on a higher level in the network stack. Taking a look at the MCO namespaced object, nothing seems to jump out. At the time of must-gather for the above CI run, the MCO has fully settled, and notably all of the MCD containers were restarted at some point, so we no longer have any logs from the older MCDs. The oldest log now is from 03:36 which seems to be after the api issues. It's also possible that the api issues were happening as the first master was rebooting due to its update. I'm not sure how to dig this further. Maybe we would need to look at journal logs? Not sure what the root cause of the instability was. > ...we no longer have any logs from the older MCDs... We should have those in Loki, but I dunno if we retain them for three weeks. From [1] -> custom-link-tools -> Loki -> adjust time range to cover 2021-06-08 00:00Z to 2021-06-08 06:00Z, and using: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1402078479163330560"} | unpack | namespace="openshift-machine-config-operator" Seems to give me some MCD logs. Was there a particular MCD pod we wanted to dig into? An instance query for: count by (pod_name) (count_over_time(({invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1402078479163330560"} | unpack | namespace="openshift-machine-config-operator")[24h])) gives a list of options. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1402078479163330560 machine-config-daemon-llgpj is probably the most likely one to have any information, if any exists. It does seem to exist in the query you gave, but I don't really know how to find the logs from the dashboard The newer manifestations of this should be https://bugzilla.redhat.com/show_bug.cgi?id=2019215. If you see this still @Trevor in newer runs, please feel free to open a new bug or attach to that one. Closing this one since it's a bit stale at this point. Thanks! |