Bug 1897918
Summary: | [oVirt] e2e tests fail due to kube-apiserver not finishing | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Gal Zaidman <gzaidman> |
Component: | kube-apiserver | Assignee: | Stefan Schimanski <sttts> |
Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.7 | CC: | aos-bugs, eslutsky, kewang, mfojtik, xxia |
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:34:24 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Gal Zaidman
2020-11-15 15:39:27 UTC
I see timeouts of the init container that updates permissions of the audit logs and waits for the 6443 port to become free https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1327185540532932608/artifacts/e2e-ovirt/pods.json: "containerID": "cri-o://8e17a3aa963e96dd1babb7903bb48c8448164c9def7afae6d12c122f51fca84b", "exitCode": 124, "finishedAt": "2020-11-13T10:45:17Z", "message": "................................................................................", "reason": "Error", "startedAt": "2020-11-13T10:43:32Z" We have not seen this recently much on other platforms. We have https://github.com/openshift/kubernetes/pull/309 pending that might help here. (In reply to Stefan Schimanski from comment #1) > I see timeouts of the init container that updates permissions of the audit > logs and waits for the 6443 port to become free > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > release-openshift-ocp-installer-e2e-ovirt-4.6/1327185540532932608/artifacts/ > e2e-ovirt/pods.json: > > "containerID": > "cri-o://8e17a3aa963e96dd1babb7903bb48c8448164c9def7afae6d12c122f51fca84b", > "exitCode": 124, > "finishedAt": "2020-11-13T10:45:17Z", > "message": > "............................................................................ > ....", > "reason": "Error", > "startedAt": "2020-11-13T10:43:32Z" > > We have not seen this recently much on other platforms. > > We have https://github.com/openshift/kubernetes/pull/309 pending that might > help here. Maybe it is related to our CI infrastructure, adding Evgeny who is our infrastructure guy :) can you take a look at the PR and see if you think this is it? it doesnt look so , the init container is checking if `6443` port is available using `ss`: [ -n "$(ss -Htan '( sport = 6443 )')" ] , since it runs within the OCP node it doesn't involve our infra (proxy). Does this still happen? (In reply to Stefan Schimanski from comment #4) > Does this still happen? Yes but I guess that only on 4.6, I did a search of the last 7 days and after I filtered the results I ended up with those jobs: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1350795057384394752 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_csi-node-driver-registrar/25/pull-ci-openshift-csi-node-driver-registrar-release-4.6-e2e-ovirt/1350020106897330176 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4547/pull-ci-openshift-installer-release-4.6-e2e-ovirt/1349620131344420864 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4538/pull-ci-openshift-installer-release-4.6-e2e-ovirt/1349509687401254912 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4535/pull-ci-openshift-installer-release-4.6-e2e-ovirt/1348726050527383552 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovirt-csi-driver/65/pull-ci-openshift-ovirt-csi-driver-release-4.6-e2e-ovirt/1349085435812384768 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovirt-csi-driver-operator/38/pull-ci-openshift-ovirt-csi-driver-operator-release-4.6-e2e-ovirt/1349023096438263808 That is 7 jobs from the last 7 days so it is flaky still happends... Read the PR code, tested 4.8.0-0.nightly-2021-02-23-200827 AWS env: $ oc rsh -n openshift-kube-apiserver kube-apiserver-ip-10-0-148-173.us-east-2.compute.internal sh-4.4# kube-apiserver --help ... Secure serving flags: ... --permit-address-sharing If true, SO_REUSEADDR will be used when binding the port. This allows binding to wildcard IPs like 0.0.0.0 and specific IPs in parallel, and it avoids waiting for the kernel to release sockets in TIME_WAIT state. [default=false] ... sh-4.4# ps -eF | grep kube-apiserver ... kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=10.0.148.173 -v=2 --permit-address-sharing It has the flag --permit-address-sharing used. As for the CI results, I filtered NodeInstallerProgressing with job name RE e2e-ovirt.*4\.8|4\.8.*e2e-ovirt in last 14 days, only got 1 record: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=NodeInstallerProgressing&maxAge=336h&context=1&type=junit&name=e2e-ovirt.*4%5C.8%7C4%5C.8.*e2e-ovirt&maxMatches=1&maxBytes=20971520&groupBy=job' rehearse-15997-pull-ci-openshift-ovirt-csi-driver-release-4.8-e2e-ovirt #1362343131432882176 junit 5 days ago # operator conditions kube-apiserver Operator progressing (NodeInstaller): NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 3 I looked at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/15997/rehearse-15997-pull-ci-openshift-ovirt-csi-driver-release-4.8-e2e-ovirt/1362343131432882176/artifacts/e2e-ovirt/gather-extra/artifacts/kubeapiserver.json : "lastTransitionTime": "2021-02-18T10:34:08Z", "message": "1 nodes are at revision 0; 2 nodes are at revision 3; 0 nodes have achieved new revision 4", "status": "True", "type": "NodeInstallerProgressing" Then looked at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/15997/rehearse-15997-pull-ci-openshift-ovirt-csi-driver-release-4.8-e2e-ovirt/1362343131432882176/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-5cd4766d68-fg8ws_kube-apiserver-operator.log : its last occurrence of "NodeInstallerProgressing": I0218 10:41:08.880244 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"3b9996ed-0d1d-4ed3-b782-dac1e541d90e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Progressing message changed from "NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 3" to "NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 3; 0 nodes have achieved new revision 4",Available message changed from "StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 3" to "StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 3; 0 nodes have achieved new revision 4" So 10:41:08 - 10:34:08 = 7 mins, it had 7 mins stuck at "1 nodes are at revision 0; 2 nodes are at revision 3". Since kube-apiserver-operator.log last line timestamp was I0218 10:41:16.250020, no more log lines to check if it was stuck more time at "1 nodes are at revision 0; 2 nodes are at revision 3". > I filtered NodeInstallerProgressing with job name RE e2e-ovirt.*4\.8|4\.8.*e2e-ovirt in last 14 days, only got 1 record
> no more log lines to check if it was stuck more time at "1 nodes are at revision 0; 2 nodes are at revision 3"
Given only 1 record in last 14 days, moving this to VERIFIED. If you will hit more, can reopen.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |