Bug 1897918 - [oVirt] e2e tests fail due to kube-apiserver not finishing
Summary: [oVirt] e2e tests fail due to kube-apiserver not finishing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-15 15:39 UTC by Gal Zaidman
Modified: 2021-07-27 22:34 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:34:24 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 309 0 None closed Bug 1897918: UPSTREAM: 93861: apiserver: add --permit-address-sharing flag to listen with SO_REUSEADDR 2021-02-19 02:22:39 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:34:53 UTC

Description Gal Zaidman 2020-11-15 15:39:27 UTC
Description of problem:

On oVirt e2e tests (checked 4.5 and 4.6 since 4.7 is not stable) we started seeing cluster installation failures due to kube-apiserver operator not finishing the installation with nodes stuck in a certain revision.

When looking at the jobs we see that 2 operators didn't finish, auth and kube-apiserver.
When we look at the logs of the auth operator we see:
".... kube-apiserver oauth endpoint https://192.168.216.133:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting ...."

Looking at the kube-apiserver operator and installer logs we see that one of the masters is stuck.

Why is this happening?
- Is there a bug behind it?
- Is this a problem with a different component that causes this behavior? 
- Do we just need more time for the installation to let the kube-api to finish?

Additional info:

Examples:
4.6:

1.
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1327185540532932608 :
Ovirt16-lt88t-master-2 didn’t finish moving from revision 5 to 6 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1327185540532932608/artifacts/e2e-ovirt/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-85bbfb4d94-6z22l_kube-apiserver-operator.log

2.
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1327766689307168768 :
ovirt15-w5pnr-master-0 didn’t finish moving from revision 3 to 6 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1327766689307168768/artifacts/e2e-ovirt/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-749dd55d87-729wq_kube-apiserver-operator.log

More examples:
4.6:
3. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1325377973590691840
4. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1326893628500480000
4.5:
5. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1327932175370162176
6. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1327638417177055232
Version-Release number of selected component (if applicable):
7. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1327346595305361408

Comment 1 Stefan Schimanski 2020-11-16 09:15:54 UTC
I see timeouts of the init container that updates permissions of the audit logs and waits for the 6443 port to become free https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1327185540532932608/artifacts/e2e-ovirt/pods.json:

                                "containerID": "cri-o://8e17a3aa963e96dd1babb7903bb48c8448164c9def7afae6d12c122f51fca84b",
                                "exitCode": 124,
                                "finishedAt": "2020-11-13T10:45:17Z",
                                "message": "................................................................................",
                                "reason": "Error",
                                "startedAt": "2020-11-13T10:43:32Z"

We have not seen this recently much on other platforms.

We have https://github.com/openshift/kubernetes/pull/309 pending that might help here.

Comment 2 Gal Zaidman 2020-11-16 11:27:41 UTC
(In reply to Stefan Schimanski from comment #1)
> I see timeouts of the init container that updates permissions of the audit
> logs and waits for the 6443 port to become free
> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/
> release-openshift-ocp-installer-e2e-ovirt-4.6/1327185540532932608/artifacts/
> e2e-ovirt/pods.json:
> 
>                                 "containerID":
> "cri-o://8e17a3aa963e96dd1babb7903bb48c8448164c9def7afae6d12c122f51fca84b",
>                                 "exitCode": 124,
>                                 "finishedAt": "2020-11-13T10:45:17Z",
>                                 "message":
> "............................................................................
> ....",
>                                 "reason": "Error",
>                                 "startedAt": "2020-11-13T10:43:32Z"
> 
> We have not seen this recently much on other platforms.
> 
> We have https://github.com/openshift/kubernetes/pull/309 pending that might
> help here.

Maybe it is related to our CI infrastructure, adding Evgeny who is our infrastructure guy :) can you take a look at the PR and see if you think this is it?

Comment 3 Evgeny Slutsky 2020-11-16 13:45:14 UTC
it doesnt look so , the init container is checking if `6443` port is available using `ss`:
[ -n "$(ss -Htan '( sport = 6443 )')" ] , since it runs within the OCP node it doesn't involve our infra (proxy).

Comment 4 Stefan Schimanski 2021-01-12 10:49:57 UTC
Does this still happen?

Comment 7 Xingxing Xia 2021-02-24 09:49:16 UTC
Read the PR code, tested 4.8.0-0.nightly-2021-02-23-200827 AWS env:
$ oc rsh -n openshift-kube-apiserver kube-apiserver-ip-10-0-148-173.us-east-2.compute.internal
sh-4.4# kube-apiserver --help
...
Secure serving flags:
...
      --permit-address-sharing
                If true, SO_REUSEADDR will be used when binding the port. This allows binding to wildcard IPs like 0.0.0.0 and
                specific IPs in parallel, and it avoids waiting for the kernel to release sockets in TIME_WAIT state. [default=false]
...
sh-4.4# ps -eF | grep kube-apiserver
... kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=10.0.148.173 -v=2 --permit-address-sharing

It has the flag --permit-address-sharing used.

Comment 8 Xingxing Xia 2021-02-24 09:54:59 UTC
As for the CI results, I filtered NodeInstallerProgressing with job name RE e2e-ovirt.*4\.8|4\.8.*e2e-ovirt in last 14 days, only got 1 record:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=NodeInstallerProgressing&maxAge=336h&context=1&type=junit&name=e2e-ovirt.*4%5C.8%7C4%5C.8.*e2e-ovirt&maxMatches=1&maxBytes=20971520&groupBy=job'
rehearse-15997-pull-ci-openshift-ovirt-csi-driver-release-4.8-e2e-ovirt
#1362343131432882176                                                          junit               5 days ago
# operator conditions kube-apiserver
Operator progressing (NodeInstaller): NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 3

I looked at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/15997/rehearse-15997-pull-ci-openshift-ovirt-csi-driver-release-4.8-e2e-ovirt/1362343131432882176/artifacts/e2e-ovirt/gather-extra/artifacts/kubeapiserver.json :
                        "lastTransitionTime": "2021-02-18T10:34:08Z",
                        "message": "1 nodes are at revision 0; 2 nodes are at revision 3; 0 nodes have achieved new revision 4",
                        "status": "True",
                        "type": "NodeInstallerProgressing"

Then looked at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/15997/rehearse-15997-pull-ci-openshift-ovirt-csi-driver-release-4.8-e2e-ovirt/1362343131432882176/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-5cd4766d68-fg8ws_kube-apiserver-operator.log : its last occurrence of "NodeInstallerProgressing":
I0218 10:41:08.880244       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"3b9996ed-0d1d-4ed3-b782-dac1e541d90e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Progressing message changed from "NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 3" to "NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 3; 0 nodes have achieved new revision 4",Available message changed from "StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 3" to "StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 3; 0 nodes have achieved new revision 4"

So 10:41:08 - 10:34:08 = 7 mins, it had 7 mins stuck at "1 nodes are at revision 0; 2 nodes are at revision 3".
Since kube-apiserver-operator.log last line timestamp was I0218 10:41:16.250020, no more log lines to check if it was stuck more time at "1 nodes are at revision 0; 2 nodes are at revision 3".

Comment 9 Xingxing Xia 2021-02-24 09:56:43 UTC
> I filtered NodeInstallerProgressing with job name RE e2e-ovirt.*4\.8|4\.8.*e2e-ovirt in last 14 days, only got 1 record
> no more log lines to check if it was stuck more time at "1 nodes are at revision 0; 2 nodes are at revision 3"
Given only 1 record in last 14 days, moving this to VERIFIED. If you will hit more, can reopen.

Comment 12 errata-xmlrpc 2021-07-27 22:34:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.