Bug 1897918

Summary: [oVirt] e2e tests fail due to kube-apiserver not finishing
Product: OpenShift Container Platform Reporter: Gal Zaidman <gzaidman>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.7CC: aos-bugs, eslutsky, kewang, mfojtik, xxia
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:34:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gal Zaidman 2020-11-15 15:39:27 UTC
Description of problem:

On oVirt e2e tests (checked 4.5 and 4.6 since 4.7 is not stable) we started seeing cluster installation failures due to kube-apiserver operator not finishing the installation with nodes stuck in a certain revision.

When looking at the jobs we see that 2 operators didn't finish, auth and kube-apiserver.
When we look at the logs of the auth operator we see:
".... kube-apiserver oauth endpoint https://192.168.216.133:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting ...."

Looking at the kube-apiserver operator and installer logs we see that one of the masters is stuck.

Why is this happening?
- Is there a bug behind it?
- Is this a problem with a different component that causes this behavior? 
- Do we just need more time for the installation to let the kube-api to finish?

Additional info:

Examples:
4.6:

1.
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1327185540532932608 :
Ovirt16-lt88t-master-2 didn’t finish moving from revision 5 to 6 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1327185540532932608/artifacts/e2e-ovirt/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-85bbfb4d94-6z22l_kube-apiserver-operator.log

2.
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1327766689307168768 :
ovirt15-w5pnr-master-0 didn’t finish moving from revision 3 to 6 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1327766689307168768/artifacts/e2e-ovirt/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-749dd55d87-729wq_kube-apiserver-operator.log

More examples:
4.6:
3. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1325377973590691840
4. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1326893628500480000
4.5:
5. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1327932175370162176
6. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1327638417177055232
Version-Release number of selected component (if applicable):
7. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1327346595305361408

Comment 1 Stefan Schimanski 2020-11-16 09:15:54 UTC
I see timeouts of the init container that updates permissions of the audit logs and waits for the 6443 port to become free https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1327185540532932608/artifacts/e2e-ovirt/pods.json:

                                "containerID": "cri-o://8e17a3aa963e96dd1babb7903bb48c8448164c9def7afae6d12c122f51fca84b",
                                "exitCode": 124,
                                "finishedAt": "2020-11-13T10:45:17Z",
                                "message": "................................................................................",
                                "reason": "Error",
                                "startedAt": "2020-11-13T10:43:32Z"

We have not seen this recently much on other platforms.

We have https://github.com/openshift/kubernetes/pull/309 pending that might help here.

Comment 2 Gal Zaidman 2020-11-16 11:27:41 UTC
(In reply to Stefan Schimanski from comment #1)
> I see timeouts of the init container that updates permissions of the audit
> logs and waits for the 6443 port to become free
> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/
> release-openshift-ocp-installer-e2e-ovirt-4.6/1327185540532932608/artifacts/
> e2e-ovirt/pods.json:
> 
>                                 "containerID":
> "cri-o://8e17a3aa963e96dd1babb7903bb48c8448164c9def7afae6d12c122f51fca84b",
>                                 "exitCode": 124,
>                                 "finishedAt": "2020-11-13T10:45:17Z",
>                                 "message":
> "............................................................................
> ....",
>                                 "reason": "Error",
>                                 "startedAt": "2020-11-13T10:43:32Z"
> 
> We have not seen this recently much on other platforms.
> 
> We have https://github.com/openshift/kubernetes/pull/309 pending that might
> help here.

Maybe it is related to our CI infrastructure, adding Evgeny who is our infrastructure guy :) can you take a look at the PR and see if you think this is it?

Comment 3 Evgeny Slutsky 2020-11-16 13:45:14 UTC
it doesnt look so , the init container is checking if `6443` port is available using `ss`:
[ -n "$(ss -Htan '( sport = 6443 )')" ] , since it runs within the OCP node it doesn't involve our infra (proxy).

Comment 4 Stefan Schimanski 2021-01-12 10:49:57 UTC
Does this still happen?

Comment 7 Xingxing Xia 2021-02-24 09:49:16 UTC
Read the PR code, tested 4.8.0-0.nightly-2021-02-23-200827 AWS env:
$ oc rsh -n openshift-kube-apiserver kube-apiserver-ip-10-0-148-173.us-east-2.compute.internal
sh-4.4# kube-apiserver --help
...
Secure serving flags:
...
      --permit-address-sharing
                If true, SO_REUSEADDR will be used when binding the port. This allows binding to wildcard IPs like 0.0.0.0 and
                specific IPs in parallel, and it avoids waiting for the kernel to release sockets in TIME_WAIT state. [default=false]
...
sh-4.4# ps -eF | grep kube-apiserver
... kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=10.0.148.173 -v=2 --permit-address-sharing

It has the flag --permit-address-sharing used.

Comment 8 Xingxing Xia 2021-02-24 09:54:59 UTC
As for the CI results, I filtered NodeInstallerProgressing with job name RE e2e-ovirt.*4\.8|4\.8.*e2e-ovirt in last 14 days, only got 1 record:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=NodeInstallerProgressing&maxAge=336h&context=1&type=junit&name=e2e-ovirt.*4%5C.8%7C4%5C.8.*e2e-ovirt&maxMatches=1&maxBytes=20971520&groupBy=job'
rehearse-15997-pull-ci-openshift-ovirt-csi-driver-release-4.8-e2e-ovirt
#1362343131432882176                                                          junit               5 days ago
# operator conditions kube-apiserver
Operator progressing (NodeInstaller): NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 3

I looked at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/15997/rehearse-15997-pull-ci-openshift-ovirt-csi-driver-release-4.8-e2e-ovirt/1362343131432882176/artifacts/e2e-ovirt/gather-extra/artifacts/kubeapiserver.json :
                        "lastTransitionTime": "2021-02-18T10:34:08Z",
                        "message": "1 nodes are at revision 0; 2 nodes are at revision 3; 0 nodes have achieved new revision 4",
                        "status": "True",
                        "type": "NodeInstallerProgressing"

Then looked at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/15997/rehearse-15997-pull-ci-openshift-ovirt-csi-driver-release-4.8-e2e-ovirt/1362343131432882176/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-5cd4766d68-fg8ws_kube-apiserver-operator.log : its last occurrence of "NodeInstallerProgressing":
I0218 10:41:08.880244       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"3b9996ed-0d1d-4ed3-b782-dac1e541d90e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Progressing message changed from "NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 3" to "NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 3; 0 nodes have achieved new revision 4",Available message changed from "StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 3" to "StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 3; 0 nodes have achieved new revision 4"

So 10:41:08 - 10:34:08 = 7 mins, it had 7 mins stuck at "1 nodes are at revision 0; 2 nodes are at revision 3".
Since kube-apiserver-operator.log last line timestamp was I0218 10:41:16.250020, no more log lines to check if it was stuck more time at "1 nodes are at revision 0; 2 nodes are at revision 3".

Comment 9 Xingxing Xia 2021-02-24 09:56:43 UTC
> I filtered NodeInstallerProgressing with job name RE e2e-ovirt.*4\.8|4\.8.*e2e-ovirt in last 14 days, only got 1 record
> no more log lines to check if it was stuck more time at "1 nodes are at revision 0; 2 nodes are at revision 3"
Given only 1 record in last 14 days, moving this to VERIFIED. If you will hit more, can reopen.

Comment 12 errata-xmlrpc 2021-07-27 22:34:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438