2018005 – local kubeconfig "lb-int.kubeconfig" should be present on all masters and work

Bug 2018005 - local kubeconfig "lb-int.kubeconfig" should be present on all masters and work [NEEDINFO]

Summary: local kubeconfig "lb-int.kubeconfig" should be present on all masters and work

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.8.z
Assignee:	Maciej Szulik
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:	LifecycleReset
Depends On:	1986003 2002552 2018208
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-27 22:36 UTC by Stephen Benjamin
Modified:	2022-01-11 22:31 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2002552
Environment:
Last Closed:	2022-01-11 22:31:15 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2022:0021	0	None	None	None	2022-01-11 22:31:35 UTC

Description Stephen Benjamin 2021-10-27 22:36:11 UTC

BZ#2002552 was backported to 4.8, but we still have instances of this failing on both IPv4 and IPv6.

Example runs where this failed:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1453456250908971008

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1453425238455881728

Comment 1 Andrea Fasano 2021-10-29 13:13:47 UTC

By looking at the sparse test failures, the oc debug command - triggered for each master found - failed intermittently due the unavailability of the kube-apiserver for that particular master. By cross-checking the test timestamps failures with the kube-apiserver events then it appeared that the server gets several restarts, and sometimes the test tries to execute command when the server it's not yet ready. Even though the test implemented a retry mechanism, it's not enough to cover the restart gaps.

For example, in the 1453817495201779712 job [1], the debug command failed for master-1.ostest.test.metalkube.org at 21:36:52.652:

> STEP: Testing master node master-1.ostest.test.metalkube.org
> Oct 28 21:36:50.073: INFO: Verifying kubeconfig "localhost-recovery.kubeconfig" on master "master-1.ostest.test.metalkube.org"
> Oct 28 21:36:50.073: INFO: Running 'oc --namespace=e2e-test-apiserver-rvg62 --kubeconfig=/tmp/secret/kubeconfig debug node/master-1.ostest.test.metalkube.org -- chroot /host /bin/bash -euxo pipefail -c oc --kubeconfig "/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig" get namespace > kube-system'
> Oct 28 21:36:52.652: INFO: Error running /usr/bin/oc --namespace=e2e-test-apiserver-rvg62 --kubeconfig=/tmp/secret/kubeconfig debug node/master-1.ostest.test.metalkube.org -- chroot /host /bin/bash -euxo pipefail -c oc --kubeconfig "/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig" > get namespace kube-system:
> StdOut>
> Starting pod/master-1ostesttestmetalkubeorg-debug ...
> To use host binaries, run `chroot /host`
> + oc --kubeconfig /etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig get namespace kube-system
> The connection to the server localhost:6443 was refused - did you specify the right host or port?

master-1.ostest.test.metalkube.org kube-apiserver logs [2] shows that last server restart was on 21:38:15.058617 (last kill event happened at 21:35:03, see [3])

Same behavior has been observed also in other job failures for the same tests. To prevent the tests being triggered too early, we're introducing a first waiting condition in the metal-ipi jobs [4] (and it should address also https://bugzilla.redhat.com/show_bug.cgi?id=2018208)

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1453817495201779712
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1453817495201779712/artifacts/e2e-metal-ipi-ovn-ipv6/gather-extra/artifacts/pods/openshift-kube-apiserver_kube-apiserver-master-1.ostest.test.metalkube.org_kube-apiserver.log
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6/1453817495201779712/artifacts/e2e-metal-ipi-ovn-ipv6/gather-must-gather/artifacts/event-filter.html
[4] https://github.com/openshift/release/pull/23131

Comment 2 Maciej Szulik 2021-11-03 11:54:00 UTC

From looking at https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6 I see this is not super-high pressing issue so I'm lowering the priority on this one.

Comment 3 Michal Fojtik 2021-12-20 09:34:04 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 4 Maciej Szulik 2021-12-21 17:10:38 UTC

From looking at https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6 the test seems pretty green so I'll move this over to qa.

Comment 5 Michal Fojtik 2021-12-21 17:22:31 UTC

The LifecycleStale keyword was removed because the bug moved to QE.
The bug assignee was notified.

Comment 7 Ke Wang 2021-12-27 09:30:05 UTC

Checked the kubeconfig related tests from
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-ipv6 and https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi, they were passed during the past month, so move the bug VERIFIED.

Comment 10 errata-xmlrpc 2022-01-11 22:31:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.26 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0021

Note You need to log in before you can comment on or make changes to this bug.