Bug 2002374 - Inexplicably slow kubelet on bootstrap makes installation fail
Summary: Inexplicably slow kubelet on bootstrap makes installation fail
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.9.0
Assignee: aos-install
QA Contact: Jianli Wei
URL:
Whiteboard:
Depends On: 1981999
Blocks: 2004716 2027414
TreeView+ depends on / blocked
 
Reported: 2021-09-08 16:14 UTC by Pablo Alonso Rodriguez
Modified: 2021-11-29 15:10 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2004716 2027414 (view as bug list)
Environment:
Last Closed: 2021-11-29 15:05:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:51:45 UTC

Description Pablo Alonso Rodriguez 2021-09-08 16:14:29 UTC
Description of problem:

In one customer, whenever an installation is tried, the kubelet is inexplicably slow, so it doesn't start the kube-apiserver even after waiting hours. 

As per crio, it doesn't seem to even try to start it, but I cannot point any failure log. 

Sar metrics were also collected and there was no apparent resource exhaustion (either at CPU, RAM, storage, network, no high load...).

So I am going to need kubelet team help to try to understand where can slowness come from and whether it can be due to a kubelet bug.

Version-Release number of selected component (if applicable):

4.6 (different erratas)

How reproducible:

Only at a concrete environment.

Steps to Reproduce:
1. Install a cluster


Actual results:

Bootstrap kube-apiserver pod never starts due to apparent kubelet slowness

Expected results:

kube-apiserver pod starting.

Additional info:

Comment 14 Benjamin Gilbert 2021-09-17 18:41:56 UTC
Moving to POST because bug 1978268 has landed in a build, and we're just waiting for the bootimage bump.

Comment 15 Benjamin Gilbert 2021-09-22 21:47:43 UTC
The bootimage bump in bug 1981999 has landed.  Moving to MODIFIED.

Comment 16 Scott Dodson 2021-09-23 13:48:38 UTC
This made it into 4.9.0-rc.3, moving ON_QA

Comment 17 Gaoyun Pei 2021-09-25 09:21:16 UTC
In payload quay.io/openshift-release-dev/ocp-release:4.9.0-rc.3-x86_64, RHCOS-49.84.202109172039-0 was used as boot image.

[root@ip-10-0-13-79 ~]# rpm-ostree status
State: idle
Deployments:
* ostree://67a210b2d0d1c3787f813061995783c3528d132cfb97bd44b3eb003fb8dacde8
                   Version: 49.84.202109172039-0 (2021-09-17T20:43:24Z)

In QE's CI test, we didn't see bootstrap failure with 4.9.0-rc.3-x86_64, move this bug as verified.

Comment 20 errata-xmlrpc 2021-10-18 17:51:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.