1891742 – 4.6: OperatorStatusChanged is noisy

Bug 1891742 - 4.6: OperatorStatusChanged is noisy

Summary: 4.6: OperatorStatusChanged is noisy

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Maciej Szulik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:	1891740
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-27 09:06 UTC by Stefan Schimanski
Modified:	2021-01-25 20:02 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1891740
Environment:
Last Closed:	2021-01-25 20:02:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-apiserver-operator pull 1029	None	closed	Bug 1891742: bump(library-go): controller/staticpod: reduce noise of not-ready containers	2021-01-18 02:18:38 UTC
Github	openshift cluster-kube-controller-manager-operator pull 493	None	closed	[release-4.6] Bug 1891742: bump(library-go): controller/staticpod: reduce noise of …	2021-01-18 02:19:18 UTC
Github	openshift cluster-kube-scheduler-operator pull 314	None	closed	[release-4.6] Bug 1891742: bump(library-go): controller/staticpod: reduce noise of …	2021-01-18 02:19:18 UTC
Github	openshift library-go pull 932	None	closed	[release-4.6] Bug 1891742: controller/staticpod: reduce noise of not-ready containers	2021-01-18 02:19:19 UTC
Red Hat Product Errata	RHSA-2021:0171	None	None	None	2021-01-25 20:02:34 UTC

Comment 3 Xingxing Xia 2020-11-12 09:53:49 UTC

Sorry for late on it while on other work.
Checked latest 4.6 (4.6.0-0.nightly-2020-11-11-195428), rollout the static components, unlike 4.7 fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1891740#c2 , 4.6 still reproduces:
oc logs deployment/kube-controller-manager-operator -n openshift-kube-controller-manager-operator > logs/kcm_o.log
oc logs deployment/openshift-kube-scheduler-operator -n openshift-kube-scheduler-operator > logs/ks_o.log
oc logs deployment/kube-apiserver-operator -n openshift-kube-apiserver-operator > logs/kas_o.log

$ grep 'OperatorStatusChanged.*StaticPodsDegraded: pod.*kube-.*container.*is running.*but not ready: unknown reason"' logs/kas_o.log | wc -l
94

# with ending $
$ grep 'OperatorStatusChanged.*StaticPodsDegraded: pod.*kube-.*container.*is running.*but not ready: unknown reason"$' logs/kas_o.log | wc -l
54

$ vi logs/kas_o.log
...
I1112 08:45:29.448695       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"e83be615-48b0-4635-b635-81993241488d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-apiserver-xxia1112bm-xjj94-control-plane-1 container \"kube-apiserver\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-xxia1112bm-xjj94-control-plane-1 container \"kube-apiserver-cert-regeneration-controller\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-xxia1112bm-xjj94-control-plane-1 container \"kube-apiserver-cert-syncer\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-xxia1112bm-xjj94-control-plane-1 container \"kube-apiserver-check-endpoints\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-xxia1112bm-xjj94-control-plane-1 container \"kube-apiserver-insecure-readyz\" is not ready: PodInitializing: " to "NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-apiserver-xxia1112bm-xjj94-control-plane-1 container \"kube-apiserver\" is running since 2020-11-12 08:45:15 +0000 UTC but not ready: unknown reason\nStaticPodsDegraded: pod/kube-apiserver-xxia1112bm-xjj94-control-plane-1 container \"kube-apiserver-check-endpoints\" is running since 2020-11-12 08:45:17 +0000 UTC but not ready: unknown reason"
...

Checked 4.6 branch, assigning back for bumping:
[xxia@pc ~]$ cd /data/src/github.com/openshift/cluster-kube-apiserver-operator/
[xxia@pc cluster-kube-apiserver-operator]$ git pull
[xxia@pc library-go]$ git checkout -b release-4.6 origin/release-4.6
[xxia@pc library-go]$ vi pkg/operator/staticpod/controller/staticpodstate/staticpodstate_controller.go
It does not yet bump the lines like `case containerStatus.State.Waiting != nil && containerStatus.State.Waiting.Reason != "PodInitializing"` from https://github.com/openshift/library-go/pull/932/files

Comment 5 Xingxing Xia 2021-01-12 07:57:51 UTC

I'm following JIRA issue DPTP-660 to do the pre-merge verification. Used the cluster-bot launching an env with the still open but Dev-approved PR(s): launch openshift/cluster-kube-apiserver-operator#1029 
Checking the env with step in previous comment, kube-apiserver-operator logs do not have the noise now. KCM and kube-scheduler operator pods' logs still have, as previous step. Are there KCM and kube-scheduler operators' PRs? Moving back to Assigned. Feel free to move back if the answer is no, or apply their PRs if the answer is yes.

Comment 6 Stefan Schimanski 2021-01-12 09:13:24 UTC

Moving to workloads team in order for them to bump library-go in kcm and ks.

Comment 7 Maciej Szulik 2021-01-15 11:47:48 UTC

PR in the queue.

Comment 9 Xingxing Xia 2021-01-18 02:23:07 UTC

The bug attaches 3 PRs, only the KAS PR was pre-merge-verified, the BZ robot should not automatically move it to VERIFIED. Thus manually moving it back to ON_QA for verifying the other 2 KCM/KS PRs.

Comment 11 zhou ying 2021-01-19 03:02:17 UTC

Checked with latest payload, the issue has fixed:
[root@dhcp-140-138 ~]# oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-01-18-070340   True        False         45m     Cluster version is 4.6.0-0.nightly-2021-01-18-070340


[root@dhcp-140-138 ~]# oc logs deployment/kube-controller-manager-operator -n openshift-kube-controller-manager-operator > logs/kcm_o.log
[root@dhcp-140-138 ~]# oc logs deployment/openshift-kube-scheduler-operator -n openshift-kube-scheduler-operator > logs/ks_o.log
[root@dhcp-140-138 ~]# grep 'OperatorStatusChanged.*StaticPodsDegraded: pod.*kube-.*container.*is running.*but not ready: unknown reason"' logs/kcm_o.log |wc -l 
0
[root@dhcp-140-138 ~]# grep 'OperatorStatusChanged.*StaticPodsDegraded: pod.*kube-.*container.*is running.*but not ready: unknown reason"$' logs/kcm_o.log |wc -l 
0
[root@dhcp-140-138 ~]# grep 'OperatorStatusChanged.*StaticPodsDegraded: pod.*kube-.*container.*is running.*but not ready: unknown reason"$' logs/ks_o.log |wc -l 
0
[root@dhcp-140-138 ~]# grep 'OperatorStatusChanged.*StaticPodsDegraded: pod.*kube-.*container.*is running.*but not ready: unknown reason"' logs/ks_o.log |wc -l 
0

Comment 13 errata-xmlrpc 2021-01-25 20:02:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.13 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0171

Note You need to log in before you can comment on or make changes to this bug.