1910801 – Nodes are going into a NotReady state and are unresponsive

Bug 1910801 - Nodes are going into a NotReady state and are unresponsive

Summary: Nodes are going into a NotReady state and are unresponsive

Keywords:
Status:	CLOSED DUPLICATE of bug 1857446
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-24 18:22 UTC by mchebbi@redhat.com
Modified:	2024-03-25 17:40 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-08 15:43:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description mchebbi@redhat.com 2020-12-24 18:22:24 UTC

Gathered information url: shorturl.at/louwy

customer hase had at least 1 node a week that goes into a NotReady state and then the node is unresponsive until we reboot the node.  This has happened on at least 4-5 different nodes in our Cluster.

The problem is occurring on Master and Worker Nodes.

The problem is causing pods to go down, because they are attached to a PVC and the PVC gets into a locked state and cannot be reused until we take manual action to remove the rdb lock on the PVC.

I have checked the gathered information and found the following issues:
--------------------
worker-06
--------------------
Dec 04 17:30:27 worker-6.ott-ocp1.lab.rbbn.com hyperkube[1455892]: E1204 17:30:27.600953 1455892 remote_runtime.go:321] UpdateContainerResources "2e87aa083f54cd9fbfe6515714d43dbb03d959fdc957 fc1159880541bee6644f" from runtime service failed: rpc error: code = Unknown desc = container 2e87aa083f54cd9fbfe6515714d43dbb03d959fdc957fc1159880541bee6644f is not running or created state
: stopped
Dec 04 17:30:27 worker-6.ott-ocp1.lab.rbbn.com hyperkube[1455892]: E1204 17:30:27.600967 1455892 cpu_manager.go:201] [cpumanager] AddContainer error: rpc error: code = Unknown desc = container  2e87aa083f54cd9fbfe6515714d43dbb03d959fdc957fc1159880541bee6644f is not running or created state: stopped
Dec 04 17:30:27 worker-6.ott-ocp1.lab.rbbn.com hyperkube[1455892]: I1204 17:30:27.600973 1455892 policy_static.go:249] [cpumanager] static policy: RemoveContainer (container id: 2e87aa083f54cd9fbfe6515714d43dbb03d959fdc957fc1159880541bee6644f)
Dec 04 17:30:27 worker-6.ott-ocp1.lab.rbbn.com hyperkube[1455892]: E1204 17:30:27.600983 1455892 cpu_manager.go:333] [cpumanager] reconcileState: failed to add container (pod: multus-j98jj, container: whereabouts-cni, container id: 2e87aa083f54cd9fbfe6515714d43dbb03d959fdc957fc1159880541bee6644f, error: rpc error: code = Unknown desc = container 2e87aa083f54cd9fbfe6515714d43dbb03d959fdc957fc1159880541bee6644f is not running or created state: stopped)
==================

master 3:

master 3:

Dec 02 19:15:31.692158 master-3.ott-ocp1.lab.rbbn.com hyperkube[2867]: E1101 07:54:02.445808       1 leaderelection.go:331] error retrieving resource lock openshift-machine-config-operator/m
achine-config: etcdserver: leader changed
Dec 02 19:15:31.692158 master-3.ott-ocp1.lab.rbbn.com hyperkube[2867]: E1101 07:55:44.576854       1 leaderelection.go:331] error retrieving resource lock openshift-machine-config-operator/m
achine-config: etcdserver: leader changed
Dec 02 19:15:31.692158 master-3.ott-ocp1.lab.rbbn.com hyperkube[2867]: E1101 10:12:52.406397       1 operator.go:330] timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 9, updated: 9, ready: 8, unavailable: 1)

Dec 02 19:15:31.692158 master-3.ott-ocp1.lab.rbbn.com hyperkube[2867]: E1101 10:33:08.056879       1 leaderelection.go:331] error retrieving resource lock openshift-machine-config-operator/machine-config: etcdserver: request timed out
Dec 02 19:15:31.692158 master-3.ott-ocp1.lab.rbbn.com hyperkube[2867]: E1101 10:33:37.657178       1 event.go:319] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-operator-5f77b58d6b-sz5nz_a97fc9f0-aa7e-4073-92dd-38b209d06506 stopped leading'
Dec 02 19:15:31.692158 master-3.ott-ocp1.lab.rbbn.com hyperkube[2867]: I1101 10:33:37.657524       1 leaderelection.go:288] failed to renew lease openshift-machine-config-operator/machine-config: timed out waiting for the condition
Dec 02 19:15:31.692158 master-3.ott-ocp1.lab.rbbn.com hyperkube[2867]: F1101 10:33:37.657560       1 start.go:113] leaderelection lost
Dec 02 19:15:31.692158 master-3.ott-ocp1.lab.rbbn.com hyperkube[2867]: I1101 10:33:37.657608       1 operator.go:274] Shutting down MachineConfigOperator

ec 02 19:16:28.691962 master-3.ott-ocp1.lab.rbbn.com hyperkube[2867]: I1202 19:16:28.691876    2867 status_manager.go:434] Ignoring same status for pod "kube-apiserver-operator-55cbcc4bbb-xddfr_openshift-kube-apiserver-operator(9fb317bc-0b23-4a5d-bef0-dd29c599ba1e)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-31 05:49:53 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-11-02 07:12:59 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-11-02 07:12:59 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-31 05:49:53 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.29.114.23 PodIP:10.129.0.159 PodIPs:[{IP:10.129.0.159}] StartTime:2020-10-31 05:49:53 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:kube-apiserver-operator State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-11-02 07:12:59 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Error,Message:et stopped posting node status.)" to "NodeControllerDegraded: The master nodes not ready: node \"master-1.ott-ocp1.lab.rbbn.com\" not ready since 2020-10-31 05:44:27 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nKubeAPIServerStaticResourcesDegraded: \"v4.1.0/kube-apiserver/ns.yaml\" (string): etcdserver: request timed out\nKubeAPIServerStaticResourcesDegraded: "


Dec 02 19:16:28.691962 master-3.ott-ocp1.lab.rbbn.com hyperkube[2867]: I1102 07:12:55.174756       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"077cbd9f-11c2-4960-9502-2010d1b6a2ff", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"master-1.ott-ocp1.lab.rbbn.com\" not ready since 2020-10-31 05:44:27 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nKubeAPIServerStaticResourcesDegraded: \"v4.1.0/kube-apiserver/ns.yaml\" (string): etcdserver: request timed out\nKubeAPIServerStaticResourcesDegraded: " to "NodeControllerDegraded: The master nodes not ready: node \"master-1.ott-ocp1.lab.rbbn.com\" not ready since 2020-10-31 05:44:27 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
==========================================================================
Thanks for your help and support.

Comment 1 Ryan Phillips 2021-01-04 19:44:07 UTC

Which platform is this on?

Good news is we have a kernel patch and http/2 patch in 4.6 that fixes this.

Comment 2 Ryan Phillips 2021-01-04 19:44:28 UTC

There is more info here: https://bugzilla.redhat.com/show_bug.cgi?id=1857446

Comment 3 mchebbi@redhat.com 2021-01-05 09:41:54 UTC

hello,
it's OCP 4.4.

Comment 4 Ryan Phillips 2021-01-08 15:43:20 UTC

This is fixed in 4.6 with a backport to 4.5 pending. Please upgrade to at least 4.6.9+ for long term support.

*** This bug has been marked as a duplicate of bug 1857446 ***

Note You need to log in before you can comment on or make changes to this bug.