Bug 1868645 - After a disaster recovery pods a stuck in "NodeAffinity" state and not running [NEEDINFO]
Summary: After a disaster recovery pods a stuck in "NodeAffinity" state and not running
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.5
Hardware: All
OS: Linux
Target Milestone: ---
: 4.7.0
Assignee: Elana Hashman
QA Contact: Sunil Choudhary
Depends On:
Blocks: 1878638 1930960
TreeView+ depends on / blocked
Reported: 2020-08-13 11:31 UTC by Marko Karg
Modified: 2021-02-24 15:16 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Node is marked as Ready and admits pods before it has a chance to sync. Consequence: Pod status may go out of sync, sometimes many are stuck in NodeAffinity, at node startup for a node that is not cordoned. Fix: Do not mark node as Ready until Node has synced with API servers at least once. Result: Pods should not get stuck in NodeAffinity after e.g. a cold cluster restart.
Clone Of:
Last Closed: 2021-02-24 15:15:36 UTC
Target Upstream Version:
abodhe: needinfo+
rphillips: needinfo? (mkarg)

Attachments (Terms of Use)
the namespace definition (1.48 KB, text/plain)
2020-08-18 06:33 UTC, Marko Karg
no flags Details
the entire contents of the namespace (1.86 MB, text/plain)
2020-08-18 06:34 UTC, Marko Karg
no flags Details
worker044 logs (compressed) (15.68 MB, application/gzip)
2020-08-18 09:20 UTC, Marko Karg
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 522 0 None closed Bug 1868645: UPSTREAM: 94087: kubelet: node sync at least once 2021-02-21 23:19:26 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:16:17 UTC

Description Marko Karg 2020-08-13 11:31:20 UTC
Description of problem:

A full outage on a 120 node baremetal cluster recovers fine and in time to a point where not all previously running pods come up. Out of 8428 pods 
3414 do not get to the Running state. 

Version-Release number of selected component (if applicable):
OCP 4.5.5

How reproducible:
I've tested the procedure twice, in both cases pods where stuck in NodeAffinity, similar numbers in both cases. 

Steps to Reproduce:
1. Power off all nodes hard simultaneously (masters and workers) while non-system pods are running fine (all in one namespace)
2. Power on the masters, let them form a quorate clusters and wait for the workers to come online as well.
3. Check for the number of running pods in the test namespace

Actual results:

Not all the expected pods do come back to the Running state, a large part is stuck in NodeAffinity.

Expected results:

All pods should come back online and enter the Running state.

Additional info:

The cluster is still up and running, I'm creating a must-gather for this case. Please let me know if you need any other information.

These are the pods in question:

[root@e16-h18-b03-fc640 ~]# oc get pods -n nodevertical-rerun-10 | wc -l
[root@e16-h18-b03-fc640 ~]# oc get pods -n nodevertical-rerun-10 | grep Running | wc -l
[root@e16-h18-b03-fc640 ~]# oc get pods -n nodevertical-rerun-10 | grep NodeAffinity | wc -l
[root@e16-h18-b03-fc640 ~]# oc get pods -n nodevertical-rerun-10 | grep -i creating | wc -l
[root@e16-h18-b03-fc640 ~]# 

One of the stuck ones:

[root@e16-h18-b03-fc640 ~]# oc describe pod nodevert-pod-999 -n nodevertical-rerun-10
Name:         nodevert-pod-999
Namespace:    nodevertical-rerun-10
Priority:     0
Node:         worker044/
Start Time:   Thu, 13 Aug 2020 09:02:51 +0000
Labels:       purpose=test
Annotations:  k8s.ovn.org/pod-networks:
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                    "mac": "8e:06:aa:83:04:0c",
                    "default": true,
                    "dns": {}
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                    "mac": "8e:06:aa:83:04:0c",
                    "default": true,
                    "dns": {}
              openshift.io/scc: privileged
Status:       Failed
Reason:       NodeAffinity
Message:      Pod Predicate NodeAffinity failed
IPs:          <none>
    Image:        gcr.io/google_containers/pause-amd64:3.0
    Port:         8080/TCP
    Host Port:    0/TCP
    Environment:  <none>
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zbjsw (ro)
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-zbjsw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node-role.kubernetes.io/worker=
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
  Type     Reason                  Age   From                Message
  ----     ------                  ----  ----                -------
  Normal   Scheduled               136m  default-scheduler   Successfully assigned nodevertical-rerun-10/nodevert-pod-999 to worker044
  Warning  FailedCreatePodSandBox  136m  kubelet, worker044  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nodevert-pod-999_nodevertical-rerun-10_8f5f6194-5e60-4e09-a13e-6b8de3689c41_0(2747c552ef645754e22bdb72380eefc0a445bfa4ff752fbcbcefe6fea0b1e1e7): Multus: [nodevertical-rerun-10/nodevert-pod-999]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking confAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[nodevertical-rerun-10/nodevert-pod-999] failed to get pod annotation: timed out waiting for the condition
  Warning  FailedCreatePodSandBox  135m  kubelet, worker044  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nodevert-pod-999_nodevertical-rerun-10_8f5f6194-5e60-4e09-a13e-6b8de3689c41_0(db08e2db7a0bbe39c7e451c969384eedcc15eb8c6b201d41e31260da92bbef8c): Multus: [nodevertical-rerun-10/nodevert-pod-999]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking confAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[nodevertical-rerun-10/nodevert-pod-999] failed to get pod annotation: timed out waiting for the condition
  Normal   AddedInterface  134m  multus              Add eth0 []
  Normal   Pulled          134m  kubelet, worker044  Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine
  Normal   Created         134m  kubelet, worker044  Created container nodevert
  Normal   Started         134m  kubelet, worker044  Started container nodevert
  Warning  NodeAffinity    35m   kubelet, worker044  Predicate NodeAffinity failed

Comment 1 Marko Karg 2020-08-13 12:27:50 UTC
I've seen the same problem for 2 fluentd pods in the openshift-logging namespace and deleted them manually. After that they came up fine. Could this be some sort of a race condition between worker nodes coming up and pods already being created?

The must-gather is available at http://file.str.redhat.com/mkarg/bz1868645/must-gather.tgz

Comment 2 Tomáš Nožička 2020-08-18 06:22:26 UTC
can you get us the yaml for the pod? I don't see the affinity spec here

or ideally dump it all with:
`oc get namespace nodevertical-rerun-10 -o yaml`
`oc get all -n nodevertical-rerun-10 -o yaml`


Comment 3 Marko Karg 2020-08-18 06:33:11 UTC
I've attached the outputs as namespace.yml and all.yml to this bz. 
Let me know if you need anything else.

Comment 4 Marko Karg 2020-08-18 06:33:41 UTC
Created attachment 1711685 [details]
the namespace definition

Comment 5 Marko Karg 2020-08-18 06:34:09 UTC
Created attachment 1711686 [details]
the entire contents of the namespace

Comment 6 Tomáš Nožička 2020-08-18 07:36:04 UTC
Looks like those are bare pods without a controller to recreate them.

Unfortunately the dump isn't from the same state  - all the pods there are either Running or Pending.

The pods don't have explicit affinity but have a nodeSelector:
      node-role.kubernetes.io/worker: ""
      nodevertical: "true"

and the default project node selector set on the namespace is
  openshift.io/node-selector: node-role.kubernetes.io/worker=

kube-scheduler thinks the node is fit to run the pod

  ./namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-master-2/kube-scheduler/kube-scheduler/logs/current.log:1392:2020-08-13T09:02:51.308294392Z I0813 09:02:51.308256       1 scheduler.go:731] pod nodevertical-rerun-10/nodevert-pod-999 is bound successfully on node "worker044", 123 nodes evaluated, 100 nodes were found feasible.

and the nodeName is set
  Warning  NodeAffinity    35m   kubelet, worker044  Predicate NodeAffinity failed

snapshotted labels on worker044 are:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: worker044
    kubernetes.io/os: linux
    node-role.kubernetes.io/worker: ""
    node.openshift.io/os_id: rhcos
    nodevertical: "true"
    placement: logtest

which seem to match as well. kube-scheduler evaluated the predicates as matching an assigned the node worker044 to the pod but kubelet on that node thinks the labels don't match and fails the pod.

It would be good to get kubelet logs from the worker044 to help debug the issue. Also the YAML dump for a failed pod would be useful.

As the above points to kubelet, I am sending it to the Node team to have a look.

The behaviour seem similar to https://github.com/kubernetes/kubernetes/issues/93338

Comment 7 Tomáš Nožička 2020-08-18 08:52:35 UTC
this code looks suspicious btw. https://github.com/kubernetes/kubernetes/issues/92067#issuecomment-643711902

// getNodeAnyWay() must return a *v1.Node which is required by RunGeneralPredicates().
// The *v1.Node is obtained as follows:
// Return kubelet's nodeInfo for this node, except on error or if in standalone mode,
// in which case return a manufactured nodeInfo representing a node with no pods,
// zero capacity, and the default labels.
func (kl *Kubelet) getNodeAnyWay() (*v1.Node, error) {
	if kl.kubeClient != nil {
		if n, err := kl.nodeLister.Get(string(kl.nodeName)); err == nil {
			return n, nil
	return kl.initialNode(context.TODO())

Comment 8 Marko Karg 2020-08-18 09:20:25 UTC
Created attachment 1711707 [details]
worker044 logs (compressed)

created with 
oc adm node-logs worker044 --since=-6d > /tmp/worker044.logs

Comment 9 Marko Karg 2020-08-18 13:23:04 UTC
Forget about those log files please, the cluster was re-deployed in between so they are useless. Sorry about that.

Comment 10 Seth Jennings 2020-08-18 19:16:28 UTC
Derek and I discussed the situation and, for this particular situation of bare pods running on hard downed nodes in a DR situation, there shouldn't be an expectation that these pods restart on the same nodes.

However, this can be an issue in less contrived situations, like the ones mentioned in upstream issues, and Derek is coding up a PR for upstream.

Comment 11 Seth Jennings 2020-08-18 20:32:23 UTC
upstream PR https://github.com/kubernetes/kubernetes/pull/94087

Comment 12 Marko Karg 2020-08-19 01:51:46 UTC
The expectation was not that pods would get restarted on the same nodes as before the outage, but that they would all come up running at some point. 
Sorry if that wasn't clear.

Comment 13 Seth Jennings 2020-08-19 14:59:49 UTC
Pods, once scheduled, can not move nodes.  So a bare pod, with no higher level controller to recreate a new pod from a pod template (i.e. ReplicaSet, DaemonSet, etc), will not restart.

Comment 14 Itamar Heim 2020-08-19 22:15:48 UTC
(In reply to Seth Jennings from comment #13)
> Pods, once scheduled, can not move nodes.  So a bare pod, with no higher
> level controller to recreate a new pod from a pod template (i.e. ReplicaSet,
> DaemonSet, etc), will not restart.

So if all nodes come back up, all pods should have as well (no hard down nodes)?

Comment 15 Marko Karg 2020-08-21 09:21:31 UTC
I'm under the impression that there still is some confusion about what the expectation after a disaster outage is. For me it's like this:

Power outage for all nodes

Masters get powered on and form a quorate cluster

Masters start to power up worker nodes and schedule pods (it does not matter for the pods where they get started)

Nodes start the assigned pods

Ultimately all the pods that were running before the outage are running again, even on different hosts. 

If the pods are not running again, I would consider that a bug, and I'm sure our customers would see that the same way. If I understand the upstream PR correctly, pods should only get scheduled to a node when that node is synced and thus can actually run pods, is that correct?


Comment 18 Seth Jennings 2020-08-31 20:32:37 UTC
Deferring to 4.7 as the test, expected behavior, and potential fix are all in dispute and there is no regression here.

Comment 21 Marko Karg 2020-09-23 14:56:02 UTC
TestBlocker removed

Comment 46 errata-xmlrpc 2021-02-24 15:15:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.