Description of problem: A full outage on a 120 node baremetal cluster recovers fine and in time to a point where not all previously running pods come up. Out of 8428 pods 3414 do not get to the Running state. Version-Release number of selected component (if applicable): OCP 4.5.5 How reproducible: I've tested the procedure twice, in both cases pods where stuck in NodeAffinity, similar numbers in both cases. Steps to Reproduce: 1. Power off all nodes hard simultaneously (masters and workers) while non-system pods are running fine (all in one namespace) 2. Power on the masters, let them form a quorate clusters and wait for the workers to come online as well. 3. Check for the number of running pods in the test namespace Actual results: Not all the expected pods do come back to the Running state, a large part is stuck in NodeAffinity. Expected results: All pods should come back online and enter the Running state. Additional info: The cluster is still up and running, I'm creating a must-gather for this case. Please let me know if you need any other information. These are the pods in question: [root@e16-h18-b03-fc640 ~]# oc get pods -n nodevertical-rerun-10 | wc -l 8428 [root@e16-h18-b03-fc640 ~]# oc get pods -n nodevertical-rerun-10 | grep Running | wc -l 5014 [root@e16-h18-b03-fc640 ~]# oc get pods -n nodevertical-rerun-10 | grep NodeAffinity | wc -l 3326 [root@e16-h18-b03-fc640 ~]# oc get pods -n nodevertical-rerun-10 | grep -i creating | wc -l 87 [root@e16-h18-b03-fc640 ~]# One of the stuck ones: [root@e16-h18-b03-fc640 ~]# oc describe pod nodevert-pod-999 -n nodevertical-rerun-10 Name: nodevert-pod-999 Namespace: nodevertical-rerun-10 Priority: 0 Node: worker044/ Start Time: Thu, 13 Aug 2020 09:02:51 +0000 Labels: purpose=test Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.131.4.11/23"],"mac_address":"8e:06:aa:83:04:0c","gateway_ips":["10.131.4.1"],"ip_address":"10.131.4.11/23"... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.131.4.11" ], "mac": "8e:06:aa:83:04:0c", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.131.4.11" ], "mac": "8e:06:aa:83:04:0c", "default": true, "dns": {} }] openshift.io/scc: privileged Status: Failed Reason: NodeAffinity Message: Pod Predicate NodeAffinity failed IP: IPs: <none> Containers: nodevert: Image: gcr.io/google_containers/pause-amd64:3.0 Port: 8080/TCP Host Port: 0/TCP Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-zbjsw (ro) Volumes: default-token-zbjsw: Type: Secret (a volume populated by a Secret) SecretName: default-token-zbjsw Optional: false QoS Class: BestEffort Node-Selectors: node-role.kubernetes.io/worker= nodevertical=true Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 136m default-scheduler Successfully assigned nodevertical-rerun-10/nodevert-pod-999 to worker044 Warning FailedCreatePodSandBox 136m kubelet, worker044 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nodevert-pod-999_nodevertical-rerun-10_8f5f6194-5e60-4e09-a13e-6b8de3689c41_0(2747c552ef645754e22bdb72380eefc0a445bfa4ff752fbcbcefe6fea0b1e1e7): Multus: [nodevertical-rerun-10/nodevert-pod-999]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking confAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[nodevertical-rerun-10/nodevert-pod-999] failed to get pod annotation: timed out waiting for the condition ' Warning FailedCreatePodSandBox 135m kubelet, worker044 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nodevert-pod-999_nodevertical-rerun-10_8f5f6194-5e60-4e09-a13e-6b8de3689c41_0(db08e2db7a0bbe39c7e451c969384eedcc15eb8c6b201d41e31260da92bbef8c): Multus: [nodevertical-rerun-10/nodevert-pod-999]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking confAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[nodevertical-rerun-10/nodevert-pod-999] failed to get pod annotation: timed out waiting for the condition ' Normal AddedInterface 134m multus Add eth0 [10.131.4.11/23] Normal Pulled 134m kubelet, worker044 Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine Normal Created 134m kubelet, worker044 Created container nodevert Normal Started 134m kubelet, worker044 Started container nodevert Warning NodeAffinity 35m kubelet, worker044 Predicate NodeAffinity failed
I've seen the same problem for 2 fluentd pods in the openshift-logging namespace and deleted them manually. After that they came up fine. Could this be some sort of a race condition between worker nodes coming up and pods already being created? The must-gather is available at http://file.str.redhat.com/mkarg/bz1868645/must-gather.tgz
can you get us the yaml for the pod? I don't see the affinity spec here or ideally dump it all with: `oc get namespace nodevertical-rerun-10 -o yaml` `oc get all -n nodevertical-rerun-10 -o yaml` thanks
I've attached the outputs as namespace.yml and all.yml to this bz. Let me know if you need anything else.
Created attachment 1711685 [details] the namespace definition
Created attachment 1711686 [details] the entire contents of the namespace
Looks like those are bare pods without a controller to recreate them. Unfortunately the dump isn't from the same state - all the pods there are either Running or Pending. The pods don't have explicit affinity but have a nodeSelector: nodeSelector: node-role.kubernetes.io/worker: "" nodevertical: "true" and the default project node selector set on the namespace is openshift.io/node-selector: node-role.kubernetes.io/worker= kube-scheduler thinks the node is fit to run the pod ./namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-master-2/kube-scheduler/kube-scheduler/logs/current.log:1392:2020-08-13T09:02:51.308294392Z I0813 09:02:51.308256 1 scheduler.go:731] pod nodevertical-rerun-10/nodevert-pod-999 is bound successfully on node "worker044", 123 nodes evaluated, 100 nodes were found feasible. and the nodeName is set Warning NodeAffinity 35m kubelet, worker044 Predicate NodeAffinity failed snapshotted labels on worker044 are: labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/os: linux kubernetes.io/arch: amd64 kubernetes.io/hostname: worker044 kubernetes.io/os: linux node-role.kubernetes.io/worker: "" node.openshift.io/os_id: rhcos nodevertical: "true" placement: logtest which seem to match as well. kube-scheduler evaluated the predicates as matching an assigned the node worker044 to the pod but kubelet on that node thinks the labels don't match and fails the pod. It would be good to get kubelet logs from the worker044 to help debug the issue. Also the YAML dump for a failed pod would be useful. As the above points to kubelet, I am sending it to the Node team to have a look. The behaviour seem similar to https://github.com/kubernetes/kubernetes/issues/93338
this code looks suspicious btw. https://github.com/kubernetes/kubernetes/issues/92067#issuecomment-643711902 // getNodeAnyWay() must return a *v1.Node which is required by RunGeneralPredicates(). // The *v1.Node is obtained as follows: // Return kubelet's nodeInfo for this node, except on error or if in standalone mode, // in which case return a manufactured nodeInfo representing a node with no pods, // zero capacity, and the default labels. func (kl *Kubelet) getNodeAnyWay() (*v1.Node, error) { if kl.kubeClient != nil { if n, err := kl.nodeLister.Get(string(kl.nodeName)); err == nil { return n, nil } } return kl.initialNode(context.TODO()) }
Created attachment 1711707 [details] worker044 logs (compressed) created with oc adm node-logs worker044 --since=-6d > /tmp/worker044.logs
Forget about those log files please, the cluster was re-deployed in between so they are useless. Sorry about that.
Derek and I discussed the situation and, for this particular situation of bare pods running on hard downed nodes in a DR situation, there shouldn't be an expectation that these pods restart on the same nodes. However, this can be an issue in less contrived situations, like the ones mentioned in upstream issues, and Derek is coding up a PR for upstream.
upstream PR https://github.com/kubernetes/kubernetes/pull/94087
The expectation was not that pods would get restarted on the same nodes as before the outage, but that they would all come up running at some point. Sorry if that wasn't clear.
Pods, once scheduled, can not move nodes. So a bare pod, with no higher level controller to recreate a new pod from a pod template (i.e. ReplicaSet, DaemonSet, etc), will not restart.
(In reply to Seth Jennings from comment #13) > Pods, once scheduled, can not move nodes. So a bare pod, with no higher > level controller to recreate a new pod from a pod template (i.e. ReplicaSet, > DaemonSet, etc), will not restart. So if all nodes come back up, all pods should have as well (no hard down nodes)?
I'm under the impression that there still is some confusion about what the expectation after a disaster outage is. For me it's like this: -- Power outage for all nodes Masters get powered on and form a quorate cluster Masters start to power up worker nodes and schedule pods (it does not matter for the pods where they get started) Nodes start the assigned pods Ultimately all the pods that were running before the outage are running again, even on different hosts. -- If the pods are not running again, I would consider that a bug, and I'm sure our customers would see that the same way. If I understand the upstream PR correctly, pods should only get scheduled to a node when that node is synced and thus can actually run pods, is that correct? Thanks!
Deferring to 4.7 as the test, expected behavior, and potential fix are all in dispute and there is no regression here.
TestBlocker removed
Upstream PR merged. Upstream backports: https://github.com/kubernetes/kubernetes/pull/97995 https://github.com/kubernetes/kubernetes/pull/97996 https://github.com/kubernetes/kubernetes/pull/97997 OpenShift PRs to follow.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days