Bug 1868645
Summary: | After a disaster recovery pods a stuck in "NodeAffinity" state and not running | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Marko Karg <mkarg> | ||||||||
Component: | Node | Assignee: | Elana Hashman <ehashman> | ||||||||
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | urgent | ||||||||||
Priority: | urgent | CC: | abeekhof, abodhe, aos-bugs, dblack, decarr, ehashman, iheim, jokerman, mfojtik, nagrawal, rphillips, tsweeney, yjoseph, yprokule | ||||||||
Version: | 4.5 | Keywords: | Reopened | ||||||||
Target Milestone: | --- | Flags: | abodhe:
needinfo+
|
||||||||
Target Release: | 4.7.0 | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Cause: Node is marked as Ready and admits pods before it has a chance to sync.
Consequence: Pod status may go out of sync, sometimes many are stuck in NodeAffinity, at node startup for a node that is not cordoned.
Fix: Do not mark node as Ready until Node has synced with API servers at least once.
Result: Pods should not get stuck in NodeAffinity after e.g. a cold cluster restart.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2021-02-24 15:15:36 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1878638, 1930960 | ||||||||||
Attachments: |
|
Description
Marko Karg
2020-08-13 11:31:20 UTC
I've seen the same problem for 2 fluentd pods in the openshift-logging namespace and deleted them manually. After that they came up fine. Could this be some sort of a race condition between worker nodes coming up and pods already being created? The must-gather is available at http://file.str.redhat.com/mkarg/bz1868645/must-gather.tgz can you get us the yaml for the pod? I don't see the affinity spec here or ideally dump it all with: `oc get namespace nodevertical-rerun-10 -o yaml` `oc get all -n nodevertical-rerun-10 -o yaml` thanks I've attached the outputs as namespace.yml and all.yml to this bz. Let me know if you need anything else. Created attachment 1711685 [details]
the namespace definition
Created attachment 1711686 [details]
the entire contents of the namespace
Looks like those are bare pods without a controller to recreate them. Unfortunately the dump isn't from the same state - all the pods there are either Running or Pending. The pods don't have explicit affinity but have a nodeSelector: nodeSelector: node-role.kubernetes.io/worker: "" nodevertical: "true" and the default project node selector set on the namespace is openshift.io/node-selector: node-role.kubernetes.io/worker= kube-scheduler thinks the node is fit to run the pod ./namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-master-2/kube-scheduler/kube-scheduler/logs/current.log:1392:2020-08-13T09:02:51.308294392Z I0813 09:02:51.308256 1 scheduler.go:731] pod nodevertical-rerun-10/nodevert-pod-999 is bound successfully on node "worker044", 123 nodes evaluated, 100 nodes were found feasible. and the nodeName is set Warning NodeAffinity 35m kubelet, worker044 Predicate NodeAffinity failed snapshotted labels on worker044 are: labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/os: linux kubernetes.io/arch: amd64 kubernetes.io/hostname: worker044 kubernetes.io/os: linux node-role.kubernetes.io/worker: "" node.openshift.io/os_id: rhcos nodevertical: "true" placement: logtest which seem to match as well. kube-scheduler evaluated the predicates as matching an assigned the node worker044 to the pod but kubelet on that node thinks the labels don't match and fails the pod. It would be good to get kubelet logs from the worker044 to help debug the issue. Also the YAML dump for a failed pod would be useful. As the above points to kubelet, I am sending it to the Node team to have a look. The behaviour seem similar to https://github.com/kubernetes/kubernetes/issues/93338 this code looks suspicious btw. https://github.com/kubernetes/kubernetes/issues/92067#issuecomment-643711902 // getNodeAnyWay() must return a *v1.Node which is required by RunGeneralPredicates(). // The *v1.Node is obtained as follows: // Return kubelet's nodeInfo for this node, except on error or if in standalone mode, // in which case return a manufactured nodeInfo representing a node with no pods, // zero capacity, and the default labels. func (kl *Kubelet) getNodeAnyWay() (*v1.Node, error) { if kl.kubeClient != nil { if n, err := kl.nodeLister.Get(string(kl.nodeName)); err == nil { return n, nil } } return kl.initialNode(context.TODO()) } Created attachment 1711707 [details]
worker044 logs (compressed)
created with
oc adm node-logs worker044 --since=-6d > /tmp/worker044.logs
Forget about those log files please, the cluster was re-deployed in between so they are useless. Sorry about that. Derek and I discussed the situation and, for this particular situation of bare pods running on hard downed nodes in a DR situation, there shouldn't be an expectation that these pods restart on the same nodes. However, this can be an issue in less contrived situations, like the ones mentioned in upstream issues, and Derek is coding up a PR for upstream. The expectation was not that pods would get restarted on the same nodes as before the outage, but that they would all come up running at some point. Sorry if that wasn't clear. Pods, once scheduled, can not move nodes. So a bare pod, with no higher level controller to recreate a new pod from a pod template (i.e. ReplicaSet, DaemonSet, etc), will not restart. (In reply to Seth Jennings from comment #13) > Pods, once scheduled, can not move nodes. So a bare pod, with no higher > level controller to recreate a new pod from a pod template (i.e. ReplicaSet, > DaemonSet, etc), will not restart. So if all nodes come back up, all pods should have as well (no hard down nodes)? I'm under the impression that there still is some confusion about what the expectation after a disaster outage is. For me it's like this: -- Power outage for all nodes Masters get powered on and form a quorate cluster Masters start to power up worker nodes and schedule pods (it does not matter for the pods where they get started) Nodes start the assigned pods Ultimately all the pods that were running before the outage are running again, even on different hosts. -- If the pods are not running again, I would consider that a bug, and I'm sure our customers would see that the same way. If I understand the upstream PR correctly, pods should only get scheduled to a node when that node is synced and thus can actually run pods, is that correct? Thanks! Deferring to 4.7 as the test, expected behavior, and potential fix are all in dispute and there is no regression here. TestBlocker removed Upstream PR merged. Upstream backports: https://github.com/kubernetes/kubernetes/pull/97995 https://github.com/kubernetes/kubernetes/pull/97996 https://github.com/kubernetes/kubernetes/pull/97997 OpenShift PRs to follow. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |