This scheduling issue appears to be unique to VMs. CNV v2.6.5 on (preGA) OCP 4.8.0, we also saw on OCP 4.7 before a recent upgrade. We did NOT see this on previous CNV 2.6.x versions but not exactly sure when it started happening. I started 50 simple pods and distribution looks normal, each of the 24 nodes gets 2-3 pods including the 6 that VMs seem to "ignore". Expected output from the kube-sched pod logs, ex: I0723 20:24:00.962967 1 scheduler.go:604] "Successfully bound pod to node" pod="default/mypod21" node="worker10" evaluatedNodes=24 feasibleNodes=24 I0723 20:24:00.968189 1 scheduler.go:604] "Successfully bound pod to node" pod="default/mypod22" node="worker16" evaluatedNodes=24 feasibleNodes=24 Also an example of a CNV DataVolume pod being assigned to an "ignored" node just fine: I0723 20:52:22.671062 1 scheduler.go:604] "Successfully bound pod to node" pod="default/importer-rhel8.3-dv-source" node="worker02" evaluatedNodes=24 feasibleNodes=24 But if I start VMs, even very simple ones w/ only a containerDisk (i.e. no PVC), default scheduling just ignores those nodes.. My VM definition, with some resources defined which the scheduler should be balancing: http://perf1.perf.lab.eng.bos.redhat.com/pub/jhopper/OCP4/debug/testvm.yaml # nodes=`oc get node | awk '{print $1}' | grep -v NAME`; for i in $nodes; do echo "$i: "; oc get vmi | grep test | grep $i | wc -l; done master-0: 0 master-1: 3 master-2: 3 worker01: 3 worker02: 0 worker03: 0 worker04: 3 worker05: 3 worker06: 2 worker07: 0 worker08: 3 worker09: 3 worker10: 0 worker11: 3 worker12: 3 worker13: 3 worker14: 3 worker15: 2 worker16: 0 worker17: 3 worker18: 2 worker19: 3 worker20: 3 worker21: 2 (just a note: our masters are schedulable) kube-sched logs look the same, so no clues there: I0723 21:20:08.726475 1 scheduler.go:604] "Successfully bound pod to node" pod="default/virt-launcher-testvm2-qw9ls" node="worker18" evaluatedNodes=24 feasibleNodes=24
I do see one anomaly, but only on worker02, it is the only node w/ an e2e-az defined?? e2e-az-name: e2e-az3 f:e2e-az-name: {}
> kube-sched logs look the same, so no clues there: > I0723 21:20:08.726475 1 scheduler.go:604] "Successfully bound pod to node" pod="default/virt-launcher-testvm2-qw9ls" node="worker18" evaluatedNodes=24 feasibleNodes=24 This says all the nodes were evaluated and none of the nodes were filtered out. It's also possible all the ignored nodes just get a very low score. Could you increase the log level of the scheduler to 10 and collect the logs? So we can see which score are assigned to each node. In order to increase the log level, you can edit the kubescheduler/cluster object and set the .spec.logLevel field to TraceAll. The default log level is set to Normal.
I mean, can you increase the log level to TraceAll and schedule the VM pods again?
Ah looks like we have a clue, all workers that VMs "ignore" have a score of 0 for "ImageLocality" (?) for virt-launcher: I0727 20:22:34.831369 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker03" score=0 I0727 20:22:34.831376 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker08" score=23 I0727 20:22:34.831383 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker05" score=23 I0727 20:22:34.831389 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker20" score=23 I0727 20:22:34.831396 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="master-0" score=0 I0727 20:22:34.831403 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker16" score=0 I0727 20:22:34.831409 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker01" score=23 I0727 20:22:34.831416 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker17" score=23 I0727 20:22:34.831423 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker09" score=23 I0727 20:22:34.831429 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker11" score=23 I0727 20:22:34.831436 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker12" score=23 I0727 20:22:34.831442 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker10" score=0 I0727 20:22:34.831449 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker21" score=23 I0727 20:22:34.831456 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker04" score=23 I0727 20:22:34.831463 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="master-1" score=23 I0727 20:22:34.831469 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker06" score=23 I0727 20:22:34.831476 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker02" score=0 I0727 20:22:34.831496 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker07" score=0 I0727 20:22:34.831508 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker14" score=23 I0727 20:22:34.831519 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="master-2" score=23 I0727 20:22:34.831534 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker13" score=23 I0727 20:22:34.831545 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker15" score=23 I0727 20:22:34.831556 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker18" score=23 I0727 20:22:34.831569 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker19" score=23 Full log here: http://perf1.perf.lab.eng.bos.redhat.com/pub/jhopper/OCP4/debug/kube_sched_debug.txt This was scheduling 50 simple containerDisk VMs: http://perf1.perf.lab.eng.bos.redhat.com/pub/jhopper/OCP4/debug/testvm.yaml I am not sure if this is just looking for the virt-launcher image because the "ignored" nodes do have that image: worker01 ("good" node): [core@worker01 ~]$ sudo crictl image | grep virt-launcher registry.redhat.io/container-native-virtualization/virt-launcher <none> ac3dcebd26c92 678MB registry.redhat.io/container-native-virtualization/virt-launcher <none> 133e89e7a991c 678MB registry.redhat.io/container-native-virtualization/virt-launcher <none> 28d97d0bf9eab 682MB worker02 (score=0 node): [core@worker02 ~]$ sudo crictl image | grep virt-launcher registry.redhat.io/container-native-virtualization/virt-launcher <none> ac3dcebd26c92 678MB registry.redhat.io/container-native-virtualization/virt-launcher <none> 133e89e7a991c 678MB registry.redhat.io/container-native-virtualization/virt-launcher <none> 28d97d0bf9eab 682MB
Side note: not sure what's the reason for the score=0 ImageLocality yet, but in general I wonder if the ImageLocality score of 23 is too high in relation to the NodeResources scores (70s to 90s in this case) since it could cause some large resource imbalances if the image is actually not there? I noticed some other pods (-n openshift-marketplace) vary the ImageLocality score from 8-19. Maybe it is based on image size?
(In reply to Jenifer Abrams from comment #8) > Side note: not sure what's the reason for the score=0 ImageLocality yet, but > in general I wonder if the ImageLocality score of 23 is too high in relation > to the NodeResources scores (70s to 90s in this case) since it could cause > some large resource imbalances if the image is actually not there? I > noticed some other pods (-n openshift-marketplace) vary the ImageLocality > score from 8-19. Maybe it is based on image size? Note that the containerdisk itself also counts, not just virt-launcher (looking at the score plugin init containers are not counted). Is the containerdisk bigger than before?
Ok here is my confusion... If I log onto each node I see the same crictl images: "good" node: [core@worker01 ~]$ sudo crictl image | grep virt-launcher registry.redhat.io/container-native-virtualization/virt-launcher <none> 02b9d22a09975 716MB registry.redhat.io/container-native-virtualization/virt-launcher <none> ac3dcebd26c92 678MB registry.redhat.io/container-native-virtualization/virt-launcher <none> 133e89e7a991c 678MB registry.redhat.io/container-native-virtualization/virt-launcher <none> 28d97d0bf9eab 682MB "ignored" node: [core@worker02 ~]$ sudo crictl image | grep virt-launcher registry.redhat.io/container-native-virtualization/virt-launcher <none> 02b9d22a09975 716MB registry.redhat.io/container-native-virtualization/virt-launcher <none> ac3dcebd26c92 678MB registry.redhat.io/container-native-virtualization/virt-launcher <none> 133e89e7a991c 678MB registry.redhat.io/container-native-virtualization/virt-launcher <none> 28d97d0bf9eab 682MB But I do see the virt-launcher images missing in some of the 'oc get node -o yaml' output. For ex. oc get node $i -o yaml | grep "registry.redhat.io/container-native-virtualization/virt-launcher node: worker01 - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:09e0329dfe89d555f04ef56a3c6391c966404cf4ad608cddd690a8eafac136f9 - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:1b951d299a36d557600bb8376d8879621f390c059c36bedd856dcb6a4ec5d6d0 - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:97c5d5fe8006f76e28f5bdabe65260eea826c805a4f1184272f966ae4d622164 - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:f1bba5bca1617a15c8e56ba804c0003f3bcbc34bcbe578b3bd51699044bfdb8a - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:61766becbae704a808fd191731064a4c1ff0faf158dba3f8de7d1e22ce2591b3 - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:6dffa5e2229188d7f301163d73b0b4119841248c8e5d75d76a87e1e7ba6f1a8a - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:3da3f81b2dc2c6ad885587c7a3e90facb2bc68e03444e29e63c9c1328ab5806c - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:c5402ae650feb4be9327d826a161a6de876267bd06a0cbc63069bdcff7bab48b node: worker02 <none> And we know worker02 does in fact have the working virt-launcher image (as crictl shows) since VMs are running on that node (scheduled when others were cordoned).
(In reply to Roman Mohr from comment #9) > (In reply to Jenifer Abrams from comment #8) > > Side note: not sure what's the reason for the score=0 ImageLocality yet, but > > in general I wonder if the ImageLocality score of 23 is too high in relation > > to the NodeResources scores (70s to 90s in this case) since it could cause > > some large resource imbalances if the image is actually not there? I > > noticed some other pods (-n openshift-marketplace) vary the ImageLocality > > score from 8-19. Maybe it is based on image size? > > Note that the containerdisk itself also counts, not just virt-launcher > (looking at the score plugin init containers are not counted). Is the > containerdisk bigger than before? Good to know container-disk image is also considered, but I tried another test. I searched all nodes for "sudo crictl image | grep container-disk", (some) nodes only had the "quay.io/kubevirt/cirros-container-disk-demo" image. So this time I started 100 VMs w/ fedora-cloud-container-disk-demo instead of cirros, but same behavior, the nodes with "supposed missing" virt-launcher images (which crictl says are there) never download the fedora container-disk and never get VMs: # nodes=`oc get node | awk '{print $1}' | grep -v NAME`; for i in $nodes; do echo "$i: "; oc get vmi | grep testfedora | grep $i | wc -l; done master-0: 0 master-1: 5 master-2: 5 worker01: 6 worker02: 0 worker03: 0 worker04: 6 worker05: 6 worker06: 5 worker07: 0 worker08: 5 worker09: 6 worker10: 0 worker11: 6 worker12: 6 worker13: 6 worker14: 5 worker15: 5 worker16: 0 worker17: 6 worker18: 5 worker19: 6 worker20: 6 worker21: 5 Maybe the scheduler is behaving as it should based on the plugin input but it appears to be some issue w/ how the image is "advertised" to be available on nodes or something??
Just a note that CNV auto-updated to 4.8.0 last night so we were on the latest before the last round of sched testing.
Kubelet is responsible for populating the list of images on a node. Changing the component for further investigation.
Not sure if the images list shown in 'oc get node -o yaml' is the same as what kubelet provides to the ImageLocality plugin, but if so it seems there is a limit on the number of images reported? ex: # nodes=`oc get node | awk '{print $1}' | grep -v NAME`; for i in $nodes; do echo "$i:"; echo -n " crictl count:"; ssh core@$i "sudo crictl images | grep registry | wc -l"; echo -n " node yaml count:"; oc get node $i -o yaml | grep registry | wc -l ; done master-0: crictl count:120 node yaml count:41 master-1: crictl count:57 node yaml count:35 master-2: crictl count:96 node yaml count:46 worker01: crictl count:78 node yaml count:45 worker02: crictl count:237 node yaml count:43
Tried an experiment w/ an "ignored" worker16 node.. I deleted all VMs on the cluster, drained & rebooted all nodes so everything was fresh. Checked 'oc get node worker16 -o yaml', no virt-launcher pods were shown there even though crictl on the node does show the images. Scheduled 50 ephemeral VMs (fedora-container-disk again), 0 VMs landed on the "ignored" nodes (master-0,worker02,03,07,10,16) as usual. >> I0730 19:17:21.115843 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testfedoravm2-d5knf" plugin="ImageLocality" node="worker16" score=0 btw to prove it is not a problem for all container images, here is another pod that does get an ImageLocality hit for worker16: >> I0730 19:31:15.250323 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="openshift-marketplace/redhat-marketplace-hwwpl" plugin="ImageLocality" node="worker16" score=15 the experiment: On worker16 I ran 'sudo podman image prune' (y) For that node it took the total image count from 202 to 147 Now 'oc get node worker16 -o yaml' does show virt-launcher images! Then I scheduled 50 more VMs. Now worker16 gets VMs! I0730 19:35:16.935976 1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testfedoravm100-jkd4j" plugin="ImageLocality" node="worker16" score=26
Found out the kubelet is configured to report only 50 images to the API server [1]. The code in [1] does a truncation before a sort, so the kubelet might report images differently every time the function is run. I will open an upstream bug for this. We can disable the image limit functionality by this KubeletConfig to see if it clears up the issue with the Kubelet not reporting the image that exists: apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: custom-config spec: machineConfigPoolSelector: matchLabels: custom-kubelet: enabled kubeletConfig: nodeStatusMaxImages: -1 1. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/nodestatus/setters.go#L459-L477
Upstream Issue: https://github.com/kubernetes/kubernetes/issues/104136
The upstream issue got closed, therefore I wonder: What and where is the bug?
To recap - the truncated list of images seems to be the root cause. This is not CNV specific, and is affecting clusters with many container images per node. The workaround is to use "nodeStatusMaxImages: -1". This behavior is affecting the scheduling of all pods (not just CNV), and right now the consensus seems to be that this is generally not an issue in OpenShift. For CNV we should explore if a) we can make a point that it's also an issue for pods in general and get the limited increased/default of nodeStatusMaxImages changed or b) find a CNV only solution
Thank you for the summary, Fabian! Going to close this for now... Please reopen if necessary.
Ryan, could there be any optimizations we do here? Known issue, alert, some api change/scheduler change to compensate this?
Fabian, I really like your suggestion here. Our team recently ran into this bug and it resulted in a lot of troubleshooting before we ultimately were directed to this bug. I suspect as Openshift Virtualization becomes more popular, more and more customers are going to face this issue and go through the same, unnecessary efforts. Has an RFE been filed against this? Has the official documentation been updated to reflect this issue?
Keith, did you encounter this in an OpenShift Virtualization setting? For CNV we will be adding a documentation note about this: https://issues.redhat.com/browse/CNV-13681
Looking at the previous issue a little closer. We see that almost all nodes where at the max image limit: $ for N in ./cluster-scoped-resources/core/nodes/* ; do shyaml get-length status.images < $N ; echo " ${N/…/}" ; done 50 ./cluster-scoped-resources/core/nodes/control-013.yaml 50 ./cluster-scoped-resources/core/nodes/control-014.yaml 50 ./cluster-scoped-resources/core/nodes/control-015.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-001.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-002.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-003.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-004.yaml 30 ./cluster-scoped-resources/core/nodes/worker-cnv-005.yaml 36 ./cluster-scoped-resources/core/nodes/worker-cnv-006.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-007.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-008.yaml 42 ./cluster-scoped-resources/core/nodes/worker-cnv-009.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-010.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-011.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-012.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-013.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-014.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-015.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-016.yaml 50 ./cluster-scoped-resources/core/nodes/worker-cnv-017.yaml In general there were 370+ unique images on all nodes: $ for N in ./cluster-scoped-resources/core/nodes/* ; do shyaml get-values status.images < $N ; echo " ${N/…/}" ; done | sort -u | grep registry | wc -l 387 Removing shasums - specifically pinned images - then we ended up with 90%+ less images: $ for N in ./cluster-scoped-resources/core/nodes/* ; do shyaml get-values status.images < $N ; echo " ${N/…/}" ; done | sort -u | grep registry | sed "s/@sha256.*//" | sort -u | wc -l 33 Drilling a little bit into the data we see that few logical images are pinned and lead to the high number of images on the nodes: 2 registry.access.redhat.com/openshift3/ose@sha256 2 registry.redhat.io/container-native-virtualization/bridge-marker@sha256 2 registry.redhat.io/container-native-virtualization/cnv-containernetworking-plugins@sha256 2 registry.redhat.io/container-native-virtualization/cnv-must-gather-rhel8@sha256 2 registry.redhat.io/container-native-virtualization/kubernetes-nmstate-handler-rhel8@sha256 2 registry.redhat.io/container-native-virtualization/kubevirt-template-validator@sha256 2 registry.redhat.io/container-native-virtualization/virt-artifacts-server@sha256 2 registry.redhat.io/container-native-virtualization/virt-handler@sha256 2 registry.redhat.io/ocs4/ocs-rhel8-operator@sha256 2 registry.redhat.io/openshift4/ose-csi-node-driver-registrar@sha256 2 registry.redhat.io/rhel8/postgresql-12@sha256 3 registry.access.redhat.com/openshift3/ose 3 registry.openshift-image-registry.svc 3 registry.redhat.io/container-native-virtualization/cnv-must-gather-rhel8 4 registry.redhat.io/container-native-virtualization/virt-launcher@sha256 4 registry.redhat.io/rhel8/rhel-guest-image@sha256 5 registry.redhat.io/rhel8/rhel-guest-image 6 registry.redhat.io/ocs4/cephcsi-rhel8@sha256 6 registry.redhat.io/ocs4/mcg-core-rhel8@sha256 6 registry.redhat.io/ocs4/rook-ceph-rhel8-operator@sha256 45 registry.redhat.io/redhat/redhat-operator-index@sha256 47 registry.redhat.io/redhat/redhat-operator-index 50 registry.redhat.io/redhat/redhat-marketplace-index@sha256 52 registry.redhat.io/redhat/redhat-marketplace-index 100 registry.redhat.io/redhat/certified-operator-index@sha256 102 registry.redhat.io/redhat/certified-operator-index 130 registry.redhat.io/redhat/community-operator-index@sha256 132 registry.redhat.io/redhat/community-operator-index Usually un-needed images will be pruned by GC as described in https://docs.openshift.com/container-platform/4.9/nodes/nodes/nodes-nodes-garbage-collection.html IIUIC then the decision to prune images is purely driven by the nodes disk usage. From a node perspective this is fine. However, I think this bug shows that the node level GC should also be driven by the number of images present on a node in relation to the max image count on the cluster. In other words: Images should be pruned if the number of local images is close to or above the max image count that the scheduler is considering. Ryan, thoughts?
The other upstream discussion is at https://github.com/kubernetes/kubernetes/issues/93488
Reopening this issue, as it remains
Release note added to OpenShift Virtualization 4.10 * If a single node contains more than 50 images, pod scheduling might be imbalanced across nodes. This is because the list of images on a node is shortened to 50 by default. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1984442[*BZ#1984442*]) ** As a workaround, you can disable the image limit by xref:../nodes/nodes/nodes-nodes-managing.adoc#nodes-nodes-managing[editing the `KubeletConfig` object] and setting the value of `nodeStatusMaxImages` to `-1`. @rphillips When this issue is resolved, then please change the "Doc Type" field from known issue to Bug Fix so that we can remove the known issue from the release notes. Thank you
Docs removed release note from OpenShift Virtualization 4.12 and added a note to the documentation - see https://bugzilla.redhat.com/show_bug.cgi?id=1995518. Removed doc text flag.