1984442 – Scaled setup|Scheduler ignore nodes for no apparent reason when scheduling CNV VMs

Bug 1984442 - Scaled setup|Scheduler ignore nodes for no apparent reason when scheduling CNV VMs

Summary: Scaled setup|Scheduler ignore nodes for no apparent reason when scheduling CN...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-21 13:14 UTC by Boaz
Modified:	2023-02-15 04:14 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	* If a single node contains more than 50 images, pod scheduling might be imbalanced across nodes. This is because the list of images on a node is shortened to 50 by default. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1984442[BZ#1984442]) ** As a workaround, you can disable the image limit by xref:../nodes/nodes/nodes-nodes-managing.adoc#nodes-nodes-managing[editing the `KubeletConfig` object] and setting the value of `nodeStatusMaxImages` to `-1`.
Clone Of:
Environment:
Last Closed:	2022-03-02 18:56:21 UTC
Target Upstream Version:
Embargoed:
Flags:	akamra: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1995518	0	medium	CLOSED	Document known issue about scheduler misbehavior with many node images	2023-09-18 04:31:32 UTC
Red Hat Bugzilla	2032930	1	unspecified	CLOSED	Catalog image bloat leads to scheduler issues	2021-12-17 14:29:20 UTC
Red Hat Issue Tracker	OCPBUGSM-32268	0	None	None	None	2022-12-21 13:31:21 UTC

Internal Links: 1995518

Comment 3 Jenifer Abrams 2021-07-23 21:31:56 UTC

This scheduling issue appears to be unique to VMs. CNV v2.6.5 on (preGA) OCP 4.8.0, we also saw on OCP 4.7 before a recent upgrade. We did NOT see this on previous CNV 2.6.x versions but not exactly sure when it started happening. 

I started 50 simple pods and distribution looks normal, each of the 24 nodes gets 2-3 pods including the 6 that VMs seem to "ignore". Expected output from the kube-sched pod logs, ex:

I0723 20:24:00.962967       1 scheduler.go:604] "Successfully bound pod to node" pod="default/mypod21" node="worker10" evaluatedNodes=24 feasibleNodes=24
I0723 20:24:00.968189       1 scheduler.go:604] "Successfully bound pod to node" pod="default/mypod22" node="worker16" evaluatedNodes=24 feasibleNodes=24


Also an example of a CNV DataVolume pod being assigned to an "ignored" node just fine:

I0723 20:52:22.671062       1 scheduler.go:604] "Successfully bound pod to node" pod="default/importer-rhel8.3-dv-source" node="worker02" evaluatedNodes=24 feasibleNodes=24


But if I start VMs, even very simple ones w/ only a containerDisk (i.e. no PVC), default scheduling just ignores those nodes.. 

My VM definition, with some resources defined which the scheduler should be balancing:
http://perf1.perf.lab.eng.bos.redhat.com/pub/jhopper/OCP4/debug/testvm.yaml

# nodes=`oc get node | awk '{print $1}' | grep -v NAME`; for i in $nodes; do echo "$i: "; oc get vmi | grep test | grep $i | wc -l; done
master-0:
0
master-1:
3
master-2:
3
worker01:
3
worker02:
0
worker03:
0
worker04:
3
worker05:
3
worker06:
2
worker07:
0
worker08:
3
worker09:
3
worker10:
0
worker11:
3
worker12:
3
worker13:
3
worker14:
3
worker15:
2
worker16:
0
worker17:
3
worker18:
2
worker19:
3
worker20:
3
worker21:
2

(just a note: our masters are schedulable)


kube-sched logs look the same, so no clues there:
I0723 21:20:08.726475       1 scheduler.go:604] "Successfully bound pod to node" pod="default/virt-launcher-testvm2-qw9ls" node="worker18" evaluatedNodes=24 feasibleNodes=24

Comment 4 Jenifer Abrams 2021-07-23 21:58:59 UTC

I do see one anomaly, but only on worker02, it is the only node w/ an e2e-az defined??
    e2e-az-name: e2e-az3
          f:e2e-az-name: {}

Comment 5 Jan Chaloupka 2021-07-26 11:08:07 UTC

> kube-sched logs look the same, so no clues there:
> I0723 21:20:08.726475       1 scheduler.go:604] "Successfully bound pod to node" pod="default/virt-launcher-testvm2-qw9ls" node="worker18" evaluatedNodes=24 feasibleNodes=24

This says all the nodes were evaluated and none of the nodes were filtered out. It's also possible all the ignored nodes just get a very low score.

Could you increase the log level of the scheduler to 10 and collect the logs? So we can see which score are assigned to each node. In order to increase the log level, you can edit the kubescheduler/cluster object and set the .spec.logLevel field to TraceAll. The default log level is set to Normal.

Comment 6 Jan Chaloupka 2021-07-26 11:09:41 UTC

I mean, can you increase the log level to TraceAll and schedule the VM pods again?

Comment 7 Jenifer Abrams 2021-07-27 20:49:27 UTC

Ah looks like we have a clue, all workers that VMs "ignore" have a score of 0 for "ImageLocality" (?) for virt-launcher:

I0727 20:22:34.831369       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker03" score=0
I0727 20:22:34.831376       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker08" score=23
I0727 20:22:34.831383       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker05" score=23
I0727 20:22:34.831389       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker20" score=23
I0727 20:22:34.831396       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="master-0" score=0
I0727 20:22:34.831403       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker16" score=0
I0727 20:22:34.831409       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker01" score=23
I0727 20:22:34.831416       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker17" score=23
I0727 20:22:34.831423       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker09" score=23
I0727 20:22:34.831429       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker11" score=23
I0727 20:22:34.831436       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker12" score=23
I0727 20:22:34.831442       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker10" score=0
I0727 20:22:34.831449       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker21" score=23
I0727 20:22:34.831456       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker04" score=23
I0727 20:22:34.831463       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="master-1" score=23
I0727 20:22:34.831469       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker06" score=23
I0727 20:22:34.831476       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker02" score=0
I0727 20:22:34.831496       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker07" score=0
I0727 20:22:34.831508       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker14" score=23
I0727 20:22:34.831519       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="master-2" score=23
I0727 20:22:34.831534       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker13" score=23
I0727 20:22:34.831545       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker15" score=23
I0727 20:22:34.831556       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker18" score=23
I0727 20:22:34.831569       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testvm18-w7h7z" plugin="ImageLocality" node="worker19" score=23


Full log here: http://perf1.perf.lab.eng.bos.redhat.com/pub/jhopper/OCP4/debug/kube_sched_debug.txt

This was scheduling 50 simple containerDisk VMs: http://perf1.perf.lab.eng.bos.redhat.com/pub/jhopper/OCP4/debug/testvm.yaml

I am not sure if this is just looking for the virt-launcher image because the "ignored" nodes do have that image:

worker01 ("good" node):
[core@worker01 ~]$ sudo crictl image | grep virt-launcher
registry.redhat.io/container-native-virtualization/virt-launcher                          <none>              ac3dcebd26c92       678MB
registry.redhat.io/container-native-virtualization/virt-launcher                          <none>              133e89e7a991c       678MB
registry.redhat.io/container-native-virtualization/virt-launcher                          <none>              28d97d0bf9eab       682MB

worker02 (score=0 node):
[core@worker02 ~]$ sudo crictl image | grep virt-launcher
registry.redhat.io/container-native-virtualization/virt-launcher                      <none>              ac3dcebd26c92       678MB
registry.redhat.io/container-native-virtualization/virt-launcher                      <none>              133e89e7a991c       678MB
registry.redhat.io/container-native-virtualization/virt-launcher                      <none>              28d97d0bf9eab       682MB

Comment 8 Jenifer Abrams 2021-07-27 21:03:57 UTC

Side note: not sure what's the reason for the score=0 ImageLocality yet, but in general I wonder if the ImageLocality score of 23 is too high in relation to the NodeResources scores (70s to 90s in this case) since it could cause some large resource imbalances if the image is actually not there?  I noticed some other pods (-n openshift-marketplace) vary the ImageLocality score from 8-19. Maybe it is based on image size?

Comment 9 Roman Mohr 2021-07-28 11:37:34 UTC

(In reply to Jenifer Abrams from comment #8)
> Side note: not sure what's the reason for the score=0 ImageLocality yet, but
> in general I wonder if the ImageLocality score of 23 is too high in relation
> to the NodeResources scores (70s to 90s in this case) since it could cause
> some large resource imbalances if the image is actually not there?  I
> noticed some other pods (-n openshift-marketplace) vary the ImageLocality
> score from 8-19. Maybe it is based on image size?

Note that the containerdisk itself also counts, not just virt-launcher (looking at the score plugin init containers are not counted). Is the containerdisk bigger than before?

Comment 12 Jenifer Abrams 2021-07-28 15:58:54 UTC

Ok here is my confusion...

If I log onto each node I see the same crictl images:

"good" node:
[core@worker01 ~]$ sudo crictl image | grep virt-launcher
registry.redhat.io/container-native-virtualization/virt-launcher                          <none>              02b9d22a09975       716MB
registry.redhat.io/container-native-virtualization/virt-launcher                          <none>              ac3dcebd26c92       678MB
registry.redhat.io/container-native-virtualization/virt-launcher                          <none>              133e89e7a991c       678MB
registry.redhat.io/container-native-virtualization/virt-launcher                          <none>              28d97d0bf9eab       682MB

"ignored" node:
[core@worker02 ~]$ sudo crictl image | grep virt-launcher
registry.redhat.io/container-native-virtualization/virt-launcher                          <none>              02b9d22a09975       716MB
registry.redhat.io/container-native-virtualization/virt-launcher                          <none>              ac3dcebd26c92       678MB
registry.redhat.io/container-native-virtualization/virt-launcher                          <none>              133e89e7a991c       678MB
registry.redhat.io/container-native-virtualization/virt-launcher                          <none>              28d97d0bf9eab       682MB

But I do see the virt-launcher images missing in some of the 'oc get node -o yaml' output. For ex. 
oc get node $i -o yaml | grep "registry.redhat.io/container-native-virtualization/virt-launcher

node: worker01
    - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:09e0329dfe89d555f04ef56a3c6391c966404cf4ad608cddd690a8eafac136f9
    - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:1b951d299a36d557600bb8376d8879621f390c059c36bedd856dcb6a4ec5d6d0
    - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:97c5d5fe8006f76e28f5bdabe65260eea826c805a4f1184272f966ae4d622164
    - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:f1bba5bca1617a15c8e56ba804c0003f3bcbc34bcbe578b3bd51699044bfdb8a
    - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:61766becbae704a808fd191731064a4c1ff0faf158dba3f8de7d1e22ce2591b3
    - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:6dffa5e2229188d7f301163d73b0b4119841248c8e5d75d76a87e1e7ba6f1a8a
    - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:3da3f81b2dc2c6ad885587c7a3e90facb2bc68e03444e29e63c9c1328ab5806c
    - registry.redhat.io/container-native-virtualization/virt-launcher@sha256:c5402ae650feb4be9327d826a161a6de876267bd06a0cbc63069bdcff7bab48b

node: worker02
<none>

And we know worker02 does in fact have the working virt-launcher image (as crictl shows) since VMs are running on that node (scheduled when others were cordoned).

Comment 13 Jenifer Abrams 2021-07-28 16:27:15 UTC

(In reply to Roman Mohr from comment #9)
> (In reply to Jenifer Abrams from comment #8)
> > Side note: not sure what's the reason for the score=0 ImageLocality yet, but
> > in general I wonder if the ImageLocality score of 23 is too high in relation
> > to the NodeResources scores (70s to 90s in this case) since it could cause
> > some large resource imbalances if the image is actually not there?  I
> > noticed some other pods (-n openshift-marketplace) vary the ImageLocality
> > score from 8-19. Maybe it is based on image size?
> 
> Note that the containerdisk itself also counts, not just virt-launcher
> (looking at the score plugin init containers are not counted). Is the
> containerdisk bigger than before?

Good to know container-disk image is also considered, but I tried another test. I searched all nodes for "sudo crictl image | grep container-disk", (some) nodes only had the "quay.io/kubevirt/cirros-container-disk-demo" image.

So this time I started 100 VMs w/ fedora-cloud-container-disk-demo instead of cirros, but same behavior, the nodes with "supposed missing" virt-launcher images (which crictl says are there) never download the fedora container-disk and never get VMs:
 
# nodes=`oc get node | awk '{print $1}' | grep -v NAME`; for i in $nodes; do echo "$i: "; oc get vmi | grep testfedora | grep $i | wc -l; done
master-0:
0
master-1:
5
master-2:
5
worker01:
6
worker02:
0
worker03:
0
worker04:
6
worker05:
6
worker06:
5
worker07:
0
worker08:
5
worker09:
6
worker10:
0
worker11:
6
worker12:
6
worker13:
6
worker14:
5
worker15:
5
worker16:
0
worker17:
6
worker18:
5
worker19:
6
worker20:
6
worker21:
5


Maybe the scheduler is behaving as it should based on the plugin input but it appears to be some issue w/ how the image is "advertised" to be available on nodes or something??

Comment 14 Jenifer Abrams 2021-07-28 18:39:22 UTC

Just a note that CNV auto-updated to 4.8.0 last night so we were on the latest before the last round of sched testing.

Comment 15 Jan Chaloupka 2021-07-29 08:26:35 UTC

Kubelet is responsible for populating the list of images on a node. Changing the component for further investigation.

Comment 16 Jenifer Abrams 2021-07-29 21:51:00 UTC

Not sure if the images list shown in 'oc get node -o yaml' is the same as what kubelet provides to the ImageLocality plugin, but if so it seems there is a limit on the number of images reported? ex:

# nodes=`oc get node | awk '{print $1}' | grep -v NAME`; for i in $nodes; do echo "$i:"; echo -n "  crictl count:"; ssh core@$i "sudo crictl images | grep registry | wc -l"; echo -n "  node yaml count:"; oc get node $i -o yaml | grep registry | wc -l ; done
master-0:
  crictl count:120
  node yaml count:41
master-1:
  crictl count:57
  node yaml count:35
master-2:
  crictl count:96
  node yaml count:46
worker01:
  crictl count:78
  node yaml count:45
worker02:
  crictl count:237
  node yaml count:43

Comment 17 Jenifer Abrams 2021-07-30 19:45:22 UTC

Tried an experiment w/ an "ignored" worker16 node..

I deleted all VMs on the cluster, drained & rebooted all nodes so everything was fresh.
Checked 'oc get node worker16 -o yaml', no virt-launcher pods were shown there even though crictl on the node does show the images.
Scheduled 50 ephemeral VMs (fedora-container-disk again), 0 VMs landed on the "ignored" nodes (master-0,worker02,03,07,10,16) as usual.

>> I0730 19:17:21.115843       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testfedoravm2-d5knf" plugin="ImageLocality" node="worker16" score=0

btw to prove it is not a problem for all container images, here is another pod that does get an ImageLocality hit for worker16:
>> I0730 19:31:15.250323       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="openshift-marketplace/redhat-marketplace-hwwpl" plugin="ImageLocality" node="worker16" score=15


the experiment:
On worker16 I ran 'sudo podman image prune' (y)
For that node it took the total image count from 202 to 147

Now 'oc get node worker16 -o yaml' does show virt-launcher images!

Then I scheduled 50 more VMs. Now worker16 gets VMs!

I0730 19:35:16.935976       1 generic_scheduler.go:441] "Plugin scored node for pod" pod="default/virt-launcher-testfedoravm100-jkd4j" plugin="ImageLocality" node="worker16" score=26

Comment 19 Ryan Phillips 2021-08-04 20:26:06 UTC

Found out the kubelet is configured to report only 50 images to the API server [1]. The code in [1] does a truncation before a sort, so the kubelet might report images differently every time the function is run. I will open an upstream bug for this.

We can disable the image limit functionality by this KubeletConfig to see if it clears up the issue with the Kubelet not reporting the image that exists:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: custom-config 
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: enabled 
  kubeletConfig: 
    nodeStatusMaxImages: -1

1. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/nodestatus/setters.go#L459-L477

Comment 20 Ryan Phillips 2021-08-04 20:35:47 UTC

Upstream Issue: https://github.com/kubernetes/kubernetes/issues/104136

Comment 22 Fabian Deutsch 2021-08-09 10:14:02 UTC

The upstream issue got closed, therefore I wonder: What and where is the bug?

Comment 24 Fabian Deutsch 2021-08-11 15:28:04 UTC

To recap - the truncated list of images seems to be the root cause. This is not CNV specific, and is affecting clusters with many container images per node. The workaround is to use "nodeStatusMaxImages: -1".

This behavior is affecting the scheduling of all pods (not just CNV), and right now the consensus seems to be that this is generally not an issue in OpenShift.

For CNV we should explore if a) we can make a point that it's also an issue for pods in general and get the limited increased/default of nodeStatusMaxImages changed or b) find a CNV only solution

Comment 25 Ryan Phillips 2021-08-12 19:42:00 UTC

Thank you for the summary, Fabian!

Going to close this for now... Please reopen if necessary.

Comment 26 Fabian Deutsch 2021-08-19 10:13:27 UTC

Ryan, could there be any optimizations we do here?
Known issue, alert, some api change/scheduler change to compensate this?

Comment 29 Keith Fryklund 2021-12-08 14:30:13 UTC

Fabian, I really like your suggestion here.  Our team recently ran into this bug and it resulted in a lot of troubleshooting before we ultimately were directed to this bug.  I suspect as Openshift Virtualization becomes more popular, more and more customers are going to face this issue and go through the same, unnecessary efforts.  

Has an RFE been filed against this?  Has the official documentation been updated to reflect this issue?

Comment 30 Fabian Deutsch 2021-12-08 15:24:42 UTC

Keith, did you encounter this in an OpenShift Virtualization setting?
For CNV we will be adding a documentation note about this: https://issues.redhat.com/browse/CNV-13681

Comment 32 Fabian Deutsch 2021-12-10 10:17:57 UTC

Looking at the previous issue a little closer.

We see that almost all nodes where at the max image limit:

$ for N in ./cluster-scoped-resources/core/nodes/* ; do shyaml get-length status.images < $N ; echo " ${N/…/}" ; done
50 ./cluster-scoped-resources/core/nodes/control-013.yaml
50 ./cluster-scoped-resources/core/nodes/control-014.yaml
50 ./cluster-scoped-resources/core/nodes/control-015.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-001.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-002.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-003.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-004.yaml
30 ./cluster-scoped-resources/core/nodes/worker-cnv-005.yaml
36 ./cluster-scoped-resources/core/nodes/worker-cnv-006.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-007.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-008.yaml
42 ./cluster-scoped-resources/core/nodes/worker-cnv-009.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-010.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-011.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-012.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-013.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-014.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-015.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-016.yaml
50 ./cluster-scoped-resources/core/nodes/worker-cnv-017.yaml


In general there were 370+ unique images on all nodes:
$ for N in ./cluster-scoped-resources/core/nodes/* ; do shyaml get-values status.images < $N ; echo " ${N/…/}" ; done | sort -u | grep registry | wc -l
387

Removing shasums - specifically pinned images - then we ended up with 90%+ less images:
$ for N in ./cluster-scoped-resources/core/nodes/* ; do shyaml get-values status.images < $N ; echo " ${N/…/}" ; done | sort -u | grep registry | sed "s/@sha256.*//" | sort -u | wc -l
33

Drilling a little bit into the data we see that few logical images are pinned and lead to the high number of images on the nodes:
2 registry.access.redhat.com/openshift3/ose@sha256
2 registry.redhat.io/container-native-virtualization/bridge-marker@sha256
2 registry.redhat.io/container-native-virtualization/cnv-containernetworking-plugins@sha256
2 registry.redhat.io/container-native-virtualization/cnv-must-gather-rhel8@sha256
2 registry.redhat.io/container-native-virtualization/kubernetes-nmstate-handler-rhel8@sha256
2 registry.redhat.io/container-native-virtualization/kubevirt-template-validator@sha256
2 registry.redhat.io/container-native-virtualization/virt-artifacts-server@sha256
2 registry.redhat.io/container-native-virtualization/virt-handler@sha256
2 registry.redhat.io/ocs4/ocs-rhel8-operator@sha256
2 registry.redhat.io/openshift4/ose-csi-node-driver-registrar@sha256
2 registry.redhat.io/rhel8/postgresql-12@sha256
3 registry.access.redhat.com/openshift3/ose
3 registry.openshift-image-registry.svc
3 registry.redhat.io/container-native-virtualization/cnv-must-gather-rhel8
4 registry.redhat.io/container-native-virtualization/virt-launcher@sha256
4 registry.redhat.io/rhel8/rhel-guest-image@sha256
5 registry.redhat.io/rhel8/rhel-guest-image
6 registry.redhat.io/ocs4/cephcsi-rhel8@sha256
6 registry.redhat.io/ocs4/mcg-core-rhel8@sha256
6 registry.redhat.io/ocs4/rook-ceph-rhel8-operator@sha256
45 registry.redhat.io/redhat/redhat-operator-index@sha256
47 registry.redhat.io/redhat/redhat-operator-index
50 registry.redhat.io/redhat/redhat-marketplace-index@sha256
52 registry.redhat.io/redhat/redhat-marketplace-index
100 registry.redhat.io/redhat/certified-operator-index@sha256
102 registry.redhat.io/redhat/certified-operator-index
130 registry.redhat.io/redhat/community-operator-index@sha256
132 registry.redhat.io/redhat/community-operator-index

Usually un-needed images will be pruned by GC as described in https://docs.openshift.com/container-platform/4.9/nodes/nodes/nodes-nodes-garbage-collection.html
IIUIC then the decision to prune images is purely driven by the nodes disk usage.

From a node perspective this is fine.
However, I think this bug shows that the node level GC should also be driven by the number of images present on a node in relation to the max image count on the cluster.

In other words: Images should be pruned if the number of local images is close to or above the max image count that the scheduler is considering.

Ryan, thoughts?

Comment 33 Fabian Deutsch 2021-12-10 10:18:29 UTC

The other upstream discussion is at https://github.com/kubernetes/kubernetes/issues/93488

Comment 34 Fabian Deutsch 2022-02-03 10:25:02 UTC

Reopening this issue, as it remains

Comment 37 ctomasko 2022-04-18 20:25:03 UTC

Release note added to OpenShift Virtualization 4.10 * If a single node contains more than 50 images, pod scheduling might be imbalanced across nodes. This is because the list of images on a node is shortened to 50 by default. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1984442[*BZ#1984442*])
** As a workaround, you can disable the image limit by xref:../nodes/nodes/nodes-nodes-managing.adoc#nodes-nodes-managing[editing the `KubeletConfig` object] and setting the value of `nodeStatusMaxImages` to `-1`.

@rphillips  When this issue is resolved, then please change the "Doc Type" field from known issue to Bug Fix so that we can remove the known issue from the release notes. Thank you

Comment 38 ctomasko 2023-02-15 04:14:14 UTC

Docs removed release note from OpenShift Virtualization 4.12 and added a note to the documentation - see https://bugzilla.redhat.com/show_bug.cgi?id=1995518. Removed doc text flag.

Note You need to log in before you can comment on or make changes to this bug.