Created attachment 1737980 [details] CPU usage Description of problem: I'm not really convinced this falls under the installation component, but for the lack of a better space, will open it against that for now. Please feel free to move it where it needs to be. Immediately following the iinstall the of CNV on a 500 node baremetal clusteer (3 masters + 497 workers), we see 3511 CNV related pods brought up which is quite a lot. But along with that, there is an increased consumption of kube-apiserver memory and CPU coinciding with the install of CNV and it continues to be high. It seems like we need to investigate if there is any unwanted polling that these pods are doing or too many watches I will update the bug with more info as I find. Also happy to provide access to the cluster while it's up. Version-Release number of selected component (if applicable): 2/4/3 (OCP 4.6.4) How reproducible: 100% following a CNV install on a large scale environemnt. Steps to Reproduce: 1. Install a large OCP cluster (500 nodes) 2. Install CNV 3. Actual results: API server on all masters is overloaded Expected results: The increase in load on API server should not be that significant Additional info: [kni@e16-h18-b03-fc640 ansible]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.4.3 OpenShift Virtualization 2.4.3 kubevirt-hyperconverged-operator.v2.4.2 Installing [kni@e16-h18-b03-fc640 ansible]$ oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE bridge-marker 498 498 498 498 498 beta.kubernetes.io/arch=amd64 4h57m hostpath-provisioner 495 495 495 495 495 <none> 4h32m kube-cni-linux-bridge-plugin 498 498 498 498 498 beta.kubernetes.io/arch=amd64 4h57m kubevirt-node-labeller 495 495 495 495 495 <none> 4h57m nmstate-handler 498 498 498 498 498 beta.kubernetes.io/arch=amd64 4h57m ovs-cni-amd64 498 498 498 498 498 beta.kubernetes.io/arch=amd64 4h57m virt-handler 495 495 495 495 495 <none> 4h53m In the attached pictures, you can see around 13:30 when the CNV install first happened, API server CPU/Memory usage increased and continued to be high.
Created attachment 1737981 [details] Memory Usage
I would be happy to have a closer look on the cluster. Could you share with me the connection details on a private feed?
Here are a few observations: Following a CNV install on this 499 worker node + 3 master node environment, there is a significant increase the CPU each kube-apiserver process consumes and also the memory (the CNV operator install was initiated around 15:35 UTC): CPU: https://snapshot.raintank.io/dashboard/snapshot/u7RdkJVT1sstpAGcZX03YQH1isjR68aw Memory: https://snapshot.raintank.io/dashboard/snapshot/E46bUXkEee2zGNYzSj5uxJGTijAZyIHa Looking at the API server requests, there is an obvious increase in the number of operations agains tthe API server Requests rate increase: https://snapshot.raintank.io/dashboard/snapshot/0EJ1jYcAzZlkHzK50k6swD9e68EGWSzh?orgId=2 Also there is an increase in the total number of long-running requests (watches) across the three API server instances: https://snapshot.raintank.io/dashboard/snapshot/yNbi5YNhr445B461UWiyha9UzKxsrvfx breaking down the watches by instance and resource https://snapshot.raintank.io/dashboard/snapshot/MyQshIldkTfX7Nh2eqcYtq0EcXUdp00k Seems like there is a big increase in the number of watches on secrets and configmaps Following the CNV install, the 99% of he response size from the apiserver also shoots up drastically https://snapshot.raintank.io/dashboard/snapshot/LGyj3T0vMGtkwX8UdHrOE18ZBefq0CI5
Created attachment 1739002 [details] CPU consumption from comment3
We debugged the cluster and we saw that nmstate-handler pods use significant amount of CPU all the time. ``` NAME CPU(cores) MEMORY(bytes) nmstate-handler-s9jq6 1186m 161Mi nmstate-handler-pbjhp 579m 123Mi nmstate-handler-fgk45 577m 128Mi nmstate-handler-5x96n 577m 120Mi nmstate-handler-n7rp7 575m 127Mi nmstate-handler-j9g9h 574m 133Mi nmstate-handler-nm42f 573m 120Mi nmstate-handler-gtzmx 570m 122Mi nmstate-handler-chbrx 569m 127Mi nmstate-handler-rjd5l 567m 155Mi ``` In contrast to that the virt-handler cpu consumption: ``` virt-handler-tsmxt 8m 119Mi virt-handler-jx5vz 6m 119Mi virt-handler-rsz56 6m 116Mi virt-handler-lsqjl 6m 119Mi virt-handler-z2qtj 6m 119Mi virt-handler-qdhll 5m 111Mi virt-handler-dkqds 5m 115Mi virt-handler-jg5rn 5m 119Mi ``` The api-server usage is very high: ``` NAME CPU(cores) MEMORY(bytes) kube-apiserver-master-0 29080m 21303Mi kube-apiserver-master-1 28151m 10823Mi kube-apiserver-master-2 27660m 8020Mi ``` We scaled down HCO and the network-addon-operator and then deleted the nmstate-handler daemonset. The apiserver load significantly dropped by a third. Scaling back up made the load return. There are a lot of 'GET: nodes' and 'UPDATE: nodenetworkstates:status' operations going on which seem to cause the load. The nmstate-handler logs show nothing after the controller loop startup.
Not sure if there is awareness of this and if this is fixed in 2.5 or 2.5, but raising severity and reassigning to network.
This is not new it appears already at https://bugzilla.redhat.com/show_bug.cgi?id=1845558, but it was not happening anymore and bz get closed The issue is the amount of CPU needed by python to startup, right now every 5 seconds the NodeNetworkState is updated, the solution is to run nmstatectl as a daemon using varlink to communicate between kubernetes-nmstate and nmstatectl this way we don't pay for python startup, I did a little PoC and cpu usage dropped drastically. Also varlink support was backported to nmstate 0.3 with this in mind, let me prepare a PR with the fix.
(In reply to Quique Llorente from comment #7) > This is not new it appears already at > https://bugzilla.redhat.com/show_bug.cgi?id=1845558, but it was not > happening anymore and bz get closed > > The issue is the amount of CPU needed by python to startup, right now every > 5 seconds the NodeNetworkState is updated, the solution is to run nmstatectl > as a daemon using varlink to communicate between kubernetes-nmstate and > nmstatectl this way we don't pay for python startup, I did a little PoC and > cpu usage dropped drastically. > But does that not only address the hight CPU usage? I would highly recommend to not update NodeNetworkState every 5 seconds. Note that the high cpu-usage on nmstate-handler made me just suspect that nmstate-handler is the reason for the high resource consumption on the apiserver. In a 500 node setup this daemonset is responsible for 1/3 of the whole apiserver load. > Also varlink support was backported to nmstate 0.3 with this in mind, let me > prepare a PR with the fix.
(In reply to Roman Mohr from comment #8) > (In reply to Quique Llorente from comment #7) > > This is not new it appears already at > > https://bugzilla.redhat.com/show_bug.cgi?id=1845558, but it was not > > happening anymore and bz get closed > > > > The issue is the amount of CPU needed by python to startup, right now every > > 5 seconds the NodeNetworkState is updated, the solution is to run nmstatectl > > as a daemon using varlink to communicate between kubernetes-nmstate and > > nmstatectl this way we don't pay for python startup, I did a little PoC and > > cpu usage dropped drastically. > > > > But does that not only address the hight CPU usage? I would highly recommend > to not update NodeNetworkState every 5 seconds. > Note that the high cpu-usage on nmstate-handler made me just suspect that > nmstate-handler is the reason for the high resource consumption on the > apiserver. > In a 500 node setup this daemonset is responsible for 1/3 of the whole > apiserver load. > To reduce the traffic to apiserver we can update NodeNetworkState only if state has changed, we can store last version at memory and compare. > > > Also varlink support was backported to nmstate 0.3 with this in mind, let me > > prepare a PR with the fix.
Thanks for putting it back to assigned. Let me just highlight again that this bug is about the apiserver being under pressure and that the high cpu-usage of nmstate-handler was just a symptom for finding the cause and while being needing a fix, it is not the main issue. Indeed something like > To reduce the traffic to apiserver we can update NodeNetworkState only if state has changed, we can store last version at memory and compare. will have to happen.
Echoing Roman's words here, while nmstate could be one of the reasons, thre might be other reasons too. The fact that apiserver CPU usage is 8-10 cores without cnv and it shoot up to around 30 cores with cnv, only to fall back to 20ush cores with nmstate disabled, hints at other potential causes.
(In reply to Sai Sindhur Malleni from comment #11) > Echoing Roman's words here, while nmstate could be one of the reasons, thre > might be other reasons too. The fact that apiserver CPU usage is 8-10 cores > without cnv and it shoot up to around 30 cores with cnv, only to fall back > to 20ush cores with nmstate disabled, hints at other potential causes. After further investigations, it is not clear what makes the apiserver still use 20 cores when nmstate is disabled. We uninstalled CNV and it stayed around 20 cores. Sai and I agreed on doing another install on a fresh scale cluster then next time they install CNV.
I have put in place a PR to reduce the number of GET Node and UPDATE NodeNetworkState (https://github.com/nmstate/kubernetes-nmstate/pull/664) Numbers of UPDATE NodeNetworkState: cmd: ./cluster/kubectl.sh get --raw https://127.0.0.1:49267/metrics |grep nmstate |grep nodenetworkstates |grep apiserver |grep UPDATE |grep apiserver_request_total Without the PR the number increase so clearly NNSs are updated every 5 seconds, but with the PR the number is static so we have a winner for this. Numbers of GET Node: cmd: ./cluster/kubectl.sh get --raw https://127.0.0.1:49267/metrics |grep nmstate |grep nodenetworkstates |grep apiserver |grep UPDATE |grep apiserver_request_total The only metric that increase over time is apiserver_request_total{code="200",component="apiserver",contentType="application/vnd.kubernetes.protobuf",dry_run="",group="",resource="nodes",scope="cluster",subresource="",verb="GET",version="v1"} And curiously this is happening even without kubernetes-nmstate installed and as the same frequency as with or without the PR. About CPU usage I don't have access to a huge cluster so I don't see much difference. Also at a huge cluster the numbers above will be more evident.
(In reply to Quique Llorente from comment #13) > I have put in place a PR to reduce the number of GET Node and UPDATE > NodeNetworkState (https://github.com/nmstate/kubernetes-nmstate/pull/664) > > > > > Numbers of UPDATE NodeNetworkState: > > cmd: ./cluster/kubectl.sh get --raw https://127.0.0.1:49267/metrics |grep > nmstate |grep nodenetworkstates |grep apiserver |grep UPDATE |grep > apiserver_request_total > > Without the PR the number increase so clearly NNSs are updated every 5 > seconds, but with the PR the number is static so we have a winner for this. > > > > Numbers of GET Node: > > cmd: ./cluster/kubectl.sh get --raw https://127.0.0.1:49267/metrics |grep > nmstate |grep nodenetworkstates |grep apiserver |grep UPDATE |grep > apiserver_request_total Correction: ./cluster/kubectl.sh get --raw https://127.0.0.1:49267/metrics |grep nodes |grep apiserver |grep GET |grep apiserver_request_total > > The only metric that increase over time is > > apiserver_request_total{code="200",component="apiserver", > contentType="application/vnd.kubernetes.protobuf",dry_run="",group="", > resource="nodes",scope="cluster",subresource="",verb="GET",version="v1"} > > And curiously this is happening even without kubernetes-nmstate installed > and as the same frequency as with or without the PR. > > > About CPU usage I don't have access to a huge cluster so I don't see much > difference. Also at a huge cluster the numbers above will be more evident.
Israel, I hope you don't mind me moving QE of this to you as this issue is not limited to network and will require scale testing.
On CNV 2.6 cluster this the memory and cpu usage (3 masters , 3 workers): $ oc adm top pods -A | grep nmstate-handler openshift-cnv nmstate-handler-bdtq6 101m 35Mi openshift-cnv nmstate-handler-brkj4 94m 39Mi openshift-cnv nmstate-handler-gdbjv 115m 58Mi openshift-cnv nmstate-handler-p2mjg 85m 25Mi openshift-cnv nmstate-handler-sxjbd 110m 44Mi openshift-cnv nmstate-handler-w75vt 110m 56Mi $ oc adm top pods -A | grep virt-api openshift-cnv virt-api-66cdcd649f-9wnxr 4m 102Mi openshift-cnv virt-api-66cdcd649f-k65hd 5m 109Mi $ oc adm top pods -A | grep kube-apiserver openshift-kube-apiserver kube-apiserver-ssp02-2mlw2-master-0 153m 1299Mi openshift-kube-apiserver kube-apiserver-ssp02-2mlw2-master-1 469m 2171Mi openshift-kube-apiserver kube-apiserver-ssp02-2mlw2-master-2 222m 2048Mi $ oc adm top pods -A | grep virt-handler openshift-cnv virt-handler-26bjn 7m 104Mi openshift-cnv virt-handler-65czz 3m 120Mi openshift-cnv virt-handler-q2jcm 3m 116Mi
Hi a PR was merged at kubernetse-nmstate https://github.com/nmstate/kubernetes-nmstate/pull/664, we have to test again with that fix in place.
Also latest release from CNAO include it https://github.com/kubevirt/cluster-network-addons-operator/releases/tag/v0.44.0
To test that the fix is in place we can monitor "/metrics" to find the total number of NNS updates and also monitor the update timestamp at NNS Status.
The nmstate resource consumption improvement is available on D/S. As mentioned in this BZ, nmstate was not the only cause of this issue. I'm moving this to QE to get results for the current D/S build and identify the next culprit.
Can someone let me know what "channel" is a CNV build with this fix available?
With CNV 2.6 + OCP 4.7 and kubernetes-nmstate-handler-container-v2.6.0-14 nmstate-handler looks more relax on CPU usage: $ oc adm top pods -A | grep nmstate-handler openshift-cnv nmstate-handler-5vlxm 110m 46Mi openshift-cnv nmstate-handler-6fm9r 95m 26Mi openshift-cnv nmstate-handler-6h9dm 93m 23Mi openshift-cnv nmstate-handler-czb7n 77m 25Mi openshift-cnv nmstate-handler-djcts 172m 25Mi openshift-cnv nmstate-handler-k8b9h 67m 23Mi For other API server: openshift-cnv namespace cdi-apiserver-58ccb6859d-6tbmv 15m 21Mi virt-api-6598bfdcd6-cjqps 1m 87Mi virt-api-6598bfdcd6-xvxkj 4m 86Mi Out of openshift-cnv: openshift-apiserver apiserver-5dd646db98-kx59n 32m 250Mi openshift-apiserver apiserver-5dd646db98-qmdn6 15m 298Mi openshift-apiserver apiserver-5dd646db98-vcg5w 23m 287Mi openshift-apiserver-operator openshift-apiserver-operator-55fbff96-84ssh 17m 145Mi openshift-kube-apiserver kube-apiserver-verify-26-jw5hm-master-0 358m 2174Mi openshift-kube-apiserver kube-apiserver-verify-26-jw5hm-master-1 71m 1425Mi openshift-kube-apiserver kube-apiserver-verify-26-jw5hm-master-2 217m 1966Mi openshift-kube-apiserver-operator kube-apiserver-operator-5b574b9476-2vm74 65m 154Mi openshift-machine-api cluster-autoscaler-operator-84c7c78f7c-ds7g7 0m 60Mi openshift-machine-api cluster-baremetal-operator-77cf96f46b-5qcm5 0m 71Mi openshift-machine-api machine-api-controllers-68cfd48-29f68 1m 162Mi openshift-machine-api machine-api-operator-5c7456b446-qqcwt 0m 62Mi openshift-oauth-apiserver apiserver-58b88b85c-2gjb8 3m 61Mi openshift-oauth-apiserver apiserver-58b88b85c-s8xsz 6m 72Mi openshift-oauth-apiserver apiserver-58b88b85c-xb4vh 6m 67Mi I see that kube-apiserver have the highest usage. In the logs is see alot of connection error, see: I0107 07:33:05.128468 21 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0214a8980, {CONNECTING <nil>} I0107 07:33:05.138002 21 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0214a8980, {READY <nil>} I0107 07:33:05.139040 21 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0107 07:33:06.259477 21 client.go:360] parsed scheme: "passthrough" I0107 07:33:06.259527 21 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://192.168.0.24:2379 <nil> 0 <nil>}] <nil> <nil>} I0107 07:33:06.259536 21 clientconn.go:948] ClientConn switching balancer to "pick_first" I0107 07:33:06.259644 21 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc023f546c0, {CONNECTING <nil>} I0107 07:33:06.269286 21 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc023f546c0, {READY <nil>} I0107 07:33:06.270159 21 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" W0107 07:33:07.424630 21 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured I0107 07:33:15.744069 21 client.go:360] parsed scheme: "endpoint" it seem to be related to the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1844448 Sei you plan to retest with deploying large cluster (500 nodes) ?
Moving to verify.
I got a chance to test this at 500 node scale on OCP 4.6.4 with a 2.6 nightly of CNV [kni@e16-h18-b03-fc640 clusterconfigs]$ oc describe sub/hco-operatorhub | grep 2.6 UID: 261f3260-da9b-4a8a-889a-b5e1216fbbd3 Current CSV: kubevirt-hyperconverged-operator.v2.6.0 Installed CSV: kubevirt-hyperconverged-operator.v2.6.0 nmstate-handler pods are still hot on CPU and over kube-apiserver CPU usage is around 30 cores (per apiserver) and memory consumption is around 40G per apiserver. Not really sure if nmstate-handler is the only culprit here but the CPU consumption is still high as noted earlier.
nmstate-handler top output https://gist.github.com/smalleni/75a8974766e62c234da42beda25165f8
What is the right status to put this bug in to reopen it?
(In reply to Sai Sindhur Malleni from comment #25) > I got a chance to test this at 500 node scale on OCP 4.6.4 with a 2.6 > nightly of CNV > > [kni@e16-h18-b03-fc640 clusterconfigs]$ oc describe sub/hco-operatorhub | > grep 2.6 > UID: 261f3260-da9b-4a8a-889a-b5e1216fbbd3 > Current CSV: kubevirt-hyperconverged-operator.v2.6.0 > Installed CSV: kubevirt-hyperconverged-operator.v2.6.0 > > > nmstate-handler pods are still hot on CPU and over kube-apiserver CPU usage > is around 30 cores (per apiserver) and memory consumption is around 40G per > apiserver. Not really sure if nmstate-handler is the only culprit here but > the CPU consumption is still high as noted earlier. Can you similar thing done at https://bugzilla.redhat.com/show_bug.cgi?id=1906151#c13 so we verify that the number of updates on NodeNetworkState is reduced ? at the end this is more or less what has being done.
@smalleni Can you also verify at the 500 workers cluster that that the NodeNetworkState is not updated ? there is a "lastSuccessfulUpdateTime" field at "status" that get refreshed everytime a change at network node is detected.
We are preparing a 1 minute NNS refresh period veresion to see if it helps with the effort for 2.6 https://github.com/nmstate/kubernetes-nmstate/pull/682.
Removing needinfo for now as I clarified over email/chat that the environment is currently not around. A dedicated CNV scale test might be the best way to reproduce this and dig deeper.
We tried to get CPU consumption of nmstate even lower. Please see if the nmstate issue prevails on scaled-up clusters and if there is any other process that would be consuming too much.
Moving back on dev to clarify couple of questions to see if we can make it even better.
We identified other issues in kubernetes-nmstate's controllers mechanism that will cause performance issues on midsize and big clusters. They are rooted deeply in the code and will require serious changes, we won't be able to address them in 2.6. I believe that this specific nmstate issue is not a regression introduced in 2.6. I suggest to defer this to 4.8.
Petr. This bug was identified by me in our internal scale lab testing, if that helps. I'm more than happy to try out the bits but hardware resources are in high contention and we are currently using the cluster for some upgrade testing, so I will try to identify a slot to try this. Do I have to be on OCP 4.7 to try CNV 2.6 nightlies or is 4.^ fine too? And the way to get these fixes is through the CNV nightly operator, correct? I will leave a needinfo on me.
4.7 would be strongly preferred. It may work fine with 4.6, but there is a chance we would hit some incompatibility and I would not like to waste your time asking for suboptimal and risky solutions. Yes, the nightly operator should deliver our partial fixes.
(In reply to Petr Horáček from comment #34) > We identified other issues in kubernetes-nmstate's controllers mechanism > that will cause performance issues on midsize and big clusters. They are > rooted deeply in the code and will require serious changes, we won't be able > to address them in 2.6. I believe that this specific nmstate issue is not a > regression introduced in 2.6. > > I suggest to defer this to 4.8. Lets be a little bit more specific to help the scale people understand what is going on: The issue is that nmstate delivers a daemonset which starts pods on every node. And every pod on every node watches all nodes. Considering that kubevirt does 1 or 2 node updates per minute and the kubelet by default does 6 updates per minute on every node, we transfer every node object to every nmstate pod 6-8 times per minute. That is where the main issue is coming from. To do the match: Let's suppose 500 nodes, with 400 nmstate pods and default node update intervals, we have : 500 * 400 * 7 node objects read from the apiserver per minute, which makes 14 million objects transferred per minute or 23 000 node objects transferred per second.
> 14 million objects transferred per minute or 23 000 node objects transferred > per second. Sorry 1.4 million objects.
(In reply to Sai Sindhur Malleni from comment #3) > Also there is an increase in the total number of long-running requests > (watches) across the three API server instances: > https://snapshot.raintank.io/dashboard/snapshot/ > yNbi5YNhr445B461UWiyha9UzKxsrvfx > > breaking down the watches by instance and resource > https://snapshot.raintank.io/dashboard/snapshot/ > MyQshIldkTfX7Nh2eqcYtq0EcXUdp00k > Seems like there is a big increase in the number of watches on secrets and > configmaps A part from that I have see that since we mount a configmap at every handler pod for the FILTER_INTERFACES env var we have a watcher at apiserver per node so that would be 500 watchers just for that on the big cluster. Also the service account token is being watched too at every kubernetes-nmstate-handler since the pod mount it, so that make 500 secrets watcher there. For the first one we can put in place a solution easily but for the second one I think it's only that is going to be there if the pod has a service account.
(In reply to Roman Mohr from comment #38) > (In reply to Petr Horáček from comment #34) > > We identified other issues in kubernetes-nmstate's controllers mechanism > > that will cause performance issues on midsize and big clusters. They are > > rooted deeply in the code and will require serious changes, we won't be able > > to address them in 2.6. I believe that this specific nmstate issue is not a > > regression introduced in 2.6. > > > > I suggest to defer this to 4.8. > > Lets be a little bit more specific to help the scale people understand what > is going on: The issue is that nmstate delivers a daemonset which starts > pods on every node. And every pod on every node watches all nodes. > Considering that kubevirt does 1 or 2 node updates per minute and the > kubelet by default does 6 updates per minute on every node, we transfer > every node object to every nmstate pod 6-8 times per minute. > That is where the main issue is coming from. To do the match: Let's suppose > 500 nodes, with 400 nmstate pods and default node update intervals, we have > : 500 * 400 * 7 node objects read from the apiserver per minute, which makes > 14 million objects transferred per minute or 23 000 node objects transferred > per second. Just to verify this I have put in places some traces at controller-runtime vendor with https://github.com/nmstate/kubernetes-nmstate/pull/694 And it's clear that all handlers receive all the Nodes and all the NodeNetworkStates (NodeNetworkConfigurationPolicies does not increase with cluster size). [ellorent@localhost kubernetes-nmstate]$ ./cluster/kubectl.sh logs -n nmstate nmstate-handler-ct5vh |grep event {"level":"info","ts":1612442369.534102,"logger":"controller-runtime.source.EventHandler.OnAdd","msg":"event received","name":"node01","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442369.5341187,"logger":"controller-runtime.source.EventHandler.OnAdd","msg":"event received","name":"node01","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442369.5341403,"logger":"controller-runtime.source.EventHandler.OnAdd","msg":"event received","name":"node02","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442369.534188,"logger":"controller-runtime.source.EventHandler.OnAdd","msg":"event received","name":"node01","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442369.534203,"logger":"controller-runtime.source.EventHandler.OnAdd","msg":"event received","name":"node02","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442369.534318,"logger":"controller-runtime.source.EventHandler.OnAdd","msg":"event received","name":"node02","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442375.4340763,"logger":"controller-runtime.source.EventHandler.OnUpdate","msg":"event received","name":"node01","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442375.4341297,"logger":"controller-runtime.source.EventHandler.OnUpdate","msg":"event received","name":"node01","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442375.4341464,"logger":"controller-runtime.source.EventHandler.OnUpdate","msg":"event received","name":"node01","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442376.5333335,"logger":"controller-runtime.source.EventHandler.OnAdd","msg":"event received","name":"node02","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442376.5543258,"logger":"controller-runtime.source.EventHandler.OnUpdate","msg":"event received","name":"node02","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442376.5719728,"logger":"controller-runtime.source.EventHandler.OnAdd","msg":"event received","name":"node01","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442376.5807424,"logger":"controller-runtime.source.EventHandler.OnUpdate","msg":"event received","name":"node01","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442397.385236,"logger":"controller-runtime.source.EventHandler.OnUpdate","msg":"event received","name":"node02","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442397.3853154,"logger":"controller-runtime.source.EventHandler.OnUpdate","msg":"event received","name":"node02","kind":"&TypeMeta{Kind:,APIVersion:,}"} {"level":"info","ts":1612442397.3853505,"logger":"controller-runtime.source.EventHandler.OnUpdate","msg":"event received","name":"node02","kind":"&TypeMeta{Kind:,APIVersion:,}"} And reading the controller-runtime cache code they create an informer for each "Watch" source with a resync of 10 hours so what you say makes totally sense.
We are currently investigating this issue on a performance cluster. This won't make it to 2.6, deferring to the upcoming 4.8.
So we have access to a 100 nodes cluster and Sarah and Rai installed CNV 2.4 on and graphana. Following is the metrics posted here with the following events: - 22:00 UTC 02/14/2021 - Installation of CNV 2.4 - 15:00 UTC 02/15/2021 - Deactivation of CNAO and kubernetes-nmstate CPU: https://snapshot.raintank.io/dashboard/snapshot/kAV0sUY0Ptl1HOIhg5RTuAvmr2e9RERP?orgId=2 Memory: https://snapshot.raintank.io/dashboard/snapshot/WFJ2Xwfjo5Prr5i7GOyMhgpZ8DgoZ7jB Long running request (watchers): https://snapshot.raintank.io/dashboard/snapshot/8srTE6TsJNJwnzGgJAmM35oyVnpyTqQ3 Response size: https://snapshot.raintank.io/dashboard/snapshot/Ur2JgjaNUjDsXgDf0m46foqkwa6iA4W7 Request rate (read vs write): https://snapshot.raintank.io/dashboard/snapshot/Ur2JgjaNUjDsXgDf0m46foqkwa6iA4W7 Network (total number of bytes send and recive by GRPC clients): https://snapshot.raintank.io/dashboard/snapshot/wTGfSgYTkZA0bzk5E6g18iiXzrcFVz9z Client traffic in: https://snapshot.raintank.io/dashboard/snapshot/Xb6XEVieFppz6i2nGrtmNqt49gVdLhRI Peer traffic in: https://snapshot.raintank.io/dashboard/snapshot/jqrQBCE99rnGhBn5Ngb38JYRSQ3I9IMl Peer traffic out: https://snapshot.raintank.io/dashboard/snapshot/wTGfSgYTkZA0bzk5E6g18iiXzrcFVz9z Network utilization at worker: https://snapshot.raintank.io/dashboard/snapshot/Ks6Z8k0zxW4FcyrmpxKFeLLWKYYcpVk2 Top: [kni@f08-h17-b01-5039ms ~]$ oc adm top pods -n openshift-cnv -l component=kubernetes-nmstate-handler NAME CPU(cores) MEMORY(bytes) nmstate-handler-2699f 138m 35Mi nmstate-handler-2fh29 121m 37Mi nmstate-handler-2hw2t 123m 36Mi nmstate-handler-2lhvc 110m 35Mi nmstate-handler-2psph 119m 37Mi nmstate-handler-44q5x 87m 39Mi nmstate-handler-477bb 119m 38Mi nmstate-handler-4cjsv 162m 38Mi nmstate-handler-4t2xw 122m 37Mi ... [kni@f08-h17-b01-5039ms ~]$ oc adm top pods -n openshift-cnv -l kubevirt.io=virt-handler NAME CPU(cores) MEMORY(bytes) virt-handler-28cvv 0m 80Mi virt-handler-29w29 0m 77Mi virt-handler-2blb4 0m 79Mi virt-handler-2cp2z 0m 76Mi virt-handler-2kv6j 0m 77Mi virt-handler-2wdlz 0m 78Mi virt-handler-4cg5p 0m 85Mi api server with knmstate running kni@f08-h17-b01-5039ms ~]$ oc adm top pods -n openshift-kube-apiserver NAME CPU(cores) MEMORY(bytes) kube-apiserver-master-0 359m 2152Mi kube-apiserver-master-1 1802m 3078Mi kube-apiserver-master-2 1586m 3620Mi api server without knmstate running [kni@f08-h17-b01-5039ms ~]$ oc adm top pods -n openshift-kube-apiserver NAME CPU(cores) MEMORY(bytes) kube-apiserver-master-0 93m 2174Mi kube-apiserver-master-1 1284m 3134Mi kube-apiserver-master-2 2373m 3543Mi Good thing is that we are able to reproduce the issue with smaller cluster and bad news is that kubernetes-nmstate looks like the part of the problem but not all. So from the kubernetes-nmstate we have a pair of big issues: - The refresh of NodeNetworkState, it has being mitigated at CNV 2.6 reducing the frequency to 1 minute (but forcing it if nncp is applied) and comparing before update NNS. - The controller-runtime does not filter by field or label so all the nmstate-handlers are retrieving all the nodes/nns everytime they change or after 10 hours, to fix this we have a pair of PoCs we want to test. I suppose Sai will be able to add much more info to this.
I have continue looking at issue without kubernetes-nmstate and looks like "GET" operations for nodes is quite high, so I have take a look at daemonsets do something with nodes like bridge-marker So disabling bridge-marker (11:00 UTC 02/16/2021 ) the number of GET operations over nodes drops: https://snapshot.raintank.io/dashboard/snapshot/S0LGUWt3T53PetORZRkz0lhTKBVGG1r8 Although apiservers not affected that much. [kni@f08-h17-b01-5039ms ~]$ oc adm top pods -n openshift-kube-apiserver NAME CPU(cores) MEMORY(bytes) kube-apiserver-master-0 79m 2328Mi kube-apiserver-master-1 1062m 3154Mi kube-apiserver-master-2 2552m 3638Mi And same thing happends with ovs-cni since it's the analogue of bridge marker but for ovs bridges after removing it (11:30 UTC 02/16/2021 ) we have a drop on node GET operations: https://snapshot.raintank.io/dashboard/snapshot/kjEjlIE5nVD3HJAiB6NDmEgp9PMgGPLj But still not much difference on apiservers [kni@f08-h17-b01-5039ms ~]$ oc adm top pods -n openshift-kube-apiserver NAME CPU(cores) MEMORY(bytes) kube-apiserver-master-0 90m 2258Mi kube-apiserver-master-1 1169m 3147Mi kube-apiserver-master-2 2435m 3573Mi Network graphs for etcd cluster does not show much difference.
Checking what api request keeps in the top the next candidate is configmaps. After killing virt-handler we get a drop on GET configmaps https://snapshot.raintank.io/dashboard/snapshot/LQYwBEMfslWxlMovjIKpy3MDKiY8uh16 apiserver continues with the same load/memory [kni@f08-h17-b01-5039ms ~]$ oc adm top pods -n openshift-kube-apiserver NAME CPU(cores) MEMORY(bytes) kube-apiserver-master-0 109m 2342Mi kube-apiserver-master-1 1191m 3142Mi kube-apiserver-master-2 2179m 3575Mi
One thing to note is after CNV install the 99% API response size becomes very large. Do we know why? If there are requests that return a large amount of data now? This graph captures the increase in API server CPU usage after the install of CNV around 22:00 UTC. Theoretically if we uninstall CNV completely, the APi server usage should go back to previous levels, is that happening?
I also see that the response size does not change after disabling knmstate, it has to be something else. Let's start tackling what we know: - Try to check latest knmstate improvements either with CNV 2.6 or CNV 4.8 (I think this one is not ready yet) - Deploy kubernetes-nmstate PoC that refresh NNS from a Deployment instead of DaemonSet, so we don't faces the N^2 issue retrieving nodes. We have also to improve bridge-marker and ovs-cni so they don't retrieve Node so often, but this is not as impactful. The other little issue is that we have some configmaps GET either from virt-operator or virt-handler but we didn't found where.
I am going to try to deploy in top of the OCP we have the kubernetes-nmstate u/s version with the PoC so we have results earlier.
I have being able to deploy latest u/s kubernetes-nmstate with the latest fixed and implementing the PoC to refresh NNS from a deployment (Wed Feb 17 11:05:38 UTC 2021) Numbers looks better at top nmstate-handler-v4vq8 0m 24Mi nmstate-handler-w8ngq 12m 24Mi nmstate-handler-wl5pz 12m 24Mi nmstate-handler-wmv6g 0m 23Mi nmstate-handler-xgrnt 12m 24Mi nmstate-handler-z74g5 0m 26Mi nmstate-handler-zfvhx 12m 26Mi nmstate-handler-zpgr9 0m 24Mi nmstate-monitor-9489cd4f9-6dcfs 26m 225Mi nmstate-operator-665949d7cb-5k8wg 0m 31Mi nmstate-webhook-6548b4b9d9-jd5l8 0m 33Mi nmstate-webhook-6548b4b9d9-pp4kh 1m 69Mi Now the workers are doing less and all the load is at "nmstate-monitor" the 12m at "nmstate-handler" is related to "nmstate-monitor" calling "nmstatectl show" there The apiserver looks better too [kni@f08-h17-b01-5039ms knmstate]$ kubectl top pod -n openshift-kube-apiserver NAME CPU(cores) MEMORY(bytes) kube-apiserver-master-0 122m 1839Mi kube-apiserver-master-1 1335m 2778Mi kube-apiserver-master-2 2732m 3708Mi @smalleni Can you characterize the cluster with this installed ?
OK just checked on the cluster, it seems like nmstate-handler isn't running now. While definitely the nmstate-handler induced CPU/network business is solved, having a CNV install on the cluster still seems to add some load on the API server. While not so pronounced on the 100 node cluster, this might cause problems on larger clusters. We also need to define what % of increase in API server load is acceptabled with CNV.
> While definitely the nmstate-handler induced CPU/network business is solved, having a CNV install on the cluster still seems to add some load on the API server. Yeah, i think we have more issues. I noticed recently that when introducing new VMIs into a cluster, that every virt-handler is watching every VMI. That's a global vmi watcher on every node in the cluster that a vmi can run on. This will likely impact performance on large clusters as the density of VMIs increase. For instance, if we introduce a single VMI then every virt handler on every node is sent that VMI. Instead only the virt-handler present on the node the VMI is scheduled to should receive that information. There's a PR here to address that specific issue upstream. https://github.com/kubevirt/kubevirt/pull/5060 > We also need to define what % of increase in API server load is acceptabled with CNV. Does it make sense to define this at both an idle state and at load? At an idle state, my expectation is our controllers should only result in a slight increase of API calls. Under load, I'd want to see a linear progression that correlates to how many VMIs are introduced into the system.
(In reply to David Vossel from comment #52) > > While definitely the nmstate-handler induced CPU/network business is solved, having a CNV install on the cluster still seems to add some load on the API server. > > Yeah, i think we have more issues. I noticed recently that when introducing > new VMIs into a cluster, that every virt-handler is watching every VMI. > That's a global vmi watcher on every node in the cluster that a vmi can run > on. This will likely impact performance on large clusters as the density of > VMIs increase. For instance, if we introduce a single VMI then every virt > handler on every node is sent that VMI. Instead only the virt-handler > present on the node the VMI is scheduled to should receive that information. > > There's a PR here to address that specific issue upstream. > https://github.com/kubevirt/kubevirt/pull/5060 Whoa, that is even worse than the nmstate issue. I would propose porting that to 2.5 and 2.6. It is a very simple fix. > > > We also need to define what % of increase in API server load is acceptabled with CNV. > > Does it make sense to define this at both an idle state and at load? At an > idle state, my expectation is our controllers should only result in a slight > increase of API calls. Under load, I'd want to see a linear progression that > correlates to how many VMIs are introduced into the system. For me that makes absolute sense.
(In reply to Roman Mohr from comment #53) > (In reply to David Vossel from comment #52) > > > While definitely the nmstate-handler induced CPU/network business is solved, having a CNV install on the cluster still seems to add some load on the API server. > > > > Yeah, i think we have more issues. I noticed recently that when introducing > > new VMIs into a cluster, that every virt-handler is watching every VMI. > > That's a global vmi watcher on every node in the cluster that a vmi can run > > on. This will likely impact performance on large clusters as the density of > > VMIs increase. For instance, if we introduce a single VMI then every virt > > handler on every node is sent that VMI. Instead only the virt-handler > > present on the node the VMI is scheduled to should receive that information. > > > > There's a PR here to address that specific issue upstream. > > https://github.com/kubevirt/kubevirt/pull/5060 > > Whoa, that is even worse than the nmstate issue. I would propose porting > that to 2.5 and 2.6. It is a very simple fix. > > > > > > We also need to define what % of increase in API server load is acceptabled with CNV. > > > > Does it make sense to define this at both an idle state and at load? At an > > idle state, my expectation is our controllers should only result in a slight > > increase of API calls. Under load, I'd want to see a linear progression that > > correlates to how many VMIs are introduced into the system. > > For me that makes absolute sense. But at this bugzilla we have to focus on idle and try to understand what else we may need to fix Roman also pointed to https://github.com/kubevirt/kubevirt/pull/5061 but I have stop virt-handler and the different is not that much.
(In reply to Sai Sindhur Malleni from comment #51) > OK just checked on the cluster, it seems like nmstate-handler isn't running > now. While definitely the nmstate-handler induced CPU/network business is > solved, having a CNV install on the cluster still seems to add some load on > the API server. While not so pronounced on the 100 node cluster, this might > cause problems on larger clusters. We also need to define what % of increase > in API server load is acceptabled with CNV. The nmstate pods are running under different namespace "nmstate" that's why you don't see it, it's the u/s project nmstate.
After some investigation the kubernetes-nmstate PoC give us better results at CPU/Memory but it increase the req duration time since network configuration is is passed from handler to monitor. The second PoC hacks controller-runtime cache so it filters by field name it has being deployed at Mon Feb 22 12:07:13 UTC 2021 Top is similar to the first PoC [kni@f08-h17-b01-5039ms manifest-poc-nmstate-monitor]$ kubectl top pod -n openshift-kube-apiserver NAME CPU(cores) MEMORY(bytes) kube-apiserver-master-0 162m 2446Mi kube-apiserver-master-1 979m 3332Mi kube-apiserver-master-2 2354m 3821Mi And nmstate numbers looks similar nmstate-handler-vvjgt 12m 25Mi nmstate-handler-w56pt 0m 23Mi nmstate-handler-x4q5r 0m 29Mi nmstate-handler-xbgtm 0m 23Mi nmstate-handler-xh2qw 12m 25Mi nmstate-handler-xpvpt 12m 26Mi nmstate-handler-xq95g 0m 25Mi nmstate-handler-zc7g6 13m 26Mi nmstate-handler-znmrd 0m 25Mi nmstate-operator-665949d7cb-bcxtn 0m 28Mi nmstate-webhook-6548b4b9d9-9k87s 2m 68Mi nmstate-webhook-6548b4b9d9-g7ggb 0m 35Mi
Also disabling virt-handler at Mon Feb 22 12:26:25 UTC 2021
enabling virt-handler at Mon Feb 22 15:11:20 UTC 2021, not much differences.
After removing HCO and the operators, looks like cluster does not go back to initial request duration before installing CNV 2.4: request duration read vs write: https://snapshot.raintank.io/dashboard/snapshot/xgmamp3xavxd7Igp1LYr0kOtrTPGIpJV?orgId=2 Maybe the monitoring is adding some noise ?
Removing testblocker as we are not actively testing this scenario currently and will get back to it when the full set of changes are in. I am adding the CNV perf/scale team to the cc list.
Petr, The bug was retargetted for 4.10, but the fixed-in-version is CNAO v4.9.0-18. Before attempting to verifying it, I would like to ensure that the fix is available in an 4.10 CNAO. If it is - please specify exactly in which version, so we know we are attempting to verify in a correct cluster. Thanks, Yossi
Yossi, the code is already in 4.9.0. The bug was postponed to 4.10 time frame only due to QE capacity. Please verify it with the 500-node cluster currently used by Guy Chen.
> Yossi, the code is already in 4.9.0. The bug was postponed to 4.10 time frame only due to QE capacity. Please verify it with the 500-node cluster currently used by Guy Chen. My question still stands - the bug is targetted to 4.10, so I need to know on which CNAO *4.10.x-y* version the fix is included. I guess it's not available from CNAO 4.10.0-0 and up, and I wouldn't want to waste any time checking a version which doesn't include the fix.
Verified on a 500-nodes cluster from CNV SSP QE. CNV Version: 4.9.1 CNAO version: v4.9.1-9 nmstate-handler, which was found to responsible to the heavy load on kube-apiserver, now shows legitimate CPU values: [kni@e16-h12-b02-fc640 ~]$ kubectl top pods -nopenshift-cnv | grep nmstate-handler NAME CPU(cores) MEMORY(bytes) nmstate-handler-24h2l 2m 48Mi nmstate-handler-24lqv 35m 60Mi nmstate-handler-257hq 18m 52Mi nmstate-handler-2626d 18m 60Mi nmstate-handler-294b5 17m 62Mi nmstate-handler-2knms 18m 54Mi nmstate-handler-2lnsv 33m 47Mi nmstate-handler-2pkv4 20m 50Mi nmstate-handler-2ppmn 2m 52Mi in comparison to the values presented in comment #5. However, kube-apiserver still shows high CPU levels. [kni@e16-h12-b02-fc640 ~]$ kubectl top pods -A | grep kube-apiserver NAMESPACE NAME CPU(cores) MEMORY(bytes) openshift-kube-apiserver kube-apiserver-master-0 23724m 28373Mi openshift-kube-apiserver kube-apiserver-master-1 21347m 26880Mi openshift-kube-apiserver kube-apiserver-master-2 14941m 27082Mi We will now check if kube-apiserver still consumes big amount of CPU after uninstalling CNV, and if it doesn't - we will open a new bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0947