Bug 1845558 - CNV HyperConverged nmstate-handler-worker high cpu
Summary: CNV HyperConverged nmstate-handler-worker high cpu
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 2.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 2.5.0
Assignee: Petr Horáček
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On:
Blocks: 1906151
TreeView+ depends on / blocked
 
Reported: 2020-06-09 14:06 UTC by Jason Huddleston
Modified: 2020-12-15 06:44 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-29 12:59:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Console screenshot (564.57 KB, image/png)
2020-06-11 13:04 UTC, Jason Huddleston
no flags Details

Description Jason Huddleston 2020-06-09 14:06:00 UTC
Description of problem:

After adding a HyperConverged CR in the openshift-vnc namespace the the deployed nmstate-handler-worker pods level off at half to a full core of cpu usage.

Version-Release number of selected component (if applicable):

Openshift 4.4.7
CNV 2.3.0

How reproducible:

I have seen this on Openshift on VMware and on baremetal clusters

Steps to Reproduce:
1. Deploy Openshift 4.4 cluster
2. Upgrade to 4.4.7
3. Subscribe to the Container-native virtualization 2.3.0 Operator
4. Create a HyperConverged CR with the defaults.

Actual results:

Resulting nmstate-handler-worker pods will show 0.500 to 1.000 cpu cores used in web console

Expected results:


I would think the cpu usage for a service like this should be 0.050 or less.

Additional info:

Comment 1 Jason Huddleston 2020-06-11 13:04:40 UTC
Created attachment 1696766 [details]
Console screenshot

Screenshot showing CPU Usage.

Comment 2 Dan Kenigsberg 2020-06-20 16:56:30 UTC
in cnv.engineering (cnv-2.3) I see nmstate-handler taking ~0.5 per master, and nmstate-handler-worker taking 0.3-0.5 core per worker. However a closer look suggests a bigger etcdserver error.

An nmstate-handler pod is busy tight-looping with
{"level":"error","ts":1592671549.9629877,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"nodenetworkstate-controller","request":"/cnv-x4lcc-master-2","error":"etcdserver: request timed out","stacktrace":"github.com/nmstate/kubernetes-nmstate/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\ngithub.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\ngithub.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\ngithub.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\ngithub.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\ngithub.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

And ab nmstate-handler-work pod is in a similar condition.
{"level":"error","ts":1592671713.9723308,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"nodenetworkstate-controller","request":"/cnv-x4lcc-worker-0-gndlf","error":"etcdserver: request timed out","stacktrace":"github.com/nmstate/kubernetes-nmstate/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\ngithub.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\ngithub.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\ngithub.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\ngithub.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\ngithub.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/nmstate/kubernetes-nmstate/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Comment 4 Quique Llorente 2020-09-21 08:49:55 UTC
I have start to verify this at github.com/nmstate/kubevirt kubevirtci deploy and I see that every often handler goes to 45% of 6 cores with means it's around 3 cores althugh this cores are vcpus so I don't know if this correlates to the issue here.

Comment 6 Quique Llorente 2020-09-21 11:27:38 UTC
Tested with nmstate 0.3 and nmstate 0.4 both uses around 40% of CPUs everytime nmstatectl show is called.

Comment 7 Quique Llorente 2020-09-21 12:55:45 UTC
I have being playing around with nmstatectl varlink implementation at nmstate 0.4 and since nmstate is up and running as daemon using it the consumption drops to 6%.

Comment 9 Quique Llorente 2020-09-29 12:59:53 UTC
Jason reported that new releases are better we can close it, in case it reappear we can open it again.

Comment 10 Jason Huddleston 2020-10-12 20:55:26 UTC
Yes, you can close this BZ.


Note You need to log in before you can comment on or make changes to this bug.