2097620 – node_exporter uses high cpu under load

Bug 2097620 - node_exporter uses high cpu under load

Summary: node_exporter uses high cpu under load

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Haoyu Sun
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-16 07:31 UTC by Takayoshi Kimura
Modified:	2023-10-05 00:33 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-03-09 01:21:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6963044	0	None	None	None	2022-06-16 07:32:08 UTC

Internal Links: 2117007

Description Takayoshi Kimura 2022-06-16 07:31:04 UTC

Description of problem:

The node_exporter sometimes uses high cpu under load and it looks like a spinlock race on multiple CPUs.

Version-Release number of selected component (if applicable):

4.8.23, large number of CPUs host like 96 CPUs

How reproducible:

Always in customer env

Steps to Reproduce:
1. Generate load and monitor node resources using top command with 10 sec interval
2.
3.

Actual results:

node_exporter sometimes use N * 100% CPU for a while. Normally it uses only 5%.

Expected results:

No unexpected high CPU usage with node_exporter

Additional info:

Similar spinlock race high CPU usage is reported in upstream when cpufreq collector is enabled. It sounds like the spinlock race happens without the cpufreq where the node_exporter cannot get metrics smoothly for some reason.

https://github.com/prometheus/node_exporter/issues/1963
https://github.com/prometheus/node_exporter/pull/1964
https://github.com/prometheus/node_exporter/issues/1880

Comment 3 Simon Pasquier 2022-06-16 08:49:34 UTC

To investigate the issue, we would a CPU profile from one of the nodes where you see excessive CPU usage.

oc exec -n openshift-monitoring <node> -- curl -s http://localhost:9100/debug/pprof/profile?seconds=60 > cpu.pprof

Replace <node> by the actual name of the node exhibiting the issue.

Comment 20 Shiftzilla 2023-03-09 01:21:46 UTC

OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9325

Note You need to log in before you can comment on or make changes to this bug.