Bug 2097620 - node_exporter uses high cpu under load
Summary: node_exporter uses high cpu under load
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Haoyu Sun
QA Contact: hongyan li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-16 07:31 UTC by Takayoshi Kimura
Modified: 2023-10-05 00:33 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-09 01:21:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6963044 0 None None None 2022-06-16 07:32:08 UTC

Internal Links: 2117007

Description Takayoshi Kimura 2022-06-16 07:31:04 UTC
Description of problem:

The node_exporter sometimes uses high cpu under load and it looks like a spinlock race on multiple CPUs.

Version-Release number of selected component (if applicable):

4.8.23, large number of CPUs host like 96 CPUs

How reproducible:

Always in customer env

Steps to Reproduce:
1. Generate load and monitor node resources using top command with 10 sec interval
2.
3.

Actual results:

node_exporter sometimes use N * 100% CPU for a while. Normally it uses only 5%.

Expected results:

No unexpected high CPU usage with node_exporter

Additional info:

Similar spinlock race high CPU usage is reported in upstream when cpufreq collector is enabled. It sounds like the spinlock race happens without the cpufreq where the node_exporter cannot get metrics smoothly for some reason.

https://github.com/prometheus/node_exporter/issues/1963
https://github.com/prometheus/node_exporter/pull/1964
https://github.com/prometheus/node_exporter/issues/1880

Comment 3 Simon Pasquier 2022-06-16 08:49:34 UTC
To investigate the issue, we would a CPU profile from one of the nodes where you see excessive CPU usage.

oc exec -n openshift-monitoring <node> -- curl -s http://localhost:9100/debug/pprof/profile?seconds=60 > cpu.pprof

Replace <node> by the actual name of the node exhibiting the issue.

Comment 20 Shiftzilla 2023-03-09 01:21:46 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9325


Note You need to log in before you can comment on or make changes to this bug.