Description of problem: The node_exporter sometimes uses high cpu under load and it looks like a spinlock race on multiple CPUs. Version-Release number of selected component (if applicable): 4.8.23, large number of CPUs host like 96 CPUs How reproducible: Always in customer env Steps to Reproduce: 1. Generate load and monitor node resources using top command with 10 sec interval 2. 3. Actual results: node_exporter sometimes use N * 100% CPU for a while. Normally it uses only 5%. Expected results: No unexpected high CPU usage with node_exporter Additional info: Similar spinlock race high CPU usage is reported in upstream when cpufreq collector is enabled. It sounds like the spinlock race happens without the cpufreq where the node_exporter cannot get metrics smoothly for some reason. https://github.com/prometheus/node_exporter/issues/1963 https://github.com/prometheus/node_exporter/pull/1964 https://github.com/prometheus/node_exporter/issues/1880
To investigate the issue, we would a CPU profile from one of the nodes where you see excessive CPU usage. oc exec -n openshift-monitoring <node> -- curl -s http://localhost:9100/debug/pprof/profile?seconds=60 > cpu.pprof Replace <node> by the actual name of the node exhibiting the issue.
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9325