2113875 – [node_exporter] Ceph node_exporter panics on AMD EPYC CPUs

Bug 2113875 - [node_exporter] Ceph node_exporter panics on AMD EPYC CPUs

Summary: [node_exporter] Ceph node_exporter panics on AMD EPYC CPUs

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	4.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3z2
Assignee:	Guillaume Abrioux
QA Contact:	Manasa
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1760354
TreeView+	depends on / blocked

Reported:	2022-08-02 08:53 UTC by Flavio Piccioni
Modified:	2024-10-03 04:25 UTC (History)
CC List:	29 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-06-04 07:12:15 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-17953	0	None	None	None	2022-08-02 08:54:03 UTC
Red Hat Issue Tracker	RHCEPH-5004	0	None	None	None	2022-08-04 14:47:26 UTC

Description Flavio Piccioni 2022-08-02 08:53:20 UTC

Description of problem:
After updating RHOSP to 16.2z3 customer lost all hardware metrics for CephStorage nodes on the Ceph dashboard. Checking the status of the node_exporter container and found it panicked on every node:

```
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: panic: "node_rapl_package-0-die_joules_total" is not a valid metric name
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]:
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: goroutine 79 [running]:
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus/value.go:106
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/node_exporter/collector.(*raplCollector).Update(0xc0001292c0, 0xc0000ff620, 0x10d8460, 0x0)
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/rapl_linux.go
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: :69 +0x49a
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/node_exporter/collector.execute(0xc004c7, 0x4, 0xcddb80, 0xc0001292c0, 0xc0000ff620, 0xcdd420, 0xc000126ff0)
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/collector.go:153 +0x84
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1(0xc0000ff620, 0xc000127200, 0xcdd420, 0xc000126ff0, 0xc0000375d4, 0xc004c7, 0x4, 0xcddb80
, 0xc0001292c0)
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/collector.go:144 +0x71
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: created by github.com/prometheus/node_exporter/collector.NodeCollector.Collect
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/collector.go:143 +0x13c
```

This seems to be caused by this upstream issue https://github.com/prometheus/node_exporter/issues/2299.

Version-Release number of selected component (if applicable):
Red Hat Openstack 16.2.3
ose-prometheus-node-exporter/images/v4.6.0-202206010727


How reproducible:
run node exporter container

Actual results:
application panics and container die


Expected results:
application to collect metrics

Additional info:
i'm going to attach more logs in next comments

Comment 3 Flavio Piccioni 2022-08-03 07:20:41 UTC

Hi

currently we applied a simple workaround to systemd unit launching node exporter container in order to exclude rapl modules

###############################
sudo systemctl edit node_exporter

In the ExecStart command, append

--no-collector.rapl to --no-collector.timex


[..]
--no-collector.timex \
--no-collector.rapl \
[..]

sudo systemctl daemon-reload
###############################


But of course we are going to lost this customization in case of deploy/update

Currently disk I/O metrics are missing for some reason and there are no significant error messages in the exporter logs. I'm not sure rapl modules are involved in this issue.

Comment 101 Guillaume Abrioux 2024-06-04 07:11:29 UTC

Although this BZ was moved to POST more than 1 year ago, it has never made a release, I'm closing it.
Feel free to re-open if needed.

Comment 102 Red Hat Bugzilla 2024-10-03 04:25:02 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

aglotov
aramteke
aschoen
athakkar
augol
bniver
ceph-eng-bugs
cephqe-warriors
gabrioux
gfidente
gmeno
hklein
jdurgin
joboyer
kdreyer
lhh
mhicks
mmagr
mrunge
msaini
nia
nthomas
pasik
pgrist
rbruzzon
sangadi
tonay
tserlin
vereddy