Bug 2113875 - [node_exporter] Ceph node_exporter panics on AMD EPYC CPUs
Summary: [node_exporter] Ceph node_exporter panics on AMD EPYC CPUs
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Cephadm
Version: 4.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.3z2
Assignee: Guillaume Abrioux
QA Contact: Manasa
URL:
Whiteboard:
Depends On:
Blocks: 1760354
TreeView+ depends on / blocked
 
Reported: 2022-08-02 08:53 UTC by Flavio Piccioni
Modified: 2024-10-03 04:25 UTC (History)
29 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-06-04 07:12:15 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-17953 0 None None None 2022-08-02 08:54:03 UTC
Red Hat Issue Tracker RHCEPH-5004 0 None None None 2022-08-04 14:47:26 UTC

Description Flavio Piccioni 2022-08-02 08:53:20 UTC
Description of problem:
After updating RHOSP to 16.2z3 customer lost all hardware metrics for CephStorage nodes on the Ceph dashboard. Checking the status of the node_exporter container and found it panicked on every node:

```
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: panic: "node_rapl_package-0-die_joules_total" is not a valid metric name
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]:
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: goroutine 79 [running]:
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus/value.go:106
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/node_exporter/collector.(*raplCollector).Update(0xc0001292c0, 0xc0000ff620, 0x10d8460, 0x0)
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/rapl_linux.go
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: :69 +0x49a
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/node_exporter/collector.execute(0xc004c7, 0x4, 0xcddb80, 0xc0001292c0, 0xc0000ff620, 0xcdd420, 0xc000126ff0)
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/collector.go:153 +0x84
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1(0xc0000ff620, 0xc000127200, 0xcdd420, 0xc000126ff0, 0xc0000375d4, 0xc004c7, 0x4, 0xcddb80
, 0xc0001292c0)
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/collector.go:144 +0x71
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: created by github.com/prometheus/node_exporter/collector.NodeCollector.Collect
Aug  1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/collector.go:143 +0x13c
```

This seems to be caused by this upstream issue https://github.com/prometheus/node_exporter/issues/2299.

Version-Release number of selected component (if applicable):
Red Hat Openstack 16.2.3
ose-prometheus-node-exporter/images/v4.6.0-202206010727


How reproducible:
run node exporter container

Actual results:
application panics and container die


Expected results:
application to collect metrics

Additional info:
i'm going to attach more logs in next comments

Comment 3 Flavio Piccioni 2022-08-03 07:20:41 UTC
Hi

currently we applied a simple workaround to systemd unit launching node exporter container in order to exclude rapl modules

###############################
sudo systemctl edit node_exporter

In the ExecStart command, append

--no-collector.rapl to --no-collector.timex


[..]
--no-collector.timex \
--no-collector.rapl \
[..]

sudo systemctl daemon-reload
###############################


But of course we are going to lost this customization in case of deploy/update

Currently disk I/O metrics are missing for some reason and there are no significant error messages in the exporter logs. I'm not sure rapl modules are involved in this issue.

Comment 101 Guillaume Abrioux 2024-06-04 07:11:29 UTC
Although this BZ was moved to POST more than 1 year ago, it has never made a release, I'm closing it.
Feel free to re-open if needed.

Comment 102 Red Hat Bugzilla 2024-10-03 04:25:02 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.