Description of problem: After updating RHOSP to 16.2z3 customer lost all hardware metrics for CephStorage nodes on the Ceph dashboard. Checking the status of the node_exporter container and found it panicked on every node: ``` Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: panic: "node_rapl_package-0-die_joules_total" is not a valid metric name Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: goroutine 79 [running]: Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...) Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus/value.go:106 Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/node_exporter/collector.(*raplCollector).Update(0xc0001292c0, 0xc0000ff620, 0x10d8460, 0x0) Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/rapl_linux.go Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: :69 +0x49a Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/node_exporter/collector.execute(0xc004c7, 0x4, 0xcddb80, 0xc0001292c0, 0xc0000ff620, 0xcdd420, 0xc000126ff0) Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/collector.go:153 +0x84 Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1(0xc0000ff620, 0xc000127200, 0xcdd420, 0xc000126ff0, 0xc0000375d4, 0xc004c7, 0x4, 0xcddb80 , 0xc0001292c0) Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/collector.go:144 +0x71 Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: created by github.com/prometheus/node_exporter/collector.NodeCollector.Collect Aug 1 16:26:57 stornode001.mydomain.com conmon[683060]: #011/go/src/github.com/prometheus/node_exporter/collector/collector.go:143 +0x13c ``` This seems to be caused by this upstream issue https://github.com/prometheus/node_exporter/issues/2299. Version-Release number of selected component (if applicable): Red Hat Openstack 16.2.3 ose-prometheus-node-exporter/images/v4.6.0-202206010727 How reproducible: run node exporter container Actual results: application panics and container die Expected results: application to collect metrics Additional info: i'm going to attach more logs in next comments
Hi currently we applied a simple workaround to systemd unit launching node exporter container in order to exclude rapl modules ############################### sudo systemctl edit node_exporter In the ExecStart command, append --no-collector.rapl to --no-collector.timex [..] --no-collector.timex \ --no-collector.rapl \ [..] sudo systemctl daemon-reload ############################### But of course we are going to lost this customization in case of deploy/update Currently disk I/O metrics are missing for some reason and there are no significant error messages in the exporter logs. I'm not sure rapl modules are involved in this issue.
Although this BZ was moved to POST more than 1 year ago, it has never made a release, I'm closing it. Feel free to re-open if needed.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days