Bug 2028775 - [RFE] Allow containerized node-exporter to collect stats from the host storage devices
Summary: [RFE] Allow containerized node-exporter to collect stats from the host storag...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 4.2
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
: 4.3z1
Assignee: Guillaume Abrioux
QA Contact: Rahul Lepakshi
URL:
Whiteboard:
Depends On:
Blocks: 2074512
TreeView+ depends on / blocked
 
Reported: 2021-12-03 09:36 UTC by Sergii Mykhailushko
Modified: 2023-09-18 04:28 UTC (History)
15 users (show)

Fixed In Version: ceph-ansible-4.0.70.4-1.el8cp, ceph-ansible-4.0.70.4-1.el7cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2074512 (view as bug list)
Environment:
Last Closed: 2022-09-22 11:21:04 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 7138 0 None Merged [skip ci] dashboard: allow collecting stats from the host 2022-05-03 14:37:44 UTC
Red Hat Issue Tracker RHCEPH-2495 0 None None None 2021-12-03 11:33:32 UTC
Red Hat Product Errata RHBA-2022:6684 0 None None None 2022-09-22 11:21:29 UTC

Description Sergii Mykhailushko 2021-12-03 09:36:08 UTC
Description of problem:
We're using a node-exporter container in Ceph deployments for monitoring purposes.
The issue can occur if we would like to allow node-exporter instance to collect statistics from the storage devices which are local to the host, which results in errors due to the fact that the exporter process is caged inside a container:

~~~
# curl -s localhost:9100/metrics| grep node_filesystem_device_error |  grep -v ^#
node_filesystem_device_error{device="/dev/mapper/rootvg-cephconlv",fstype="xfs",mountpoint="/var/lib/containers"} 1
node_filesystem_device_error{device="/dev/mapper/rootvg-cephconlv",fstype="xfs",mountpoint="/var/lib/containers/storage/overlay"} 1
node_filesystem_device_error{device="/dev/mapper/rootvg-cephloglv",fstype="xfs",mountpoint="/var/log/ceph"} 1
node_filesystem_device_error{device="/dev/mapper/rootvg-cephmonlv",fstype="xfs",mountpoint="/var/lib/ceph/mon"} 1
~~~

If we bind-mount (ro) host's root to the container and point that location as "--path.rootfs" option to the node-exporter running inside it, the stats are being collected as expected.

Is there any chance that we get this hard-coded somehow in the container config as we ship it? Or does presenting of the rootfs to the container reveal any security-related issues that's i'm probably not aware of?


Version-Release number of selected component (if applicable):
Reproduced on v4.6 container tag:

~~~
# podman images
REPOSITORY                                                                    TAG               IMAGE ID      CREATED       SIZE
registry.redhat.io/openshift4/ose-prometheus-node-exporter                    v4.6              486b6b9a1c3a  4 weeks ago   252 MB
~~~


How reproducible:
Always

Steps to Reproduce:
1. With default setup node-exporter on the Ceph monitors is not able to export filesystem stats for the logical volumes which are local to the host.
2. In order for node-exporter to scrub the host's local filesystem stats, its container has to be started with two modifications:
   a. Bind-mount host's / into container /rootfs (read-only).
   b. Pointing container's /rootfs as an argument for --path.rootfs for node-exporter.
3. We tested that with modifying the ExecStart commandline for the node-exporter systemd unit override:
   
   ~~~
   $ cat /etc/systemd/system/node_exporter.service.d/override.conf
   [Service]
   ExecStart=
   ExecStart=/usr/bin/podman run --rm --name=node-exporter \
     -d --log-driver journald --conmon-pidfile /%t/%n-pid --cidfile /%t/%n-cid \
     --pids-limit=0 \
     --privileged \
     -v /proc:/host/proc:ro -v /sys:/host/sys:ro -v /:/rootfs:ro \         <---
     --net=host \
     registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.1 \
     --path.procfs=/host/proc \
     --path.sysfs=/host/sys \
     --path.rootfs=/rootfs \                                               <---
     --no-collector.timex \
     
     $ sudo systemctl daemon-reload
     $ sudo systemctl restart node_exporter
     ~~~
 
4. Verified that the changes were picked up by the newly-spawned container:
   
   ~~~
   $ podman ps --no-trunc
   <...>  registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.1  --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/rootfs --no-collector.timex
   --web.listen-address=:9100  18 seconds ago  Up 19 seconds ago           node-exporter
    
   $ podman exec node-exporter ls /rootfs
   bin
   boot
   dev
   etc
   home
   lib
   lib64
   media
   mnt
   opt
   proc
   root
   run
   sbin
   srv
   sys
   tmp
   usr
   var
   ~~~
   
5. Once the container is running with that, we can see that there are no errors when collecting the stats from the LVM volumes which are local to the host.

Additional info: this looks similar to the issue we have described in RHBZ #1814150, but the underlying reason is obviously different one in this particular situation.

Comment 1 Jan Fajerski 2021-12-03 11:23:36 UTC
Reassigning to Ceph Storage. We manage the OpenShift Monitoring stack. The image referenced here is under an OpenShift namespace, but I don't think it is maintained by us.

Comment 2 Philip Gough 2021-12-03 11:25:36 UTC
I don't think this RFE is in the realms of the OCP monitoring platform. IIUC this looks like a standalone node-exporter has been deployed to monitor a Ceph, but it is not being managed by cluster-monitoring-operator.

Hardcoding values config into/with the shipped image to fit this use case seems like the wrong solution, since the image will have other use cases that could be effected (CMO for example). I think this should be assigned to the storage team to evaluate an effective solution.

It might also be worth taking a look at https://github.com/openshift/enhancements/pull/838

Comment 3 Guillaume Abrioux 2021-12-03 14:04:59 UTC
(In reply to Jan Fajerski from comment #1)
> Reassigning to Ceph Storage. We manage the OpenShift Monitoring stack. The
> image referenced here is under an OpenShift namespace, but I don't think it
> is maintained by us.

I think this is incorrect.
We don't build our own node-exporter image for RH Ceph Storage, we just re-use OpenShift's component, someone from OpenShift must build this image I think.

(In reply to Philip Gough from comment #2)
> Hardcoding values config into/with the shipped image to fit this use case
> seems like the wrong solution, since the image will have other use cases
> that could be effected (CMO for example).

I agree, I don't think this is the right solution here.



@Sergii, could you elaborate a bit more on your deployment?
Is it RHCS 4 related? In that case, I would simply patch ceph-ansible for that.

Comment 4 Sergii Mykhailushko 2021-12-03 14:48:22 UTC
> @Sergii, could you elaborate a bit more on your deployment?
> Is it RHCS 4 related? In that case, I would simply patch ceph-ansible for
> that.


Hi,

Thanks for looking into this.

In this particular scenario we're dealing with RHCS 4, yes.

Few questions however:

1. How would mounting host's root inside a container hurt the operations if that is used in the other use-cases? TBH i don't know what is that CMO, thus have little idea on the impact.
2. If we decide to go the way with pathcing ceph-ansible, any guidelines how can this be achieved?
3. In case someone needs it with RHCS 5, what would be the workflow w/o ceph-ansible then? Or should we just suggest manually overriding systemd units, as building-in the values into defaults is not that safe/optimal solution?

Sergii

Comment 20 errata-xmlrpc 2022-09-22 11:21:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 4.3 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:6684

Comment 21 Red Hat Bugzilla 2023-09-18 04:28:57 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.