Description of problem: 1) Trying to configure ceph cluster using custom local registry. 2) even though container_image_node_exporter is set to custom image, node exporter is trying to pull default image itself. Since node doesn't have access to default registry, node-exporter is in unknown state Version-Release number of selected component (if applicable): 16.1.0-486.el8cp How reproducible: Tried once Steps to Reproduce: 1. Bootstrap cluster and set monitoring stack images to custom container register Actual results: # ceph config get mgr mgr/cephadm/container_image_node_exporter 172.16.34.57:5000/ose-prometheus-node-exporter:latest # ceph orch ls .. node-exporter 0/1 4m ago 18h * registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 <unknown> # sudo journalctl -fu ceph-0bf1621a-7b6e-11eb-bcba-fa163eaf1801 ... Error: Error initializing source docker://registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5: error pinging docker registry registry.redhat.io: Get "https://registry.redhat.io/v2/": dial tcp: lookup registry.redhat.io on xx.xx.xx.x: read udp 172.16.34.41:57872->10.5.30.160:53: i/o timeout # cat /var/lo/cephadm.log 2021-03-04 05:03:25,437 DEBUG Not possible to enable service <node-exporter>. firewalld.service is not available 2021-03-04 05:03:25,437 DEBUG Not possible to open ports <[9100]>. firewalld.service is not available Expected results: Node-exporter should be up and running Additional info:
does this actually work for any other daemons that don't use the Ceph image besides node-exporter (Grafana, Prometheus, Alertmanager)? I didn't think we ever actually implemented a way to properly use custom images for monitoring stack daemons. If it actually does work for those other daemons could you share your process for setting them up with custom images? That would help me figure out why node-exporter specifically is having an issue.
Hi Adam, (In reply to Adam King from comment #1) > does this actually work for any other daemons that don't use the Ceph image > besides node-exporter (Grafana, Prometheus, Alertmanager)? Worked for me for grafana, prometheus, alertmanaget # ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 48s ago 9d count:1 172.16.34.57:5000/ose-prometheus-alertmanager e39ebb02f7e9 crash 1/1 48s ago 9d * 172.16.34.57:5000/rh-osbs/rhceph@sha256:966a9a874cdb1eff8365575918543a1e9fbf775a9cf93e7a634a06d18232822f 700feae6f592 grafana 1/1 48s ago 9d count:1 172.16.34.57:5000/rh-osbs/grafana:5-17 b3d99ee311c3 mgr 1/2 48s ago 9d <unmanaged> 172.16.34.57:5000/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-21981-20210302003306 700feae6f592 mon 1/5 48s ago 9d <unmanaged> 172.16.34.57:5000/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-21981-20210302003306 700feae6f592 node-exporter 1/1 48s ago 9d * registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 e4be1e64c76a prometheus 1/1 48s ago 9d count:1 172.16.34.57:5000/ose-prometheus:latest b7e8775f2f94 > If it actually does work for those other daemons could you share your > process for setting them up with custom images? That would help me figure > out why node-exporter specifically is having an issue. I followed https://docs.ceph.com/en/latest/cephadm/monitoring/#using-custom-images # for i in {container_image_prometheus,container_image_grafana,container_image_alertmanager,container_image_node_exporter};do ceph config get mgr mgr/cephadm/$i ;done 172.16.34.57:5000/ose-prometheus:latest 172.16.34.57:5000/rh-osbs/grafana:5-17 172.16.34.57:5000/ose-prometheus-alertmanager 172.16.34.57:5000/ose-prometheus-node-exporter:latest
For me this was fixable by redeploying the node-exporter service (ceph orch redeploy node-exporter). To test this I made a custom image that used upstream master code for the base Ceph image but downstream images for the monitoring stack containers. When the cluster was setup, none of the monitoring stack containers came up [ceph: root@vm-00 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 0/1 - - count:1 <unknown> <unknown> crash 3/3 8m ago 13m * mix 78bb486410ea grafana 0/1 - - count:1 <unknown> <unknown> mgr 2/2 8m ago 13m count:2 mix 78bb486410ea mon 3/5 8m ago 13m count:5 mix 78bb486410ea node-exporter 0/3 8m ago 13m * registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 <unknown> osd.all-available-devices 6/6 8m ago 13m * mix 78bb486410ea prometheus 0/1 - - count:1 <unknown> <unknown> I changed the alertmanager, prometheus and node-exporter images using the method you linked. Prometheus and Alertmanager fixed themselves without any additional steps from me but node-exporter remained in error state. [ceph: root@vm-00 /]# ceph orch ps --refresh NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID alertmanager.vm-02 vm-02 running (2m) 66s ago 2m 0.20.0 docker.io/prom/alertmanager:v0.20.0 0881eb8f169f 66380b67a1fe crash.vm-00 vm-00 running (24m) 10m ago 24m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph:testing 78bb486410ea ab8f449941df crash.vm-01 vm-01 running (23m) 7m ago 23m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 776a5770f42e crash.vm-02 vm-02 running (21m) 66s ago 21m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea ddef1854d9f3 mgr.vm-00.vqrmqr vm-00 running (25m) 10m ago 25m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph:testing 78bb486410ea 33a4ac44e98b mgr.vm-01.opynpz vm-01 running (23m) 7m ago 23m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 7798096f87aa mon.vm-00 vm-00 running (25m) 10m ago 25m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph:testing 78bb486410ea 49890b007727 mon.vm-01 vm-01 running (20m) 7m ago 20m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 042f582c13fb mon.vm-02 vm-02 running (20m) 66s ago 20m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea acd8752d713b node-exporter.vm-00 vm-00 unknown 10m ago 20m <unknown> registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 <unknown> <unknown> node-exporter.vm-01 vm-01 error 7m ago 20m <unknown> registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 <unknown> <unknown> node-exporter.vm-02 vm-02 error 66s ago 20m <unknown> registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 <unknown> <unknown> osd.0 vm-01 running (22m) 7m ago 22m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 3f606cf45732 osd.1 vm-00 running (22m) 10m ago 22m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph:testing 78bb486410ea 7419b0d380dd osd.2 vm-00 running (22m) 10m ago 22m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph:testing 78bb486410ea 7c2adf6bb0a4 osd.3 vm-01 running (22m) 7m ago 22m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 9acacfa93f17 osd.4 vm-02 running (20m) 66s ago 20m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea e0f018252c1c osd.5 vm-02 running (20m) 66s ago 20m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 5d1ce1e7bde6 prometheus.vm-02 vm-02 running (2m) 66s ago 9m 2.18.1 docker.io/prom/prometheus:v2.18.1 de242295e225 a9d1eeb8337b When I called "ceph orch redeploy node-exporter" node-exporter came up as well [ceph: root@vm-00 /]# ceph orch redeploy node-exporter Scheduled to redeploy node-exporter.vm-00 on host 'vm-00' Scheduled to redeploy node-exporter.vm-01 on host 'vm-01' Scheduled to redeploy node-exporter.vm-02 on host 'vm-02' [ceph: root@vm-00 /]# ceph orch ps NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID alertmanager.vm-02 vm-02 running (3m) 12s ago 3m 0.20.0 docker.io/prom/alertmanager:v0.20.0 0881eb8f169f 66380b67a1fe crash.vm-00 vm-00 running (26m) 13s ago 26m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph:testing 78bb486410ea ab8f449941df crash.vm-01 vm-01 running (24m) 13s ago 24m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 776a5770f42e crash.vm-02 vm-02 running (22m) 12s ago 22m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea ddef1854d9f3 mgr.vm-00.vqrmqr vm-00 running (27m) 13s ago 27m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph:testing 78bb486410ea 33a4ac44e98b mgr.vm-01.opynpz vm-01 running (24m) 13s ago 24m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 7798096f87aa mon.vm-00 vm-00 running (27m) 13s ago 27m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph:testing 78bb486410ea 49890b007727 mon.vm-01 vm-01 running (22m) 13s ago 22m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 042f582c13fb mon.vm-02 vm-02 running (22m) 12s ago 22m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea acd8752d713b node-exporter.vm-00 vm-00 running (24s) 13s ago 21m 0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf c9dac439897a node-exporter.vm-01 vm-01 running (20s) 13s ago 21m 0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 0f04d9f5a59e node-exporter.vm-02 vm-02 running (15s) 12s ago 21m 0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 1fcfcb307510 osd.0 vm-01 running (24m) 13s ago 23m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 3f606cf45732 osd.1 vm-00 running (23m) 13s ago 23m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph:testing 78bb486410ea 7419b0d380dd osd.2 vm-00 running (23m) 13s ago 23m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph:testing 78bb486410ea 7c2adf6bb0a4 osd.3 vm-01 running (23m) 13s ago 23m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 9acacfa93f17 osd.4 vm-02 running (21m) 12s ago 21m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea e0f018252c1c osd.5 vm-02 running (21m) 12s ago 21m 17.0.0-1275-g5e197a21 docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d 78bb486410ea 5d1ce1e7bde6 prometheus.vm-02 vm-02 running (3m) 12s ago 11m 2.18.1 docker.io/prom/prometheus:v2.18.1 de242295e225 a9d1eeb8337b I'd see if this works for you as well. NOTE: Fixing with redeploy might only work with these changes present in the image: https://github.com/ceph/ceph/pull/39385 The changes were only backported to Pacific 15 days ago (https://github.com/ceph/ceph/pull/39623), so I'd check that whatever image you're using for your Ceph base while testing this was created since then. That also means that redeploying might not have worked when you originally found this bug. So if you tried that before I'd still try it again. Another note: The reason node-exporter behaves differently here is that cephadm will try to reconfigure it if it has not been reconfigured since changes occurred in other related daemons. Reconfiguration does not change the image being used, hence why it continues to try to use the old image repeatedly, and since it can never complete the reconfiguration it always thinks it still needs to be reconfigured and gets stuck there. Explicitly telling cephadm to redeploy node-exporter will cause it to actually update the image being used and break the cycle. Let me know if this doesn't fix it. Also, if that turns out to be the case, if you could include some of the systemd logs from one of the node-exporter's systemd units it could be helpful in debugging (i.e. anything near a message about failing to start from what you see from a command like 'jounalctl -xeu ceph-0ccaae08-8370-11eb-afcf-5254008c7ecc' but with your node-exporter unit name substituted in)
Hi Adam, Thanks a lot for the detailed information. >>"ceph orch redeploy node-exporter" Worked for me. + This time I did a mishap by setting internal registry grafana, where as I wanted image to be pulled from other local registry. So I re-initiated mgr/cephadm/container_image_grafana and waited for long time, change did not take place, I had to do "ceph orch redeploy grafana" to be able to update the daemons. I agree that "ceph orch redeploy node-exporter" can be documented as a workaround for this issue but I'm not sure whether it is a suitable thing to do from usability perspective. Can this be fixed ? - (Daemons should get redeployed as when user changes value of mgr/cephadm/container_image_<daemon>) Regards, Vasishta Shastry QE, ceph
Just wanted to give an update since I haven't commented in a while and you put the needinfo flag. I've looked into this a bit and feel it should be possible to automatically redeploy if the image changes, but I'm considering the issue low priority since the workaround is fairly simple. I'll see if I can take a more serious look at the possibility next week, if not later this week, and update you on what I find. Thank you for your patience.
Attempt to make the monitoring stack daemons automatically redeploy if their container image config option is changed: Tracker Issue: https://tracker.ceph.com/issues/50061 PR: https://github.com/ceph/ceph/pull/40507
requires further upstream discussions, The upstream PR has a big impact risking regressions.
Eric, this looks somewhat similar to your issue you described.
Hello Sebastian, When you change the ceph config parameter to point to the local registry container as mentioned above, the node-exporter is correctly deployed. In my case, I had an issue when I was trying to deploy an OSD using ceph orch daemon add osd <host>:/dev/sdX. In this case, it was also trying to connect to a container registry+URL that was outside the local registry. Were you able to deploy OSDs in a disconnected environment here ? Did you have to perform a specific set up for it to work if you did ? Thanks, Eric.
Hi, will we fix it in feature release or just use the workaround ? Regards Sam