Bug 1935044 - [cephadm] node-exporter not trying to pull custom image and in unknown state
Summary: [cephadm] node-exporter not trying to pull custom image and in unknown state
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Cephadm
Version: 5.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 6.1
Assignee: Adam King
QA Contact: Vasishta
Karen Norteman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-04 10:05 UTC by Vasishta
Modified: 2023-02-06 18:08 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-02-06 18:08:39 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 50061 0 None None None 2021-03-31 21:44:04 UTC

Internal Links: 2013215

Description Vasishta 2021-03-04 10:05:47 UTC
Description of problem:
1) Trying to configure ceph cluster using custom local registry. 
2) even though container_image_node_exporter is set to custom image, node exporter is trying to pull default image itself.
Since node doesn't have access to default registry, node-exporter is in unknown state

Version-Release number of selected component (if applicable):
16.1.0-486.el8cp 

How reproducible:
Tried once

Steps to Reproduce:
1. Bootstrap cluster and set monitoring stack images to custom container register

Actual results:
# ceph config get mgr mgr/cephadm/container_image_node_exporter
172.16.34.57:5000/ose-prometheus-node-exporter:latest

# ceph orch ls
..
node-exporter      0/1  4m ago     18h  *            registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5                                           <unknown>     

# sudo journalctl -fu ceph-0bf1621a-7b6e-11eb-bcba-fa163eaf1801 
...
Error: Error initializing source docker://registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5: error pinging docker registry registry.redhat.io: Get "https://registry.redhat.io/v2/": dial tcp: lookup registry.redhat.io on xx.xx.xx.x: read udp 172.16.34.41:57872->10.5.30.160:53: i/o timeout

# cat /var/lo/cephadm.log
2021-03-04 05:03:25,437 DEBUG Not possible to enable service <node-exporter>. firewalld.service is not available
2021-03-04 05:03:25,437 DEBUG Not possible to open ports <[9100]>. firewalld.service is not available


Expected results:
Node-exporter should be up and running

Additional info:

Comment 1 Adam King 2021-03-11 17:47:08 UTC
does this actually work for any other daemons that don't use the Ceph image besides node-exporter (Grafana, Prometheus, Alertmanager)? I didn't think we ever actually implemented a way to properly use custom images for monitoring stack daemons.

If it actually does work for those other daemons could you share your process for setting them up with custom images? That would help me figure out why node-exporter specifically is having an issue.

Comment 2 Vasishta 2021-03-12 11:46:42 UTC
Hi Adam,

(In reply to Adam King from comment #1)
> does this actually work for any other daemons that don't use the Ceph image
> besides node-exporter (Grafana, Prometheus, Alertmanager)? 

Worked for me for grafana, prometheus, alertmanaget

# ceph orch ls
NAME           RUNNING  REFRESHED  AGE  PLACEMENT    IMAGE NAME                                                                                                IMAGE ID      
alertmanager       1/1  48s ago    9d   count:1      172.16.34.57:5000/ose-prometheus-alertmanager                                                             e39ebb02f7e9  
crash              1/1  48s ago    9d   *            172.16.34.57:5000/rh-osbs/rhceph@sha256:966a9a874cdb1eff8365575918543a1e9fbf775a9cf93e7a634a06d18232822f  700feae6f592  
grafana            1/1  48s ago    9d   count:1      172.16.34.57:5000/rh-osbs/grafana:5-17                                                                    b3d99ee311c3  
mgr                1/2  48s ago    9d   <unmanaged>  172.16.34.57:5000/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-21981-20210302003306                700feae6f592  
mon                1/5  48s ago    9d   <unmanaged>  172.16.34.57:5000/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-21981-20210302003306                700feae6f592  
node-exporter      1/1  48s ago    9d   *            registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5                                           e4be1e64c76a  
prometheus         1/1  48s ago    9d   count:1      172.16.34.57:5000/ose-prometheus:latest                                                                   b7e8775f2f94  


> If it actually does work for those other daemons could you share your
> process for setting them up with custom images? That would help me figure
> out why node-exporter specifically is having an issue.

I followed https://docs.ceph.com/en/latest/cephadm/monitoring/#using-custom-images

# for i in {container_image_prometheus,container_image_grafana,container_image_alertmanager,container_image_node_exporter};do ceph config get mgr mgr/cephadm/$i ;done                     
172.16.34.57:5000/ose-prometheus:latest
172.16.34.57:5000/rh-osbs/grafana:5-17
172.16.34.57:5000/ose-prometheus-alertmanager
172.16.34.57:5000/ose-prometheus-node-exporter:latest

Comment 3 Adam King 2021-03-12 21:18:27 UTC
For me this was fixable by redeploying the node-exporter service (ceph orch redeploy node-exporter).

To test this I made a custom image that used upstream master code for the base Ceph image but downstream images for the monitoring stack containers. When the cluster was setup, none of the monitoring stack containers came up

[ceph: root@vm-00 /]# ceph orch ls
NAME                       RUNNING  REFRESHED  AGE  PLACEMENT  IMAGE NAME                                                       IMAGE ID      
alertmanager                   0/1  -          -    count:1    <unknown>                                                        <unknown>     
crash                          3/3  8m ago     13m  *          mix                                                              78bb486410ea  
grafana                        0/1  -          -    count:1    <unknown>                                                        <unknown>     
mgr                            2/2  8m ago     13m  count:2    mix                                                              78bb486410ea  
mon                            3/5  8m ago     13m  count:5    mix                                                              78bb486410ea  
node-exporter                  0/3  8m ago     13m  *          registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5  <unknown>     
osd.all-available-devices      6/6  8m ago     13m  *          mix                                                              78bb486410ea  
prometheus                     0/1  -          -    count:1    <unknown>                                                        <unknown>     


I changed the alertmanager, prometheus and node-exporter images using the method you linked. Prometheus and Alertmanager fixed themselves without any additional steps from me but node-exporter remained in error state.


[ceph: root@vm-00 /]# ceph orch ps --refresh
NAME                 HOST   STATUS         REFRESHED  AGE  VERSION                IMAGE NAME                                                                                      IMAGE ID      CONTAINER ID  
alertmanager.vm-02   vm-02  running (2m)   66s ago    2m   0.20.0                 docker.io/prom/alertmanager:v0.20.0                                                             0881eb8f169f  66380b67a1fe  
crash.vm-00          vm-00  running (24m)  10m ago    24m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph:testing                                                                  78bb486410ea  ab8f449941df  
crash.vm-01          vm-01  running (23m)  7m ago     23m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  776a5770f42e  
crash.vm-02          vm-02  running (21m)  66s ago    21m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  ddef1854d9f3  
mgr.vm-00.vqrmqr     vm-00  running (25m)  10m ago    25m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph:testing                                                                  78bb486410ea  33a4ac44e98b  
mgr.vm-01.opynpz     vm-01  running (23m)  7m ago     23m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  7798096f87aa  
mon.vm-00            vm-00  running (25m)  10m ago    25m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph:testing                                                                  78bb486410ea  49890b007727  
mon.vm-01            vm-01  running (20m)  7m ago     20m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  042f582c13fb  
mon.vm-02            vm-02  running (20m)  66s ago    20m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  acd8752d713b  
node-exporter.vm-00  vm-00  unknown        10m ago    20m  <unknown>              registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5                                 <unknown>     <unknown>     
node-exporter.vm-01  vm-01  error          7m ago     20m  <unknown>              registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5                                 <unknown>     <unknown>     
node-exporter.vm-02  vm-02  error          66s ago    20m  <unknown>              registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5                                 <unknown>     <unknown>     
osd.0                vm-01  running (22m)  7m ago     22m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  3f606cf45732  
osd.1                vm-00  running (22m)  10m ago    22m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph:testing                                                                  78bb486410ea  7419b0d380dd  
osd.2                vm-00  running (22m)  10m ago    22m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph:testing                                                                  78bb486410ea  7c2adf6bb0a4  
osd.3                vm-01  running (22m)  7m ago     22m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  9acacfa93f17  
osd.4                vm-02  running (20m)  66s ago    20m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  e0f018252c1c  
osd.5                vm-02  running (20m)  66s ago    20m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  5d1ce1e7bde6  
prometheus.vm-02     vm-02  running (2m)   66s ago    9m   2.18.1                 docker.io/prom/prometheus:v2.18.1                                                               de242295e225  a9d1eeb8337b  



When I called "ceph orch redeploy node-exporter" node-exporter came up as well


[ceph: root@vm-00 /]# ceph orch redeploy node-exporter
Scheduled to redeploy node-exporter.vm-00 on host 'vm-00'
Scheduled to redeploy node-exporter.vm-01 on host 'vm-01'
Scheduled to redeploy node-exporter.vm-02 on host 'vm-02'
[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   STATUS         REFRESHED  AGE  VERSION                IMAGE NAME                                                                                      IMAGE ID      CONTAINER ID  
alertmanager.vm-02   vm-02  running (3m)   12s ago    3m   0.20.0                 docker.io/prom/alertmanager:v0.20.0                                                             0881eb8f169f  66380b67a1fe  
crash.vm-00          vm-00  running (26m)  13s ago    26m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph:testing                                                                  78bb486410ea  ab8f449941df  
crash.vm-01          vm-01  running (24m)  13s ago    24m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  776a5770f42e  
crash.vm-02          vm-02  running (22m)  12s ago    22m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  ddef1854d9f3  
mgr.vm-00.vqrmqr     vm-00  running (27m)  13s ago    27m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph:testing                                                                  78bb486410ea  33a4ac44e98b  
mgr.vm-01.opynpz     vm-01  running (24m)  13s ago    24m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  7798096f87aa  
mon.vm-00            vm-00  running (27m)  13s ago    27m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph:testing                                                                  78bb486410ea  49890b007727  
mon.vm-01            vm-01  running (22m)  13s ago    22m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  042f582c13fb  
mon.vm-02            vm-02  running (22m)  12s ago    22m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  acd8752d713b  
node-exporter.vm-00  vm-00  running (24s)  13s ago    21m  0.18.1                 docker.io/prom/node-exporter:v0.18.1                                                            e5a616e4b9cf  c9dac439897a  
node-exporter.vm-01  vm-01  running (20s)  13s ago    21m  0.18.1                 docker.io/prom/node-exporter:v0.18.1                                                            e5a616e4b9cf  0f04d9f5a59e  
node-exporter.vm-02  vm-02  running (15s)  12s ago    21m  0.18.1                 docker.io/prom/node-exporter:v0.18.1                                                            e5a616e4b9cf  1fcfcb307510  
osd.0                vm-01  running (24m)  13s ago    23m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  3f606cf45732  
osd.1                vm-00  running (23m)  13s ago    23m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph:testing                                                                  78bb486410ea  7419b0d380dd  
osd.2                vm-00  running (23m)  13s ago    23m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph:testing                                                                  78bb486410ea  7c2adf6bb0a4  
osd.3                vm-01  running (23m)  13s ago    23m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  9acacfa93f17  
osd.4                vm-02  running (21m)  12s ago    21m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  e0f018252c1c  
osd.5                vm-02  running (21m)  12s ago    21m  17.0.0-1275-g5e197a21  docker.io/amk3798/ceph@sha256:d10ca19a6cecd7b498a282b2ef05a51cba54a298a34c66a446bbecc4596c0f6d  78bb486410ea  5d1ce1e7bde6  
prometheus.vm-02     vm-02  running (3m)   12s ago    11m  2.18.1                 docker.io/prom/prometheus:v2.18.1                                                               de242295e225  a9d1eeb8337b  


I'd see if this works for you as well.

NOTE: Fixing with redeploy might only work with these changes present in the image: https://github.com/ceph/ceph/pull/39385
      The changes were only backported to Pacific 15 days ago (https://github.com/ceph/ceph/pull/39623), so I'd check that 
      whatever image you're using for your Ceph base while testing this was created since then. That also means that redeploying
      might not have worked when you originally found this bug. So if you tried that before I'd still try it again.

Another note: The reason node-exporter behaves differently here is that cephadm will try to reconfigure it if it has not been reconfigured since changes occurred in other related daemons.
              Reconfiguration does not change the image being used, hence why it continues to try to use the old image repeatedly, and since it can never complete the reconfiguration it always 
              thinks it still needs to be reconfigured and gets stuck there. Explicitly telling cephadm to redeploy node-exporter will cause it to actually update the image being used and break 
              the cycle.
 

Let me know if this doesn't fix it. Also, if that turns out to be the case, if you could include some of the systemd logs from one of the node-exporter's systemd units it could be helpful in debugging (i.e. anything near a message about failing to start from what you see from a command like 'jounalctl -xeu  ceph-0ccaae08-8370-11eb-afcf-5254008c7ecc' but with your node-exporter unit name substituted in)

Comment 4 Vasishta 2021-03-18 15:53:48 UTC
Hi Adam,

Thanks a lot for the detailed information.

>>"ceph orch redeploy node-exporter"
Worked for me.

+
This time I did a mishap by setting internal registry grafana, where as I wanted image to be pulled from other local registry.
So I re-initiated mgr/cephadm/container_image_grafana and waited for long time, change did not take place, I had to do "ceph orch redeploy grafana" to be able to update the daemons.

I agree that "ceph orch redeploy node-exporter" can be documented as a workaround for this issue but I'm not sure whether it is a suitable thing to do from usability perspective.
Can this be fixed ? - (Daemons should get redeployed as when user changes value of mgr/cephadm/container_image_<daemon>)

Regards,
Vasishta Shastry
QE, ceph

Comment 5 Adam King 2021-03-24 16:38:41 UTC
Just wanted to give an update since I haven't commented in a while and you put the needinfo flag. I've looked into this a bit and feel it should be possible to automatically redeploy if the image changes, but I'm considering the issue low priority since the workaround is fairly simple. I'll see if I can take a more serious look at the possibility next week, if not later this week, and update you on what I find.

Thank you for your patience.

Comment 6 Adam King 2021-03-31 12:27:31 UTC
Attempt to make the monitoring stack daemons automatically redeploy if their container image config option is changed:

Tracker Issue: https://tracker.ceph.com/issues/50061
PR:            https://github.com/ceph/ceph/pull/40507

Comment 8 Sebastian Wagner 2021-05-26 10:38:06 UTC
requires further upstream discussions, The upstream PR has a big impact risking regressions.

Comment 9 Sebastian Wagner 2021-10-12 10:52:51 UTC
Eric, this looks somewhat similar to your issue you described.

Comment 10 egoirand 2021-10-13 13:24:31 UTC
Hello Sebastian,

When you change the ceph config parameter to point to the local registry container as mentioned above, the node-exporter is correctly deployed.

In my case, I had an issue when I was trying to deploy an OSD using ceph orch daemon add osd <host>:/dev/sdX. In this case, it was also trying to connect to a container registry+URL that was outside the local registry.

Were you able to deploy OSDs in a disconnected environment here ? 

Did you have to perform a specific set up for it to work if you did ? 

Thanks, 

Eric.

Comment 11 XinhuaLi 2021-11-13 07:20:49 UTC
Hi, will we fix it in feature release or just use the workaround ?

Regards
Sam


Note You need to log in before you can comment on or make changes to this bug.