Bug 2167729

Summary: [Bluewash][Installation] Unable to deploy node-exporter daemon using ceph orchestrator
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Sayalee <saraut>
Component: BuildAssignee: Ken Dreyer (Red Hat) <kdreyer>
Status: CLOSED NOTABUG QA Contact: Sayalee <saraut>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.3CC: cephqe-warriors, msaini, vereddy
Target Milestone: ---   
Target Release: 5.3z4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-25 06:43:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sayalee 2023-02-07 11:51:59 UTC
Description of problem:
=======================
On a fresh installation on IBM build, deployment of node-exporter daemon using the image shared in https://bugzilla.redhat.com/show_bug.cgi?id=2167314#c5 fails and it is getting oulled from cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:latest instead of -
node_exporter_container_image: cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:v4.10

Note : This issue is not observed with other monitoring stack images.



Version-Release number of selected component (if applicable):
=============================================================
16.2.10-107.ibm.el8cp (afaf046ef07b474757f48ef45c02e3ed5e30a63f) pacific



How reproducible:
=================
Always



Steps to Reproduce:
===================
1. Deploy a ceph cluster using IBM build.
2. Pull monitoring stack images mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2167314#c5
3. Deploy a node-exporter daemon using command - 
   # ceph orch apply node-exporter --placement="saya-bluewash-dashboard-1"


Actual results:
===============

node-exporter daemon is in failed/error state.


# systemctl status ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service

● ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service - Ceph node-exporter.saya-bluewash-dashboard-1 for 7a2938fa-a391-11ed-9448-fa163ef14223
   Loaded: loaded (/etc/systemd/system/ceph-7a2938fa-a391-11ed-9448-fa163ef14223@.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Tue 2023-02-07 05:35:14 EST; 5s ago
  Process: 2802523 ExecStopPost=/bin/rm -f /run/ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service-pid /run/ceph-7a2938fa-a391-11ed-9448-fa163ef14223@>
  Process: 2802522 ExecStopPost=/bin/bash /var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/node-exporter.saya-bluewash-dashboard-1/unit.poststop (code=exited, status=0/SUCCESS)
  Process: 2802480 ExecStart=/bin/bash /var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/node-exporter.saya-bluewash-dashboard-1/unit.run (code=exited, status=125)
  Process: 2802478 ExecStartPre=/bin/rm -f /run/ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service-pid /run/ceph-7a2938fa-a391-11ed-9448-fa163ef14223@>

Feb 07 05:35:14 saya-bluewash-dashboard-1 systemd[1]: Failed to start Ceph node-exporter.saya-bluewash-dashboard-1 for 7a2938fa-a391-11ed-9448-fa163ef14223.



"journalctl -xe" log snippet-

-- Unit ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service has begun starting up.
Feb 07 05:36:17 saya-bluewash-dashboard-1 ceph-7a2938fa-a391-11ed-9448-fa163ef14223-mon-saya-bluewash-dashboard-1[18244]: cluster 2023-02-07T10:36:16.746391+0000 mgr.saya-bluewash-dashboard>
Feb 07 05:36:17 saya-bluewash-dashboard-1 bash[2803010]: Trying to pull cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:latest...
Feb 07 05:36:18 saya-bluewash-dashboard-1 bash[2803010]: Error: initializing source docker://cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:latest: reading manifest latest in cp.icr.io/cp/i>
Feb 07 05:36:18 saya-bluewash-dashboard-1 systemd[1]: ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service: Control process exited, code=exited status=1>
Feb 07 05:36:18 saya-bluewash-dashboard-1 systemd[1]: ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- The unit ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service has entered the 'failed' state with result 'exit-code'.
Feb 07 05:36:18 saya-bluewash-dashboard-1 systemd[1]: Failed to start Ceph node-exporter.saya-bluewash-dashboard-1 for 7a2938fa-a391-11ed-9448-fa163ef14223.
-- Subject: Unit ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service has failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service has failed.
-- 
-- The result is failed.


Expected results:
=================
node-exporter service should be up and running.



Additional info:
================

Initially after fresh installation and before pulling any images-

# ceph orch ps
NAME                                     HOST                       PORTS   STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION                IMAGE ID      CONTAINER ID  
crash.saya-bluewash-dashboard-1          saya-bluewash-dashboard-1          running (3d)     7m ago   3d    6987k        -  16.2.10-107.ibm.el8cp  1a3e836939db  2cc055d0faae  
mgr.saya-bluewash-dashboard-1.nlopgv     saya-bluewash-dashboard-1  *:9283  running (3d)     7m ago   3d     561M        -  16.2.10-107.ibm.el8cp  1a3e836939db  3b53758e6a65  
mon.saya-bluewash-dashboard-1            saya-bluewash-dashboard-1          running (3d)     7m ago   3d    1094M    2048M  16.2.10-107.ibm.el8cp  1a3e836939db  e1fe9c2a709f  
node-exporter.saya-bluewash-dashboard-1  saya-bluewash-dashboard-1  *:9100  error            7m ago   3d        -        -  <unknown>              <unknown>     <unknown>     


# ceph -s
  cluster:
    id:     7a2938fa-a391-11ed-9448-fa163ef14223
    health: HEALTH_WARN
            Failed to place 1 daemon(s)
            1 failed cephadm daemon(s)
            OSD count 0 < osd_pool_default_size 3
 
  services:
    mon: 1 daemons, quorum saya-bluewash-dashboard-1 (age 3d)
    mgr: saya-bluewash-dashboard-1.nlopgv(active, since 3d)
    osd: 0 osds: 0 up, 0 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     
 

# ceph health detail
HEALTH_WARN Failed to place 1 daemon(s); 1 failed cephadm daemon(s); OSD count 0 < osd_pool_default_size 3
[WRN] CEPHADM_DAEMON_PLACE_FAIL: Failed to place 1 daemon(s)
    Failed while placing prometheus.saya-bluewash-dashboard-1 on saya-bluewash-dashboard-1: cephadm exited with an error code: 1, stderr:Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-7a2938fa-a391-11ed-9448-fa163ef14223-prometheus-saya-bluewash-dashboard-1
/bin/podman: stderr Error: inspecting object: no such container ceph-7a2938fa-a391-11ed-9448-fa163ef14223-prometheus-saya-bluewash-dashboard-1
Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-7a2938fa-a391-11ed-9448-fa163ef14223-prometheus.saya-bluewash-dashboard-1
/bin/podman: stderr Error: inspecting object: no such container ceph-7a2938fa-a391-11ed-9448-fa163ef14223-prometheus.saya-bluewash-dashboard-1
Deploy daemon prometheus.saya-bluewash-dashboard-1 ...
Non-zero exit code 125 from /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint stat --init -e CONTAINER_IMAGE=icr.io/ibm-ceph/prometheus:v4.10 -e NODE_NAME=saya-bluewash-dashboard-1 -e CEPH_USE_RANDOM_NONCE=1 icr.io/ibm-ceph/prometheus:v4.10 -c %u %g /etc/prometheus
stat: stderr Trying to pull icr.io/ibm-ceph/prometheus:v4.10...
stat: stderr Error: initializing source docker://icr.io/ibm-ceph/prometheus:v4.10: unable to retrieve auth token: invalid username/password: unauthorized: The login credentials are not valid, or your IBM Cloud account is not active.
ERROR: Failed to extract uid/gid for path /etc/prometheus: Failed command: /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint stat --init -e CONTAINER_IMAGE=icr.io/ibm-ceph/prometheus:v4.10 -e NODE_NAME=saya-bluewash-dashboard-1 -e CEPH_USE_RANDOM_NONCE=1 icr.io/ibm-ceph/prometheus:v4.10 -c %u %g /etc/prometheus: Trying to pull icr.io/ibm-ceph/prometheus:v4.10...
Error: initializing source docker://icr.io/ibm-ceph/prometheus:v4.10: unable to retrieve auth token: invalid username/password: unauthorized: The login credentials are not valid, or your IBM Cloud account is not active.

[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon node-exporter.saya-bluewash-dashboard-1 on saya-bluewash-dashboard-1 is in error state
[WRN] TOO_FEW_OSDS: OSD count 0 < osd_pool_default_size 3




After pulling the images -

# podman pull cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:v4.10
Trying to pull cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:v4.10...
Getting image source signatures
Checking if image destination supports signatures
Copying blob 98a0c66b9ec7 done  
Copying blob 1df162fae087 done  
Copying blob d20a374ee8f7 done  
Copying blob fbcfec983935 done  
Copying config 6c8570b192 done  
Writing manifest to image destination
Storing signatures
6c8570b1928b4a811bc5e7175dde87d6c0e19b640d4584c547eebc02ab81e5f4

# ceph config set mgr mgr/cephadm/container_image_node_exporter cp.icr.io/cp/ibm-ceph/prometheus-node-exporter


# ceph orch apply node-exporter --placement="saya-bluewash-dashboard-1"
Scheduled node-exporter update...


# ceph orch ps --daemon_type=node-exporter
NAME                                     HOST                       PORTS   STATUS  REFRESHED  AGE  MEM USE  MEM LIM  VERSION    IMAGE ID   
node-exporter.saya-bluewash-dashboard-1  saya-bluewash-dashboard-1  *:9100  error      9m ago   4d        -        -  <unknown>  <unknown>  


# ceph orch daemon redeploy node-exporter.saya-bluewash-dashboard-1
Scheduled to redeploy node-exporter.saya-bluewash-dashboard-1 on host 'saya-bluewash-dashboard-1'


# ceph orch ps --daemon_type=node-exporter
NAME                                     HOST                       PORTS   STATUS  REFRESHED  AGE  MEM USE  MEM LIM  VERSION    IMAGE ID   
node-exporter.saya-bluewash-dashboard-1  saya-bluewash-dashboard-1  *:9100  error     22s ago   4d        -        -  <unknown>  <unknown>  


# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); Degraded data redundancy: 1 pg undersized
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon node-exporter.saya-bluewash-dashboard-1 on saya-bluewash-dashboard-1 is in error state
[WRN] PG_DEGRADED: Degraded data redundancy: 1 pg undersized
    pg 1.0 is stuck undersized for 99m, current state active+undersized, last acting [2,0]



cephadm log snippet - 

RuntimeError: Failed command: systemctl restart ceph-7a2938fa-a391-11ed-9448-fa163ef14223: Job for ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service failed because the control process exited with error code.
See "systemctl status ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service" and "journalctl -xe" for details.
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1466, in _remote_connection
    yield (conn, connr)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in _run_cephadm
    code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-7a2938fa-a391-11ed-9448-fa163ef14223-node-exporter-saya-bluewash-dashboard-1
/bin/podman: stderr Error: inspecting object: no such container ceph-7a2938fa-a391-11ed-9448-fa163ef14223-node-exporter-saya-bluewash-dashboard-1
Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-7a2938fa-a391-11ed-9448-fa163ef14223-node-exporter.saya-bluewash-dashboard-1
/bin/podman: stderr Error: inspecting object: no such container ceph-7a2938fa-a391-11ed-9448-fa163ef14223-node-exporter.saya-bluewash-dashboard-1
Reconfig daemon node-exporter.saya-bluewash-dashboard-1 ...
Non-zero exit code 1 from systemctl restart ceph-7a2938fa-a391-11ed-9448-fa163ef14223
systemctl: stderr Job for ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service failed because the control process exited with error code.
systemctl: stderr See "systemctl status ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service" and "journalctl -xe" for details.
Traceback (most recent call last):
  File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 9183, in <module>
    main()
  File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 9171, in main
    r = ctx.func(ctx)
  File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 1970, in _default_image
    return func(ctx)
  File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 5055, in command_deploy
    ports=daemon_ports)
  File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 2973, in deploy_daemon
    get_unit_name(fsid, daemon_type, daemon_id)])
  File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 1637, in call_throws
    raise RuntimeError(f'Failed command: {" ".join(command)}: {s}')
RuntimeError: Failed command: systemctl restart ceph-7a2938fa-a391-11ed-9448-fa163ef14223: Job for ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service failed because the control process exited with error code.
See "systemctl status ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service" and "journalctl -xe" for details.



# ceph orch ps
NAME                                     HOST                       PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION                IMAGE ID      CONTAINER ID  
alertmanager.saya-bluewash-dashboard-1   saya-bluewash-dashboard-1  *:9093,9094  running (2h)      2m ago   2h    23.4M        -                         2de2e7d63e1b  bef1aecba2d0  
crash.saya-bluewash-dashboard-1          saya-bluewash-dashboard-1               running (4d)      2m ago   4d    6987k        -  16.2.10-107.ibm.el8cp  1a3e836939db  2cc055d0faae  
crash.saya-bluewash-dashboard-2          saya-bluewash-dashboard-2               running (4h)      8m ago   4h    10.1M        -  16.2.10-107.ibm.el8cp  1a3e836939db  3f940156798c  
crash.saya-bluewash-dashboard-3          saya-bluewash-dashboard-3               running (4h)     15s ago   4h    7142k        -  16.2.10-107.ibm.el8cp  1a3e836939db  18af89f78216  
crash.saya-bluewash-dashboard-4          saya-bluewash-dashboard-4               running (4h)      6m ago   4h    7147k        -  16.2.10-107.ibm.el8cp  1a3e836939db  c5e0e96e6f51  
grafana.saya-bluewash-dashboard-1        saya-bluewash-dashboard-1  *:3000       running (2h)      2m ago   2h     142M        -  8.3.5                  bf676a29bcc5  31c0cae49309  
mgr.saya-bluewash-dashboard-1.nlopgv     saya-bluewash-dashboard-1  *:9283       running (4d)      2m ago   4d     640M        -  16.2.10-107.ibm.el8cp  1a3e836939db  3b53758e6a65  
mgr.saya-bluewash-dashboard-2.dwierz     saya-bluewash-dashboard-2  *:8443,9283  running (4h)      8m ago   4h     401M        -  16.2.10-107.ibm.el8cp  1a3e836939db  51be1f1a1602  
mon.saya-bluewash-dashboard-1            saya-bluewash-dashboard-1               running (4d)      2m ago   4d    1157M    2048M  16.2.10-107.ibm.el8cp  1a3e836939db  e1fe9c2a709f  
mon.saya-bluewash-dashboard-2            saya-bluewash-dashboard-2               running (4h)      8m ago   4h    1116M    2048M  16.2.10-107.ibm.el8cp  1a3e836939db  b084d33a2c54  
mon.saya-bluewash-dashboard-3            saya-bluewash-dashboard-3               running (4h)     15s ago   4h    1120M    2048M  16.2.10-107.ibm.el8cp  1a3e836939db  34039a1b6513  
mon.saya-bluewash-dashboard-4            saya-bluewash-dashboard-4               running (4h)      6m ago   4h    1061M    2048M  16.2.10-107.ibm.el8cp  1a3e836939db  ef48c7de081b  
node-exporter.saya-bluewash-dashboard-1  saya-bluewash-dashboard-1  *:9100       error             2m ago  75m        -        -  <unknown>              <unknown>     <unknown>     
osd.0                                    saya-bluewash-dashboard-1               running (93m)     2m ago  93m    44.4M    4096M  16.2.10-107.ibm.el8cp  1a3e836939db  c756888c32f5  
osd.1                                    saya-bluewash-dashboard-1               running (93m)     2m ago  93m    38.1M    4096M  16.2.10-107.ibm.el8cp  1a3e836939db  170de24b22df  
osd.2                                    saya-bluewash-dashboard-2               running (92m)     8m ago  92m    35.3M    4096M  16.2.10-107.ibm.el8cp  1a3e836939db  481377c15200  
prometheus.saya-bluewash-dashboard-1     saya-bluewash-dashboard-1  *:9095       running (75m)     2m ago   2h    70.7M        -                         39847ff1cddf  088fdaecec0c

Comment 2 Ken Dreyer (Red Hat) 2023-03-01 16:26:35 UTC
The description has this command:

> ceph config set mgr mgr/cephadm/container_image_node_exporter cp.icr.io/cp/ibm-ceph/prometheus-node-exporter

When you do that, podman uses the ":latest" tag, rather than ":v4.10". We haven't pushed a floating latest tag IBM's registry for that image. We want users to use that v4.10 tag.

I've confirmed that cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:v4.10 is properly set in the two files we ship in IBM Storage Ceph:

 /usr/sbin/cephadm
 /usr/share/ceph/mgr/cephadm/module.py

Are you still seeing errors when deploying or upgrading?

Comment 3 Sayalee 2023-07-25 06:43:05 UTC
(In reply to Ken Dreyer (Red Hat) from comment #2)
> The description has this command:
> 
> > ceph config set mgr mgr/cephadm/container_image_node_exporter cp.icr.io/cp/ibm-ceph/prometheus-node-exporter
> 
> When you do that, podman uses the ":latest" tag, rather than ":v4.10". We
> haven't pushed a floating latest tag IBM's registry for that image. We want
> users to use that v4.10 tag.
> 
> I've confirmed that cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:v4.10 is
> properly set in the two files we ship in IBM Storage Ceph:
> 
>  /usr/sbin/cephadm
>  /usr/share/ceph/mgr/cephadm/module.py
> 
> Are you still seeing errors when deploying or upgrading?

Hello Ken, 

This issue is not observed with the latest IBM Ceph Storage 5.3z4/IBM Ceph Storage 6.1 deployments.

Closing this bug.

Thanks