Description of problem: ======================= On a fresh installation on IBM build, deployment of node-exporter daemon using the image shared in https://bugzilla.redhat.com/show_bug.cgi?id=2167314#c5 fails and it is getting oulled from cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:latest instead of - node_exporter_container_image: cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:v4.10 Note : This issue is not observed with other monitoring stack images. Version-Release number of selected component (if applicable): ============================================================= 16.2.10-107.ibm.el8cp (afaf046ef07b474757f48ef45c02e3ed5e30a63f) pacific How reproducible: ================= Always Steps to Reproduce: =================== 1. Deploy a ceph cluster using IBM build. 2. Pull monitoring stack images mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2167314#c5 3. Deploy a node-exporter daemon using command - # ceph orch apply node-exporter --placement="saya-bluewash-dashboard-1" Actual results: =============== node-exporter daemon is in failed/error state. # systemctl status ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service ● ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service - Ceph node-exporter.saya-bluewash-dashboard-1 for 7a2938fa-a391-11ed-9448-fa163ef14223 Loaded: loaded (/etc/systemd/system/ceph-7a2938fa-a391-11ed-9448-fa163ef14223@.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Tue 2023-02-07 05:35:14 EST; 5s ago Process: 2802523 ExecStopPost=/bin/rm -f /run/ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service-pid /run/ceph-7a2938fa-a391-11ed-9448-fa163ef14223@> Process: 2802522 ExecStopPost=/bin/bash /var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/node-exporter.saya-bluewash-dashboard-1/unit.poststop (code=exited, status=0/SUCCESS) Process: 2802480 ExecStart=/bin/bash /var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/node-exporter.saya-bluewash-dashboard-1/unit.run (code=exited, status=125) Process: 2802478 ExecStartPre=/bin/rm -f /run/ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service-pid /run/ceph-7a2938fa-a391-11ed-9448-fa163ef14223@> Feb 07 05:35:14 saya-bluewash-dashboard-1 systemd[1]: Failed to start Ceph node-exporter.saya-bluewash-dashboard-1 for 7a2938fa-a391-11ed-9448-fa163ef14223. "journalctl -xe" log snippet- -- Unit ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service has begun starting up. Feb 07 05:36:17 saya-bluewash-dashboard-1 ceph-7a2938fa-a391-11ed-9448-fa163ef14223-mon-saya-bluewash-dashboard-1[18244]: cluster 2023-02-07T10:36:16.746391+0000 mgr.saya-bluewash-dashboard> Feb 07 05:36:17 saya-bluewash-dashboard-1 bash[2803010]: Trying to pull cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:latest... Feb 07 05:36:18 saya-bluewash-dashboard-1 bash[2803010]: Error: initializing source docker://cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:latest: reading manifest latest in cp.icr.io/cp/i> Feb 07 05:36:18 saya-bluewash-dashboard-1 systemd[1]: ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service: Control process exited, code=exited status=1> Feb 07 05:36:18 saya-bluewash-dashboard-1 systemd[1]: ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service: Failed with result 'exit-code'. -- Subject: Unit failed -- Defined-By: systemd -- Support: https://access.redhat.com/support -- -- The unit ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service has entered the 'failed' state with result 'exit-code'. Feb 07 05:36:18 saya-bluewash-dashboard-1 systemd[1]: Failed to start Ceph node-exporter.saya-bluewash-dashboard-1 for 7a2938fa-a391-11ed-9448-fa163ef14223. -- Subject: Unit ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service has failed -- Defined-By: systemd -- Support: https://access.redhat.com/support -- -- Unit ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service has failed. -- -- The result is failed. Expected results: ================= node-exporter service should be up and running. Additional info: ================ Initially after fresh installation and before pulling any images- # ceph orch ps NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID crash.saya-bluewash-dashboard-1 saya-bluewash-dashboard-1 running (3d) 7m ago 3d 6987k - 16.2.10-107.ibm.el8cp 1a3e836939db 2cc055d0faae mgr.saya-bluewash-dashboard-1.nlopgv saya-bluewash-dashboard-1 *:9283 running (3d) 7m ago 3d 561M - 16.2.10-107.ibm.el8cp 1a3e836939db 3b53758e6a65 mon.saya-bluewash-dashboard-1 saya-bluewash-dashboard-1 running (3d) 7m ago 3d 1094M 2048M 16.2.10-107.ibm.el8cp 1a3e836939db e1fe9c2a709f node-exporter.saya-bluewash-dashboard-1 saya-bluewash-dashboard-1 *:9100 error 7m ago 3d - - <unknown> <unknown> <unknown> # ceph -s cluster: id: 7a2938fa-a391-11ed-9448-fa163ef14223 health: HEALTH_WARN Failed to place 1 daemon(s) 1 failed cephadm daemon(s) OSD count 0 < osd_pool_default_size 3 services: mon: 1 daemons, quorum saya-bluewash-dashboard-1 (age 3d) mgr: saya-bluewash-dashboard-1.nlopgv(active, since 3d) osd: 0 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: # ceph health detail HEALTH_WARN Failed to place 1 daemon(s); 1 failed cephadm daemon(s); OSD count 0 < osd_pool_default_size 3 [WRN] CEPHADM_DAEMON_PLACE_FAIL: Failed to place 1 daemon(s) Failed while placing prometheus.saya-bluewash-dashboard-1 on saya-bluewash-dashboard-1: cephadm exited with an error code: 1, stderr:Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-7a2938fa-a391-11ed-9448-fa163ef14223-prometheus-saya-bluewash-dashboard-1 /bin/podman: stderr Error: inspecting object: no such container ceph-7a2938fa-a391-11ed-9448-fa163ef14223-prometheus-saya-bluewash-dashboard-1 Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-7a2938fa-a391-11ed-9448-fa163ef14223-prometheus.saya-bluewash-dashboard-1 /bin/podman: stderr Error: inspecting object: no such container ceph-7a2938fa-a391-11ed-9448-fa163ef14223-prometheus.saya-bluewash-dashboard-1 Deploy daemon prometheus.saya-bluewash-dashboard-1 ... Non-zero exit code 125 from /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint stat --init -e CONTAINER_IMAGE=icr.io/ibm-ceph/prometheus:v4.10 -e NODE_NAME=saya-bluewash-dashboard-1 -e CEPH_USE_RANDOM_NONCE=1 icr.io/ibm-ceph/prometheus:v4.10 -c %u %g /etc/prometheus stat: stderr Trying to pull icr.io/ibm-ceph/prometheus:v4.10... stat: stderr Error: initializing source docker://icr.io/ibm-ceph/prometheus:v4.10: unable to retrieve auth token: invalid username/password: unauthorized: The login credentials are not valid, or your IBM Cloud account is not active. ERROR: Failed to extract uid/gid for path /etc/prometheus: Failed command: /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint stat --init -e CONTAINER_IMAGE=icr.io/ibm-ceph/prometheus:v4.10 -e NODE_NAME=saya-bluewash-dashboard-1 -e CEPH_USE_RANDOM_NONCE=1 icr.io/ibm-ceph/prometheus:v4.10 -c %u %g /etc/prometheus: Trying to pull icr.io/ibm-ceph/prometheus:v4.10... Error: initializing source docker://icr.io/ibm-ceph/prometheus:v4.10: unable to retrieve auth token: invalid username/password: unauthorized: The login credentials are not valid, or your IBM Cloud account is not active. [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s) daemon node-exporter.saya-bluewash-dashboard-1 on saya-bluewash-dashboard-1 is in error state [WRN] TOO_FEW_OSDS: OSD count 0 < osd_pool_default_size 3 After pulling the images - # podman pull cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:v4.10 Trying to pull cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:v4.10... Getting image source signatures Checking if image destination supports signatures Copying blob 98a0c66b9ec7 done Copying blob 1df162fae087 done Copying blob d20a374ee8f7 done Copying blob fbcfec983935 done Copying config 6c8570b192 done Writing manifest to image destination Storing signatures 6c8570b1928b4a811bc5e7175dde87d6c0e19b640d4584c547eebc02ab81e5f4 # ceph config set mgr mgr/cephadm/container_image_node_exporter cp.icr.io/cp/ibm-ceph/prometheus-node-exporter # ceph orch apply node-exporter --placement="saya-bluewash-dashboard-1" Scheduled node-exporter update... # ceph orch ps --daemon_type=node-exporter NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID node-exporter.saya-bluewash-dashboard-1 saya-bluewash-dashboard-1 *:9100 error 9m ago 4d - - <unknown> <unknown> # ceph orch daemon redeploy node-exporter.saya-bluewash-dashboard-1 Scheduled to redeploy node-exporter.saya-bluewash-dashboard-1 on host 'saya-bluewash-dashboard-1' # ceph orch ps --daemon_type=node-exporter NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID node-exporter.saya-bluewash-dashboard-1 saya-bluewash-dashboard-1 *:9100 error 22s ago 4d - - <unknown> <unknown> # ceph health detail HEALTH_WARN 1 failed cephadm daemon(s); Degraded data redundancy: 1 pg undersized [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s) daemon node-exporter.saya-bluewash-dashboard-1 on saya-bluewash-dashboard-1 is in error state [WRN] PG_DEGRADED: Degraded data redundancy: 1 pg undersized pg 1.0 is stuck undersized for 99m, current state active+undersized, last acting [2,0] cephadm log snippet - RuntimeError: Failed command: systemctl restart ceph-7a2938fa-a391-11ed-9448-fa163ef14223: Job for ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service failed because the control process exited with error code. See "systemctl status ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service" and "journalctl -xe" for details. Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py", line 1466, in _remote_connection yield (conn, connr) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in _run_cephadm code, '\n'.join(err))) orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-7a2938fa-a391-11ed-9448-fa163ef14223-node-exporter-saya-bluewash-dashboard-1 /bin/podman: stderr Error: inspecting object: no such container ceph-7a2938fa-a391-11ed-9448-fa163ef14223-node-exporter-saya-bluewash-dashboard-1 Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-7a2938fa-a391-11ed-9448-fa163ef14223-node-exporter.saya-bluewash-dashboard-1 /bin/podman: stderr Error: inspecting object: no such container ceph-7a2938fa-a391-11ed-9448-fa163ef14223-node-exporter.saya-bluewash-dashboard-1 Reconfig daemon node-exporter.saya-bluewash-dashboard-1 ... Non-zero exit code 1 from systemctl restart ceph-7a2938fa-a391-11ed-9448-fa163ef14223 systemctl: stderr Job for ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service failed because the control process exited with error code. systemctl: stderr See "systemctl status ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service" and "journalctl -xe" for details. Traceback (most recent call last): File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 9183, in <module> main() File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 9171, in main r = ctx.func(ctx) File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 1970, in _default_image return func(ctx) File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 5055, in command_deploy ports=daemon_ports) File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 2973, in deploy_daemon get_unit_name(fsid, daemon_type, daemon_id)]) File "/var/lib/ceph/7a2938fa-a391-11ed-9448-fa163ef14223/cephadm.6445ef0141c8cd1318507afc55e1a5712f2f254909a0bf329ee294c106bb760a", line 1637, in call_throws raise RuntimeError(f'Failed command: {" ".join(command)}: {s}') RuntimeError: Failed command: systemctl restart ceph-7a2938fa-a391-11ed-9448-fa163ef14223: Job for ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service failed because the control process exited with error code. See "systemctl status ceph-7a2938fa-a391-11ed-9448-fa163ef14223.service" and "journalctl -xe" for details. # ceph orch ps NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.saya-bluewash-dashboard-1 saya-bluewash-dashboard-1 *:9093,9094 running (2h) 2m ago 2h 23.4M - 2de2e7d63e1b bef1aecba2d0 crash.saya-bluewash-dashboard-1 saya-bluewash-dashboard-1 running (4d) 2m ago 4d 6987k - 16.2.10-107.ibm.el8cp 1a3e836939db 2cc055d0faae crash.saya-bluewash-dashboard-2 saya-bluewash-dashboard-2 running (4h) 8m ago 4h 10.1M - 16.2.10-107.ibm.el8cp 1a3e836939db 3f940156798c crash.saya-bluewash-dashboard-3 saya-bluewash-dashboard-3 running (4h) 15s ago 4h 7142k - 16.2.10-107.ibm.el8cp 1a3e836939db 18af89f78216 crash.saya-bluewash-dashboard-4 saya-bluewash-dashboard-4 running (4h) 6m ago 4h 7147k - 16.2.10-107.ibm.el8cp 1a3e836939db c5e0e96e6f51 grafana.saya-bluewash-dashboard-1 saya-bluewash-dashboard-1 *:3000 running (2h) 2m ago 2h 142M - 8.3.5 bf676a29bcc5 31c0cae49309 mgr.saya-bluewash-dashboard-1.nlopgv saya-bluewash-dashboard-1 *:9283 running (4d) 2m ago 4d 640M - 16.2.10-107.ibm.el8cp 1a3e836939db 3b53758e6a65 mgr.saya-bluewash-dashboard-2.dwierz saya-bluewash-dashboard-2 *:8443,9283 running (4h) 8m ago 4h 401M - 16.2.10-107.ibm.el8cp 1a3e836939db 51be1f1a1602 mon.saya-bluewash-dashboard-1 saya-bluewash-dashboard-1 running (4d) 2m ago 4d 1157M 2048M 16.2.10-107.ibm.el8cp 1a3e836939db e1fe9c2a709f mon.saya-bluewash-dashboard-2 saya-bluewash-dashboard-2 running (4h) 8m ago 4h 1116M 2048M 16.2.10-107.ibm.el8cp 1a3e836939db b084d33a2c54 mon.saya-bluewash-dashboard-3 saya-bluewash-dashboard-3 running (4h) 15s ago 4h 1120M 2048M 16.2.10-107.ibm.el8cp 1a3e836939db 34039a1b6513 mon.saya-bluewash-dashboard-4 saya-bluewash-dashboard-4 running (4h) 6m ago 4h 1061M 2048M 16.2.10-107.ibm.el8cp 1a3e836939db ef48c7de081b node-exporter.saya-bluewash-dashboard-1 saya-bluewash-dashboard-1 *:9100 error 2m ago 75m - - <unknown> <unknown> <unknown> osd.0 saya-bluewash-dashboard-1 running (93m) 2m ago 93m 44.4M 4096M 16.2.10-107.ibm.el8cp 1a3e836939db c756888c32f5 osd.1 saya-bluewash-dashboard-1 running (93m) 2m ago 93m 38.1M 4096M 16.2.10-107.ibm.el8cp 1a3e836939db 170de24b22df osd.2 saya-bluewash-dashboard-2 running (92m) 8m ago 92m 35.3M 4096M 16.2.10-107.ibm.el8cp 1a3e836939db 481377c15200 prometheus.saya-bluewash-dashboard-1 saya-bluewash-dashboard-1 *:9095 running (75m) 2m ago 2h 70.7M - 39847ff1cddf 088fdaecec0c
The description has this command: > ceph config set mgr mgr/cephadm/container_image_node_exporter cp.icr.io/cp/ibm-ceph/prometheus-node-exporter When you do that, podman uses the ":latest" tag, rather than ":v4.10". We haven't pushed a floating latest tag IBM's registry for that image. We want users to use that v4.10 tag. I've confirmed that cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:v4.10 is properly set in the two files we ship in IBM Storage Ceph: /usr/sbin/cephadm /usr/share/ceph/mgr/cephadm/module.py Are you still seeing errors when deploying or upgrading?
(In reply to Ken Dreyer (Red Hat) from comment #2) > The description has this command: > > > ceph config set mgr mgr/cephadm/container_image_node_exporter cp.icr.io/cp/ibm-ceph/prometheus-node-exporter > > When you do that, podman uses the ":latest" tag, rather than ":v4.10". We > haven't pushed a floating latest tag IBM's registry for that image. We want > users to use that v4.10 tag. > > I've confirmed that cp.icr.io/cp/ibm-ceph/prometheus-node-exporter:v4.10 is > properly set in the two files we ship in IBM Storage Ceph: > > /usr/sbin/cephadm > /usr/share/ceph/mgr/cephadm/module.py > > Are you still seeing errors when deploying or upgrading? Hello Ken, This issue is not observed with the latest IBM Ceph Storage 5.3z4/IBM Ceph Storage 6.1 deployments. Closing this bug. Thanks