Bug 1917543
| Summary: | [cephadm] 5.0 - Registry based update to the latest build using Ceph orch upgrade start --image <registry url> is throwing an error " Unable to pull the target image" | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Preethi <pnataraj> |
| Component: | Cephadm | Assignee: | Adam King <adking> |
| Status: | CLOSED ERRATA | QA Contact: | Vasishta <vashastr> |
| Severity: | high | Docs Contact: | Karen Norteman <knortema> |
| Priority: | urgent | ||
| Version: | 5.0 | CC: | jolmomar, kdreyer, sewagner, tserlin, vereddy |
| Target Milestone: | --- | ||
| Target Release: | 5.0 | ||
| Hardware: | x86_64 | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-16.2.0-13.el8cp | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-08-30 08:27:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
related upstream tracker: https://tracker.ceph.com/issues/48695 Unfortunately, I don't think "ceph -s" or "ceph orch upgrade status" shows any completion message after the update has already finished and is currently only useful for upgrade information while the upgrade is still in progress. The best way, currently, to tell if an update completed successfully after "ceph orch upgrade status" reports "in_progess" to be "false" is to run "ceph orch ls" and look at the image name for all ceph daemons (daemons of type 'mgr', 'mon', 'crash', 'osd', 'mds', 'rgw' and 'rbd-mirror'). For example, upgrading from the 15.2.8 cluster to the downstream image, using "ceph orch ls" before upgrade would give me (not all ceph daemon type services have image name "docker.io/amk3798/ceph:15.2.8"): [ceph: root@vm-00 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 1s ago 5m count:1 docker.io/prom/alertmanager:v0.20.0 0881eb8f169f crash 3/3 2s ago 5m * docker.io/amk3798/ceph:15.2.8 3c03696dbf74 grafana 1/1 1s ago 5m count:1 docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a mgr 2/2 2s ago 5m count:2 docker.io/amk3798/ceph:15.2.8 3c03696dbf74 mon 3/5 2s ago 5m count:5 docker.io/amk3798/ceph:15.2.8 3c03696dbf74 node-exporter 3/3 2s ago 5m * docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf osd.all-available-devices 6/6 2s ago 4m * docker.io/amk3798/ceph:15.2.8 3c03696dbf74 prometheus 1/1 1s ago 5m count:1 docker.io/prom/prometheus:v2.18.1 de242295e225 in the middle of the upgrade I might see something like this. Notice, some of the ceph daemons still have the old image name, mgr has the new one (mgr is upgraded first) and mon service has image name "mixed" which means some of the mon daemons have been upgraded and some have not: [ceph: root@vm-00 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 37s ago 9m count:1 docker.io/prom/alertmanager:v0.20.0 0881eb8f169f crash 3/3 38s ago 10m * docker.io/amk3798/ceph:15.2.8 3c03696dbf74 grafana 1/1 37s ago 9m count:1 docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a mgr 2/2 38s ago 10m count:2 registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 mon 3/5 38s ago 10m count:5 mix mix node-exporter 3/3 38s ago 9m * docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf osd.all-available-devices 6/6 38s ago 9m * docker.io/amk3798/ceph:15.2.8 3c03696dbf74 prometheus 1/1 37s ago 10m count:1 docker.io/prom/prometheus:v2.18.1 de242295e225 Finally, after the upgrade is fully complete, I can see the image name for all ceph daemon type services is the new, upgraded image, telling me all ceph daemons have been upgraded and the upgrade can be considered complete. [ceph: root@vm-00 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 4m ago 17m count:1 docker.io/prom/alertmanager:v0.20.0 0881eb8f169f crash 3/3 4m ago 17m * registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 grafana 1/1 4m ago 17m count:1 docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a mgr 2/2 4m ago 17m count:2 registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 mon 3/5 4m ago 17m count:5 registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 node-exporter 3/3 4m ago 17m * docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf osd.all-available-devices 6/6 4m ago 16m * registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 prometheus 1/1 4m ago 17m count:1 docker.io/prom/prometheus:v2.18.1 de242295e225 Can you share the "ceph orch ls" output on clusters where you've done this upgrade? This will give us a good idea if the daemons have all been upgraded successfully and if not, which ones failed to do so. To adjust my previous comment, it's better to look at the "IMAGE ID" field than the "IMAGE NAME" field. Right now, sometimes the same image can temporarily show up with two different names: one with the expected name and one using the digest (e.g. "docker.io/amk3798/ceph:testing" and "docker.io/amk3798/ceph@sha256:2bd0cd0945534f321737f1d5959af199c05656d80d1fd4303bb55966876ab387" could actually be the same image. For that reason, the "IMAGE ID" field is actually a more consistent way to check. Simply find the image id for the image you're upgrading to and check all services for ceph daemons have that image id in the output of "ceph orch ls" @Adam, Below snippet of ceph orch ls.It was same like before performing the build update. Since, you have tried update commands from upstream to downstream you can see the changes in the ceph orch ls Image name. However, we do not see any changes before and after upgrade unless we have any version changes we have for promethues/node exporter etc. I would suggest we should strongly have progress status when we perform build updates. Let me know if i can create an RFE BZ on the same. We had this progress status in the earlier builds. [root@ceph-adm7 ~]# sudo cephadm shell Inferring fsid 58149bf2-66ac-11eb-84bf-001a4a000262 Inferring config /var/lib/ceph/58149bf2-66ac-11eb-84bf-001a4a000262/mon.ceph-adm7/config Using recent ceph image registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest [ceph: root@ceph-adm7 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 118s ago 2w count:1 registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.5 b7bae610cd46 crash 3/3 2m ago 2w * registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 grafana 1/1 118s ago 2w count:1 registry.redhat.io/rhceph-alpha/rhceph-5-dashboard-rhel8:latest bd3d7748747b mgr 3/2 2m ago 7d <unmanaged> registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 mon 3/3 2m ago 2w ceph-adm7;ceph-adm8;ceph-adm9 registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 node-exporter 1/3 2m ago 2w * registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 mix osd.all-available-devices 13/13 2m ago 12d * registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 prometheus 1/1 118s ago 2w count:1 registry.redhat.io/openshift4/ose-prometheus:v4.6 bebb0ddef7f0 [ceph: root@ceph-adm7 /]# history NOTE: We need to wait for new alpha release for this to test again @Preethi, I'm a bit confused, it looks like the image for crash, mgr, mon and osd here are "registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest" which is the image you were trying to upgrade to originally in the command "ceph orch upgrade start --image registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest" and they have the same image id as in my orch ls output from a few comments ago (c88a5d60f510). It looks like the daemons were successfully upgraded? @Adam, We do not have latest build in registry.redhat.io to perform build update. Was performing same build to build update as we were noticing "faile dto pull target image" issue as per the BZ. To ensure build to latest build is working with this command, We may need to wait until we have latest build pushed to alpha path. I will verify the BZ once we have the latest image and update you. @Adam, Issue is not seen. Hence, moving to verified.
Below snippet of the output with the workaround
[ceph: root@magna007 /]# ceph orch ps | grep mgr
mgr.magna007.wpgvme magna007 running (6h) 11s ago 3w *:9283 16.1.0-1325-geb5d7a86 0a963d7074de 585d98f2cc16
mgr.magna010.syndxo magna010 running (29s) 13s ago 3w *:8443 *:9283 16.2.0-13.el8cp 89a188512eee 53aae976ffc6
[ceph: root@magna007 /]# ceph orch daemon redeploy mgr.magna007.wpgvme --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822
Scheduled to redeploy mgr.magna007.wpgvme on host 'magna007'
[ceph: root@magna007 /]# ceph mgr fail
[ceph: root@magna007 /]# ceph -s
cluster:
id: 802d6a00-9277-11eb-aa4f-002590fc2538
health: HEALTH_OK
services:
mon: 3 daemons, quorum magna007,magna010,magna104 (age 6h)
mgr: magna007.wpgvme(active, since 7s), standbys: magna010.syndxo
osd: 15 osds: 15 up (since 6h), 15 in (since 2w)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 5 daemons active (3 hosts, 1 zones)
data:
pools: 7 pools, 169 pgs
objects: 372 objects, 24 KiB
usage: 9.5 GiB used, 14 TiB / 14 TiB avail
pgs: 169 active+clean
[ceph: root@magna007 /]# ceph orch ps | grep mgr
mgr.magna007.wpgvme magna007 running (19s) 0s ago 3w *:8443 *:9283 16.2.0-13.el8cp 89a188512eee a287245f4789
mgr.magna010.syndxo magna010 running (6m) 3s ago 3w *:8443 *:9283 16.2.0-13.el8cp 89a188512eee 53aae976ffc6
[ceph: root@magna007 /]# ceph orch
[ceph: root@magna007 /]# ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822
Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822
[ceph: root@magna007 /]# ceph orch upgrade status
{
"target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:55206326df77ef04991a3d4a59621f9dfcff5a8e68c151febc3d5e0e1cfd79e8",
"in_progress": true,
"services_complete": [
"mgr"
],
"progress": "2/41 ceph daemons upgraded",
"message": ""
}
[ceph: root@magna007 /]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3294 |
I tried the same steps in another bootstrapping node i.e magna061 and noticed same issue [ceph: root@magna061 /]# ceph status cluster: id: a2a63eea-51b8-11eb-889f-002590fbd650 health: HEALTH_OK services: mon: 5 daemons, quorum magna061,magna064,magna066,magna063,magna065 (age 10d) mgr: magna061.codfuh(active, since 10d), standbys: magna063.zohfew osd: 17 osds: 17 up (since 6d), 17 in (since 6d) rgw: 2 daemons active (myrealm1.myzone1.magna063.jpljqe, myrealm1.myzone1.magna064.rribai) data: pools: 6 pools, 137 pgs objects: 394 objects, 38 KiB usage: 2.2 GiB used, 15 TiB / 15 TiB avail pgs: 137 active+clean io: client: 1.7 KiB/s rd, 3 op/s rd, 0 op/s wr [ceph: root@magna061 /]# ceph orch upgrade start --image registry.redhat.io Initiating upgrade to registry.redhat.io [ceph: root@magna061 /]# ceph orch upgrade status { "target_image": "registry.redhat.io", "in_progress": true, "services_complete": [], "message": "" } [ceph: root@magna061 /]# ceph status cluster: id: a2a63eea-51b8-11eb-889f-002590fbd650 health: HEALTH_OK services: mon: 5 daemons, quorum magna061,magna064,magna066,magna063,magna065 (age 10d) mgr: magna061.codfuh(active, since 10d), standbys: magna063.zohfew osd: 17 osds: 17 up (since 6d), 17 in (since 6d) rgw: 2 daemons active (myrealm1.myzone1.magna063.jpljqe, myrealm1.myzone1.magna064.rribai) data: pools: 6 pools, 137 pgs objects: 394 objects, 38 KiB usage: 2.2 GiB used, 15 TiB / 15 TiB avail pgs: 137 active+clean io: client: 8.9 KiB/s rd, 17 op/s rd, 0 op/s wr progress: [ceph: root@magna061 /]# ceph status cluster: id: a2a63eea-51b8-11eb-889f-002590fbd650 health: HEALTH_WARN Upgrade: failed to pull target image services: mon: 5 daemons, quorum magna061,magna064,magna066,magna063,magna065 (age 10d) mgr: magna061.codfuh(active, since 10d), standbys: magna063.zohfew osd: 17 osds: 17 up (since 6d), 17 in (since 6d) rgw: 2 daemons active (myrealm1.myzone1.magna063.jpljqe, myrealm1.myzone1.magna064.rribai) data: pools: 6 pools, 137 pgs objects: 394 objects, 38 KiB usage: 2.2 GiB used, 15 TiB / 15 TiB avail pgs: 137 active+clean io: client: 2.0 KiB/s rd, 3 op/s rd, 0 op/s wr progress: [ceph: root@magna061 /]# ceph orch upgrade status { "target_image": "registry.redhat.io", "in_progress": true, "services_complete": [], "message": "Error: UPGRADE_FAILED_PULL: Upgrade: failed to pull target image" }