Description of problem:[cephadm] 5.0 - Ceph orch upgrade from redhat.registry to internal registry is taking long time to complete and still shows in progress and there is no progress bar in ceph cluster displaying when the upgrade is expected to complete. Version-Release number of selected component (if applicable): [ubuntu@magna031 ~]$ sudo cephadm version Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:89948bcc07ab0f1364586b110044fe2ff9e3b3d5b9e39bdb9826bbdc5e368b72 ceph version 16.0.0-8633.el8cp (a0d3d2e82a786c006507b0df445433f7725e477d) pacific (dev) [ubuntu@magna031 ~]$ sudo rpm -qa | grep cephadm cephadm-16.0.0-8633.el8cp.noarch [ubuntu@magna031 ~]$ How reproducible: Steps to Reproduce: 1. Have a cluster with 5.0 bootstrapped with registry.redhat.io 2. Perform upgrade using ceph orch upgrade command from redhat.registry.io to internal registry i.e registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 3. observe the behaviour Actual results: Following ouput for reference [ceph: root@magna031 /]# ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 [ceph: root@magna031 /]# ceph orch upgrade status { "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956", "in_progress": true, "services_complete": [], "message": "" } [ceph: root@magna031 /]# [ceph: root@magna031 /]# ceph status cluster: id: ffa035a2-53c9-11eb-9e9c-002590fc25a4 health: HEALTH_WARN failed to probe daemons or devices services: mon: 2 daemons, quorum magna031,magna033 (age 7d) mgr: magna031.cdrwla(active, since 7d), standbys: magna032.iiowmg osd: 9 osds: 9 up (since 6d), 9 in (since 6d) tcmu-runner: 2 daemons active (magna032:iscsi/disk_1, magna033:iscsi/disk_1) data: pools: 2 pools, 33 pgs objects: 26 objects, 6.6 KiB usage: 87 MiB used, 8.2 TiB / 8.2 TiB avail pgs: 33 active+clean io: client: 1.7 KiB/s rd, 1 op/s rd, 0 op/s wr progress: [ceph: root@magna031 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 11s ago 7d count:1 registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.5 b2c52bc75c45 crash 6/6 14s ago 7d * registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 grafana 1/1 11s ago 7d count:1 registry.redhat.io/rhceph-alpha/rhceph-5-dashboard-rhel8:latest bd3d7748747b iscsi.iscsi 2/2 13s ago 7d label:iscsi registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 mgr 2/2 12s ago 7d count:2 mix mix mon 2/2 13s ago 7d magna033;magna032 registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 node-exporter 1/6 14s ago 7d * registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 mix osd.all-available-devices 9/9 14s ago 7d * registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 prometheus 1/1 11s ago 7d count:1 registry.redhat.io/openshift4/ose-prometheus:v4.6 2e37e2555fd5 [ceph: root@magna031 /]# [ceph: root@magna031 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 30s ago 7d count:1 registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.5 b2c52bc75c45 crash 6/6 84s ago 7d * registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 grafana 1/1 30s ago 7d count:1 registry.redhat.io/rhceph-alpha/rhceph-5-dashboard-rhel8:latest bd3d7748747b iscsi.iscsi 2/2 83s ago 7d label:iscsi registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 mgr 2/2 82s ago 7d count:2 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 43a4d2e8dfd3 mon 2/2 83s ago 7d magna033;magna032 registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 node-exporter 1/6 84s ago 7d * registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 mix osd.all-available-devices 9/9 84s ago 7d * registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 prometheus 1/1 30s ago 7d count:1 registry.redhat.io/openshift4/ose-prometheus:v4.6 2e37e2555fd5 [ceph: root@magna031 /]# ceph status cluster: id: ffa035a2-53c9-11eb-9e9c-002590fc25a4 health: HEALTH_WARN failed to probe daemons or devices services: mon: 2 daemons, quorum magna031,magna033 (age 7d) mgr: magna032.iiowmg(active, since 95s), standbys: magna031.cdrwla osd: 9 osds: 9 up (since 6d), 9 in (since 6d) tcmu-runner: 2 daemons active (magna032:iscsi/disk_1, magna033:iscsi/disk_1) data: pools: 2 pools, 33 pgs objects: 26 objects, 6.6 KiB usage: 88 MiB used, 8.2 TiB / 8.2 TiB avail pgs: 33 active+clean io: client: 1.7 KiB/s rd, 1 op/s rd, 0 op/s wr progress: [ceph: root@magna031 /]# [ubuntu@magna031 ~]$ sudo cephadm shell Inferring fsid ffa035a2-53c9-11eb-9e9c-002590fc25a4 Inferring config /var/lib/ceph/ffa035a2-53c9-11eb-9e9c-002590fc25a4/mon.magna031/config Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:89948bcc07ab0f1364586b110044fe2ff9e3b3d5b9e39bdb9826bbdc5e368b72 WARNING: The same type, major and minor should not be used for multiple devices. WARNING: The same type, major and minor should not be used for multiple devices. WARNING: The same type, major and minor should not be used for multiple devices. [ceph: root@magna031 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 4m ago 8d count:1 registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.5 b2c52bc75c45 crash 6/6 7m ago 8d * registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 grafana 1/1 4m ago 8d count:1 registry.redhat.io/rhceph-alpha/rhceph-5-dashboard-rhel8:latest bd3d7748747b iscsi.iscsi 2/2 7m ago 8d label:iscsi registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 mds.test 3/3 7m ago 10h magna032;magna033;magna034 registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 mgr 2/2 7m ago 8d count:2 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 43a4d2e8dfd3 mon 2/2 7m ago 8d magna033;magna032 registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 node-exporter 1/6 7m ago 8d * registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 mix osd.all-available-devices 9/9 7m ago 15h * registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest c88a5d60f510 prometheus 1/1 4m ago 8d count:1 registry.redhat.io/openshift4/ose-prometheus:v4.6 2e37e2555fd5 [ceph: root@magna031 /]# ceph orch upgrade status { "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956", "in_progress": true, "services_complete": [], "message": "" } [ceph: root@magna031 /]# Expected results: Additional info: magna031
Judging from the output of the last "ceph orch ls" command, it had at least upgraded both the mgr daemons. Typically, the slowest portion of upgrades is pulling the target image on each host (which usually happens early) so if the registry-proxy.engineering.redhat.com registry was slow at the time it could be what's slowing this down so much. However, this issue with the upgrade progress not showing up is a real problem (in this bug and others as well) as it makes it much more difficult for users to see what's going on with the upgrade. I'll ask upstream when I get the chance and see if we can find out why the update progress is no longer showing up properly.
Have a PR open to improve the amount of info available from 'ceph orch upgrade status' during an upgrade which will hopefully help diagnose issues like this https://github.com/ceph/ceph/pull/39880. Until then, the only way to see where the upgrade got stuck is to check the mgr debug logs with a couple extra ceph shell commands: 'ceph config set mgr mgr/cephadm/log_to_cluster_level debug' <some action, like an upgrade, that you want debug logs for> 'ceph -W cephadm --watch-debug' Running those two commands will have the debug logs from the mgr print out in the shell, including info that would tell what's going on with the upgrade. If you don't want to wait until the upgrade status improvements to try to find the cause here, you could try using those debug logs by running those two commands before and after an upgrade respectively. However, there's a lot of info in those logs so it can be really hard to read if you don't know what you're looking for.
@pnataraj to add to my previous comment about getting the debug logs during upgrade, I've found this process works pretty well: 1) Go on host you want to upgrade 2) run 'cephadm shell -- ceph config set mgr mgr/cephadm/log_to_cluster_level debug' 3) run '(cephadm shell -- ceph -W cephadm --watch-debug) > debug.txt & this will create a container that will run in the background and print all the mgr debug logs to "debug.txt" You can see the container listed just like any other: [root@vm-00 ~]# podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 7b1cdccfe9eb docker.io/amk3798/ceph:15.2.5 -W cephadm --watc... 29 seconds ago Up 29 seconds ago tender_feistel 5cd51349edfa docker.io/prom/alertmanager:v0.20.0 --config.file=/et... 25 minutes ago Up 25 minutes ago ceph-48c15d8a-8357-11eb-b846-52540094da7f-alertmanager.vm-00 59a4286d0036 docker.io/amk3798/ceph:15.2.5 -n osd.3 -f --set... 25 minutes ago Up 25 minutes ago ceph-48c15d8a-8357-11eb-b846-52540094da7f-osd.3 a6a726dfe6f6 docker.io/amk3798/ceph:15.2.5 -n osd.0 -f --set... 25 minutes ago Up 25 minutes ago ceph-48c15d8a-8357-11eb-b846-52540094da7f-osd.0 0148238dc48c docker.io/prom/node-exporter:v0.18.1 --no-collector.ti... 26 minutes ago Up 26 minutes ago ceph-48c15d8a-8357-11eb-b846-52540094da7f-node-exporter.vm-00 458501eb0b5a docker.io/amk3798/ceph:15.2.5 -n client.crash.v... 28 minutes ago Up 28 minutes ago ceph-48c15d8a-8357-11eb-b846-52540094da7f-crash.vm-00 861c19c164a9 docker.io/amk3798/ceph:15.2.5 -n mgr.vm-00.qemo... 29 minutes ago Up 29 minutes ago ceph-48c15d8a-8357-11eb-b846-52540094da7f-mgr.vm-00.qemozq 52cb6ebdfd80 docker.io/amk3798/ceph:15.2.5 -n mon.vm-00 -f -... 29 minutes ago Up 29 minutes ago ceph-48c15d8a-8357-11eb-b846-52540094da7f-mon.vm-00 Here it is the first one on the list. You can see it running the watch command in the command field. 4) Initiate the upgrade: 'cephadm shell -- ceph orch upgrade start --image <image-name>' At this point, go into another ceph shell or whatever you'd like to try and check the progress of the upgrade. The upgrade will be running and everything will be getting logged in debug.txt 5) Once you think it has been long enough that the upgrade should definitely either be done or failed (on a small cluster with ~3 hosts and ~20 or less daemons this should be around 15 minutes) stop the container that is outputting the debug logs. From the example 'podman ps' above it would be 'podman stop 7b1cdccfe9eb' to kill the container producing the logs. 6) You should now have the debug logs for the upgrade in debug.txt. I highly recommend using scp to copy the file from the node back to your personal computer. The file will likely be large (when I was trying this for an upgrade the file ended up being over 11000 lines) so you'll want to use a nice text editor on your personal machine (I like Atom) to view the file rather than trying to view it directly on the ceph node with something like vi. 7) Once you have the file open with a nice text editor (one the has a find feature in this case), search the file for instances of the word 'Upgrade'. You should see some sections like 2021-03-12T17:54:41.812733+0000 mgr.vm-02.mtclno [INF] Upgrade: Target is version 17.0.0-1275-g5e197a21 (quincy), container docker.io/amk3798/ceph@sha256:1d35e0a9f8be07b8bf25c1ae1306ec6b85127b981b714f9ca6f69f7a18932f70 digests ['docker.io/amk3798/ceph@sha256:1d35e0a9f8be07b8bf25c1ae1306ec6b85127b981b714f9ca6f69f7a18932f70'] 2021-03-12T17:54:41.815143+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config dump' -> 0 in 0.002s 2021-03-12T17:54:41.815288+0000 mgr.vm-02.mtclno [INF] Upgrade: Checking mgr daemons 2021-03-12T17:54:41.815363+0000 mgr.vm-02.mtclno [DBG] daemon mgr.vm-00.qemozq container digest correct 2021-03-12T17:54:41.815407+0000 mgr.vm-02.mtclno [DBG] daemon mgr.vm-02.mtclno container digest correct 2021-03-12T17:54:41.817068+0000 mgr.vm-02.mtclno [DBG] mon_command: 'versions' -> 0 in 0.002s 2021-03-12T17:54:41.817178+0000 mgr.vm-02.mtclno [WRN] Upgrade: 1 mgr daemon(s) are 15.2.5 != target 17.0.0-1275-g5e197a21 2021-03-12T17:54:41.817220+0000 mgr.vm-02.mtclno [INF] Upgrade: Setting container_image for all mgr 2021-03-12T17:54:41.839217+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config set' -> 0 in 0.021s 2021-03-12T17:54:41.839735+0000 mgr.vm-02.mtclno [DBG] Upgrade: Cleaning up container_image for ['mgr.vm-00.qemozq', 'mgr.vm-02.mtclno'] 2021-03-12T17:54:41.865324+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config rm' -> 0 in 0.025s 2021-03-12T17:54:41.889028+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config rm' -> 0 in 0.023s 2021-03-12T17:54:41.889621+0000 mgr.vm-02.mtclno [INF] Upgrade: All mgr daemons are up to date. 2021-03-12T17:54:41.890094+0000 mgr.vm-02.mtclno [INF] Upgrade: Checking mon daemons 2021-03-12T17:54:41.890598+0000 mgr.vm-02.mtclno [DBG] daemon mon.vm-00 not correct (docker.io/amk3798/ceph:15.2.5, ['docker.io/amk3798/ceph@sha256:467938ffe694262bea0b58712ebf0683a6745a1597d3ee9c14b3b77a97d10762'], 15.2.5) 2021-03-12T17:54:41.891125+0000 mgr.vm-02.mtclno [DBG] daemon mon.vm-01 not correct (docker.io/amk3798/ceph:15.2.5, ['docker.io/amk3798/ceph@sha256:467938ffe694262bea0b58712ebf0683a6745a1597d3ee9c14b3b77a97d10762'], 15.2.5) 2021-03-12T17:54:41.891673+0000 mgr.vm-02.mtclno [DBG] daemon mon.vm-02 not correct (docker.io/amk3798/ceph:15.2.5, ['docker.io/amk3798/ceph@sha256:467938ffe694262bea0b58712ebf0683a6745a1597d3ee9c14b3b77a97d10762'], 15.2.5) 2021-03-12T17:54:41.894240+0000 mgr.vm-02.mtclno [DBG] mon_command: 'mon ok-to-stop' -> 0 in 0.002s 2021-03-12T17:54:41.894793+0000 mgr.vm-02.mtclno [INF] It is presumed safe to stop mon.vm-00 2021-03-12T17:54:41.895258+0000 mgr.vm-02.mtclno [INF] Upgrade: It is presumed safe to stop mon.vm-00 2021-03-12T17:54:41.895983+0000 mgr.vm-02.mtclno [DBG] _run_cephadm : command = inspect-image 2021-03-12T17:54:41.896468+0000 mgr.vm-02.mtclno [DBG] _run_cephadm : args = [] 2021-03-12T17:54:41.896987+0000 mgr.vm-02.mtclno [DBG] Have connection to vm-00 2021-03-12T17:54:41.897300+0000 mgr.vm-02.mtclno [DBG] args: --image docker.io/amk3798/ceph@sha256:1d35e0a9f8be07b8bf25c1ae1306ec6b85127b981b714f9ca6f69f7a18932f70 inspect-image 2021-03-12T17:54:42.809584+0000 mgr.vm-02.mtclno [DBG] code: 0 2021-03-12T17:54:42.809677+0000 mgr.vm-02.mtclno [DBG] out: { "ceph_version": "ceph version 17.0.0-1275-g5e197a21 (5e197a21e61b6d7e4f41a330cd63bc787164937d) quincy (dev)", "image_id": "b485ead7ae5f3b78ce5d33f5f3380e887cbbb737b64a4ca298eac78882c9b8b7", "repo_digests": [ "docker.io/amk3798/ceph@sha256:1d35e0a9f8be07b8bf25c1ae1306ec6b85127b981b714f9ca6f69f7a18932f70" ] } 2021-03-12T17:54:42.809819+0000 mgr.vm-02.mtclno [INF] Upgrade: Updating mon.vm-00 2021-03-12T17:54:42.830627+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config set' -> 0 in 0.021s 2021-03-12T17:54:42.832780+0000 mgr.vm-02.mtclno [DBG] mon_command: 'auth get' -> 0 in 0.002s 2021-03-12T17:54:42.834794+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config get' -> 0 in 0.002s 2021-03-12T17:54:42.836693+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config generate-minimal-conf' -> 0 in 0.001s 2021-03-12T17:54:42.837244+0000 mgr.vm-02.mtclno [INF] Deploying daemon mon.vm-00 on vm-00 In that section, for example, you can see the mgr daemons have already been upgraded and it is currently upgrading the monitors. The last few instances of where "Upgrade" is printed should hopefully give some valuable insight into what went wrong. If the upgrade succeeded, the last instance should be a line like 2021-03-12T17:58:38.276424+0000 mgr.vm-02.mtclno [INF] Upgrade: Complete! Otherwise, the last instance should be the last thing upgrade attempted, which is hopefully whatever is causing issues. Let me know if you have any issues with this process or if it helps you find anything useful.
@Adam, Thanks for the info. Debug logs are helpful to check for the upgrade status.
Additional upgrade info PR was merged upstream as well: https://github.com/ceph/ceph/pull/39880 . Hopefully will be backported soon
Sage has backported the relevant changes to pacific in https://github.com/ceph/ceph/pull/40202 . When that is in pacific upstream, we'll have it in the next downstream rebase.
I think a new downstream image was built yesterday that should have the additional info in the ceph orch upgrade status command. Some extra info from that or, even better, the debug logs is going to be needed to figure out what is causing the issue here. See if you can verify this using the new build and with additional information. Also,after looking at this again, I'm seeing that only 1/6 node-exporters were running while the upgrade was going on. Maybe the issue was related to that? I'd check the status of node-exporter daemons before and during upgrade. There can be issues where node-exporter is repeatedly reconfigured and this fails which would relly slow upgrade down. You can see some information about situations like that in the "Another Note" section of this comment https://bugzilla.redhat.com/show_bug.cgi?id=1935044#c3
I rebased ceph-5.0-rhel-patches again today for some other bugs, and that build includes all the changes in https://github.com/ceph/ceph/pull/40202. Please follow Adam's Comment #7 with today's build, ceph-16.1.0-1084.el8cp.
Tried the above commands with the specified build. The upgrade status lists the daemons upgraded, but upgrade is in progress from a long time # ceph orch upgrade status { "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:5495fe25a0500573ce1787a6bce219169ea4c098fefe68c429cc5efb065e7246", "in_progress": true, "services_complete": [ "mgr", "mds", "osd", "mon" ], "progress": "31/31 ceph daemons upgraded", "message": "Doing first pull of registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847 image" } Ceph status enters error mode.. # ceph -s cluster: id: 1878d0d4-8e3f-11eb-9fd8-fa163ece0c4e health: HEALTH_ERR Module 'cephadm' has failed: too many values to unpack (expected 3) Degraded data redundancy: 8625/204039 objects degraded (4.227%), 6 pgs degraded, 8 pgs undersized services: mon: 5 daemons, quorum ceph-pdhiran-cephadm-1616767756496-node1-mon-installer-node-exp,ceph-pdhiran-cephadm-1616767756496-node6-mon-mgr-mds-node-expor,ceph-pdhiran-cephadm-1616767756496-node2-mon-mgr-mds-node-expor,ceph-pdhiran-cephadm-1616767756496-node11-mon-node-exporter-cra,ceph-pdhiran-cephadm-1616767756496-node7-mon-node-exporter-cras (age 61m) mgr: ceph-pdhiran-cephadm-1616767756496-node1-mon-installer-node-exp.eesken(active, since 57m), standbys: ceph-pdhiran-cephadm-1616767756496-node2-mon-mgr-mds-node-expor.aohezh, ceph-pdhiran-cephadm-1616767756496-node6-mon-mgr-mds-node-expor.madiyk mds: 1/1 daemons up, 1 standby osd: 21 osds: 21 up (since 58m), 21 in (since 59m); 8 remapped pgs data: volumes: 1/1 healthy pools: 6 pools, 257 pgs objects: 68.01k objects, 266 MiB usage: 2.8 GiB used, 286 GiB / 289 GiB avail pgs: 8625/204039 objects degraded (4.227%) 3936/204039 objects misplaced (1.929%) 249 active+clean 6 active+recovery_wait+undersized+degraded+remapped 2 active+recovering+undersized+remapped io: recovery: 55 KiB/s, 13 objects/s progress: Global Recovery Event (33m) [===========================.] (remaining: 64s) Upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847 (0s) [............................] # ceph health detail HEALTH_ERR Module 'cephadm' has failed: too many values to unpack (expected 3); Degraded data redundancy: 6527/204162 objects degraded (3.197%), 4 pgs degraded, 6 pgs undersized [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: too many values to unpack (expected 3) Module 'cephadm' has failed: too many values to unpack (expected 3) Ceph orch ls still shows the old image names : # ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 40m ago 71m count:1 registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.5 12d6d4b9afb2 grafana 1/1 40m ago 71m count:1 registry.redhat.io/rhceph-alpha/rhceph-5-dashboard-rhel8:latest ea002a20207d mds.cephfs 2/2 40m ago 62m ceph-pdhiran-cephadm-1616767756496-node6-mon-mgr-mds-node-expor;ceph-pdhiran-cephadm-1616767756496-node2-mon-mgr-mds-node-expor;count:2 registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:5495fe25a0500573ce1787a6bce219169ea4c098fefe68c429cc5efb065e7246 c1a8b8b28c91 mgr 3/2 40m ago 68m label:mgr mix c1a8b8b28c91 mon 5/5 40m ago 65m label:mon mix c1a8b8b28c91 node-exporter 10/10 40m ago 71m * registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 934722bb0c30 osd.all-available-devices 21/21 40m ago 63m * registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:5495fe25a0500573ce1787a6bce219169ea4c098fefe68c429cc5efb065e7246 c1a8b8b28c91 prometheus 1/1 40m ago 71m count:1 registry.redhat.io/openshift4/ose-prometheus:v4.6 476b3dbd7bc2 # cephadm version Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:5495fe25a0500573ce1787a6bce219169ea4c098fefe68c429cc5efb065e7246 ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc) ]# ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847 Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847 # ceph versions { "mon": { "ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)": 5 }, "mgr": { "ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)": 3 }, "osd": { "ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)": 21 }, "mds": { "ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)": 2 }, "overall": { "ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)": 31 } }
@Adam, I have used the latest build specified. However, I still see the upgrade in progress and no changes could see in the "ceph orch upgrade status" Below host details magna057 root/q
(In reply to Preethi from comment #15) > @Adam, I have used the latest build specified. However, I still see the > upgrade in progress and no changes could see in the "ceph orch upgrade > status" > > Below host details > magna057 > root/q [ceph: root@magna057 /]# ceph orch upgrade status { "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847", "in_progress": true, "services_complete": [], "message": "" } [ceph: root@magna057 /]#
@Adam, We still see "upgrade in progress" due to the below error in the health status. Hence, we cannot verify until the below issue is fixed. ceph -s health: HEALTH_ERR Module 'cephadm' has failed: too many values to unpack (expected 3) ( where the module is disabled because of too many values to unpack in the 'ceph -s' output. Some of our downstream images have the version formatted differently and it's causing the above health error.)
@Adam, Created a separate BZ to track this module failure issue. Yes, once the fix is available in downstream we will follow the above mentioned guideline. https://bugzilla.redhat.com/show_bug.cgi?id=1944978
Fix for too many values to unpack upgrade issue was merged and backported which may also fix this BZ once it is downstream. Refer to this comment https://bugzilla.redhat.com/show_bug.cgi?id=1944978#c6
@Adam, We do not see issue for build to build upgrade for the following a) Internal registry to internal registry - Fix has been implemented in Ceph v 16.2.0.-6 and above. Hence, We cannot directly apply the upgrade command from the lower version of ceph (i.e below 16.2.0-6) to the latest Ceph version. Hence, used the workaround for upgrade If ceph version is above 16.2.0-6, you can directly perform build to build upgrades using " ceph orch upgrade start --image <image name>" - This also worked fine when verified b) Default path to internal regsitry - Verified and working fine with workaround as deafult path has older beta build hence, we need to apply workaround to test this Below snippet of the output with the workaround [ceph: root@magna007 /]# ceph orch ps | grep mgr mgr.magna007.wpgvme magna007 running (6h) 11s ago 3w *:9283 16.1.0-1325-geb5d7a86 0a963d7074de 585d98f2cc16 mgr.magna010.syndxo magna010 running (29s) 13s ago 3w *:8443 *:9283 16.2.0-13.el8cp 89a188512eee 53aae976ffc6 [ceph: root@magna007 /]# ceph orch daemon redeploy mgr.magna007.wpgvme --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822 Scheduled to redeploy mgr.magna007.wpgvme on host 'magna007' [ceph: root@magna007 /]# ceph mgr fail [ceph: root@magna007 /]# ceph -s cluster: id: 802d6a00-9277-11eb-aa4f-002590fc2538 health: HEALTH_OK services: mon: 3 daemons, quorum magna007,magna010,magna104 (age 6h) mgr: magna007.wpgvme(active, since 7s), standbys: magna010.syndxo osd: 15 osds: 15 up (since 6h), 15 in (since 2w) rbd-mirror: 1 daemon active (1 hosts) rgw: 5 daemons active (3 hosts, 1 zones) data: pools: 7 pools, 169 pgs objects: 372 objects, 24 KiB usage: 9.5 GiB used, 14 TiB / 14 TiB avail pgs: 169 active+clean [ceph: root@magna007 /]# ceph orch ps | grep mgr mgr.magna007.wpgvme magna007 running (19s) 0s ago 3w *:8443 *:9283 16.2.0-13.el8cp 89a188512eee a287245f4789 mgr.magna010.syndxo magna010 running (6m) 3s ago 3w *:8443 *:9283 16.2.0-13.el8cp 89a188512eee 53aae976ffc6 [ceph: root@magna007 /]# ceph orch [ceph: root@magna007 /]# ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822 Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822 [ceph: root@magna007 /]# ceph orch upgrade status { "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:55206326df77ef04991a3d4a59621f9dfcff5a8e68c151febc3d5e0e1cfd79e8", "in_progress": true, "services_complete": [ "mgr" ], "progress": "2/41 ceph daemons upgraded", "message": "" } [ceph: root@magna007 /]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3294