Bug 1917949 - [cephadm] 5.0 - Ceph orch upgrade from redhat.registry to internal registry is taking long time to complete and still shows in progress
Summary: [cephadm] 5.0 - Ceph orch upgrade from redhat.registry to internal registry i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Cephadm
Version: 5.0
Hardware: x86_64
OS: All
high
high
Target Milestone: ---
: 5.0
Assignee: Adam King
QA Contact: Preethi
Karen Norteman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-19 17:58 UTC by Preethi
Modified: 2021-08-30 08:28 UTC (History)
6 users (show)

Fixed In Version: ceph-16.2.0-13.el8cp
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-30 08:27:54 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-1048 0 None None None 2021-08-27 05:14:51 UTC
Red Hat Product Errata RHBA-2021:3294 0 None None None 2021-08-30 08:28:10 UTC

Description Preethi 2021-01-19 17:58:40 UTC
Description of problem:[cephadm] 5.0 - Ceph orch upgrade from redhat.registry to internal registry is taking long time to complete and still shows in progress
and there is no progress bar in ceph cluster displaying when the upgrade is expected to complete.

Version-Release number of selected component (if applicable):
[ubuntu@magna031 ~]$ sudo cephadm version
Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:89948bcc07ab0f1364586b110044fe2ff9e3b3d5b9e39bdb9826bbdc5e368b72
ceph version 16.0.0-8633.el8cp (a0d3d2e82a786c006507b0df445433f7725e477d) pacific (dev)
[ubuntu@magna031 ~]$ sudo rpm -qa | grep cephadm
cephadm-16.0.0-8633.el8cp.noarch
[ubuntu@magna031 ~]$ 


How reproducible:


Steps to Reproduce:
1. Have a cluster with 5.0 bootstrapped with registry.redhat.io 
2. Perform upgrade using ceph orch upgrade command from redhat.registry.io to internal registry i.e registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956
3. observe the behaviour

Actual results:
Following ouput for reference

[ceph: root@magna031 /]# ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956
Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956
[ceph: root@magna031 /]# ceph orch upgrade status
{
    "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956",
    "in_progress": true,
    "services_complete": [],
    "message": ""
}
[ceph: root@magna031 /]# 

[ceph: root@magna031 /]# ceph status
  cluster:
    id:     ffa035a2-53c9-11eb-9e9c-002590fc25a4
    health: HEALTH_WARN
            failed to probe daemons or devices
 
  services:
    mon:         2 daemons, quorum magna031,magna033 (age 7d)
    mgr:         magna031.cdrwla(active, since 7d), standbys: magna032.iiowmg
    osd:         9 osds: 9 up (since 6d), 9 in (since 6d)
    tcmu-runner: 2 daemons active (magna032:iscsi/disk_1, magna033:iscsi/disk_1)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 26 objects, 6.6 KiB
    usage:   87 MiB used, 8.2 TiB / 8.2 TiB avail
    pgs:     33 active+clean
 
  io:
    client:   1.7 KiB/s rd, 1 op/s rd, 0 op/s wr
 
  progress:

[ceph: root@magna031 /]# ceph orch ls
NAME                       RUNNING  REFRESHED  AGE  PLACEMENT          IMAGE NAME                                                       IMAGE ID      
alertmanager                   1/1  11s ago    7d   count:1            registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.5   b2c52bc75c45  
crash                          6/6  14s ago    7d   *                  registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest            c88a5d60f510  
grafana                        1/1  11s ago    7d   count:1            registry.redhat.io/rhceph-alpha/rhceph-5-dashboard-rhel8:latest  bd3d7748747b  
iscsi.iscsi                    2/2  13s ago    7d   label:iscsi        registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest            c88a5d60f510  
mgr                            2/2  12s ago    7d   count:2            mix                                                              mix           
mon                            2/2  13s ago    7d   magna033;magna032  registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest            c88a5d60f510  
node-exporter                  1/6  14s ago    7d   *                  registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5  mix           
osd.all-available-devices      9/9  14s ago    7d   *                  registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest            c88a5d60f510  
prometheus                     1/1  11s ago    7d   count:1            registry.redhat.io/openshift4/ose-prometheus:v4.6                2e37e2555fd5  
[ceph: root@magna031 /]# 

[ceph: root@magna031 /]# ceph orch ls
NAME                       RUNNING  REFRESHED  AGE  PLACEMENT          IMAGE NAME                                                                                                      IMAGE ID      
alertmanager                   1/1  30s ago    7d   count:1            registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.5                                                  b2c52bc75c45  
crash                          6/6  84s ago    7d   *                  registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest                                                           c88a5d60f510  
grafana                        1/1  30s ago    7d   count:1            registry.redhat.io/rhceph-alpha/rhceph-5-dashboard-rhel8:latest                                                 bd3d7748747b  
iscsi.iscsi                    2/2  83s ago    7d   label:iscsi        registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest                                                           c88a5d60f510  
mgr                            2/2  82s ago    7d   count:2            registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956  43a4d2e8dfd3  
mon                            2/2  83s ago    7d   magna033;magna032  registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest                                                           c88a5d60f510  
node-exporter                  1/6  84s ago    7d   *                  registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5                                                 mix           
osd.all-available-devices      9/9  84s ago    7d   *                  registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest                                                           c88a5d60f510  
prometheus                     1/1  30s ago    7d   count:1            registry.redhat.io/openshift4/ose-prometheus:v4.6                                                               2e37e2555fd5  
[ceph: root@magna031 /]# ceph status
  cluster:
    id:     ffa035a2-53c9-11eb-9e9c-002590fc25a4
    health: HEALTH_WARN
            failed to probe daemons or devices
 
  services:
    mon:         2 daemons, quorum magna031,magna033 (age 7d)
    mgr:         magna032.iiowmg(active, since 95s), standbys: magna031.cdrwla
    osd:         9 osds: 9 up (since 6d), 9 in (since 6d)
    tcmu-runner: 2 daemons active (magna032:iscsi/disk_1, magna033:iscsi/disk_1)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 26 objects, 6.6 KiB
    usage:   88 MiB used, 8.2 TiB / 8.2 TiB avail
    pgs:     33 active+clean
 
  io:
    client:   1.7 KiB/s rd, 1 op/s rd, 0 op/s wr
 
  progress:
 
[ceph: root@magna031 /]# 

[ubuntu@magna031 ~]$ sudo cephadm shell
Inferring fsid ffa035a2-53c9-11eb-9e9c-002590fc25a4
Inferring config /var/lib/ceph/ffa035a2-53c9-11eb-9e9c-002590fc25a4/mon.magna031/config
Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:89948bcc07ab0f1364586b110044fe2ff9e3b3d5b9e39bdb9826bbdc5e368b72
WARNING: The same type, major and minor should not be used for multiple devices.
WARNING: The same type, major and minor should not be used for multiple devices.
WARNING: The same type, major and minor should not be used for multiple devices.
[ceph: root@magna031 /]# ceph orch ls
NAME                       RUNNING  REFRESHED  AGE  PLACEMENT                   IMAGE NAME                                                                                                      IMAGE ID      
alertmanager                   1/1  4m ago     8d   count:1                     registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.5                                                  b2c52bc75c45  
crash                          6/6  7m ago     8d   *                           registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest                                                           c88a5d60f510  
grafana                        1/1  4m ago     8d   count:1                     registry.redhat.io/rhceph-alpha/rhceph-5-dashboard-rhel8:latest                                                 bd3d7748747b  
iscsi.iscsi                    2/2  7m ago     8d   label:iscsi                 registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest                                                           c88a5d60f510  
mds.test                       3/3  7m ago     10h  magna032;magna033;magna034  registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest                                                           c88a5d60f510  
mgr                            2/2  7m ago     8d   count:2                     registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956  43a4d2e8dfd3  
mon                            2/2  7m ago     8d   magna033;magna032           registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest                                                           c88a5d60f510  
node-exporter                  1/6  7m ago     8d   *                           registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5                                                 mix           
osd.all-available-devices      9/9  7m ago     15h  *                           registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest                                                           c88a5d60f510  
prometheus                     1/1  4m ago     8d   count:1                     registry.redhat.io/openshift4/ose-prometheus:v4.6                                                               2e37e2555fd5  
[ceph: root@magna031 /]# ceph orch upgrade status
{
    "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956",
    "in_progress": true,
    "services_complete": [],
    "message": ""
}
[ceph: root@magna031 /]# 




Expected results:


Additional info: magna031

Comment 1 Adam King 2021-02-25 19:28:07 UTC
Judging from the output of the last "ceph orch ls" command, it had at least upgraded both the mgr daemons. Typically, the slowest portion of upgrades is pulling the target image on each host (which usually happens early) so if the registry-proxy.engineering.redhat.com registry was slow at the time it could be what's slowing this down so much. However, this issue with the upgrade progress not showing up is a real problem (in this bug and others as well) as it makes it much more difficult for users to see what's going on with the upgrade. I'll ask upstream when I get the chance and see if we can find out why the update progress is no longer showing up properly.

Comment 2 Adam King 2021-03-11 21:42:14 UTC
Have a PR open to improve the amount of info available from 'ceph orch upgrade status' during an upgrade which will hopefully help diagnose issues like this https://github.com/ceph/ceph/pull/39880. Until then, the only way to see where the upgrade got stuck is to check the mgr debug logs with a couple extra ceph shell commands:

'ceph config set mgr mgr/cephadm/log_to_cluster_level debug'

<some action, like an upgrade, that you want debug logs for>

'ceph -W cephadm --watch-debug'


Running those two commands will have the debug logs from the mgr print out in the shell, including info that would tell what's going on with the upgrade. If you don't want to wait until the upgrade status improvements to try to find the cause here, you could try using those debug logs by running those two commands before and after an upgrade respectively. However, there's a lot of info in those logs so it can be really hard to read if you don't know what you're looking for.

Comment 3 Adam King 2021-03-12 18:36:07 UTC
@pnataraj to add to my previous comment about getting the debug logs during upgrade, I've found this process works pretty well:

1) Go on host you want to upgrade

2) run 'cephadm shell -- ceph config set mgr mgr/cephadm/log_to_cluster_level debug'

3) run '(cephadm shell -- ceph -W cephadm --watch-debug) > debug.txt &

   this will create a container that will run in the background and print all the mgr debug logs to "debug.txt"

   You can see the container listed just like any other:

   [root@vm-00 ~]# podman ps
   CONTAINER ID  IMAGE                                 COMMAND               CREATED         STATUS             PORTS   NAMES
   7b1cdccfe9eb  docker.io/amk3798/ceph:15.2.5         -W cephadm --watc...  29 seconds ago  Up 29 seconds ago          tender_feistel
   5cd51349edfa  docker.io/prom/alertmanager:v0.20.0   --config.file=/et...  25 minutes ago  Up 25 minutes ago          ceph-48c15d8a-8357-11eb-b846-52540094da7f-alertmanager.vm-00
   59a4286d0036  docker.io/amk3798/ceph:15.2.5         -n osd.3 -f --set...  25 minutes ago  Up 25 minutes ago          ceph-48c15d8a-8357-11eb-b846-52540094da7f-osd.3
   a6a726dfe6f6  docker.io/amk3798/ceph:15.2.5         -n osd.0 -f --set...  25 minutes ago  Up 25 minutes ago          ceph-48c15d8a-8357-11eb-b846-52540094da7f-osd.0
   0148238dc48c  docker.io/prom/node-exporter:v0.18.1  --no-collector.ti...  26 minutes ago  Up 26 minutes ago          ceph-48c15d8a-8357-11eb-b846-52540094da7f-node-exporter.vm-00
   458501eb0b5a  docker.io/amk3798/ceph:15.2.5         -n client.crash.v...  28 minutes ago  Up 28 minutes ago          ceph-48c15d8a-8357-11eb-b846-52540094da7f-crash.vm-00
   861c19c164a9  docker.io/amk3798/ceph:15.2.5         -n mgr.vm-00.qemo...  29 minutes ago  Up 29 minutes ago          ceph-48c15d8a-8357-11eb-b846-52540094da7f-mgr.vm-00.qemozq
   52cb6ebdfd80  docker.io/amk3798/ceph:15.2.5         -n mon.vm-00 -f -...  29 minutes ago  Up 29 minutes ago          ceph-48c15d8a-8357-11eb-b846-52540094da7f-mon.vm-00

   Here it is the first one on the list. You can see it running the watch command in the command field.


4) Initiate the upgrade: 'cephadm shell -- ceph orch upgrade start --image <image-name>'

   At this point, go into another ceph shell or whatever you'd like to try and check the progress of the upgrade. The upgrade will be running and everything will be getting logged in debug.txt


5) Once you think it has been long enough that the upgrade should definitely either be done or failed (on a small cluster with ~3 hosts and ~20 or less daemons this should be around 15 minutes)
   stop the container that is outputting the debug logs. From the example 'podman ps' above it would be 'podman stop 7b1cdccfe9eb' to kill the container producing the logs.

6) You should now have the debug logs for the upgrade in debug.txt. I highly recommend using scp to copy the file from the node back to your personal computer. The file will likely be large 
   (when I was trying this for an upgrade the file ended up being over 11000 lines) so you'll want to use a nice text editor on your personal machine (I like Atom) to view the file rather than 
   trying to view it directly on the ceph node with something like vi.

7) Once you have the file open with a nice text editor (one the has a find feature in this case), search the file for instances of the word 'Upgrade'. You should see some sections like


2021-03-12T17:54:41.812733+0000 mgr.vm-02.mtclno [INF] Upgrade: Target is version 17.0.0-1275-g5e197a21 (quincy), container docker.io/amk3798/ceph@sha256:1d35e0a9f8be07b8bf25c1ae1306ec6b85127b981b714f9ca6f69f7a18932f70 digests ['docker.io/amk3798/ceph@sha256:1d35e0a9f8be07b8bf25c1ae1306ec6b85127b981b714f9ca6f69f7a18932f70']
2021-03-12T17:54:41.815143+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config dump' -> 0 in 0.002s
2021-03-12T17:54:41.815288+0000 mgr.vm-02.mtclno [INF] Upgrade: Checking mgr daemons
2021-03-12T17:54:41.815363+0000 mgr.vm-02.mtclno [DBG] daemon mgr.vm-00.qemozq container digest correct
2021-03-12T17:54:41.815407+0000 mgr.vm-02.mtclno [DBG] daemon mgr.vm-02.mtclno container digest correct
2021-03-12T17:54:41.817068+0000 mgr.vm-02.mtclno [DBG] mon_command: 'versions' -> 0 in 0.002s
2021-03-12T17:54:41.817178+0000 mgr.vm-02.mtclno [WRN] Upgrade: 1 mgr daemon(s) are 15.2.5 != target 17.0.0-1275-g5e197a21
2021-03-12T17:54:41.817220+0000 mgr.vm-02.mtclno [INF] Upgrade: Setting container_image for all mgr
2021-03-12T17:54:41.839217+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config set' -> 0 in 0.021s
2021-03-12T17:54:41.839735+0000 mgr.vm-02.mtclno [DBG] Upgrade: Cleaning up container_image for ['mgr.vm-00.qemozq', 'mgr.vm-02.mtclno']
2021-03-12T17:54:41.865324+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config rm' -> 0 in 0.025s
2021-03-12T17:54:41.889028+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config rm' -> 0 in 0.023s
2021-03-12T17:54:41.889621+0000 mgr.vm-02.mtclno [INF] Upgrade: All mgr daemons are up to date.
2021-03-12T17:54:41.890094+0000 mgr.vm-02.mtclno [INF] Upgrade: Checking mon daemons
2021-03-12T17:54:41.890598+0000 mgr.vm-02.mtclno [DBG] daemon mon.vm-00 not correct (docker.io/amk3798/ceph:15.2.5, ['docker.io/amk3798/ceph@sha256:467938ffe694262bea0b58712ebf0683a6745a1597d3ee9c14b3b77a97d10762'], 15.2.5)
2021-03-12T17:54:41.891125+0000 mgr.vm-02.mtclno [DBG] daemon mon.vm-01 not correct (docker.io/amk3798/ceph:15.2.5, ['docker.io/amk3798/ceph@sha256:467938ffe694262bea0b58712ebf0683a6745a1597d3ee9c14b3b77a97d10762'], 15.2.5)
2021-03-12T17:54:41.891673+0000 mgr.vm-02.mtclno [DBG] daemon mon.vm-02 not correct (docker.io/amk3798/ceph:15.2.5, ['docker.io/amk3798/ceph@sha256:467938ffe694262bea0b58712ebf0683a6745a1597d3ee9c14b3b77a97d10762'], 15.2.5)
2021-03-12T17:54:41.894240+0000 mgr.vm-02.mtclno [DBG] mon_command: 'mon ok-to-stop' -> 0 in 0.002s
2021-03-12T17:54:41.894793+0000 mgr.vm-02.mtclno [INF] It is presumed safe to stop mon.vm-00
2021-03-12T17:54:41.895258+0000 mgr.vm-02.mtclno [INF] Upgrade: It is presumed safe to stop mon.vm-00
2021-03-12T17:54:41.895983+0000 mgr.vm-02.mtclno [DBG] _run_cephadm : command = inspect-image
2021-03-12T17:54:41.896468+0000 mgr.vm-02.mtclno [DBG] _run_cephadm : args = []
2021-03-12T17:54:41.896987+0000 mgr.vm-02.mtclno [DBG] Have connection to vm-00
2021-03-12T17:54:41.897300+0000 mgr.vm-02.mtclno [DBG] args: --image docker.io/amk3798/ceph@sha256:1d35e0a9f8be07b8bf25c1ae1306ec6b85127b981b714f9ca6f69f7a18932f70 inspect-image
2021-03-12T17:54:42.809584+0000 mgr.vm-02.mtclno [DBG] code: 0
2021-03-12T17:54:42.809677+0000 mgr.vm-02.mtclno [DBG] out: {
    "ceph_version": "ceph version 17.0.0-1275-g5e197a21 (5e197a21e61b6d7e4f41a330cd63bc787164937d) quincy (dev)",
    "image_id": "b485ead7ae5f3b78ce5d33f5f3380e887cbbb737b64a4ca298eac78882c9b8b7",
    "repo_digests": [
        "docker.io/amk3798/ceph@sha256:1d35e0a9f8be07b8bf25c1ae1306ec6b85127b981b714f9ca6f69f7a18932f70"
    ]
}
2021-03-12T17:54:42.809819+0000 mgr.vm-02.mtclno [INF] Upgrade: Updating mon.vm-00
2021-03-12T17:54:42.830627+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config set' -> 0 in 0.021s
2021-03-12T17:54:42.832780+0000 mgr.vm-02.mtclno [DBG] mon_command: 'auth get' -> 0 in 0.002s
2021-03-12T17:54:42.834794+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config get' -> 0 in 0.002s
2021-03-12T17:54:42.836693+0000 mgr.vm-02.mtclno [DBG] mon_command: 'config generate-minimal-conf' -> 0 in 0.001s
2021-03-12T17:54:42.837244+0000 mgr.vm-02.mtclno [INF] Deploying daemon mon.vm-00 on vm-00


In that section, for example, you can see the mgr daemons have already been upgraded and it is currently upgrading the monitors.


The last few instances of where "Upgrade" is printed should hopefully give some valuable insight into what went wrong. If the upgrade succeeded, the last instance should be a line like 

2021-03-12T17:58:38.276424+0000 mgr.vm-02.mtclno [INF] Upgrade: Complete!

Otherwise, the last instance should be the last thing upgrade attempted, which is hopefully whatever is causing issues.

Let me know if you have any issues with this process or if it helps you find anything useful.

Comment 4 Preethi 2021-03-15 10:26:05 UTC
@Adam, Thanks for the info. Debug logs are helpful to check for the upgrade status.

Comment 5 Adam King 2021-03-17 17:32:50 UTC
Additional upgrade info PR was merged upstream as well: https://github.com/ceph/ceph/pull/39880 . Hopefully will be backported soon

Comment 6 Ken Dreyer (Red Hat) 2021-03-19 17:40:40 UTC
Sage has backported the relevant changes to pacific in https://github.com/ceph/ceph/pull/40202 . When that is in pacific upstream, we'll have it in the next downstream rebase.

Comment 7 Adam King 2021-03-23 16:23:52 UTC
I think a new downstream image was built yesterday that should have the additional info in the ceph orch upgrade status command. Some extra info from that or, even better, the debug logs is going to be needed to figure out what is causing the issue here. See if you can verify this using the new build and with additional information. Also,after looking at this again, I'm seeing that only 1/6 node-exporters were running while the upgrade was going on. Maybe the issue was related to that? I'd check the status of node-exporter daemons before and during upgrade. There can be issues where node-exporter is repeatedly reconfigured and this fails which would relly slow upgrade down. You can see some information about situations like that in the "Another Note" section of this comment https://bugzilla.redhat.com/show_bug.cgi?id=1935044#c3

Comment 8 Ken Dreyer (Red Hat) 2021-03-23 22:03:03 UTC
I rebased ceph-5.0-rhel-patches again today for some other bugs, and that build includes all the changes in https://github.com/ceph/ceph/pull/40202. Please follow Adam's Comment #7 with today's build, ceph-16.1.0-1084.el8cp.

Comment 9 Pawan 2021-03-26 15:44:32 UTC
Tried the above commands with the specified build.

The upgrade status lists the daemons upgraded, but upgrade is in progress from a long time

# ceph orch upgrade status
{
    "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:5495fe25a0500573ce1787a6bce219169ea4c098fefe68c429cc5efb065e7246",
    "in_progress": true,
    "services_complete": [
        "mgr",
        "mds",
        "osd",
        "mon"
    ],
    "progress": "31/31 ceph daemons upgraded",
    "message": "Doing first pull of registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847 image"
}

Ceph status enters error mode..

# ceph -s
  cluster:
    id:     1878d0d4-8e3f-11eb-9fd8-fa163ece0c4e
    health: HEALTH_ERR
            Module 'cephadm' has failed: too many values to unpack (expected 3)
            Degraded data redundancy: 8625/204039 objects degraded (4.227%), 6 pgs degraded, 8 pgs undersized

  services:
    mon: 5 daemons, quorum ceph-pdhiran-cephadm-1616767756496-node1-mon-installer-node-exp,ceph-pdhiran-cephadm-1616767756496-node6-mon-mgr-mds-node-expor,ceph-pdhiran-cephadm-1616767756496-node2-mon-mgr-mds-node-expor,ceph-pdhiran-cephadm-1616767756496-node11-mon-node-exporter-cra,ceph-pdhiran-cephadm-1616767756496-node7-mon-node-exporter-cras (age 61m)
    mgr: ceph-pdhiran-cephadm-1616767756496-node1-mon-installer-node-exp.eesken(active, since 57m), standbys: ceph-pdhiran-cephadm-1616767756496-node2-mon-mgr-mds-node-expor.aohezh, ceph-pdhiran-cephadm-1616767756496-node6-mon-mgr-mds-node-expor.madiyk
    mds: 1/1 daemons up, 1 standby
    osd: 21 osds: 21 up (since 58m), 21 in (since 59m); 8 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   6 pools, 257 pgs
    objects: 68.01k objects, 266 MiB
    usage:   2.8 GiB used, 286 GiB / 289 GiB avail
    pgs:     8625/204039 objects degraded (4.227%)
             3936/204039 objects misplaced (1.929%)
             249 active+clean
             6   active+recovery_wait+undersized+degraded+remapped
             2   active+recovering+undersized+remapped

  io:
    recovery: 55 KiB/s, 13 objects/s

  progress:
    Global Recovery Event (33m)
      [===========================.] (remaining: 64s)
    Upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847 (0s)
      [............................]

# ceph health detail
HEALTH_ERR Module 'cephadm' has failed: too many values to unpack (expected 3); Degraded data redundancy: 6527/204162 objects degraded (3.197%), 4 pgs degraded, 6 pgs undersized
[ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: too many values to unpack (expected 3)
    Module 'cephadm' has failed: too many values to unpack (expected 3)

Ceph orch ls still shows the old image names : 
# ceph orch ls
NAME                       RUNNING  REFRESHED  AGE  PLACEMENT                                                                                                                                IMAGE NAME                                                                                                                    IMAGE ID
alertmanager                   1/1  40m ago    71m  count:1                                                                                                                                  registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.5                                                                12d6d4b9afb2
grafana                        1/1  40m ago    71m  count:1                                                                                                                                  registry.redhat.io/rhceph-alpha/rhceph-5-dashboard-rhel8:latest                                                               ea002a20207d
mds.cephfs                     2/2  40m ago    62m  ceph-pdhiran-cephadm-1616767756496-node6-mon-mgr-mds-node-expor;ceph-pdhiran-cephadm-1616767756496-node2-mon-mgr-mds-node-expor;count:2  registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:5495fe25a0500573ce1787a6bce219169ea4c098fefe68c429cc5efb065e7246  c1a8b8b28c91
mgr                            3/2  40m ago    68m  label:mgr                                                                                                                                mix                                                                                                                           c1a8b8b28c91
mon                            5/5  40m ago    65m  label:mon                                                                                                                                mix                                                                                                                           c1a8b8b28c91
node-exporter                10/10  40m ago    71m  *                                                                                                                                        registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5                                                               934722bb0c30
osd.all-available-devices    21/21  40m ago    63m  *                                                                                                                                        registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:5495fe25a0500573ce1787a6bce219169ea4c098fefe68c429cc5efb065e7246  c1a8b8b28c91
prometheus                     1/1  40m ago    71m  count:1                                                                                                                                  registry.redhat.io/openshift4/ose-prometheus:v4.6                                                                             476b3dbd7bc2

# cephadm version
Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:5495fe25a0500573ce1787a6bce219169ea4c098fefe68c429cc5efb065e7246
ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)

]# ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847
Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847

# ceph versions
{
    "mon": {
        "ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)": 5
    },
    "mgr": {
        "ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)": 3
    },
    "osd": {
        "ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)": 21
    },
    "mds": {
        "ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)": 2
    },
    "overall": {
        "ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)": 31
    }
}

Comment 15 Preethi 2021-03-29 18:01:05 UTC
@Adam, I have used the latest build specified. However, I still see the upgrade in progress and no changes could see in the "ceph orch upgrade status"

Below host details
magna057
root/q

Comment 16 Preethi 2021-03-29 18:02:41 UTC
(In reply to Preethi from comment #15)
> @Adam, I have used the latest build specified. However, I still see the
> upgrade in progress and no changes could see in the "ceph orch upgrade
> status"
> 
> Below host details
> magna057
> root/q

[ceph: root@magna057 /]# ceph orch upgrade status
{
    "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847",
    "in_progress": true,
    "services_complete": [],
    "message": ""
}
[ceph: root@magna057 /]#

Comment 17 Preethi 2021-03-29 18:21:33 UTC
@Adam,

We still see "upgrade in progress" due to the below error in the health status. Hence, we cannot verify until the below issue is fixed.

ceph -s
    health: HEALTH_ERR
            Module 'cephadm' has failed: too many values to unpack (expected 3)

( where the module is disabled because of too many values to unpack in the 'ceph -s' output. Some of our downstream images have the version formatted differently and it's causing the above health error.)

Comment 19 Preethi 2021-03-31 06:40:49 UTC
@Adam, Created a separate BZ to track this module failure issue. Yes, once the fix is available in downstream we will follow the above mentioned guideline.

https://bugzilla.redhat.com/show_bug.cgi?id=1944978

Comment 20 Adam King 2021-04-05 16:06:12 UTC
Fix for too many values to unpack upgrade issue was merged and backported which may also fix this BZ once it is downstream. Refer to this comment https://bugzilla.redhat.com/show_bug.cgi?id=1944978#c6

Comment 21 Preethi 2021-04-26 11:05:18 UTC
@Adam,
We do not see issue for build to build upgrade for the following

a) Internal registry to internal registry - Fix has been implemented in Ceph v 16.2.0.-6 and above. Hence, We cannot directly apply the upgrade command from the lower version of ceph (i.e below 16.2.0-6) to the latest Ceph version. Hence, used the workaround for upgrade If ceph version is above 16.2.0-6, you can directly perform build to build upgrades using " ceph orch upgrade start --image <image name>" - This also worked fine when verified

b) Default path to internal regsitry - Verified and working fine with workaround as deafult path has older beta build hence, we need to apply workaround to test this

Below snippet of the output with the workaround



[ceph: root@magna007 /]# ceph orch ps | grep mgr
mgr.magna007.wpgvme         magna007  running (6h)   11s ago    3w   *:9283         16.1.0-1325-geb5d7a86  0a963d7074de  585d98f2cc16  
mgr.magna010.syndxo         magna010  running (29s)  13s ago    3w   *:8443 *:9283  16.2.0-13.el8cp        89a188512eee  53aae976ffc6  
[ceph: root@magna007 /]# ceph orch daemon redeploy mgr.magna007.wpgvme --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822
Scheduled to redeploy mgr.magna007.wpgvme on host 'magna007'
[ceph: root@magna007 /]# ceph mgr fail
[ceph: root@magna007 /]# ceph -s
  cluster:
    id:     802d6a00-9277-11eb-aa4f-002590fc2538
    health: HEALTH_OK
 
  services:
    mon:        3 daemons, quorum magna007,magna010,magna104 (age 6h)
    mgr:        magna007.wpgvme(active, since 7s), standbys: magna010.syndxo
    osd:        15 osds: 15 up (since 6h), 15 in (since 2w)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        5 daemons active (3 hosts, 1 zones)
 
  data:
    pools:   7 pools, 169 pgs
    objects: 372 objects, 24 KiB
    usage:   9.5 GiB used, 14 TiB / 14 TiB avail
    pgs:     169 active+clean

 
[ceph: root@magna007 /]# ceph orch ps | grep mgr
mgr.magna007.wpgvme         magna007  running (19s)  0s ago     3w   *:8443 *:9283  16.2.0-13.el8cp        89a188512eee  a287245f4789  
mgr.magna010.syndxo         magna010  running (6m)   3s ago     3w   *:8443 *:9283  16.2.0-13.el8cp        89a188512eee  53aae976ffc6  
[ceph: root@magna007 /]# ceph orch 

[ceph: root@magna007 /]# ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822
Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822
[ceph: root@magna007 /]# ceph orch upgrade status
{
    "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:55206326df77ef04991a3d4a59621f9dfcff5a8e68c151febc3d5e0e1cfd79e8",
    "in_progress": true,
    "services_complete": [
        "mgr"
    ],
    "progress": "2/41 ceph daemons upgraded",
    "message": ""
}
[ceph: root@magna007 /]#

Comment 24 errata-xmlrpc 2021-08-30 08:27:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3294


Note You need to log in before you can comment on or make changes to this bug.