Bug 1917557 - [cephadm] 5.0 - Ceph orch upgrade start --image option with internal registry is not upgrading the cluster, cluster enters HEALTH_ERR state.
Summary: [cephadm] 5.0 - Ceph orch upgrade start --image option with internal registry...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Cephadm
Version: 5.0
Hardware: x86_64
OS: All
high
urgent
Target Milestone: ---
: 5.0
Assignee: Adam King
QA Contact: Vasishta
Karen Norteman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-18 18:08 UTC by Preethi
Modified: 2021-08-30 08:28 UTC (History)
6 users (show)

Fixed In Version: ceph-16.2.0-13.el8cp
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-30 08:27:52 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-1046 0 None None None 2021-08-27 05:14:47 UTC
Red Hat Product Errata RHBA-2021:3294 0 None None None 2021-08-30 08:28:07 UTC

Description Preethi 2021-01-18 18:08:18 UTC
Description of problem:[cephadm] 5.0 - Ceph compose update for internal registry using ceph orch upgrade start --image option is not updating all services to the latest i.e partial upgrade is noticed- All daemons are not updated with the latest builds


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Install 5.0 cluster
2. Perform build update to the latest using the below command for internal registry
ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956

3. observe the behaviour
4. Build number is showing latest but ceph orch ls still points to older image and details hence, all services under ceph orch ls is not updated the latest.



Actual results: Below output for reference:

[ceph: root@magna094 /]# ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956
Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956
[ceph: root@magna094 /]# ceph orch upgrade status
{
    "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956",
    "in_progress": true,
    "services_complete": [],
    "message": ""
}



[ceph: root@magna094 /]# ceph orch ls
NAME                               RUNNING  REFRESHED  AGE  PLACEMENT                   IMAGE NAME                                                                                                                    IMAGE ID      
alertmanager                           1/2  5m ago     3w   magna094;magna067           docker.io/prom/alertmanager:v0.20.0                                                                                           0881eb8f169f  
crash                                  9/9  5m ago     3M   *                           registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956                43a4d2e8dfd3  
grafana                                1/1  5m ago     3M   count:1                     docker.io/ceph/ceph-grafana:6.6.2                                                                                             a0dce381714a  
iscsi.iscsi                            1/1  5m ago     4w   magna094;count:1            registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:4b985089d14513ccab29c42e1531bfcb2e98a614c497726153800d72a2ac11f0  dd0a3c51082c  
mds.test                               3/3  5m ago     4w   count:3                     registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824                dd0a3c51082c  
mgr                                    2/2  5m ago     3w   magna067;magna094           registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956                43a4d2e8dfd3  
mon                                    3/3  5m ago     3w   magna067;magna093;magna094  registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956                43a4d2e8dfd3  
nfs.foo                                1/1  5m ago     4w   count:1                     registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824                dd0a3c51082c  
node-exporter                          9/9  5m ago     3M   *                           docker.io/prom/node-exporter:v0.18.1                                                                                          e5a616e4b9cf  
osd.None                               7/0  5m ago     -    <unmanaged>                 mix                                                                                                                           dd0a3c51082c  
osd.all-available-devices            16/20  5m ago     7w   *                           registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824                dd0a3c51082c  
osd.dashboard-admin-1605876982239      4/4  5m ago     8w   *                           mix                                                                                                                           mix           
prometheus                             1/1  5m ago     3w   magna094                    docker.io/prom/prometheus:v2.18.1                                                                                             de242295e225  
rgw.myorg.us-east-1                    2/2  5m ago     11w  magna092;magna093;count:2   registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824                dd0a3c51082c  
rgw.test_realm.test_zone               0/2  -          -    count:2                     <unknown>                                                                                                                     <unknown>     
[ceph: root@magna094 /]# 



[root@magna094 yum.repos.d]# sudo cephadm version
Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:368dca105ba5fc2ab3f41eeeb6ee51351d3db66e0f984fecdabc6d908fc78d06
ceph version 16.0.0-8633.el8cp (a0d3d2e82a786c006507b0df445433f7725e477d) pacific (dev)
[root@magna094 yum.repos.d]# rpm -qa |grep cephadm 
cephadm-16.0.0-8633.el8cp.noarch
[root@magna094 yum.repos.d]# 

Expected results:


Additional info: magna094
root/q

Comment 1 Preethi 2021-01-18 18:13:20 UTC
Upgrade is in progress, but seeing the below error as current status for now. Also, There is no message/display saying how long update takes to complete. 

[ceph: root@magna094 /]# ceph status
  cluster:
    id:     c97c2c8c-0942-11eb-ae18-002590fbecb6
    health: HEALTH_WARN
            Upgrading daemon osd.27 on host magna067 failed.
 
  services:
    mon: 3 daemons, quorum magna067,magna093,magna094 (age 4h)
    mgr: magna094.fnswbj(active, since 4h), standbys: magna067.nnxabw
    mds: test:1 {0=test.magna076.xymdrn=up:active} 2 up:standby
    osd: 27 osds: 27 up (since 4h), 27 in (since 8w)
    rgw: 2 daemons active (myorg.us-east-1.magna092.bxiihn, myorg.us-east-1.magna093.nhekwk)
 
  data:
    pools:   21 pools, 617 pgs
    objects: 456 objects, 429 KiB
    usage:   15 GiB used, 25 TiB / 25 TiB avail
    pgs:     617 active+clean
 
  io:
    client:   937 B/s rd, 0 op/s rd, 0 op/s wr
 
  progress:
 
[ceph: root@magna094 /]# ceph orch upgrade status
{
    "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956",
    "in_progress": true,
    "services_complete": [],
    "message": "Error: UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.27 on host magna067 failed."
}
[ceph: root@magna094 /]# ceph orch ls
NAME                               RUNNING  REFRESHED  AGE  PLACEMENT                   IMAGE NAME                                                                                                                    IMAGE ID      
alertmanager                           1/2  9m ago     3w   magna094;magna067           docker.io/prom/alertmanager:v0.20.0                                                                                           0881eb8f169f  
crash                                  9/9  10m ago    3M   *                           registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956                43a4d2e8dfd3  
grafana                                1/1  9m ago     3M   count:1                     docker.io/ceph/ceph-grafana:6.6.2                                                                                             a0dce381714a  
iscsi.iscsi                            1/1  9m ago     4w   magna094;count:1            registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:4b985089d14513ccab29c42e1531bfcb2e98a614c497726153800d72a2ac11f0  dd0a3c51082c  
mds.test                               3/3  9m ago     4w   count:3                     registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824                dd0a3c51082c  
mgr                                    2/2  9m ago     3w   magna067;magna094           registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956                43a4d2e8dfd3  
mon                                    3/3  9m ago     3w   magna067;magna093;magna094  registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956                43a4d2e8dfd3  
nfs.foo                                1/1  9m ago     4w   count:1                     registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824                dd0a3c51082c  
node-exporter                          9/9  10m ago    3M   *                           docker.io/prom/node-exporter:v0.18.1                                                                                          e5a616e4b9cf  
osd.None                               7/0  10m ago    -    <unmanaged>                 mix                                                                                                                           dd0a3c51082c  
osd.all-available-devices            16/20  10m ago    7w   *                           registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824                dd0a3c51082c  
osd.dashboard-admin-1605876982239      4/4  10m ago    8w   *                           mix                                                                                                                           mix           
prometheus                             1/1  9m ago     3w   magna094                    docker.io/prom/prometheus:v2.18.1                                                                                             de242295e225  
rgw.myorg.us-east-1                    2/2  9m ago     11w  magna092;magna093;count:2   registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824                dd0a3c51082c  
rgw.test_realm.test_zone               0/2  -          -    count:2                     <unknown>                                                                                                                     <unknown>     
[ceph: root@magna094 /]#

Comment 3 Preethi 2021-01-25 06:52:58 UTC
@ken, We had an issue before, few services related to dashboard were pointing to docker.io and now we do not see the issue with later on builds. I wanted to move the cluster which has older build to the latest with ceph orch upgrade command and now we see this behaviour.

Comment 4 Ken Dreyer (Red Hat) 2021-01-25 20:01:41 UTC
(Ok, sounds like that is just a problem with older builds, then. Thanks.)

Comment 5 Juan Miguel Olmo 2021-02-16 12:57:00 UTC
Preethy: can you confirm if this issue is happening with the latest compose?

Comment 7 Preethi 2021-03-10 06:33:50 UTC
@Juan, Issue is still seen with latest compose i.e Build to build upgrades from internal registry path to latest builds in internal registry.

Comment 8 Preethi 2021-03-10 06:47:51 UTC
(In reply to Preethi from comment #7)
> @Juan, Issue is still seen with latest compose i.e Build to build upgrades
> from internal registry path to latest builds in internal registry.


Its been 2 days, Upgrade is still showing in Progress but none of the services are getting upgraded.


Inferring fsid d8a1d97c-7cbb-11eb-82af-002590fc26f6
Inferring config /var/lib/ceph/d8a1d97c-7cbb-11eb-82af-002590fc26f6/mon.magna011/config
Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f
WARNING: The same type, major and minor should not be used for multiple devices.
WARNING: The same type, major and minor should not be used for multiple devices.
WARNING: The same type, major and minor should not be used for multiple devices.
[ceph: root@magna011 /]# 
[ceph: root@magna011 /]# 
[ceph: root@magna011 /]# ceph orch ls
NAME                       RUNNING  REFRESHED  AGE  PLACEMENT                           IMAGE NAME                                                                                                                    IMAGE ID      
alertmanager                   0/1  -          -    count:1                             <unknown>                                                                                                                     <unknown>     
crash                          4/4  16h ago    5d   *                                   registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f  38e52bf51cef  
grafana                        0/1  -          -    count:1                             <unknown>                                                                                                                     <unknown>     
ha-rgw.haproxy_for_rgw         0/2  -          -    magna013                            <unknown>                                                                                                                     <unknown>     
mgr                            3/3  16h ago    5d   magna011;magna013;magna014;count:3  mix                                                                                                                           38e52bf51cef  
mon                            3/3  16h ago    5d   magna011;magna013;magna014;count:3  mix                                                                                                                           38e52bf51cef  
node-exporter                  4/4  16h ago    5d   *                                   registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5                                                               f0a5cfd22f16  
osd.all-available-devices    12/12  16h ago    5d   *                                   registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f  38e52bf51cef  
prometheus                     0/1  -          -    count:1                             <unknown>                                                                                                                     <unknown>     
rgw.haproxy_realm              3/3  16h ago    22h  magna013;magna014;magna016          registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f  38e52bf51cef  
rgw.realm1.rg1-zo1             1/1  16h ago    10h  magna014;count:1                    registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f  38e52bf51cef  
rgw.realm2.rg2-zo2             1/1  16h ago    1h   magna016;count:1                    registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f  38e52bf51cef  
[ceph: root@magna011 /]# ceph orch upgrade status
{
    "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:4fdcef25152b368bc779eb7ffa6ecaf5d64f38bf06303cd611416f9e333d3720",
    "in_progress": true,
    "services_complete": [],
    "message": ""
}


Below command used:
ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-3.3-rhel-7-containers-candidate-35844-20210209220929 

Ceph health went to error state from OK state.
[ceph: root@magna011 /]# ceph status
  cluster:
    id:     d8a1d97c-7cbb-11eb-82af-002590fc26f6
    health: HEALTH_ERR
            Module 'cephadm' has failed: too many values to unpack (expected 3)
 
  services:
    mon: 3 daemons, quorum magna011,magna014,magna013 (age 5d)
    mgr: magna011.vpdjxa(active, since 5d), standbys: magna014.pmkeku, magna013.evxipz
    osd: 12 osds: 12 up (since 5d), 12 in (since 5d)
    rgw: 5 daemons active (haproxy_realm.magna013.avbphi, haproxy_realm.magna014.iciudt, haproxy_realm.magna016.yvvugh, realm1.rg1-zo1.magna014.gdpgbb, realm2.rg2-zo2.magna016.jvreil)
 
  data:
    pools:   11 pools, 800 pgs
    objects: 1.05k objects, 77 KiB
    usage:   6.4 GiB used, 11 TiB / 11 TiB avail
    pgs:     800 active+clean
 
  progress:
 
[ceph: root@magna011 /]#

Comment 10 Adam King 2021-03-24 12:40:38 UTC
During next upgrade to new image there should be new info appearing in 'ceph orch upgrade status' including how many of the daemons have been upgraded and some more info messages. Unfortunately, the new info won't start showing up at least until the current active mgr is upgraded (so mgr can access code with these changes) so the additional info will only start showing up when the mgr have been upgraded. That said, if you could check for what info 'ceph orch upgrdade status' prints out after the mgr upgrades as well as follow the instruction in https://bugzilla.redhat.com/show_bug.cgi?id=1917949#c3 to get debug logs during the upgrade it would be very helpful. I haven't been able to reproduce this issue with any test clusters I've upgraded so that information is going to be necessary to figure out the problem here.

Comment 12 Preethi 2021-03-29 18:19:18 UTC
@Adam,

We still see "upgrade in progress" due to the below error in the health status. Hence, we cannot verify until the below issue is fixed.

[ceph: root@magna057 /]# ceph status
  cluster:
    id:     0e91a2fa-88c0-11eb-9d44-002590fbd52a
    health: HEALTH_ERR
            failed to probe daemons or devices
            1 stray host(s) with 3 daemon(s) not managed by cephadm
            Module 'cephadm' has failed: 'magna110'

ceph -s
    health: HEALTH_ERR
            Module 'cephadm' has failed: too many values to unpack (expected 3)

( where the module is disabled because of too many values to unpack in the 'ceph -s' output. Some of our downstream images have the version formatted differently and it's causing the above health error.)

Comment 14 Preethi 2021-03-31 06:42:07 UTC
Adam, We want to track this issue separate. Hence, logged the below BZ.
https://bugzilla.redhat.com/show_bug.cgi?id=1944978

Comment 15 Adam King 2021-04-05 16:04:51 UTC
Fix for too many values to unpack upgrade issue was merged and backported which may also fix this BZ once it is downstream. Refer to this comment https://bugzilla.redhat.com/show_bug.cgi?id=1944978#c6

Comment 16 Preethi 2021-04-26 10:14:14 UTC
@Adam @Juan, We do not see issue for build to build upgrade for the following

a) Internal registry to internal registry - Fix has been implemented in Ceph v 16.2.0.-6 and above. Hence, We cannot directly apply the upgrade command from the lower version of ceph (i.e below 16.2.0-6) to the latest Ceph version. Hence, used the workaround for upgrade If ceph version is above 16.2.0-6, you can directly perform build to build upgrades using " ceph orch upgrade start --image <image name>" - This also worked fine when verified

b) Default path to internal regsitry - Verified and working fine with workaround as deafult path has older beta build hence, we need to apply workaround to test this

Below snippet of the output with the workaround



[ceph: root@magna007 /]# ceph orch ps | grep mgr
mgr.magna007.wpgvme         magna007  running (6h)   11s ago    3w   *:9283         16.1.0-1325-geb5d7a86  0a963d7074de  585d98f2cc16  
mgr.magna010.syndxo         magna010  running (29s)  13s ago    3w   *:8443 *:9283  16.2.0-13.el8cp        89a188512eee  53aae976ffc6  
[ceph: root@magna007 /]# ceph orch daemon redeploy mgr.magna007.wpgvme --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822
Scheduled to redeploy mgr.magna007.wpgvme on host 'magna007'
[ceph: root@magna007 /]# ceph mgr fail
[ceph: root@magna007 /]# ceph -s
  cluster:
    id:     802d6a00-9277-11eb-aa4f-002590fc2538
    health: HEALTH_OK
 
  services:
    mon:        3 daemons, quorum magna007,magna010,magna104 (age 6h)
    mgr:        magna007.wpgvme(active, since 7s), standbys: magna010.syndxo
    osd:        15 osds: 15 up (since 6h), 15 in (since 2w)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        5 daemons active (3 hosts, 1 zones)
 
  data:
    pools:   7 pools, 169 pgs
    objects: 372 objects, 24 KiB
    usage:   9.5 GiB used, 14 TiB / 14 TiB avail
    pgs:     169 active+clean

 
[ceph: root@magna007 /]# ceph orch ps | grep mgr
mgr.magna007.wpgvme         magna007  running (19s)  0s ago     3w   *:8443 *:9283  16.2.0-13.el8cp        89a188512eee  a287245f4789  
mgr.magna010.syndxo         magna010  running (6m)   3s ago     3w   *:8443 *:9283  16.2.0-13.el8cp        89a188512eee  53aae976ffc6  
[ceph: root@magna007 /]# ceph orch 

[ceph: root@magna007 /]# ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822
Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822
[ceph: root@magna007 /]# ceph orch upgrade status
{
    "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:55206326df77ef04991a3d4a59621f9dfcff5a8e68c151febc3d5e0e1cfd79e8",
    "in_progress": true,
    "services_complete": [
        "mgr"
    ],
    "progress": "2/41 ceph daemons upgraded",
    "message": ""
}
[ceph: root@magna007 /]#

Comment 20 errata-xmlrpc 2021-08-30 08:27:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3294


Note You need to log in before you can comment on or make changes to this bug.