Description of problem:[cephadm] 5.0 - Ceph compose update for internal registry using ceph orch upgrade start --image option is not updating all services to the latest i.e partial upgrade is noticed- All daemons are not updated with the latest builds Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Install 5.0 cluster 2. Perform build update to the latest using the below command for internal registry ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 3. observe the behaviour 4. Build number is showing latest but ceph orch ls still points to older image and details hence, all services under ceph orch ls is not updated the latest. Actual results: Below output for reference: [ceph: root@magna094 /]# ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 [ceph: root@magna094 /]# ceph orch upgrade status { "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956", "in_progress": true, "services_complete": [], "message": "" } [ceph: root@magna094 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/2 5m ago 3w magna094;magna067 docker.io/prom/alertmanager:v0.20.0 0881eb8f169f crash 9/9 5m ago 3M * registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 43a4d2e8dfd3 grafana 1/1 5m ago 3M count:1 docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a iscsi.iscsi 1/1 5m ago 4w magna094;count:1 registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:4b985089d14513ccab29c42e1531bfcb2e98a614c497726153800d72a2ac11f0 dd0a3c51082c mds.test 3/3 5m ago 4w count:3 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c mgr 2/2 5m ago 3w magna067;magna094 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 43a4d2e8dfd3 mon 3/3 5m ago 3w magna067;magna093;magna094 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 43a4d2e8dfd3 nfs.foo 1/1 5m ago 4w count:1 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c node-exporter 9/9 5m ago 3M * docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf osd.None 7/0 5m ago - <unmanaged> mix dd0a3c51082c osd.all-available-devices 16/20 5m ago 7w * registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c osd.dashboard-admin-1605876982239 4/4 5m ago 8w * mix mix prometheus 1/1 5m ago 3w magna094 docker.io/prom/prometheus:v2.18.1 de242295e225 rgw.myorg.us-east-1 2/2 5m ago 11w magna092;magna093;count:2 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c rgw.test_realm.test_zone 0/2 - - count:2 <unknown> <unknown> [ceph: root@magna094 /]# [root@magna094 yum.repos.d]# sudo cephadm version Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:368dca105ba5fc2ab3f41eeeb6ee51351d3db66e0f984fecdabc6d908fc78d06 ceph version 16.0.0-8633.el8cp (a0d3d2e82a786c006507b0df445433f7725e477d) pacific (dev) [root@magna094 yum.repos.d]# rpm -qa |grep cephadm cephadm-16.0.0-8633.el8cp.noarch [root@magna094 yum.repos.d]# Expected results: Additional info: magna094 root/q
Upgrade is in progress, but seeing the below error as current status for now. Also, There is no message/display saying how long update takes to complete. [ceph: root@magna094 /]# ceph status cluster: id: c97c2c8c-0942-11eb-ae18-002590fbecb6 health: HEALTH_WARN Upgrading daemon osd.27 on host magna067 failed. services: mon: 3 daemons, quorum magna067,magna093,magna094 (age 4h) mgr: magna094.fnswbj(active, since 4h), standbys: magna067.nnxabw mds: test:1 {0=test.magna076.xymdrn=up:active} 2 up:standby osd: 27 osds: 27 up (since 4h), 27 in (since 8w) rgw: 2 daemons active (myorg.us-east-1.magna092.bxiihn, myorg.us-east-1.magna093.nhekwk) data: pools: 21 pools, 617 pgs objects: 456 objects, 429 KiB usage: 15 GiB used, 25 TiB / 25 TiB avail pgs: 617 active+clean io: client: 937 B/s rd, 0 op/s rd, 0 op/s wr progress: [ceph: root@magna094 /]# ceph orch upgrade status { "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956", "in_progress": true, "services_complete": [], "message": "Error: UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.27 on host magna067 failed." } [ceph: root@magna094 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/2 9m ago 3w magna094;magna067 docker.io/prom/alertmanager:v0.20.0 0881eb8f169f crash 9/9 10m ago 3M * registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 43a4d2e8dfd3 grafana 1/1 9m ago 3M count:1 docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a iscsi.iscsi 1/1 9m ago 4w magna094;count:1 registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:4b985089d14513ccab29c42e1531bfcb2e98a614c497726153800d72a2ac11f0 dd0a3c51082c mds.test 3/3 9m ago 4w count:3 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c mgr 2/2 9m ago 3w magna067;magna094 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 43a4d2e8dfd3 mon 3/3 9m ago 3w magna067;magna093;magna094 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956 43a4d2e8dfd3 nfs.foo 1/1 9m ago 4w count:1 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c node-exporter 9/9 10m ago 3M * docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf osd.None 7/0 10m ago - <unmanaged> mix dd0a3c51082c osd.all-available-devices 16/20 10m ago 7w * registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c osd.dashboard-admin-1605876982239 4/4 10m ago 8w * mix mix prometheus 1/1 9m ago 3w magna094 docker.io/prom/prometheus:v2.18.1 de242295e225 rgw.myorg.us-east-1 2/2 9m ago 11w magna092;magna093;count:2 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c rgw.test_realm.test_zone 0/2 - - count:2 <unknown> <unknown> [ceph: root@magna094 /]#
@ken, We had an issue before, few services related to dashboard were pointing to docker.io and now we do not see the issue with later on builds. I wanted to move the cluster which has older build to the latest with ceph orch upgrade command and now we see this behaviour.
(Ok, sounds like that is just a problem with older builds, then. Thanks.)
Preethy: can you confirm if this issue is happening with the latest compose?
@Juan, Issue is still seen with latest compose i.e Build to build upgrades from internal registry path to latest builds in internal registry.
(In reply to Preethi from comment #7) > @Juan, Issue is still seen with latest compose i.e Build to build upgrades > from internal registry path to latest builds in internal registry. Its been 2 days, Upgrade is still showing in Progress but none of the services are getting upgraded. Inferring fsid d8a1d97c-7cbb-11eb-82af-002590fc26f6 Inferring config /var/lib/ceph/d8a1d97c-7cbb-11eb-82af-002590fc26f6/mon.magna011/config Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f WARNING: The same type, major and minor should not be used for multiple devices. WARNING: The same type, major and minor should not be used for multiple devices. WARNING: The same type, major and minor should not be used for multiple devices. [ceph: root@magna011 /]# [ceph: root@magna011 /]# [ceph: root@magna011 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 0/1 - - count:1 <unknown> <unknown> crash 4/4 16h ago 5d * registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f 38e52bf51cef grafana 0/1 - - count:1 <unknown> <unknown> ha-rgw.haproxy_for_rgw 0/2 - - magna013 <unknown> <unknown> mgr 3/3 16h ago 5d magna011;magna013;magna014;count:3 mix 38e52bf51cef mon 3/3 16h ago 5d magna011;magna013;magna014;count:3 mix 38e52bf51cef node-exporter 4/4 16h ago 5d * registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.5 f0a5cfd22f16 osd.all-available-devices 12/12 16h ago 5d * registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f 38e52bf51cef prometheus 0/1 - - count:1 <unknown> <unknown> rgw.haproxy_realm 3/3 16h ago 22h magna013;magna014;magna016 registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f 38e52bf51cef rgw.realm1.rg1-zo1 1/1 16h ago 10h magna014;count:1 registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f 38e52bf51cef rgw.realm2.rg2-zo2 1/1 16h ago 1h magna016;count:1 registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:b2ca10515af7e243732ac10b43f68a0d218d9a34421ec3b807bdc33d58c5c00f 38e52bf51cef [ceph: root@magna011 /]# ceph orch upgrade status { "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:4fdcef25152b368bc779eb7ffa6ecaf5d64f38bf06303cd611416f9e333d3720", "in_progress": true, "services_complete": [], "message": "" } Below command used: ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-3.3-rhel-7-containers-candidate-35844-20210209220929 Ceph health went to error state from OK state. [ceph: root@magna011 /]# ceph status cluster: id: d8a1d97c-7cbb-11eb-82af-002590fc26f6 health: HEALTH_ERR Module 'cephadm' has failed: too many values to unpack (expected 3) services: mon: 3 daemons, quorum magna011,magna014,magna013 (age 5d) mgr: magna011.vpdjxa(active, since 5d), standbys: magna014.pmkeku, magna013.evxipz osd: 12 osds: 12 up (since 5d), 12 in (since 5d) rgw: 5 daemons active (haproxy_realm.magna013.avbphi, haproxy_realm.magna014.iciudt, haproxy_realm.magna016.yvvugh, realm1.rg1-zo1.magna014.gdpgbb, realm2.rg2-zo2.magna016.jvreil) data: pools: 11 pools, 800 pgs objects: 1.05k objects, 77 KiB usage: 6.4 GiB used, 11 TiB / 11 TiB avail pgs: 800 active+clean progress: [ceph: root@magna011 /]#
During next upgrade to new image there should be new info appearing in 'ceph orch upgrade status' including how many of the daemons have been upgraded and some more info messages. Unfortunately, the new info won't start showing up at least until the current active mgr is upgraded (so mgr can access code with these changes) so the additional info will only start showing up when the mgr have been upgraded. That said, if you could check for what info 'ceph orch upgrdade status' prints out after the mgr upgrades as well as follow the instruction in https://bugzilla.redhat.com/show_bug.cgi?id=1917949#c3 to get debug logs during the upgrade it would be very helpful. I haven't been able to reproduce this issue with any test clusters I've upgraded so that information is going to be necessary to figure out the problem here.
@Adam, We still see "upgrade in progress" due to the below error in the health status. Hence, we cannot verify until the below issue is fixed. [ceph: root@magna057 /]# ceph status cluster: id: 0e91a2fa-88c0-11eb-9d44-002590fbd52a health: HEALTH_ERR failed to probe daemons or devices 1 stray host(s) with 3 daemon(s) not managed by cephadm Module 'cephadm' has failed: 'magna110' ceph -s health: HEALTH_ERR Module 'cephadm' has failed: too many values to unpack (expected 3) ( where the module is disabled because of too many values to unpack in the 'ceph -s' output. Some of our downstream images have the version formatted differently and it's causing the above health error.)
Adam, We want to track this issue separate. Hence, logged the below BZ. https://bugzilla.redhat.com/show_bug.cgi?id=1944978
Fix for too many values to unpack upgrade issue was merged and backported which may also fix this BZ once it is downstream. Refer to this comment https://bugzilla.redhat.com/show_bug.cgi?id=1944978#c6
@Adam @Juan, We do not see issue for build to build upgrade for the following a) Internal registry to internal registry - Fix has been implemented in Ceph v 16.2.0.-6 and above. Hence, We cannot directly apply the upgrade command from the lower version of ceph (i.e below 16.2.0-6) to the latest Ceph version. Hence, used the workaround for upgrade If ceph version is above 16.2.0-6, you can directly perform build to build upgrades using " ceph orch upgrade start --image <image name>" - This also worked fine when verified b) Default path to internal regsitry - Verified and working fine with workaround as deafult path has older beta build hence, we need to apply workaround to test this Below snippet of the output with the workaround [ceph: root@magna007 /]# ceph orch ps | grep mgr mgr.magna007.wpgvme magna007 running (6h) 11s ago 3w *:9283 16.1.0-1325-geb5d7a86 0a963d7074de 585d98f2cc16 mgr.magna010.syndxo magna010 running (29s) 13s ago 3w *:8443 *:9283 16.2.0-13.el8cp 89a188512eee 53aae976ffc6 [ceph: root@magna007 /]# ceph orch daemon redeploy mgr.magna007.wpgvme --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822 Scheduled to redeploy mgr.magna007.wpgvme on host 'magna007' [ceph: root@magna007 /]# ceph mgr fail [ceph: root@magna007 /]# ceph -s cluster: id: 802d6a00-9277-11eb-aa4f-002590fc2538 health: HEALTH_OK services: mon: 3 daemons, quorum magna007,magna010,magna104 (age 6h) mgr: magna007.wpgvme(active, since 7s), standbys: magna010.syndxo osd: 15 osds: 15 up (since 6h), 15 in (since 2w) rbd-mirror: 1 daemon active (1 hosts) rgw: 5 daemons active (3 hosts, 1 zones) data: pools: 7 pools, 169 pgs objects: 372 objects, 24 KiB usage: 9.5 GiB used, 14 TiB / 14 TiB avail pgs: 169 active+clean [ceph: root@magna007 /]# ceph orch ps | grep mgr mgr.magna007.wpgvme magna007 running (19s) 0s ago 3w *:8443 *:9283 16.2.0-13.el8cp 89a188512eee a287245f4789 mgr.magna010.syndxo magna010 running (6m) 3s ago 3w *:8443 *:9283 16.2.0-13.el8cp 89a188512eee 53aae976ffc6 [ceph: root@magna007 /]# ceph orch [ceph: root@magna007 /]# ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822 Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-99506-20210424023822 [ceph: root@magna007 /]# ceph orch upgrade status { "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:55206326df77ef04991a3d4a59621f9dfcff5a8e68c151febc3d5e0e1cfd79e8", "in_progress": true, "services_complete": [ "mgr" ], "progress": "2/41 ceph daemons upgraded", "message": "" } [ceph: root@magna007 /]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3294