Bug 1944978
| Summary: | [cephadm] 5.0 - HEALTH_ERR - Module 'cephadm' has failed: too many values to unpack (expected 3) is seen in ceph status | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Preethi <pnataraj> | ||||
| Component: | Cephadm | Assignee: | Adam King <adking> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Vasishta <vashastr> | ||||
| Severity: | urgent | Docs Contact: | Karen Norteman <knortema> | ||||
| Priority: | unspecified | ||||||
| Version: | 5.0 | CC: | adking, gabrioux, jolmomar, kdreyer, sewagner, tserlin, vashastr, vereddy | ||||
| Target Milestone: | --- | Keywords: | TestBlocker | ||||
| Target Release: | 5.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | ceph-16.2.0-8.el8cp | Doc Type: | No Doc Update | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-08-30 08:29:31 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Preethi
2021-03-31 06:37:37 UTC
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity. please provide active manager log Created attachment 1768035 [details]
Debug logs
Copying this comment from the other BZ since it's about fixing this issue ----------------------------------------------------------------------------- For dealing with the slight format difference in our ceph versions for downstream containers (causing the "too many values to unpack" issue): Upstream Tracker: https://tracker.ceph.com/issues/50043 Upstream PR: https://github.com/ceph/ceph/pull/40478 I think this change should fix the upgrade issues we're having. HOWEVER, it's very important to note that you will not simply be able to upgrade from an image without this change to an image with this change if the image with the change has the quirk in its ceph version. The upgrade would still fail the same way before even getting to access the new code. You would have to either setup a new cluster (and would hopefully be able to upgrade to any future versions with no issue) or manually redeploy the mgr daemons with the new image with the change before starting the upgrade. ------------------------------------------------------------------------------ *** Bug 1945260 has been marked as a duplicate of this bug. *** Fix was merged https://github.com/ceph/ceph/pull/40478 and backported https://github.com/ceph/ceph/pull/40544 When this change gets into the downstream image please keep in mind the note at the end of this comment about getting past this issue https://bugzilla.redhat.com/show_bug.cgi?id=1944978#c4 @Adam, We have received the latest build which has the fix as it is built on 6th April. However, We cannot verify the cluster as per the below guidelines. We need deploy a new cluster and verify the upgrades to the later versions or redeploy mgr daemons with the new image before starting the upgrade but we do not know the steps to perform the second option. Please let us know how we can do this. ************************ HOWEVER, it's very important to note that you will not simply be able to upgrade from an image without this change to an image with this change if the image with the change has the quirk in its ceph version. The upgrade would still fail the same way before even getting to access the new code. You would have to either setup a new cluster (and would hopefully be able to upgrade to any future versions with no issue) or manually redeploy the mgr daemons with the new image with the change before starting the upgrade. *********************** @Preethi Luckily, that process is pretty straightforward. For example, let's say I have a cluster on my docker.io/amk3798/ceph:latest image and I want to upgrade to docker.io/amk3798/ceph:testing but an worried about the upgrade code on the current image. I could follow these steps:
1. Record the container image id of the mgr daemons. Found by running 'ceph orch ls --service_type mgr --format yaml'
[ceph: root@vm-00 /]# ceph orch ls --service_type mgr --format yaml
service_type: mgr
service_name: mgr
placement:
count: 2
status:
container_image_id: 3659d15c68034bd6ac545d5ed403fd62904cd100052c9fa3d0073d45564170a2
container_image_name: mix
created: '2021-04-07T12:35:44.384221Z'
last_refresh: '2021-04-07T12:40:55.271091Z'
running: 2
size: 2
Here you can see the image id is 3659d15c68034bd6ac545d5ed403fd62904cd100052c9fa3d0073d45564170a2
Note this should be the same image id for all ceph services (mgr, mon osd, crash, iscsi, etc. not monitoring stack)
[ceph: root@vm-00 /]# ceph orch ls
NAME RUNNING REFRESHED AGE PLACEMENT IMAGE ID
alertmanager 1/1 20s ago 5m count:1 0881eb8f169f
crash 3/3 22s ago 5m * 3659d15c6803
grafana 1/1 20s ago 5m count:1 80728b29ad3f
mgr 2/2 22s ago 5m count:2 3659d15c6803
mon 3/5 22s ago 5m count:5 3659d15c6803
node-exporter 3/3 22s ago 5m * e5a616e4b9cf
osd.all-available-devices 6/6 22s ago 4m * 3659d15c6803
prometheus 1/1 20s ago 5m count:1 de242295e225
2) check which mgr is active using 'ceph -s'
[ceph: root@vm-00 /]# ceph -s
cluster:
id: 7b1d4898-979d-11eb-b277-5254003f444b
health: HEALTH_OK
services:
mon: 3 daemons, quorum vm-00,vm-01,vm-02 (age 14m)
mgr: vm-00.gstosb(active, since 7m), standbys: vm-01.schczo
osd: 6 osds: 6 up (since 14m), 6 in (since 14m)
data:
pools: 1 pools, 256 pgs
objects: 0 objects, 0 B
usage: 42 MiB used, 900 GiB / 900 GiB avail
pgs: 256 active+clean
Here you can see that the vm-00.gstosb mgr is active
3) Redeploy each standby mgr daemon using 'ceph orch daemon redeploy <mgr-daemon-name> --image <image-you-want-to-upgrade-to>'
In my case there is only one standby mgr vm-01.schczo (you can see it in ceph -s output above)
[ceph: root@vm-00 /]# ceph orch daemon redeploy mgr.vm-01.schczo --image docker.io/amk3798/cepg:testing
Scheduled to redeploy mgr.vm-01.schczo on host 'vm-01'
4) do a mgr fail over to change which mgr is active. Done using 'ceph mgr fail' then waiting about 30 seconds. Check that active mgr has changed in 'ceph -s'
[ceph: root@vm-00 /]# ceph mgr fail
[ceph: root@vm-00 /]# ceph -s
cluster:
id: 7b1d4898-979d-11eb-b277-5254003f444b
health: HEALTH_OK
services:
mon: 3 daemons, quorum vm-00,vm-01,vm-02 (age 17m)
mgr: vm-01.schczo(active, since 15s), standbys: vm-00.gstosb
osd: 6 osds: 6 up (since 15m), 6 in (since 37m)
data:
pools: 1 pools, 256 pgs
objects: 0 objects, 0 B
usage: 48 MiB used, 900 GiB / 900 GiB avail
pgs: 256 active+clean
5) Redeploy the mgr you didn't redeploy already using the new image. in my case it's the vm-00 mgr
[ceph: root@vm-00 /]# ceph orch daemon redeploy mgr.vm-00.gstosb --image docker.io/amk3798/ceph:testing
Scheduled to redeploy mgr.vm-00.gstosb on host 'vm-00'
6) Once that redeploy is complete check the image id for the mgr service again. (I recommend running 'ceph orch ps --refresh' first so info is updated).
[ceph: root@vm-00 /]# ceph orch ls --service_type mgr --format yaml
service_type: mgr
service_name: mgr
placement:
count: 2
status:
container_image_id: eef7ab5fad59c1f47afb00214d28c0cfddd1e9ced462eb16156edd52785ef9fb
container_image_name: docker.io/amk3798/ceph:testing
created: '2021-04-07T12:35:44.384221Z'
last_refresh: '2021-04-07T12:47:44.068510Z'
running: 2
size: 2
Note that that the image id for my mgr service has changed from the one you recorded in step 1. If you see your mgr service image id has not changed or is "mix" you must go back and redeploy the mgr daemons who do not have the new image id using the --image flag with the new image (as done in step 3 and 5). You can check the image id for each individual mgr using 'ceph orch ps --daemon-type mgr --format yaml'.
At this point you can now see that the mgr service has a different image id than the other ceph services
[ceph: root@vm-00 /]# ceph orch ls
NAME RUNNING REFRESHED AGE PLACEMENT IMAGE ID
alertmanager 1/1 4m ago 16m count:1 0881eb8f169f
crash 3/3 4m ago 16m * 3659d15c6803
grafana 1/1 4m ago 16m count:1 80728b29ad3f
mgr 2/2 4m ago 16m count:2 eef7ab5fad59
mon 3/5 4m ago 16m count:5 3659d15c6803
node-exporter 3/3 4m ago 16m * e5a616e4b9cf
osd.all-available-devices 6/6 4m ago 15m * 3659d15c6803
prometheus 1/1 4m ago 16m count:1 de242295e225
7) At this point, you have successfully redeployed the mgr daemons with the new image and can safely upgrade without worrying about faulty upgrade code in the old image.
[ceph: root@vm-00 /]# ceph orch upgrade start docker.io/amk3798/ceph:testing
Initiating upgrade to docker.io/amk3798/ceph:testing
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Let me know if you have questions or issues carrying out this process.
@Adam, Redeploying mgr daemons with new image before performing upgrade worked in upgrading to the latest except monitoring stack. Please refer the below http://pastebin.test.redhat.com/955647 However, Build to build direct upgrade should work from 16.2.0-8 to later builds (In reply to Preethi from comment #22) > @Adam, Redeploying mgr daemons with new image before performing upgrade > worked in upgrading to the latest except monitoring stack. Please refer the > below > > http://pastebin.test.redhat.com/955647 > > However, Build to build direct upgrade should work from 16.2.0-8 to later > builds We also saw the upgrade in an infinite loop trying to redeploy one monitor. When restarted the cephadm module it worked but do not know the root cause for the problem. ceph mgr module disable cephadm ceph mgr module enable cephadm @Adam, Sure, will try again with another cluster and update if we see any issue with mgr redeploy. This cannot be moved to verified until we verify the straight way build to build upgrade. I will test both the ways and update the result. We can log a low BZ for ceph orch build check behaviour for tracking purpose. @Adam, We have not seen any issues in the workaround steps. Its working fine both with workaround and direct upgrades. Moving to verify state. We have followed the same steps as you have mentioned above. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3294 |