Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1944978

Summary: [cephadm] 5.0 - HEALTH_ERR - Module 'cephadm' has failed: too many values to unpack (expected 3) is seen in ceph status
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Preethi <pnataraj>
Component: CephadmAssignee: Adam King <adking>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: urgent Docs Contact: Karen Norteman <knortema>
Priority: unspecified    
Version: 5.0CC: adking, gabrioux, jolmomar, kdreyer, sewagner, tserlin, vashastr, vereddy
Target Milestone: ---Keywords: TestBlocker
Target Release: 5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-16.2.0-8.el8cp Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-30 08:29:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Debug logs none

Description Preethi 2021-03-31 06:37:37 UTC
Description of problem:[cephadm] 5.0 - HEALTH_ERR - Module 'cephadm' has failed: too many values to unpack (expected 3) is seen in ceph status during build upgrades due to which upgrade progress shows as In progress and do not complete

We have hit this issue while verifying the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1917949

Version-Release number of selected component (if applicable):
 cephadm version
Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:5495fe25a0500573ce1787a6bce219169ea4c098fefe68c429cc5efb065e7246
ceph version 16.1.0-1084.el8cp (899d93a5c7913d6952438f4b48d29d1cef2aaa2a) pacific (rc)

How reproducible:


Steps to Reproduce:
1. Have a cluster with 5.0 bootstrapped with internal registry
2. Perform upgrade using ceph orch upgrade command to i.e registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-97619-20210115184956
3. observe the behaviour

ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847
Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-22961-20210324054847

Actual results: Upgrade progress shows in progress and we see the below health error

ceph -s
    health: HEALTH_ERR
            Module 'cephadm' has failed: too many values to unpack (expected 3)

( where the module is disabled because of too many values to unpack in the 'ceph -s' output. Some of our downstream images have the version formatted differently and it's causing the above health error.)

Expected results:
Health status should be OK and upgrade should resume and complete.

Additional info:
magna057 root/q

Comment 1 RHEL Program Management 2021-03-31 06:37:42 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Juan Miguel Olmo 2021-03-31 09:02:21 UTC
please provide active manager log

Comment 3 Preethi 2021-03-31 11:01:29 UTC
Created attachment 1768035 [details]
Debug logs

Comment 4 Adam King 2021-03-31 12:33:15 UTC
Copying this comment from the other BZ since it's about fixing this issue

-----------------------------------------------------------------------------

For dealing with the slight format difference in our ceph versions for downstream containers (causing the "too many values to unpack" issue):

Upstream Tracker: https://tracker.ceph.com/issues/50043
Upstream PR: https://github.com/ceph/ceph/pull/40478

I think this change should fix the upgrade issues we're having.

HOWEVER, it's very important to note that you will not simply be able to upgrade from an image without this change to an image with this change if the image with the change has the quirk in its ceph version. The upgrade would still fail the same way before even getting to access the new code. You would have to either setup a new cluster (and would hopefully be able to upgrade to any future versions with no issue) or manually redeploy the mgr daemons with the new image with the change before starting the upgrade.

------------------------------------------------------------------------------

Comment 5 Guillaume Abrioux 2021-03-31 18:37:43 UTC
*** Bug 1945260 has been marked as a duplicate of this bug. ***

Comment 6 Adam King 2021-04-05 16:00:31 UTC
Fix was merged https://github.com/ceph/ceph/pull/40478
and backported https://github.com/ceph/ceph/pull/40544

When this change gets into the downstream image please keep in mind the note at the end of this comment about getting past this issue  https://bugzilla.redhat.com/show_bug.cgi?id=1944978#c4

Comment 8 Preethi 2021-04-07 10:07:26 UTC
@Adam, We have received the latest build which has the fix as it is built on 6th April. However, We cannot verify the cluster as per the below guidelines. We need deploy a new cluster and verify the upgrades to the later versions or redeploy mgr daemons with the new image before starting the upgrade but we do not know the steps to perform the second option. Please let us know how we can do this.

************************
HOWEVER, it's very important to note that you will not simply be able to upgrade from an image without this change to an image with this change if the image with the change has the quirk in its ceph version. The upgrade would still fail the same way before even getting to access the new code. You would have to either setup a new cluster (and would hopefully be able to upgrade to any future versions with no issue) or manually redeploy the mgr daemons with the new image with the change before starting the upgrade.
***********************

Comment 9 Adam King 2021-04-07 13:43:00 UTC
@Preethi Luckily, that process is pretty straightforward. For example, let's say I have a cluster on my docker.io/amk3798/ceph:latest image and I want to upgrade to docker.io/amk3798/ceph:testing but an worried about the upgrade code on the current image. I could follow these steps:

1. Record the container image id of the mgr daemons. Found by running 'ceph orch ls --service_type mgr --format yaml'

[ceph: root@vm-00 /]# ceph orch ls --service_type mgr --format yaml
service_type: mgr
service_name: mgr
placement:
  count: 2
status:
  container_image_id: 3659d15c68034bd6ac545d5ed403fd62904cd100052c9fa3d0073d45564170a2
  container_image_name: mix
  created: '2021-04-07T12:35:44.384221Z'
  last_refresh: '2021-04-07T12:40:55.271091Z'
  running: 2
  size: 2

Here you can see the image id is 3659d15c68034bd6ac545d5ed403fd62904cd100052c9fa3d0073d45564170a2

Note this should be the same image id for all ceph services (mgr, mon osd, crash, iscsi, etc. not monitoring stack)

[ceph: root@vm-00 /]# ceph orch ls
NAME                       RUNNING  REFRESHED  AGE  PLACEMENT  IMAGE ID      
alertmanager                   1/1  20s ago    5m   count:1    0881eb8f169f  
crash                          3/3  22s ago    5m   *          3659d15c6803  
grafana                        1/1  20s ago    5m   count:1    80728b29ad3f  
mgr                            2/2  22s ago    5m   count:2    3659d15c6803  
mon                            3/5  22s ago    5m   count:5    3659d15c6803  
node-exporter                  3/3  22s ago    5m   *          e5a616e4b9cf  
osd.all-available-devices      6/6  22s ago    4m   *          3659d15c6803  
prometheus                     1/1  20s ago    5m   count:1    de242295e225  





2) check which mgr is active using 'ceph -s'

[ceph: root@vm-00 /]# ceph -s                
  cluster:
    id:     7b1d4898-979d-11eb-b277-5254003f444b
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum vm-00,vm-01,vm-02 (age 14m)
    mgr: vm-00.gstosb(active, since 7m), standbys: vm-01.schczo
    osd: 6 osds: 6 up (since 14m), 6 in (since 14m)
 
  data:
    pools:   1 pools, 256 pgs
    objects: 0 objects, 0 B
    usage:   42 MiB used, 900 GiB / 900 GiB avail
    pgs:     256 active+clean
 



Here you can see that the vm-00.gstosb mgr is active



3) Redeploy each standby mgr daemon using 'ceph orch daemon redeploy <mgr-daemon-name> --image <image-you-want-to-upgrade-to>'


In my case there is only one standby mgr vm-01.schczo (you can see it in ceph -s output above)



[ceph: root@vm-00 /]# ceph orch daemon redeploy mgr.vm-01.schczo --image docker.io/amk3798/cepg:testing
Scheduled to redeploy mgr.vm-01.schczo on host 'vm-01'



4) do a mgr fail over to change which mgr is active. Done using 'ceph mgr fail' then waiting about 30 seconds. Check that active mgr has changed in 'ceph -s'


[ceph: root@vm-00 /]# ceph mgr fail
[ceph: root@vm-00 /]# ceph -s
  cluster:
    id:     7b1d4898-979d-11eb-b277-5254003f444b
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum vm-00,vm-01,vm-02 (age 17m)
    mgr: vm-01.schczo(active, since 15s), standbys: vm-00.gstosb
    osd: 6 osds: 6 up (since 15m), 6 in (since 37m)
 
  data:
    pools:   1 pools, 256 pgs
    objects: 0 objects, 0 B
    usage:   48 MiB used, 900 GiB / 900 GiB avail
    pgs:     256 active+clean



5) Redeploy the mgr you didn't redeploy already using the new image. in my case it's the vm-00 mgr


[ceph: root@vm-00 /]# ceph orch daemon redeploy mgr.vm-00.gstosb --image docker.io/amk3798/ceph:testing
Scheduled to redeploy mgr.vm-00.gstosb on host 'vm-00'


6) Once that redeploy is complete check the image id for the mgr service again. (I recommend running 'ceph orch ps --refresh' first so info is updated).


[ceph: root@vm-00 /]# ceph orch ls --service_type mgr --format yaml
service_type: mgr
service_name: mgr
placement:
  count: 2
status:
  container_image_id: eef7ab5fad59c1f47afb00214d28c0cfddd1e9ced462eb16156edd52785ef9fb
  container_image_name: docker.io/amk3798/ceph:testing
  created: '2021-04-07T12:35:44.384221Z'
  last_refresh: '2021-04-07T12:47:44.068510Z'
  running: 2
  size: 2


Note that that the image id for my mgr service has changed from the one you recorded in step 1. If you see your mgr service image id has not changed or is "mix" you must go back and redeploy the mgr daemons who do not have the new image id using the --image flag with the new image (as done in step 3 and 5). You can check the image id for each individual mgr using 'ceph orch ps --daemon-type mgr --format yaml'.


At this point you can now see that the mgr service has a different image id than the other ceph services


[ceph: root@vm-00 /]# ceph orch ls
NAME                       RUNNING  REFRESHED  AGE  PLACEMENT  IMAGE ID      
alertmanager                   1/1  4m ago     16m  count:1    0881eb8f169f  
crash                          3/3  4m ago     16m  *          3659d15c6803  
grafana                        1/1  4m ago     16m  count:1    80728b29ad3f  
mgr                            2/2  4m ago     16m  count:2    eef7ab5fad59  
mon                            3/5  4m ago     16m  count:5    3659d15c6803  
node-exporter                  3/3  4m ago     16m  *          e5a616e4b9cf  
osd.all-available-devices      6/6  4m ago     15m  *          3659d15c6803  
prometheus                     1/1  4m ago     16m  count:1    de242295e225  




7) At this point, you have successfully redeployed the mgr daemons with the new image and can safely upgrade without worrying about faulty upgrade code in the old image.



[ceph: root@vm-00 /]# ceph orch upgrade start docker.io/amk3798/ceph:testing
Initiating upgrade to docker.io/amk3798/ceph:testing





--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------




Let me know if you have questions or issues carrying out this process.

Comment 22 Preethi 2021-04-14 13:32:19 UTC
@Adam, Redeploying mgr daemons with new image before performing upgrade worked in upgrading to the latest except monitoring stack. Please refer the below 

http://pastebin.test.redhat.com/955647

However, Build to build direct upgrade should work from 16.2.0-8 to later builds

Comment 23 Preethi 2021-04-14 13:51:49 UTC
(In reply to Preethi from comment #22)
> @Adam, Redeploying mgr daemons with new image before performing upgrade
> worked in upgrading to the latest except monitoring stack. Please refer the
> below 
> 
> http://pastebin.test.redhat.com/955647
> 
> However, Build to build direct upgrade should work from 16.2.0-8 to later
> builds

We also saw the upgrade in an infinite loop trying to redeploy one monitor. When restarted the cephadm module it worked but do not know the root cause for the problem.
ceph mgr module disable cephadm
ceph mgr module enable cephadm

Comment 25 Preethi 2021-04-14 15:45:35 UTC
@Adam, Sure, will try again with another cluster and update if we see any issue with mgr redeploy. This cannot be moved to verified until we verify the straight way build to build upgrade. I will test both the ways and update the result.

We can log a low BZ for ceph orch build check behaviour for tracking purpose.

Comment 26 Preethi 2021-04-15 13:13:31 UTC
@Adam, We have not seen any issues in the workaround steps. Its working fine both with workaround and direct upgrades. Moving to verify state.


We have followed the same steps as you have mentioned above.

Comment 29 errata-xmlrpc 2021-08-30 08:29:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3294