Bug 1908678 - ocs-osd-removal job failed with "Invalid value" error when using multiple ids
Summary: ocs-osd-removal job failed with "Invalid value" error when using multiple ids
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.7.0
Assignee: Servesha
QA Contact: Itzhak
URL:
Whiteboard:
Depends On:
Blocks: 1913557 1938134
TreeView+ depends on / blocked
 
Reported: 2020-12-17 10:26 UTC by Itzhak
Modified: 2021-06-01 08:50 UTC (History)
12 users (show)

Fixed In Version: 4.7.0-721.ci
Doc Type: Bug Fix
Doc Text:
.Mulitple OSD removal job no longer fails Previously, when triggering the job for multiple OSD removal, the template included a comma with the OSD IDs in the job name. This was causing the job template to fail. With this update, the OSD IDs have been removed from the job name to maintain a valid format. The job names have been changed from `ocs-osd-removal-${FAILED_OSD_IDS}` to `ocs-osd-removal-job`.
Clone Of:
: 1913557 (view as bug list)
Environment:
Last Closed: 2021-05-19 09:17:08 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 1003 0 None closed Job: Removes the OSD IDs from the label 2021-02-18 16:08:30 UTC
Github openshift ocs-operator pull 1029 0 None closed Bug 1908678: [release-4.7] Job: Removes the OSD IDs from the label 2021-02-18 16:08:30 UTC
Github openshift ocs-operator pull 969 0 None closed Removes the OSD IDs from the osd removal job name 2021-02-18 16:08:30 UTC
Github openshift ocs-operator pull 998 0 None closed Bug 1908678: [release-4.7] Removes the OSD IDs from the osd removal job name 2021-02-18 16:08:30 UTC
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:17:59 UTC

Description Itzhak 2020-12-17 10:26:35 UTC
Description of problem (please be detailed as possible and provide log
snippets):
When executing the ocs-osd-removal job with more than one OSD ID, we get an "Invalid value" error.

Version of all relevant components (if applicable):
OCP version:
Client Version: 4.6.0-0.nightly-2020-12-08-021151
Server Version: 4.6.0-0.nightly-2020-12-16-010206
Kubernetes Version: v1.19.0+7070803

OCS verison:
ocs-operator.v4.6.0-195.ci   OpenShift Container Storage   4.6.0-195.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-12-16-010206   True        False         24h     Cluster version is 4.6.0-0.nightly-2020-12-16-010206

Rook version
rook: 4.6-80.1ae5ac6a.release_4.6
go: go1.15.2

Ceph version
ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, If we want to remove more than one failed OSD, we can't execute multiple ocs-osd-removal jobs at once.  

Is there any workaround available to the best of your knowledge?
Yes. Executing separate ocs-osd-removal jobs for every OSD id. 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes.

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
Execute the command 
$oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0,1 |oc create -n openshift-storage -f -
(0,1 are just examples for osd id's, you can pick up other osd id's as well)

Actual results:
The Job "ocs-osd-removal-0,1" failed with "Invalid value" error

Expected results:
The job should finish successfully and create 2 ocs-osd-removal jobs: 
"ocs-osd-removal-0", "ocs-osd-removal-1"

Additional info:

Comment 2 Itzhak 2020-12-17 10:37:47 UTC
One other note about this BZ. 
When I tried to execute 2 ocs-osd-removal jobs separately simultaneously, the process has finished successfully.

I tested it with a dynamic cluster on vSphere.
I tried to delete with 2 hard drives in worker nodes: 'compute-0', 'compute-1' with their corresponding OSD's 0,2.

Then I ran the commands: 
$oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 |oc create -n openshift-storage -f -
$oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=2 |oc create -n openshift-storage -f -

And delete their corresponding PV's. 
The process finished successfully with all 3 OSD's up and running and Ceph health OK.

Comment 5 Servesha 2020-12-17 13:23:38 UTC
@Neha Sure, I will take a look.

Comment 6 Travis Nielsen 2020-12-17 19:16:01 UTC
Moving out of 4.6.0 since it's not a blocking issue. At least there is a known workaround to remove OSDs individually.

Comment 7 Servesha 2021-01-07 04:16:27 UTC
The PR is created here: https://github.com/openshift/ocs-operator/pull/969/

Comment 8 Martin Bukatovic 2021-01-15 17:50:13 UTC
Providing QE ack, see reproducer from the bug description.

Comment 11 Servesha 2021-02-02 07:00:44 UTC
https://github.com/openshift/ocs-operator/pull/1003/ is merged.

Comment 13 Itzhak 2021-02-18 12:57:13 UTC
Will this be tested on both Dynamic and LSO clusters, or testing it on a dynamic cluster would suffice?

Comment 14 Travis Nielsen 2021-02-18 14:32:32 UTC
This fix can be verified on any cluster. In general, the OSD removal job does need to be tested on both dynamic and LSO clusters, but this fix does not require both types.

Comment 16 Itzhak 2021-02-18 17:46:51 UTC
I tested it with a vSphere 4.7 dynamic cluster. 

steps I did to reproduce the bug:

1. Go to the vSphere platform where the cluster is located, and delete 2 disks.

2. Look at the terminal and see that 2 of the OSD's are down: 
$ oc get pods -n openshift-storage | grep osd
rook-ceph-osd-0-7855c957d-6ps45                                   2/2     Running            0          21h
rook-ceph-osd-1-78ffdc9644-hc2gc                                  1/2     CrashLoopBackOff   5          20h
rook-ceph-osd-2-b68f4c767-4hgrg                                   1/2     CrashLoopBackOff   4          21h
rook-ceph-osd-prepare-ocs-deviceset-1-data-08489z-pq65k           0/1     Completed          0          21h
rook-ceph-osd-prepare-ocs-deviceset-2-data-02bwcc-jkttr           0/1     Completed          0          21h

3. Delete osd 1:

$ osd_id_to_remove=1
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
deployment.apps/rook-ceph-osd-1 scaled

$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
NAME                               READY   STATUS        RESTARTS   AGE
rook-ceph-osd-1-78ffdc9644-hc2gc   0/2     Terminating   6          20h

$ oc project openshift-storage 
Now using project "openshift-storage" on server "https://api.ikave-vm47-feb17.qe.rh-ocs.com:6443".

$ oc delete pod rook-ceph-osd-1-78ffdc9644-hc2gc 
pod "rook-ceph-osd-1-78ffdc9644-hc2gc" deleted

$ oc delete pod rook-ceph-osd-1-78ffdc9644-hc2gc --grace-period=0 --force 
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "rook-ceph-osd-1-78ffdc9644-hc2gc" force deleted


4. Delete osd 2 with similar steps as the above steps. 

5. Execute the ocs-osd-removal-job:
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=1,2 |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                        READY   STATUS      RESTARTS   AGE
ocs-osd-removal-job-r5wh5   0/1     Completed   0          28s

6. Check the PV's status:
$ oc get pv | grep 100Gi | grep openshift-storage
pvc-3ce32a3f-6786-4e5d-ab1c-73a905381ed3   100Gi      RWO            Delete           Bound    openshift-storage/ocs-deviceset-0-data-0hnnwc                    thin                                   56s
pvc-a8c94f98-dccf-4504-a2d3-828d3adc8173   100Gi      RWO            Delete           Bound    openshift-storage/ocs-deviceset-1-data-0qsxkm                    thin                                   55s
pvc-accf611b-39ab-4c9d-8ecc-958e384e9959   100Gi      RWO            Delete           Failed   openshift-storage/ocs-deviceset-1-data-08489z                    thin                                   21h
pvc-d9e34905-4ba0-4368-8f60-3aa51efc2931   100Gi      RWO            Delete           Bound    openshift-storage/ocs-deviceset-2-data-02bwcc                    thin                                   21h
pvc-e86e797e-f62f-453a-9a78-c8f62f3b18f5   100Gi      RWO            Delete           Failed   openshift-storage/ocs-deviceset-0-data-0dq4g5                    thin                                   21h


7. There are 2 PV's in status Failed and 3 PV's in Bound state. Delete the PV's with status Failed:
$ oc delete pv pvc-accf611b-39ab-4c9d-8ecc-958e384e9959 
persistentvolume "pvc-accf611b-39ab-4c9d-8ecc-958e384e9959" deleted
                             
$ oc delete pv pvc-e86e797e-f62f-453a-9a78-c8f62f3b18f5 
persistentvolume "pvc-e86e797e-f62f-453a-9a78-c8f62f3b18f5" deleted


8. Delete the ocs-osd-removal-job:
$oc delete jobs.batch ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted

9. check the osd pods: 
$ oc get pods -n openshift-storage | grep osd
rook-ceph-osd-0-7855c957d-6ps45                                   2/2     Running     0          22h
rook-ceph-osd-1-559bf6578c-nxq6p                                  2/2     Running     0          69m
rook-ceph-osd-2-cf6bf47bc-8lcfk                                   2/2     Running     0          69m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0hnnwc-cjc7p           0/1     Completed   0          70m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0qsxkm-4vggb           0/1     Completed   0          70m
rook-ceph-osd-prepare-ocs-deviceset-2-data-02bwcc-jkttr           0/1     Completed   0          22h


10. Silence the Ceph warnings of osd crash:
$ oc rsh rook-ceph-tools-5c6ddd4df9-9v2dk
 
sh-4.4# 
sh-4.4# ceph crash ls-new
ID                                                               ENTITY NEW 
2021-02-18_11:00:51.223639Z_c7fa00d1-fd63-4dab-bb8d-831f205d91e8 osd.1   *  
2021-02-18_11:04:02.151486Z_9964cf00-e986-48c1-91e1-210273c2f9c7 osd.2   *  
sh-4.4# ceph crash archive 2021-02-18_11:00:51.223639Z_c7fa00d1-fd63-4dab-bb8d-831f205d91e8
sh-4.4# ceph crash archive 2021-02-18_11:04:02.151486Z_9964cf00-e986-48c1-91e1-210273c2f9c7

11. After approximately 40 minutes, Ceph Health Back to be Health OK.


Versions:

OCP version:
Client Version: 4.6.0-0.nightly-2021-01-12-112514
Server Version: 4.7.0-0.nightly-2021-02-13-071408
Kubernetes Version: v1.20.0+bd9e442

OCS verison:
ocs-operator.v4.7.0-263.ci   OpenShift Container Storage   4.7.0-263.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-02-13-071408   True        False         27h     Cluster version is 4.7.0-0.nightly-2021-02-13-071408

Rook version
rook: 4.7-94.16bbf3806.release_4.7
go: go1.15.5

Ceph version
ceph version 14.2.11-112.el8cp (f00060cb2688083840d657432768de1f6609767e) nautilus (stable)

Comment 17 Itzhak 2021-02-18 17:50:24 UTC
According to the steps above, we can see the ocs-osd-removal-job worked with multiple IDs. 
So I am moving the bug to status verified.

Comment 19 Erin Donnelly 2021-04-13 20:31:44 UTC
Thank you for the doc text Servesha--could you take a look at my edited version and let me know if it looks ok?

Comment 22 errata-xmlrpc 2021-05-19 09:17:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041


Note You need to log in before you can comment on or make changes to this bug.