Bug 1834301 - The OCS 4.5 build is not deployable as rook-ceph-detect-version is in loop of creating/terminating
Summary: The OCS 4.5 build is not deployable as rook-ceph-detect-version is in loop of...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: OCS 4.5.0
Assignee: Travis Nielsen
QA Contact: Petr Balogh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-11 13:39 UTC by Petr Balogh
Modified: 2020-09-15 10:17 UTC (History)
12 users (show)

Fixed In Version: 4.5.0-434.ci
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-15 10:17:01 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rook rook pull 5524 0 None closed ceph: default the device set template name to data 2021-02-04 07:01:15 UTC
Red Hat Product Errata RHBA-2020:3754 0 None None None 2020-09-15 10:17:23 UTC

Internal Links: 1930466

Description Petr Balogh 2020-05-11 13:39:05 UTC
Description of problem (please be detailed as possible and provide log
snippests):
From events you can see:
ook-ceph@sha256:e4e20a1e8756a8b9847def42a60aa117d8ab5633c6eaec3f8013132c2800c72c" already present on machine
openshift-storage                                       27s         Normal    Started                                      pod/rook-ceph-detect-version-rn4x6                                            Started container init-copy-binaries
openshift-storage                                       27s         Normal    Created                                      pod/rook-ceph-detect-version-rn4x6                                            Created container init-copy-binaries
openshift-storage                                       26s         Normal    Started                                      pod/rook-ceph-detect-version-rn4x6                                            Started container cmd-reporter
openshift-storage                                       26s         Normal    Created                                      pod/rook-ceph-detect-version-rn4x6                                            Created container cmd-reporter
openshift-storage                                       26s         Normal    Pulled                                       pod/rook-ceph-detect-version-rn4x6                                            Container image "quay.io/rhceph-dev/rhceph@sha256:2aca817ad21c8b204d8fdee03a0cfee6e2cc7a177b0b25b46d4fabb9c3f099b3" already present on machine
openshift-storage                                       <unknown>   Normal    Scheduled                                    pod/rook-ceph-detect-version-wctf6                                            Successfully assigned openshift-storage/rook-ceph-detect-version-wctf6 to ip-10-0-129-52.us-east-2.compute.internal
openshift-storage                                       21s         Normal    SuccessfulCreate                             job/rook-ceph-detect-version                                                  Created pod: rook-ceph-detect-version-wctf6
openshift-storage                                       19s         Normal    Created                                      pod/rook-ceph-detect-version-wctf6                                            Created container init-copy-binaries
openshift-storage                                       19s         Normal    Pulled                                       pod/rook-ceph-detect-version-wctf6                                            Container image "quay.io/rhceph-dev/rook-ceph@sha256:e4e20a1e8756a8b9847def42a60aa117d8ab5633c6eaec3f8013132c2800c72c" already present on machine
openshift-storage                                       19s         Normal    Started                                      pod/rook-ceph-detect-version-wctf6                                            Started container init-copy-binaries
openshift-storage                                       18s         Normal    Created                                      pod/rook-ceph-detect-version-wctf6                                            Created container cmd-reporter
openshift-storage                                       18s         Normal    Pulled                                       pod/rook-ceph-detect-version-wctf6                                            Container image "quay.io/rhceph-dev/rhceph@sha256:2aca817ad21c8b204d8fdee03a0cfee6e2cc7a177b0b25b46d4fabb9c3f099b3" already present on machine
openshift-storage                                       18s         Normal    Started                                      pod/rook-ceph-detect-version-wctf6                                            Started container cmd-reporter
openshift-storage                                       15s         Normal    SuccessfulCreate                             job/rook-ceph-detect-version                                                  Created pod: rook-ceph-detect-version-jgzsc
openshift-storage                                       <unknown>   Normal    Scheduled                                    pod/rook-ceph-detect-version-jgzsc                                            Successfully assigned openshift-storage/rook-ceph-detect-version-jgzsc to ip-10-0-129-52.us-east-2.compute.internal
openshift-storage                                       13s         Normal    Pulled                                       pod/rook-ceph-detect-version-jgzsc                                            Container image "quay.io/rhceph-dev/rook-ceph@sha256:e4e20a1e8756a8b9847def42a60aa117d8ab5633c6eaec3f8013132c2800c72c" already present on machine
openshift-storage                                       13s         Normal    Created                                      pod/rook-ceph-detect-version-jgzsc                                            Created container init-copy-binaries
openshift-storage                                       13s         Normal    Started                                      pod/rook-ceph-detect-version-jgzsc                                            Started container init-copy-binaries
openshift-storage                                       12s         Normal    Pulled                                       pod/rook-ceph-detect-version-jgzsc                                            Container image "quay.io/rhceph-dev/rhceph@sha256:2aca817ad21c8b204d8fdee03a0cfee6e2cc7a177b0b25b46d4fabb9c3f099b3" already present on machine
openshift-storage                                       12s         Normal    Created                                      pod/rook-ceph-detect-version-jgzsc                                            Created container cmd-reporter
openshift-storage                                       12s         Normal    Started                                      pod/rook-ceph-detect-version-jgzsc                                            Started container cmd-reporter
openshift-storage                                       <unknown>   Normal    Scheduled                                    pod/rook-ceph-detect-version-qxhls                                            Successfully assigned openshift-storage/rook-ceph-detect-version-qxhls to ip-10-0-129-52.us-east-2.compute.internal
openshift-storage                                       9s          Normal    SuccessfulCreate                             job/rook-ceph-detect-version                                                  Created pod: rook-ceph-detect-version-qxhls
openshift-storage                                       7s          Normal    Pulled                                       pod/rook-ceph-detect-version-qxhls                                            Container image "quay.io/rhceph-dev/rook-ceph@sha256:e4e20a1e8756a8b9847def42a60aa117d8ab5633c6eaec3f8013132c2800c72c" already present on machine
openshift-storage                                       7s          Normal    Started                                      pod/rook-ceph-detect-version-qxhls                                            Started container init-copy-binaries
openshift-storage                                       7s          Normal    Created                                      pod/rook-ceph-detect-version-qxhls                                            Created container init-copy-binaries
openshift-storage                                       6s          Normal    Created                                      pod/rook-ceph-detect-version-qxhls                                            Created container cmd-reporter
openshift-storage                                       6s          Normal    Pulled                                       pod/rook-ceph-detect-version-qxhls                                            Container image "quay.io/rhceph-dev/rhceph@sha256:2aca817ad21c8b204d8fdee03a0cfee6e2cc7a177b0b25b46d4fabb9c3f099b3" already present on machine
openshift-storage                                       6s          Normal    Started                                      pod/rook-ceph-detect-version-qxhls                                            Started container cmd-reporter
openshift-storage                                       <unknown>   Normal    Scheduled                                    pod/rook-ceph-detect-version-p65nt                                            Successfully assigned openshift-storage/rook-ceph-detect-version-p65nt to ip-10-0-129-52.us-east-2.compute.internal
openshift-storage                                       3s          Normal    SuccessfulCreate                             job/rook-ceph-detect-version                                                  Created pod: rook-ceph-detect-version-p65nt
openshift-storage                                       1s          Normal    Started                                      pod/rook-ceph-detect-version-p65nt                                            Started container init-copy-binaries
openshift-storage                                       1s          Normal    Pulled                                       pod/rook-ceph-detect-version-p65nt                                            Container image "quay.io/rhceph-dev/rook-ceph@sha256:e4e20a1e8756a8b9847def42a60aa117d8ab5633c6eaec3f8013132c2800c72c" already present on machine
openshift-storage                                       1s          Normal    Created                                      pod/rook-ceph-detect-version-p65nt                                            Created container init-copy-binaries

$ oc get pod -n openshift-storage
NAME                                  READY   STATUS     RESTARTS   AGE
noobaa-operator-b4ff6749d-fvphd       1/1     Running    0          33m
ocs-operator-6b9cbfb878-w7c5x         0/1     Running    0          33m
rook-ceph-detect-version-fprzc        0/1     Init:0/1   0          0s
rook-ceph-operator-75b8479457-cm72h   1/1     Running    0          33m

$ oc get pod -n openshift-storage
NAME                                  READY   STATUS        RESTARTS   AGE
noobaa-operator-b4ff6749d-fvphd       1/1     Running       0          32m
ocs-operator-6b9cbfb878-w7c5x         0/1     Running       0          32m
rook-ceph-detect-version-hkllm        0/1     Terminating   0          8s
rook-ceph-operator-75b8479457-cm72h   1/1     Running       0          32m

Operator logs:
{"level":"info","ts":"2020-05-11T12:52:02.179Z","logger":"controller_storagecluster","msg":"Waiting on ceph cluster to initialize before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"error","ts":"2020-05-11T12:52:02.184Z","logger":"controller_storagecluster","msg":"Failed to set PhaseProgressing","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","error":"Operation cannot be fulfilled on storageclusters.ocs.openshift.io \"ocs-storagecluster\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/go-logr/zapr/zapr.go:128\ngithub.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).Reconcile\n\t/go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker-fm\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"error","ts":"2020-05-11T12:52:02.197Z","logger":"controller_storagecluster","msg":"Failed to set PhaseProgressing","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","error":"Operation cannot be fulfilled on storageclusters.ocs.openshift.io \"ocs-storagecluster\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/go-logr/zapr/zapr.go:128\ngithub.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).Reconcile\n\t/go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker-fm\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"error","ts":"2020-05-11T12:52:02.203Z","logger":"controller_storagecluster","msg":"Failed to update status","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","error":"Operation cannot be fulfilled on storageclusters.ocs.openshift.io \"ocs-storagecluster\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/go-logr/zapr/zapr.go:128\ngithub.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).Reconcile\n\t/go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:332\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker-fm\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"error","ts":"2020-05-11T12:52:02.203Z","logger":"controller-runtime.controller","msg":"Reconciler error","controller":"storagecluster-controller","request":"openshift-storage/ocs-storagecluster","error":"Operation cannot be fulfilled on storageclusters.ocs.openshift.io \"ocs-storagecluster\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker-fm\n\t/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"info","ts":"2020-05-11T12:52:03.203Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":"2020-05-11T12:52:03.218Z","logger":"controller_storagecluster","msg":"not creating a CephObjectStore because the platform is AWS"}
{"level":"info","ts":"2020-05-11T12:52:03.271Z","logger":"controller_storagecluster","msg":"Waiting on ceph cluster to initialize before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":"2020-05-11T12:52:03.303Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":"2020-05-11T12:52:03.322Z","logger":"controller_storagecluster","msg":"not creating a CephObjectStore because the platform is AWS"}
{"level":"info","ts":"2020-05-11T12:52:03.366Z","logger":"controller_storagecluster","msg":"Waiting on ceph cluster to initialize before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
W0511 13:04:10.857621       1 reflector.go:289] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.
W0511 13:12:40.910782       1 reflector.go:289] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.
W0511 13:25:17.991870       1 reflector.go:289] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.

Version of all relevant components (if applicable):
OCS 4.5 internal build over OCP 4.4 nightly build


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yep, as we cannot even deploy OCS


Is there any workaround available to the best of your knowledge?
NO


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1


Can this issue reproducible?
Haven't tried yet but mostly yes.


Can this issue reproduce from the UI?
Haven't tried


If this is a regression, please provide more details to justify this:
Yes

Steps to Reproduce:
1. Install OCS 4.5 from internal build



Actual results:
Not able to deploy


Expected results:
Have deployment of OCS passed


Additional info:
Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr2034-b1671/jnk-pr2034-b1671_20200511T121001/logs/failed_testcase_ocs_logs_1589199655/deployment_ocs_logs/

Jenkins run:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/7425/

Comment 3 Petr Balogh 2020-05-11 13:41:16 UTC
Just missed to add OCS build:
4.5.0-419.ci

Comment 5 Travis Nielsen 2020-05-11 15:12:59 UTC
From the rook operator log, the csi version is not being detected and therefore failing to load the csi driver:

2020-05-11 12:51:47.996190 E | op-cluster: invalid csi version: failed to extract ceph CSI version: failed to parse version from: "quay.io/rhceph-dev/cephcsi@sha256:86087a7123945ce4f7f720539693395e5a6fc8175318d050d0d983af8ea0e216"

Then later rook fails when attempting to start the cluster because the csi driver is not initialized.

2020-05-11 14:02:46.148402 E | op-cluster: failed to create cluster in namespace "openshift-storage". failed to start the mons: failed to initialize ceph cluster info: failed to save mons: failed to update csi cluster config: failed to fetch current csi config map: configmaps "rook-ceph-csi-config" not found

OCS needs to set the operator flag ROOK_CSI_ALLOW_UNSUPPORTED_VERSION: "true" since the CSI version is not being detected from the downstream image. There is a fix for this in OCS:
https://github.com/openshift/ocs-operator/pull/501/files and there is a related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1832889

Comment 7 Petr Balogh 2020-05-13 13:20:58 UTC
I see that even 4.5.0-423.ci was not deployed with the same issue.

Looking here to engineering job: https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/ocs-ci/440/console

I triggered on vmware here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/Tier1/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-tier1/4/console

But expect this to fail as well.

Comment 8 Travis Nielsen 2020-05-14 17:56:59 UTC
We need to get this merged for the CSI versioning issue: https://github.com/openshift/ocs-operator/pull/501

Comment 9 Petr Balogh 2020-05-15 12:28:00 UTC
Thanks Travis for info.

I see the latest build was made 13 hours ago: 	4.5.0-425.ci

And deployment in engineering pipeline failing with the same:
https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/ocs-ci/441/console
https://github.com/openshift/ocs-operator/pull/501 see got merged 18 hours ago.

Not sure if this build suppose to have the fix or not but looks like it's still not deployable.

Comment 10 Travis Nielsen 2020-05-15 13:53:33 UTC
@Petr The backport PR was missed. Now we need to get this one merged and run a new build.
https://github.com/openshift/ocs-operator/pull/512

Comment 13 Travis Nielsen 2020-05-21 15:25:54 UTC
@umanga thanks for the analysis. There are two possible changes for this:
1. The OCS operator would generate the storageClassDeviceSet in the CephCluster CR with the "data" name instead of blank
2. Rook should default to "data" if the name is blank

We need #2 anyway for the Rook default and backward compatability, so will make the change in Rook.

Comment 14 Travis Nielsen 2020-05-21 17:26:40 UTC
Local testing looks good, now working on getting the PR merged and backported...
https://github.com/rook/rook/pull/5524

Comment 15 Travis Nielsen 2020-05-21 21:05:47 UTC
The fix is merged to the downstream release-4.5 branch now and will be picked up in the next 4.5 build.
https://github.com/openshift/rook/pull/60

Comment 22 Petr Balogh 2020-06-15 07:44:42 UTC
In the latests builds we don't see this issue. So marking as verified.

Comment 24 errata-xmlrpc 2020-09-15 10:17:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754


Note You need to log in before you can comment on or make changes to this bug.