Description of problem (please be detailed as possible and provide log snippests): When trying to create RBD PVC's using ocs-storagecluster-ceph-rbd storageclass, on cluster which encrypted with KMS (internally deployed) it stay in pending state and not creating Version of all relevant components (if applicable): OCP: 4.7.0-rc.2 OCS: 4.7.0-262.ci rook: 4.7-93.bf9b9ddb1.release_4.7 ceph: 14.2.11-112.el8cp Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? no Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy OCP 4.7 on VMWare with LSO 2. Deploy Internal KMS server on the OCP 3. Deploy OCS 4.7 with cluster encryption using the KMS server 4. create from the UI new PVC on the ocs-storagecluster-ceph-rbd storageclass Actual results: The PVC stay in pending state Expected results: Additional info: for installing the KMS server i used https://docs.google.com/document/d/1WvTQf3XfKDW9AFT2BRPKsjHCoVODxW2xtZxA5gjTQQ8/
I hit the same issue on VMWare-LSO even when it is not encrypted, and on cephfe as well
I agree to what Neha says, RBD PVC creation fails because storage pool is not created: value:"csi-rbd-provisioner" > secrets:<key:"userKey" value:"AQDiJilgcTTwABAABzkstz84hwkuaW4q472SeQ==" > 2021-02-14T13:34:53.002249410Z I0214 13:34:53.002233 1 connection.go:182] GRPC call: /csi.v1.Controller/CreateVolume 2021-02-14T13:34:53.002486698Z I0214 13:34:53.002239 1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":53687091200},"name":"pvc-d281a1f3-e639-4fcf-9c58-4b42f8a76cee","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-d281a1f3-e639-4fcf-9c58-4b42f8a76cee","csi.storage.k8s.io/pvc/name":"db-noobaa-db-pg-0","csi.storage.k8s.io/pvc/namespace":"openshift-storage","imageFeatures":"layering","imageFormat":"2","pool":"ocs-storagecluster-cephblockpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} 2021-02-14T13:34:53.062075634Z I0214 13:34:53.061933 1 connection.go:185] GRPC response: {} 2021-02-14T13:34:53.062103885Z I0214 13:34:53.062061 1 connection.go:186] GRPC error: rpc error: code = Internal desc = pool not found: pool (ocs-storagecluster-cephblockpool) not found in Ceph cluster 2021-02-14T13:34:53.062110542Z I0214 13:34:53.062099 1 controller.go:752] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Finished: rpc error: code = Internal desc = pool not found: pool (ocs-storagecluster-cephblockpool) not found in Ceph cluster 2021-02-14T13:34:53.062158998Z I0214 13:34:53.062145 1 controller.go:1102] Final error received, removing PVC d281a1f3-e639-4fcf-9c58-4b42f8a76cee from claims in progress 2021-02-14T13:34:53.062164881Z W0214 13:34:53.062156 1 controller.go:961] Retrying syncing claim "d281a1f3-e639-4fcf-9c58-4b42f8a76cee", failure 0 2021-02-14T13:34:53.062197589Z E0214 13:34:53.062170 1 controller.go:984] error syncing claim "d281a1f3-e639-4fcf-9c58-4b42f8a76cee": failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = Internal desc = pool not found: pool (ocs-storagecluster-cephblockpool) not found in Ceph cluster Not a CSI issue.
The "ocs-storagecluster-cephblockpool" was created successfully, see: 2021-02-14T13:35:06.941839975Z 2021-02-14 13:35:06.941790 I | cephclient: creating replicated pool ocs-storagecluster-cephblockpool succeeded In http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1928471/no_encryption/logs-20210214-160608/ocs-must-gather-us/quay-io-ocs-dev-ocs-must-gather-sha256-18613a28bd1cac187a657d842ab4fab9facd06958c928d9819cd74f0a44326e0/namespaces/openshift-storage/pods/rook-ceph-operator-8454c6f88-22qvs/rook-ceph-operator/rook-ceph-operator/logs/current.log See the timestamp, the creation was successful at 2021-02-14 13:35:06.941790 and ceph-csi tried at 2021-02-14T13:34:53.062197589Z, so a bit too early?
>> That could be the reason, but doesnt CSI keep trying to create a PVC until we stop the attempt. So once the CBP was created, why didnt the noobaa DB PVC get to BOUND state? This is a known issue/behaviour in CSI, once we hit the deadline exceeded error we kind of loose that request and future requests go in an endless loop saying there is already an operation in progress for the pvc in question. This occurs most of the time if the cephcsi node plugin is not able to get any response from the ceph cluster because the ceph cluster is not healthy or there are slow ops. Madhu, pls correct me if I am wrong.
Doesn't look like a rook issue, moving it to ocs-operator.
José, before proceeding with any SC's creation I'd wait for the CephBlockPool CR status to display "Ready". We would need to wait for that resources to become available. Does that work for you?
Seems inability to create volumes are an effect of the state the cluster is in (100% PGs inactive / unknown). That may just be normal and correct behaviour. But why is the cluster in this state in the first place. That's the question.
There is something wrong with the crush map. We should have hosts and racks but we have nothing. The labels have been applied correctly on the nodes and Rook also used them to prepare the OSD. The prepare and main OSD spec are correct. I have looked at all the logs but I cannot find the "ceph command outputs" anywhere. That'd be useful. In the meantime I've asked Prasad Desala for the env, I'm waiting so I can investigate further.
After logging into the system, I can tell that for some reason the OSDs are not registering their CRUSH location during their initial startup. Although, the flag on the CLI are correct. After restarting one OSD, it successfully registered itself in the CRUSH map correctly and the tree looks a bit better: [root@compute-2 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.09760 root default -4 0.09760 rack rack2 -3 0.09760 host ocs-deviceset-0-data-0dxksr 0 hdd 0.09760 osd.0 up 1.00000 1.00000 1 hdd 0 osd.1 up 1.00000 1.00000 2 hdd 0 osd.2 up 1.00000 1.00000 The only thing that changed recently is the ceph version for a few arbiter fixes. I've looked into Rook and I can't see what could cause this, we have done very few backports recently. We continue the investigation. Most likely not an OCS-Op bug so moving to Rook and perhaps in Ceph eventually.
(In reply to Sébastien Han from comment #25) > After logging into the system, I can tell that for some reason the OSDs are > not registering their CRUSH location during their initial startup. Although, > the flag on the CLI are correct. > After restarting one OSD, it successfully registered itself in the CRUSH map > correctly and the tree looks a bit better: > > [root@compute-2 /]# ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT > PRI-AFF > -1 0.09760 root default > -4 0.09760 rack rack2 > -3 0.09760 host ocs-deviceset-0-data-0dxksr > 0 hdd 0.09760 osd.0 up 1.00000 > 1.00000 > 1 hdd 0 osd.1 up 1.00000 > 1.00000 > 2 hdd 0 osd.2 up 1.00000 > 1.00000 > > The only thing that changed recently is the ceph version for a few arbiter > fixes. > I've looked into Rook and I can't see what could cause this, we have done > very few backports recently. > > We continue the investigation. Most likely not an OCS-Op bug so moving to > Rook and perhaps in Ceph eventually. Sébastien, Will restarting all the OSDs fix the issue, as a very temporary workaround? Orit
(In reply to Orit Wasserman from comment #26) > (In reply to Sébastien Han from comment #25) > > After logging into the system, I can tell that for some reason the OSDs are > > not registering their CRUSH location during their initial startup. Although, > > the flag on the CLI are correct. > > After restarting one OSD, it successfully registered itself in the CRUSH map > > correctly and the tree looks a bit better: > > > > [root@compute-2 /]# ceph osd tree > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT > > PRI-AFF > > -1 0.09760 root default > > -4 0.09760 rack rack2 > > -3 0.09760 host ocs-deviceset-0-data-0dxksr > > 0 hdd 0.09760 osd.0 up 1.00000 > > 1.00000 > > 1 hdd 0 osd.1 up 1.00000 > > 1.00000 > > 2 hdd 0 osd.2 up 1.00000 > > 1.00000 > > > > The only thing that changed recently is the ceph version for a few arbiter > > fixes. > > I've looked into Rook and I can't see what could cause this, we have done > > very few backports recently. > > > > We continue the investigation. Most likely not an OCS-Op bug so moving to > > Rook and perhaps in Ceph eventually. > > Sébastien, > Will restarting all the OSDs fix the issue, as a very temporary workaround? > > Orit Yes, but not ideal.
> > Sébastien, > > Will restarting all the OSDs fix the issue, as a very temporary workaround? > > > > Orit > > Yes, but not ideal. But it still can unblock QA?
(In reply to Mudit Agarwal from comment #28) > > > Sébastien, > > > Will restarting all the OSDs fix the issue, as a very temporary workaround? > > > > > > Orit > > > > Yes, but not ideal. > > But it still can unblock QA? Yes. oc delete pod/<osd pod id> On all OSD pods.
There is another BZ that looks like the same root cause. https://bugzilla.redhat.com/show_bug.cgi?id=1929565#c7 > Can we capture monitor logs with debug_mon=20, debug_ms=1, debug_paxos 20 and debug_crush 20 to verify this? Can we gather more detailed logging per Neha's request? Let me know if you need assistance to increase the logging.
We cannot easily increase the logging since the issue appears at boot time. So we must increase the logs before the process starts. Essentially once the mons are up and running: * quickly jump into toolbox! BEFORE THE OSD START * run: "ceph config set mon.* debug_mon 20", repeat for all the args mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1928471#c30 Ideally we would use the rook-config-override configmap but the ocs-op will reconcile it so it's not practical.
*** Bug 1929565 has been marked as a duplicate of this bug. ***
Not sure if you are still interested in new occurrences of the issue: I was asked by Boris to run acceptance tests for this latest build. 4.7.0-266.ci I see it has failed deployment here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/8/console 14:41:23 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get pod -l 'app=rook-ceph-tools' -o jsonpath='{.items[0].metadata.name}'14:41:23 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-5cb9b5d8cd-468sj -- ceph health14:41:24 - MainThread - ocs_ci.deployment.deployment - WARNING - Ceph health check failed with Ceph cluster health is not OK. Health: HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; Degraded data redundancy: 540/1620 objects degraded (33.333%), 91 pgs degraded, 200 pgs undersized14:41:24 - MainThread - ocs_ci.deployment.deployment - INFO - Patch thin storageclass as non-default14:41:24 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc patch storageclass thin -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}' --request-timeout=120s14:41:24 - MainThread - ocs_ci.ocs.utils - INFO - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.7 will be used.14:41:24 - MainThread - ocs_ci.ocs.utils - INFO - OCS logs will be placed in location /home/jenkins/current-cluster-dir/logs/deployment_1613655379/ocs_must_gather14:41:24 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.7 --dest-dir=/home/jenkins/current-cluster-dir/logs/deployment_1613655379/ocs_must_gather Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009vu1cs33-a/j009vu1cs33-a_20210218T133012/logs/failed_testcase_ocs_logs_1613655379/test_deployment_ocs_logs/
The latest report from QA indicates we are learning toward a Ceph issue, thus moving to Ceph and assigning to Neha since she was already investigating that. Now, we need a way to produce the log level Neha has requested.
Last relevant comment: https://bugzilla.redhat.com/show_bug.cgi?id=1929565#c11
(In reply to Sébastien Han from comment #34) > The latest report from QA indicates we are learning toward a Ceph issue, > thus moving to Ceph and assigning to Neha since she was already > investigating that. > Now, we need a way to produce the log level Neha has requested. Just to give some more context on why we are leaning towards a Ceph issue: This issue is reproducible when we are building OCS with RHCS 4.2z1 and not with RHCS 4.2 async. AFAIK, there are a very few arbiter related commits which have gone in RHCS4.2z1 and we should revisit those commits to see if some issue is there.
i try the work around - delete the osd pods - this is not working , the noobaa-db pod is still in pending mode
(In reply to Avi Liani from comment #42) > i try the work around - delete the osd pods - this is not working , the > noobaa-db pod is still in pending mode Be more specific, what does the osd tree look like after restart? If they still don't register, then increase the log level like requested earlier. Thanks
Moving back to assigned since the merge was only for the configuration that will allow debugging.
(In reply to Travis Nielsen from comment #45) > Moving back to assigned since the merge was only for the configuration that > will allow debugging. Thanks Travis! Jose or Travis, could you please provide details how to enable the debugging now?
FWIW, the currently running nightly OCS 4.7 build might just not have picked up the patch. We will see...
https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/OCS%20Build%20Pipeline%204.7/151/ We might to trigger need another build
This configmap needs to be created before the OCS cluster is created: kind: ConfigMap apiVersion: v1 metadata: name: rook-config-override namespace: openshift-storage data: config: | [global] mon_osd_full_ratio = .85 mon_osd_backfillfull_ratio = .8 mon_osd_nearfull_ratio = .75 mon_max_pg_per_osd = 600 [mon] debug_mon=20 debug_ms=1 debug_paxos=20 debug_crush=20 [osd] osd_memory_target_cgroup_limit_ratio = 0.5 Jose, do we need to change the reconcile setting as well?
Tried reproducing with ocs-registry:4.7.0-273.ci both while Ceph is configured to run in debug log level and while it is not. ------------------------------------------------------------------------------------------------------------------------------- With Ceph in debug log level: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-test-pr/183/testReport/ https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-test-pr/186/testReport/ 19:41:42 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.7.0-273.ci is in phase: Succeeded! 19:41:42 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: v1 data: config: '[global] mon_osd_full_ratio = .85 mon_osd_backfillfull_ratio = .8 mon_osd_nearfull_ratio = .75 mon_max_pg_per_osd = 600 [mon] debug_mon=20 debug_ms=1 debug_paxos=20 debug_crush=20 [osd] osd_memory_target_cgroup_limit_ratio = 0.5 ' kind: ConfigMap metadata: name: rook-config-override namespace: openshift-storage --------------------- 19:41:42 - MainThread - ocs_ci.deployment.deployment - INFO - Setting Ceph to work in debug log level using a new configmap resource 19:41:42 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc create -f /tmp/config_maphi2htxsb 19:41:42 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: name: ocs-storagecluster namespace: openshift-storage spec: managedResources: cephConfig: reconcileStrategy: ignore storageDeviceSets: - count: 1 dataPVCTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 256Gi storageClassName: thin volumeMode: Block name: ocs-deviceset placement: {} portable: true replica: 3 resources: {} 19:41:42 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc create -f /tmp/cluster_storagellp1grvc --------------------- A snippet from mon logs: debug 2021-02-24 19:59:16.315 7f7092a54700 20 allow all debug 2021-02-24 19:59:16.315 7f7092a54700 10 mon.b@1(peon).elector(12) handle_ping mon_ping(ping stamp 2021-02-24 19:59:16.316331) v1 debug 2021-02-24 19:59:16.315 7f7092a54700 1 -- [v2:172.30.127.73:3300/0,v1:172.30.127.73:6789/0] --> [v2:172.30.193.169:3300/0,v1:172.30.193.169:6789/0] -- mon_ping(ping_reply stamp 2021-02-24 19:59:16.316331) v1 -- 0x5591d938b8c0 con 0x5591d7065180 debug 2021-02-24 19:59:16.330 7f7095259700 20 mon.b@1(peon).elector(12) dead_ping to peer 2 ------------------------------------------------------------------------------------------------------------------------------- With Ceph not in debug: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-test-pr/187/testReport/ https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/848/testReport/ ------------------------------------------------------------------------------------------------------------------------------- All the deployment attempts were successful and the bug was not reproduced. I suggest keeping this bug open, continue examining the upcoming OCS 4.7 builds, starting ocs-registry:4.7.0-273.ci, which will consume the latest Ceph image. In case of no reproduction, we can move the bug to VERIFIED.
I just deploy a cluster with Arbiter on VMware-LSO, and it succeed whiteout any WA. OCP : 4.7.0 OCS : ocs-operator.v4.7.0-278.ci ceph version 14.2.11-123.el8cp IMO, it can be moved to VERIFIED.
Hi, There was suspicion, raised by Mudit, for why we are unable to reproduce this BZ with the latest OCS 4.7 builds, and it is that the fix for bug 1931810 also caused this BZ to be prevented . For checking this, I have tried deploying with OCS 4.7 build which was before the fix for bug 1931810, ocs-registry:4.7.0-268.ci, while changing the CSV prior storagecluster creation, to consume a newer Ceph image, which is the one we consume in the latest OCS 4.7 builds. This is in order to isolate the factor of the fix for bug 1931810. Executed here - https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/922/ And it indeed reproduced: E Events: E Type Reason Age From Message E ---- ------ ---- ---- ------- E Warning FailedScheduling 10m default-scheduler 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. E Warning FailedScheduling 10m default-scheduler 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.
Elad, thanks for the deeper validation that this OSD registration issue appears to be fixed by 1931810. @Neha Can you see why this would be related though? I don't see the relationship. The issue fixed by 1931810 was that the CRUSH rules for pools were incorrectly being created with two steps that were both for the same level. For example, a pool was being created with a rule to select from the zone bucket, then select another "zone" bucket. The two-step rules were only intended for stretch clusters, but were being applied incorrectly to all clusters. This put the PGs in a place where they cannot be fulfilled since the second step must be at another level from the first bucket (e.g. rack or host). The fix was to use a single rule from the zone bucket in non-stretch scenarios. The issue seen for this BZ and described in https://bugzilla.redhat.com/show_bug.cgi?id=1929565#c7 is that the OSD is not registering correctly at first startup, but it did not always happen. If the bad CRUSH rule was created before the OSD was started, could it affect the weight assigned to the OSD? If so, it explains the behavior and we can close this BZ. But if it's not related, I still don't see why this BZ was fixed by 1931810.
(In reply to Travis Nielsen from comment #53) > Elad, thanks for the deeper validation that this OSD registration issue > appears to be fixed by 1931810. @Elad, is https://bugzilla.redhat.com/show_bug.cgi?id=1928471#c52 with additional logging? > > @Neha Can you see why this would be related though? I don't see the > relationship. I don't see a direct correlation between these issues but additional logging will definitely help us figure out, if at all there is one. I'd like to emphasize the importance of enabling logging for testing by default, that will save us a lot of back and forth, in cases like this. > > The issue fixed by 1931810 was that the CRUSH rules for pools were > incorrectly being created with two steps that were both for the same level. > For example, a pool was being created with a rule to select from the zone > bucket, then select another "zone" bucket. The two-step rules were only > intended for stretch clusters, but were being applied incorrectly to all > clusters. This put the PGs in a place where they cannot be fulfilled since > the second step must be at another level from the first bucket (e.g. rack or > host). The fix was to use a single rule from the zone bucket in non-stretch > scenarios. > > The issue seen for this BZ and described in > https://bugzilla.redhat.com/show_bug.cgi?id=1929565#c7 is that the OSD is > not registering correctly at first startup, but it did not always happen. If > the bad CRUSH rule was created before the OSD was started, could it affect > the weight assigned to the OSD? If so, it explains the behavior and we can > close this BZ. But if it's not related, I still don't see why this BZ was > fixed by 1931810.
Hi Neha, This is not in debug. I am using an old OCS 4.7 build for the reproduction, but this build doesn't have https://github.com/openshift/ocs-operator/pull/1091 included, so OCS deployment with Ceph in debug is not possible.
@Neha @Elad Since there is no repro and we can't get debug logs at this point, shall we move this to Verified? It seems there is not much else to do for now. Going forward, the increased logging would be available though when other issues are hit.
(In reply to Travis Nielsen from comment #56) > @Neha @Elad Since there is no repro and we can't get debug logs at this > point, shall we move this to Verified? It seems there is not much else to do > for now. Going forward, the increased logging would be available though when > other issues are hit. sounds good to me, nothing much we can do, without a reproducer (with enough logs)
Moving to Verified per comments above
Actually, intended to move to ON_QA first and QE can move to verified...
After my deployment (see Comment #51), moving to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041