1928471 – [Deployment blocker] Ceph OSDs do not register properly in the CRUSH map

Bug 1928471 - [Deployment blocker] Ceph OSDs do not register properly in the CRUSH map

Summary: [Deployment blocker] Ceph OSDs do not register properly in the CRUSH map

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Neha Ojha
QA Contact:	Avi Liani
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1929565 (view as bug list)
Depends On:
Blocks:	1859570
TreeView+	depends on / blocked

Reported:	2021-02-14 10:19 UTC by Avi Liani
Modified:	2021-05-19 09:20 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:20:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2041	0	None	None	None	2021-05-19 09:20:29 UTC

Description Avi Liani 2021-02-14 10:19:40 UTC

Description of problem (please be detailed as possible and provide log
snippests):

When trying to create RBD PVC's using ocs-storagecluster-ceph-rbd storageclass, on cluster which encrypted with KMS (internally deployed) it stay in pending state and not creating

Version of all relevant components (if applicable):

OCP:   4.7.0-rc.2
OCS:   4.7.0-262.ci
rook:  4.7-93.bf9b9ddb1.release_4.7
ceph:  14.2.11-112.el8cp


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

yes

Is there any workaround available to the best of your knowledge?

no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

yes

Can this issue reproduce from the UI?

yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCP 4.7 on VMWare with LSO
2. Deploy Internal KMS server on the OCP
3. Deploy OCS 4.7 with cluster encryption using the KMS server
4. create from the UI new PVC on the ocs-storagecluster-ceph-rbd storageclass


Actual results:

The PVC stay in pending state

Expected results:


Additional info:

for installing the KMS server i used https://docs.google.com/document/d/1WvTQf3XfKDW9AFT2BRPKsjHCoVODxW2xtZxA5gjTQQ8/

Comment 4 Avi Liani 2021-02-14 13:54:53 UTC

I hit the same issue on VMWare-LSO even when it is not encrypted, and on cephfe as well

Comment 8 Mudit Agarwal 2021-02-15 06:28:08 UTC

I agree to what Neha says, RBD PVC creation fails because storage pool is not created:

value:"csi-rbd-provisioner" > secrets:<key:"userKey" value:"AQDiJilgcTTwABAABzkstz84hwkuaW4q472SeQ==" > 
2021-02-14T13:34:53.002249410Z I0214 13:34:53.002233       1 connection.go:182] GRPC call: /csi.v1.Controller/CreateVolume
2021-02-14T13:34:53.002486698Z I0214 13:34:53.002239       1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":53687091200},"name":"pvc-d281a1f3-e639-4fcf-9c58-4b42f8a76cee","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-d281a1f3-e639-4fcf-9c58-4b42f8a76cee","csi.storage.k8s.io/pvc/name":"db-noobaa-db-pg-0","csi.storage.k8s.io/pvc/namespace":"openshift-storage","imageFeatures":"layering","imageFormat":"2","pool":"ocs-storagecluster-cephblockpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
2021-02-14T13:34:53.062075634Z I0214 13:34:53.061933       1 connection.go:185] GRPC response: {}
2021-02-14T13:34:53.062103885Z I0214 13:34:53.062061       1 connection.go:186] GRPC error: rpc error: code = Internal desc = pool not found: pool (ocs-storagecluster-cephblockpool) not found in Ceph cluster
2021-02-14T13:34:53.062110542Z I0214 13:34:53.062099       1 controller.go:752] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Finished: rpc error: code = Internal desc = pool not found: pool (ocs-storagecluster-cephblockpool) not found in Ceph cluster
2021-02-14T13:34:53.062158998Z I0214 13:34:53.062145       1 controller.go:1102] Final error received, removing PVC d281a1f3-e639-4fcf-9c58-4b42f8a76cee from claims in progress
2021-02-14T13:34:53.062164881Z W0214 13:34:53.062156       1 controller.go:961] Retrying syncing claim "d281a1f3-e639-4fcf-9c58-4b42f8a76cee", failure 0
2021-02-14T13:34:53.062197589Z E0214 13:34:53.062170       1 controller.go:984] error syncing claim "d281a1f3-e639-4fcf-9c58-4b42f8a76cee": failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = Internal desc = pool not found: pool (ocs-storagecluster-cephblockpool) not found in Ceph cluster

Not a CSI issue.

Comment 9 Sébastien Han 2021-02-15 10:58:27 UTC

The "ocs-storagecluster-cephblockpool" was created successfully, see:

2021-02-14T13:35:06.941839975Z 2021-02-14 13:35:06.941790 I | cephclient: creating replicated pool ocs-storagecluster-cephblockpool succeeded


In http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1928471/no_encryption/logs-20210214-160608/ocs-must-gather-us/quay-io-ocs-dev-ocs-must-gather-sha256-18613a28bd1cac187a657d842ab4fab9facd06958c928d9819cd74f0a44326e0/namespaces/openshift-storage/pods/rook-ceph-operator-8454c6f88-22qvs/rook-ceph-operator/rook-ceph-operator/logs/current.log

See the timestamp, the creation was successful at 2021-02-14 13:35:06.941790 and ceph-csi tried at 2021-02-14T13:34:53.062197589Z, so a bit too early?

Comment 11 Mudit Agarwal 2021-02-15 13:18:34 UTC

>> That could be the reason, but doesnt CSI keep trying to create a PVC until we stop the attempt. So once the CBP was created, why didnt the noobaa DB PVC get to BOUND state?
This is a known issue/behaviour in CSI, once we hit the deadline exceeded error we kind of loose that request and future requests go in an endless loop saying there is already an operation in progress for the pvc in question. 
This occurs most of the time if the cephcsi node plugin is not able to get any response from the ceph cluster because the ceph cluster is not healthy or there are slow ops.
Madhu, pls correct me if I am wrong.

Comment 18 Mudit Agarwal 2021-02-16 15:57:12 UTC

Doesn't look like a rook issue, moving it to ocs-operator.

Comment 20 Sébastien Han 2021-02-16 17:46:33 UTC

José, before proceeding with any SC's creation I'd wait for the CephBlockPool CR status to display "Ready".
We would need to wait for that resources to become available.

Does that work for you?

Comment 23 Michael Adam 2021-02-17 08:24:04 UTC

Seems inability to create volumes are an effect of the state the cluster is in (100% PGs inactive / unknown).
That may just be normal and correct behaviour.

But why is the cluster in this state in the first place. That's the question.

Comment 24 Sébastien Han 2021-02-17 14:07:50 UTC

There is something wrong with the crush map.
We should have hosts and racks but we have nothing. The labels have been applied correctly on the nodes and Rook also used them to prepare the OSD.
The prepare and main OSD spec are correct.


I have looked at all the logs but I cannot find the "ceph command outputs" anywhere.
That'd be useful.

In the meantime I've asked Prasad Desala for the env, I'm waiting so I can investigate further.

Comment 25 Sébastien Han 2021-02-17 14:49:11 UTC

After logging into the system, I can tell that for some reason the OSDs are not registering their CRUSH location during their initial startup. Although, the flag on the CLI are correct.
After restarting one OSD, it successfully registered itself in the CRUSH map correctly and the tree looks a bit better:

[root@compute-2 /]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME                                STATUS REWEIGHT PRI-AFF
-1       0.09760 root default
-4       0.09760     rack rack2
-3       0.09760         host ocs-deviceset-0-data-0dxksr
 0   hdd 0.09760             osd.0                            up  1.00000 1.00000
 1   hdd       0 osd.1                                        up  1.00000 1.00000
 2   hdd       0 osd.2                                        up  1.00000 1.00000

The only thing that changed recently is the ceph version for a few arbiter fixes.
I've looked into Rook and I can't see what could cause this, we have done very few backports recently.

We continue the investigation. Most likely not an OCS-Op bug so moving to Rook and perhaps in Ceph eventually.

Comment 26 Orit Wasserman 2021-02-17 16:37:21 UTC

(In reply to Sébastien Han from comment #25)
> After logging into the system, I can tell that for some reason the OSDs are
> not registering their CRUSH location during their initial startup. Although,
> the flag on the CLI are correct.
> After restarting one OSD, it successfully registered itself in the CRUSH map
> correctly and the tree looks a bit better:
> 
> [root@compute-2 /]# ceph osd tree
> ID CLASS WEIGHT  TYPE NAME                                STATUS REWEIGHT
> PRI-AFF
> -1       0.09760 root default
> -4       0.09760     rack rack2
> -3       0.09760         host ocs-deviceset-0-data-0dxksr
>  0   hdd 0.09760             osd.0                            up  1.00000
> 1.00000
>  1   hdd       0 osd.1                                        up  1.00000
> 1.00000
>  2   hdd       0 osd.2                                        up  1.00000
> 1.00000
> 
> The only thing that changed recently is the ceph version for a few arbiter
> fixes.
> I've looked into Rook and I can't see what could cause this, we have done
> very few backports recently.
> 
> We continue the investigation. Most likely not an OCS-Op bug so moving to
> Rook and perhaps in Ceph eventually.

Sébastien,
Will restarting all the OSDs fix the issue, as a very temporary workaround?

Orit

Comment 27 Sébastien Han 2021-02-17 16:38:52 UTC

(In reply to Orit Wasserman from comment #26)
> (In reply to Sébastien Han from comment #25)
> > After logging into the system, I can tell that for some reason the OSDs are
> > not registering their CRUSH location during their initial startup. Although,
> > the flag on the CLI are correct.
> > After restarting one OSD, it successfully registered itself in the CRUSH map
> > correctly and the tree looks a bit better:
> > 
> > [root@compute-2 /]# ceph osd tree
> > ID CLASS WEIGHT  TYPE NAME                                STATUS REWEIGHT
> > PRI-AFF
> > -1       0.09760 root default
> > -4       0.09760     rack rack2
> > -3       0.09760         host ocs-deviceset-0-data-0dxksr
> >  0   hdd 0.09760             osd.0                            up  1.00000
> > 1.00000
> >  1   hdd       0 osd.1                                        up  1.00000
> > 1.00000
> >  2   hdd       0 osd.2                                        up  1.00000
> > 1.00000
> > 
> > The only thing that changed recently is the ceph version for a few arbiter
> > fixes.
> > I've looked into Rook and I can't see what could cause this, we have done
> > very few backports recently.
> > 
> > We continue the investigation. Most likely not an OCS-Op bug so moving to
> > Rook and perhaps in Ceph eventually.
> 
> Sébastien,
> Will restarting all the OSDs fix the issue, as a very temporary workaround?
> 
> Orit

Yes, but not ideal.

Comment 28 Mudit Agarwal 2021-02-17 16:42:24 UTC

> > Sébastien,
> > Will restarting all the OSDs fix the issue, as a very temporary workaround?
> > 
> > Orit
> 
> Yes, but not ideal.

But it still can unblock QA?

Comment 29 Sébastien Han 2021-02-17 16:43:27 UTC

(In reply to Mudit Agarwal from comment #28)
> > > Sébastien,
> > > Will restarting all the OSDs fix the issue, as a very temporary workaround?
> > > 
> > > Orit
> > 
> > Yes, but not ideal.
> 
> But it still can unblock QA?

Yes. oc delete pod/<osd pod id>
On all OSD pods.

Comment 30 Travis Nielsen 2021-02-17 19:25:24 UTC

There is another BZ that looks like the same root cause.
https://bugzilla.redhat.com/show_bug.cgi?id=1929565#c7

> Can we capture monitor logs with debug_mon=20, debug_ms=1, debug_paxos 20 and debug_crush 20 to verify this?

Can we gather more detailed logging per Neha's request? Let me know if you need assistance to increase the logging.

Comment 31 Sébastien Han 2021-02-18 13:49:02 UTC

We cannot easily increase the logging since the issue appears at boot time. So we must increase the logs before the process starts.
Essentially once the mons are up and running:

* quickly jump into toolbox! BEFORE THE OSD START
* run: "ceph config set mon.* debug_mon 20", repeat for all the args mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1928471#c30

Ideally we would use the rook-config-override configmap but the ocs-op will reconcile it so it's not practical.

Comment 32 Sébastien Han 2021-02-18 15:09:31 UTC

*** Bug 1929565 has been marked as a duplicate of this bug. ***

Comment 33 Petr Balogh 2021-02-18 15:11:43 UTC

Not sure if you are still interested in new occurrences of the issue:
I was asked by Boris to run acceptance tests for this latest build.

4.7.0-266.ci I see it has failed deployment  here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/8/console

14:41:23 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get pod -l 'app=rook-ceph-tools' -o jsonpath='{.items[0].metadata.name}'14:41:23 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-5cb9b5d8cd-468sj -- ceph health14:41:24 - MainThread - ocs_ci.deployment.deployment - WARNING - Ceph health check failed with Ceph cluster health is not OK. Health: HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; Degraded data redundancy: 540/1620 objects degraded (33.333%), 91 pgs degraded, 200 pgs undersized14:41:24 - MainThread - ocs_ci.deployment.deployment - INFO - Patch thin storageclass as non-default14:41:24 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc patch storageclass thin -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}' --request-timeout=120s14:41:24 - MainThread - ocs_ci.ocs.utils - INFO - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.7 will be used.14:41:24 - MainThread - ocs_ci.ocs.utils - INFO - OCS logs will be placed in location /home/jenkins/current-cluster-dir/logs/deployment_1613655379/ocs_must_gather14:41:24 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.7 --dest-dir=/home/jenkins/current-cluster-dir/logs/deployment_1613655379/ocs_must_gather

Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009vu1cs33-a/j009vu1cs33-a_20210218T133012/logs/failed_testcase_ocs_logs_1613655379/test_deployment_ocs_logs/

Comment 34 Sébastien Han 2021-02-18 15:27:48 UTC

The latest report from QA indicates we are learning toward a Ceph issue, thus moving to Ceph and assigning to Neha since she was already investigating that.
Now, we need a way to produce the log level Neha has requested.

Comment 35 Neha Ojha 2021-02-18 21:01:18 UTC

Last relevant comment: https://bugzilla.redhat.com/show_bug.cgi?id=1929565#c11

Comment 36 Mudit Agarwal 2021-02-19 04:54:08 UTC

(In reply to Sébastien Han from comment #34)
> The latest report from QA indicates we are learning toward a Ceph issue,
> thus moving to Ceph and assigning to Neha since she was already
> investigating that.
> Now, we need a way to produce the log level Neha has requested.

Just to give some more context on why we are leaning towards a Ceph issue:

This issue is reproducible when we are building OCS with RHCS 4.2z1 and not with RHCS 4.2 async.

AFAIK, there are a very few arbiter related commits which have gone in RHCS4.2z1 and we should revisit those commits to see if some issue is there.

Comment 42 Avi Liani 2021-02-23 11:58:21 UTC

i try the work around - delete the osd pods - this is not working , the noobaa-db pod is still in pending mode

Comment 43 Sébastien Han 2021-02-23 16:04:06 UTC

(In reply to Avi Liani from comment #42)
> i try the work around - delete the osd pods - this is not working , the
> noobaa-db pod is still in pending mode

Be more specific, what does the osd tree look like after restart?
If they still don't register, then increase the log level like requested earlier.

Thanks

Comment 45 Travis Nielsen 2021-02-24 00:11:37 UTC

Moving back to assigned since the merge was only for the configuration that will allow debugging.

Comment 46 Michael Adam 2021-02-24 00:33:15 UTC

(In reply to Travis Nielsen from comment #45)
> Moving back to assigned since the merge was only for the configuration that
> will allow debugging.

Thanks Travis!

Jose or Travis, could you please provide details how to enable the debugging now?

Comment 47 Michael Adam 2021-02-24 00:36:47 UTC

FWIW, the currently running nightly OCS 4.7 build might just not have picked up the patch. We will see...

Comment 48 Michael Adam 2021-02-24 00:37:34 UTC

https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/OCS%20Build%20Pipeline%204.7/151/

We might to trigger need another build

Comment 49 Travis Nielsen 2021-02-24 00:39:39 UTC

This configmap needs to be created before the OCS cluster is created:

kind: ConfigMap
apiVersion: v1
metadata:
  name: rook-config-override
  namespace: openshift-storage
data:
  config: |
    [global]
    mon_osd_full_ratio = .85
    mon_osd_backfillfull_ratio = .8
    mon_osd_nearfull_ratio = .75
    mon_max_pg_per_osd = 600
    [mon]
    debug_mon=20
    debug_ms=1
    debug_paxos=20
    debug_crush=20
    [osd]
    osd_memory_target_cgroup_limit_ratio = 0.5

Jose, do we need to change the reconcile setting as well?

Comment 50 Elad 2021-02-25 17:31:37 UTC

Tried reproducing with ocs-registry:4.7.0-273.ci both while Ceph is configured to run in debug log level and while it is not. 


-------------------------------------------------------------------------------------------------------------------------------

With Ceph in debug log level:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-test-pr/183/testReport/
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-test-pr/186/testReport/


19:41:42 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.7.0-273.ci is in phase: Succeeded!
19:41:42 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: v1
data:
  config: '[global]

    mon_osd_full_ratio = .85

    mon_osd_backfillfull_ratio = .8

    mon_osd_nearfull_ratio = .75

    mon_max_pg_per_osd = 600

    [mon]

    debug_mon=20

    debug_ms=1

    debug_paxos=20

    debug_crush=20

    [osd]

    osd_memory_target_cgroup_limit_ratio = 0.5

    '
kind: ConfigMap
metadata:
  name: rook-config-override
  namespace: openshift-storage


---------------------


19:41:42 - MainThread - ocs_ci.deployment.deployment - INFO - Setting Ceph to work in debug log level using a new configmap resource
19:41:42 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc create -f /tmp/config_maphi2htxsb
19:41:42 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
  name: ocs-storagecluster
  namespace: openshift-storage
spec:
  managedResources:
    cephConfig:
      reconcileStrategy: ignore
  storageDeviceSets:
  - count: 1
    dataPVCTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 256Gi
        storageClassName: thin
        volumeMode: Block
    name: ocs-deviceset
    placement: {}
    portable: true
    replica: 3
    resources: {}

19:41:42 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc create -f /tmp/cluster_storagellp1grvc


---------------------

A snippet from mon logs:

debug 2021-02-24 19:59:16.315 7f7092a54700 20  allow all
debug 2021-02-24 19:59:16.315 7f7092a54700 10 mon.b@1(peon).elector(12) handle_ping mon_ping(ping stamp 2021-02-24 19:59:16.316331) v1
debug 2021-02-24 19:59:16.315 7f7092a54700  1 -- [v2:172.30.127.73:3300/0,v1:172.30.127.73:6789/0] --> [v2:172.30.193.169:3300/0,v1:172.30.193.169:6789/0] -- mon_ping(ping_reply stamp 2021-02-24 19:59:16.316331) v1 -- 0x5591d938b8c0 con 0x5591d7065180
debug 2021-02-24 19:59:16.330 7f7095259700 20 mon.b@1(peon).elector(12) dead_ping to peer 2

-------------------------------------------------------------------------------------------------------------------------------

With Ceph not in debug:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-test-pr/187/testReport/
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/848/testReport/


-------------------------------------------------------------------------------------------------------------------------------

All the deployment attempts were successful and the bug was not reproduced.

I suggest keeping this bug open, continue examining the upcoming OCS 4.7 builds, starting ocs-registry:4.7.0-273.ci, which will consume the latest Ceph image. In case of no reproduction, we can move the bug to VERIFIED.

Comment 51 Avi Liani 2021-03-01 08:37:14 UTC

I just deploy a cluster with Arbiter on VMware-LSO, and it succeed whiteout any WA.

OCP : 4.7.0
OCS : ocs-operator.v4.7.0-278.ci
ceph version 14.2.11-123.el8cp

IMO, it can be moved to VERIFIED.

Comment 52 Elad 2021-03-02 12:15:11 UTC

Hi,

There was suspicion, raised by Mudit, for why we are unable to reproduce this BZ with the latest OCS 4.7 builds, and it is that the fix for bug 1931810 also caused this BZ to be prevented
.
For checking this, I have tried deploying with OCS 4.7 build which was before the fix for bug 1931810, ocs-registry:4.7.0-268.ci, while changing the CSV prior storagecluster creation, to consume a newer Ceph image, which is the one we consume in the latest OCS 4.7 builds. This is in order to isolate the factor of the fix for bug 1931810.

Executed here - https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/922/

And it indeed reproduced:

E           Events:
E             Type     Reason            Age   From               Message
E             ----     ------            ----  ----               -------
E             Warning  FailedScheduling  10m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.
E             Warning  FailedScheduling  10m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.

Comment 53 Travis Nielsen 2021-03-02 15:03:03 UTC

Elad, thanks for the deeper validation that this OSD registration issue appears to be fixed by 1931810. 

@Neha Can you see why this would be related though? I don't see the relationship. 

The issue fixed by 1931810 was that the CRUSH rules for pools were incorrectly being created with two steps that were both for the same level. For example, a pool was being created with a rule to select from the zone bucket, then select another "zone" bucket. The two-step rules were only intended for stretch clusters, but were being applied incorrectly to all clusters. This put the PGs in a place where they cannot be fulfilled since the second step must be at another level from the first bucket (e.g. rack or host). The fix was to use a single rule from the zone bucket in non-stretch scenarios.

The issue seen for this BZ and described in https://bugzilla.redhat.com/show_bug.cgi?id=1929565#c7 is that the OSD is not registering correctly at first startup, but it did not always happen. If the bad CRUSH rule was created before the OSD was started, could it affect the weight assigned to the OSD? If so, it explains the behavior and we can close this BZ. But if it's not related, I still don't see why this BZ was fixed by 1931810.

Comment 54 Neha Ojha 2021-03-03 00:54:47 UTC

(In reply to Travis Nielsen from comment #53)
> Elad, thanks for the deeper validation that this OSD registration issue
> appears to be fixed by 1931810. 

@Elad, is https://bugzilla.redhat.com/show_bug.cgi?id=1928471#c52 with additional logging?

> 
> @Neha Can you see why this would be related though? I don't see the
> relationship. 

I don't see a direct correlation between these issues but additional logging will definitely help us figure out, if at all there is one. I'd like to emphasize the importance of enabling logging for testing by default, that will save us a lot of back and forth, in cases like this.

> 
> The issue fixed by 1931810 was that the CRUSH rules for pools were
> incorrectly being created with two steps that were both for the same level.
> For example, a pool was being created with a rule to select from the zone
> bucket, then select another "zone" bucket. The two-step rules were only
> intended for stretch clusters, but were being applied incorrectly to all
> clusters. This put the PGs in a place where they cannot be fulfilled since
> the second step must be at another level from the first bucket (e.g. rack or
> host). The fix was to use a single rule from the zone bucket in non-stretch
> scenarios.
> 
> The issue seen for this BZ and described in
> https://bugzilla.redhat.com/show_bug.cgi?id=1929565#c7 is that the OSD is
> not registering correctly at first startup, but it did not always happen. If
> the bad CRUSH rule was created before the OSD was started, could it affect
> the weight assigned to the OSD? If so, it explains the behavior and we can
> close this BZ. But if it's not related, I still don't see why this BZ was
> fixed by 1931810.

Comment 55 Elad 2021-03-03 06:53:54 UTC

Hi Neha,

This is not in debug. I am using an old OCS 4.7 build for the reproduction, but this build doesn't have https://github.com/openshift/ocs-operator/pull/1091 included, so OCS deployment with Ceph in debug is not possible.

Comment 56 Travis Nielsen 2021-03-03 16:16:36 UTC

@Neha @Elad Since there is no repro and we can't get debug logs at this point, shall we move this to Verified? It seems there is not much else to do for now. Going forward, the increased logging would be available though when other issues are hit.

Comment 57 Neha Ojha 2021-03-03 17:03:11 UTC

(In reply to Travis Nielsen from comment #56)
> @Neha @Elad Since there is no repro and we can't get debug logs at this
> point, shall we move this to Verified? It seems there is not much else to do
> for now. Going forward, the increased logging would be available though when
> other issues are hit.

sounds good to me, nothing much we can do, without a reproducer (with enough logs)

Comment 58 Travis Nielsen 2021-03-03 19:23:02 UTC

Moving to Verified per comments above

Comment 59 Travis Nielsen 2021-03-03 19:25:50 UTC

Actually, intended to move to ON_QA first and QE can move to verified...

Comment 60 Avi Liani 2021-03-04 09:06:50 UTC

After my deployment (see Comment #51), moving to VERIFIED

Comment 65 errata-xmlrpc 2021-05-19 09:20:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.