Description of problem (please be detailed as possible and provide log snippests): On newly installed cluster, ceph cluster health is in not OK. There are Degraded data redundancy, pgs degraded and pgs undersized WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; Degraded data redundancy: 325/975 objects degraded (33.333%), 47 pgs degraded, 96 pgs undersized Version of all relevant components (if applicable): openshift installer (4.7.0-0.nightly-2021-02-13-071408) ocs-registry:4.7.0-264.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? 1/1 Can this issue reproduce from the UI? Not tried If this is a regression, please provide more details to justify this: Yes Steps to Reproduce: 1.Install OCS using ocs-ci 2.check the ceph health 3. Actual results: 22:27:41 - MainThread - ocs_ci.utility.retry - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; Degraded data redundancy: 326/978 objects degraded (33.333%), 47 pgs degraded, 96 pgs undersized Expected results: ceph health should be OK Additional info: job link: https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/consoleFull must gather logs: https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/
Wondering whether this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1928471
osd.0 is not showing as healthy. The OSD does not show any weight and is not in the expected location in the OSD tree. ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2 ssd 0.09769 1.00000 100 GiB 1.3 GiB 271 MiB 0 B 1 GiB 99 GiB 1.26 1.07 96 up 1 ssd 0.09769 1.00000 100 GiB 1.3 GiB 271 MiB 0 B 1 GiB 99 GiB 1.26 1.07 96 up 0 ssd 0 1.00000 100 GiB 1.0 GiB 200 KiB 0 B 1 GiB 99 GiB 1.00 0.85 0 up TOTAL 300 GiB 3.5 GiB 541 MiB 0 B 3 GiB 296 GiB 1.18 MIN/MAX VAR: 0.85/1.07 STDDEV: 0.12 ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.19537 root default -6 0.19537 region us-west-1 -5 0.19537 zone us-west-1a -12 0.09769 rack rack1 -11 0.09769 host ocs-deviceset-0-data-0k99pn 2 ssd 0.09769 osd.2 up 1.00000 1.00000 -4 0.09769 rack rack2 -3 0.09769 host ocs-deviceset-1-data-02wd6v 1 ssd 0.09769 osd.1 up 1.00000 1.00000 0 ssd 0 osd.0 up 1.00000 1.00000 The OSD pod spec looks as expected for the OSD to start in the correct location in the CRUSH map. The OSD pod is running and the pod logs [3] don't show any errors to my limited training reading OSD logs. --crush-location=root=default host=ocs-deviceset-2-data-0h4gxk rack=rack0 region=us-west-1 zone=us-west-1b The operator logs look normal, except of course the messages about the PGs not being clean. @Neha could you take a look at the ceph logs? [1] [2] [1] https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/ceph/must_gather_commands/ [2] https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/openshift-storage/pods/ [3] https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/openshift-storage/pods/rook-ceph-osd-0-5c75cff9d7-lbff4/osd/osd/logs/current.log
The same encountered here: Build: ocs-operator.v4.7.0-263.ci https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/635/console Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j002vu1cs33-a/j002vu1cs33-a_20210217T135542/logs/failed_testcase_ocs_logs_1613570472/test_deployment_ocs_logs/ Looks like this is blocking currently the deployment over vSphere for OCS 4.7.
(In reply to Travis Nielsen from comment #5) > osd.0 is not showing as healthy. The OSD does not show any weight and is not > in the expected location in the OSD tree. > > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE > VAR PGS STATUS > 2 ssd 0.09769 1.00000 100 GiB 1.3 GiB 271 MiB 0 B 1 GiB 99 GiB 1.26 > 1.07 96 up > 1 ssd 0.09769 1.00000 100 GiB 1.3 GiB 271 MiB 0 B 1 GiB 99 GiB 1.26 > 1.07 96 up > 0 ssd 0 1.00000 100 GiB 1.0 GiB 200 KiB 0 B 1 GiB 99 GiB 1.00 > 0.85 0 up > TOTAL 300 GiB 3.5 GiB 541 MiB 0 B 3 GiB 296 GiB 1.18 > > MIN/MAX VAR: 0.85/1.07 STDDEV: 0.12 > > ID CLASS WEIGHT TYPE NAME STATUS > REWEIGHT PRI-AFF > -1 0.19537 root default > > -6 0.19537 region us-west-1 > > -5 0.19537 zone us-west-1a > > -12 0.09769 rack rack1 > > -11 0.09769 host ocs-deviceset-0-data-0k99pn > > 2 ssd 0.09769 osd.2 up > 1.00000 1.00000 > -4 0.09769 rack rack2 > > -3 0.09769 host ocs-deviceset-1-data-02wd6v > > 1 ssd 0.09769 osd.1 up > 1.00000 1.00000 > 0 ssd 0 osd.0 up > 1.00000 1.00000 > > The OSD pod spec looks as expected for the OSD to start in the correct > location in the CRUSH map. The OSD pod is running and the pod logs [3] don't > show any errors to my limited training reading OSD logs. > > --crush-location=root=default host=ocs-deviceset-2-data-0h4gxk rack=rack0 > region=us-west-1 zone=us-west-1b > > The operator logs look normal, except of course the messages about the PGs > not being clean. > > @Neha could you take a look at the ceph logs? [1] [2] The problem is that osd.0's crush_weight isn't being set properly. "stray": [ { "id": 0, "device_class": "ssd", "name": "osd.0", "type": "osd", "type_id": 0, "crush_weight": 0, "depth": 0, "reweight": 1, "kb": 104857600, "kb_used": 1048784, "kb_used_data": 200, "kb_used_omap": 0, "kb_used_meta": 1048576, "kb_avail": 103808816, "utilization": 1.0001983642578125, "var": 0.85168021551108097, "pgs": 0, "status": "up" } ], From the mon logs, following is the last thing I can see 2021-02-17T18:29:34.504+0000 7fa4c1693700 2 mon.a@0(leader) e1 send_reply 0x55d5be3a4760 0x55d5be61fba0 mon_command_ack([{"prefix": "osd crush create-or-move", "id": 0, "weight":0.0986, "args": ["host=vossi06", "root=default"]}]=0 create-or-move updating item name 'osd.0' weight 0.0986 at location {host=vossi06,root=default} to crush map v5) v1 This tells us us that we tried to set the crush_weight but doesn't tell us if this mon command was acked properly or not. Can we capture monitor logs with debug_mon=20, debug_ms=1, debug_paxos 20 and debug_crush 20 to verify this? > > > [1] > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/ > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256- > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/ceph/ > must_gather_commands/ > [2] > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/ > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256- > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/ > openshift-storage/pods/ > [3] > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/ > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256- > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/ > openshift-storage/pods/rook-ceph-osd-0-5c75cff9d7-lbff4/osd/osd/logs/current. > log
This appears to be the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1928471. Shall we close this as a dup? Or even better if we can get multiple repros with the increased logging.
(In reply to Neha Ojha from comment #7) > (In reply to Travis Nielsen from comment #5) > > osd.0 is not showing as healthy. The OSD does not show any weight and is not > > in the expected location in the OSD tree. > > > > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE > > VAR PGS STATUS > > 2 ssd 0.09769 1.00000 100 GiB 1.3 GiB 271 MiB 0 B 1 GiB 99 GiB 1.26 > > 1.07 96 up > > 1 ssd 0.09769 1.00000 100 GiB 1.3 GiB 271 MiB 0 B 1 GiB 99 GiB 1.26 > > 1.07 96 up > > 0 ssd 0 1.00000 100 GiB 1.0 GiB 200 KiB 0 B 1 GiB 99 GiB 1.00 > > 0.85 0 up > > TOTAL 300 GiB 3.5 GiB 541 MiB 0 B 3 GiB 296 GiB 1.18 > > > > MIN/MAX VAR: 0.85/1.07 STDDEV: 0.12 > > > > ID CLASS WEIGHT TYPE NAME STATUS > > REWEIGHT PRI-AFF > > -1 0.19537 root default > > > > -6 0.19537 region us-west-1 > > > > -5 0.19537 zone us-west-1a > > > > -12 0.09769 rack rack1 > > > > -11 0.09769 host ocs-deviceset-0-data-0k99pn > > > > 2 ssd 0.09769 osd.2 up > > 1.00000 1.00000 > > -4 0.09769 rack rack2 > > > > -3 0.09769 host ocs-deviceset-1-data-02wd6v > > > > 1 ssd 0.09769 osd.1 up > > 1.00000 1.00000 > > 0 ssd 0 osd.0 up > > 1.00000 1.00000 > > > > The OSD pod spec looks as expected for the OSD to start in the correct > > location in the CRUSH map. The OSD pod is running and the pod logs [3] don't > > show any errors to my limited training reading OSD logs. > > > > --crush-location=root=default host=ocs-deviceset-2-data-0h4gxk rack=rack0 > > region=us-west-1 zone=us-west-1b > > > > The operator logs look normal, except of course the messages about the PGs > > not being clean. > > > > @Neha could you take a look at the ceph logs? [1] [2] > > The problem is that osd.0's crush_weight isn't being set properly. > > "stray": [ > { > "id": 0, > "device_class": "ssd", > "name": "osd.0", > "type": "osd", > "type_id": 0, > "crush_weight": 0, > "depth": 0, > "reweight": 1, > "kb": 104857600, > "kb_used": 1048784, > "kb_used_data": 200, > "kb_used_omap": 0, > "kb_used_meta": 1048576, > "kb_avail": 103808816, > "utilization": 1.0001983642578125, > "var": 0.85168021551108097, > "pgs": 0, > "status": "up" > } > ], > > From the mon logs, following is the last thing I can see > > 2021-02-17T18:29:34.504+0000 7fa4c1693700 2 mon.a@0(leader) e1 send_reply > 0x55d5be3a4760 0x55d5be61fba0 mon_command_ack([{"prefix": "osd crush > create-or-move", "id": 0, "weight":0.0986, "args": ["host=vossi06", > "root=default"]}]=0 create-or-move updating item name 'osd.0' weight 0.0986 > at location {host=vossi06,root=default} to crush map v5) v1 > > This tells us us that we tried to set the crush_weight but doesn't tell us > if this mon command was acked properly or not. > > Can we capture monitor logs with debug_mon=20, debug_ms=1, debug_paxos 20 > and debug_crush 20 to verify this? > > > > > > > [1] > > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/ > > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ > > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256- > > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/ceph/ > > must_gather_commands/ > > [2] > > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/ > > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ > > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256- > > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/ > > openshift-storage/pods/ > > [3] > > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/ > > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ > > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256- > > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/ > > openshift-storage/pods/rook-ceph-osd-0-5c75cff9d7-lbff4/osd/osd/logs/current. > > log Since this is failing on Deployment, I am not sure how to add debug with debug_mon=20, debug_ms=1, debug_paxos 20 and debug_crush 20. Is it ok to add post deployment. is there anyway to add those debugs during deployment? Could you tell us procedure to add
Closing in favour of 1928471 which is older. Let's move the discussions over there. Neha, we still need your help :) *** This bug has been marked as a duplicate of bug 1928471 ***
(In reply to Vijay Avuthu from comment #9) > (In reply to Neha Ojha from comment #7) > > (In reply to Travis Nielsen from comment #5) > > > osd.0 is not showing as healthy. The OSD does not show any weight and is not > > > in the expected location in the OSD tree. > > > > > > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE > > > VAR PGS STATUS > > > 2 ssd 0.09769 1.00000 100 GiB 1.3 GiB 271 MiB 0 B 1 GiB 99 GiB 1.26 > > > 1.07 96 up > > > 1 ssd 0.09769 1.00000 100 GiB 1.3 GiB 271 MiB 0 B 1 GiB 99 GiB 1.26 > > > 1.07 96 up > > > 0 ssd 0 1.00000 100 GiB 1.0 GiB 200 KiB 0 B 1 GiB 99 GiB 1.00 > > > 0.85 0 up > > > TOTAL 300 GiB 3.5 GiB 541 MiB 0 B 3 GiB 296 GiB 1.18 > > > > > > MIN/MAX VAR: 0.85/1.07 STDDEV: 0.12 > > > > > > ID CLASS WEIGHT TYPE NAME STATUS > > > REWEIGHT PRI-AFF > > > -1 0.19537 root default > > > > > > -6 0.19537 region us-west-1 > > > > > > -5 0.19537 zone us-west-1a > > > > > > -12 0.09769 rack rack1 > > > > > > -11 0.09769 host ocs-deviceset-0-data-0k99pn > > > > > > 2 ssd 0.09769 osd.2 up > > > 1.00000 1.00000 > > > -4 0.09769 rack rack2 > > > > > > -3 0.09769 host ocs-deviceset-1-data-02wd6v > > > > > > 1 ssd 0.09769 osd.1 up > > > 1.00000 1.00000 > > > 0 ssd 0 osd.0 up > > > 1.00000 1.00000 > > > > > > The OSD pod spec looks as expected for the OSD to start in the correct > > > location in the CRUSH map. The OSD pod is running and the pod logs [3] don't > > > show any errors to my limited training reading OSD logs. > > > > > > --crush-location=root=default host=ocs-deviceset-2-data-0h4gxk rack=rack0 > > > region=us-west-1 zone=us-west-1b > > > > > > The operator logs look normal, except of course the messages about the PGs > > > not being clean. > > > > > > @Neha could you take a look at the ceph logs? [1] [2] > > > > The problem is that osd.0's crush_weight isn't being set properly. > > > > "stray": [ > > { > > "id": 0, > > "device_class": "ssd", > > "name": "osd.0", > > "type": "osd", > > "type_id": 0, > > "crush_weight": 0, > > "depth": 0, > > "reweight": 1, > > "kb": 104857600, > > "kb_used": 1048784, > > "kb_used_data": 200, > > "kb_used_omap": 0, > > "kb_used_meta": 1048576, > > "kb_avail": 103808816, > > "utilization": 1.0001983642578125, > > "var": 0.85168021551108097, > > "pgs": 0, > > "status": "up" > > } > > ], > > > > From the mon logs, following is the last thing I can see > > > > 2021-02-17T18:29:34.504+0000 7fa4c1693700 2 mon.a@0(leader) e1 send_reply > > 0x55d5be3a4760 0x55d5be61fba0 mon_command_ack([{"prefix": "osd crush > > create-or-move", "id": 0, "weight":0.0986, "args": ["host=vossi06", > > "root=default"]}]=0 create-or-move updating item name 'osd.0' weight 0.0986 > > at location {host=vossi06,root=default} to crush map v5) v1 > > > > This tells us us that we tried to set the crush_weight but doesn't tell us > > if this mon command was acked properly or not. > > > > Can we capture monitor logs with debug_mon=20, debug_ms=1, debug_paxos 20 > > and debug_crush 20 to verify this? > > > > > > > > > > > [1] > > > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/ > > > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ > > > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256- > > > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/ceph/ > > > must_gather_commands/ > > > [2] > > > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/ > > > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ > > > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256- > > > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/ > > > openshift-storage/pods/ > > > [3] > > > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/ > > > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ > > > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256- > > > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/ > > > openshift-storage/pods/rook-ceph-osd-0-5c75cff9d7-lbff4/osd/osd/logs/current. > > > log > > Since this is failing on Deployment, I am not sure how to add debug with > debug_mon=20, debug_ms=1, debug_paxos 20 > and debug_crush 20. Is it ok to add post deployment. is there anyway to add > those debugs during deployment? > Could you tell us procedure to add You'll need to change these settings before the osds are deployed. Perhaps, you could do this using "ceph config set osd debug_*". @tnielsen any other ideas?