Bug 1929565 - ceph cluster health is in not OK,Degraded data redundancy, pgs degraded and pgs undersized on new cluster
Summary: ceph cluster health is in not OK,Degraded data redundancy, pgs degraded and p...
Keywords:
Status: CLOSED DUPLICATE of bug 1928471
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ceph
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Neha Ojha
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-17 07:37 UTC by Vijay Avuthu
Modified: 2021-02-18 21:00 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-18 15:09:31 UTC
Embargoed:


Attachments (Terms of Use)

Description Vijay Avuthu 2021-02-17 07:37:38 UTC
Description of problem (please be detailed as possible and provide log
snippests):

On newly installed cluster, ceph cluster health is in not OK. There are Degraded data redundancy, pgs degraded and pgs undersized

WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; Degraded data redundancy: 325/975 objects degraded (33.333%), 47 pgs degraded, 96 pgs undersized


Version of all relevant components (if applicable):

openshift installer (4.7.0-0.nightly-2021-02-13-071408)
ocs-registry:4.7.0-264.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
1/1

Can this issue reproduce from the UI?
Not tried

If this is a regression, please provide more details to justify this:
Yes

Steps to Reproduce:
1.Install OCS using ocs-ci
2.check the ceph health
3.


Actual results:

22:27:41 - MainThread - ocs_ci.utility.retry - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; Degraded data redundancy: 326/978 objects degraded (33.333%), 47 pgs degraded, 96 pgs undersized


Expected results:

ceph health should be OK


Additional info:

job link: https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/consoleFull

must gather logs: https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/

Comment 4 Michael Adam 2021-02-17 08:51:30 UTC
Wondering whether this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1928471

Comment 5 Travis Nielsen 2021-02-17 14:26:53 UTC
osd.0 is not showing as healthy. The OSD does not show any weight and is not in the expected location in the OSD tree.

ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP META  AVAIL   %USE VAR  PGS STATUS 
 2   ssd 0.09769  1.00000 100 GiB 1.3 GiB 271 MiB  0 B 1 GiB  99 GiB 1.26 1.07  96     up 
 1   ssd 0.09769  1.00000 100 GiB 1.3 GiB 271 MiB  0 B 1 GiB  99 GiB 1.26 1.07  96     up 
 0   ssd       0  1.00000 100 GiB 1.0 GiB 200 KiB  0 B 1 GiB  99 GiB 1.00 0.85   0     up 
                    TOTAL 300 GiB 3.5 GiB 541 MiB  0 B 3 GiB 296 GiB 1.18                 
MIN/MAX VAR: 0.85/1.07  STDDEV: 0.12

ID  CLASS WEIGHT  TYPE NAME                                        STATUS REWEIGHT PRI-AFF 
 -1       0.19537 root default                                                             
 -6       0.19537     region us-west-1                                                     
 -5       0.19537         zone us-west-1a                                                  
-12       0.09769             rack rack1                                                   
-11       0.09769                 host ocs-deviceset-0-data-0k99pn                         
  2   ssd 0.09769                     osd.2                            up  1.00000 1.00000 
 -4       0.09769             rack rack2                                                   
 -3       0.09769                 host ocs-deviceset-1-data-02wd6v                         
  1   ssd 0.09769                     osd.1                            up  1.00000 1.00000 
  0   ssd       0 osd.0                                                up  1.00000 1.00000 

The OSD pod spec looks as expected for the OSD to start in the correct location in the CRUSH map. The OSD pod is running and the pod logs [3] don't show any errors to my limited training reading OSD logs.

--crush-location=root=default host=ocs-deviceset-2-data-0h4gxk rack=rack0 region=us-west-1 zone=us-west-1b

The operator logs look normal, except of course the messages about the PGs not being clean.

@Neha could you take a look at the ceph logs? [1] [2]


[1] https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/ceph/must_gather_commands/
[2] https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/openshift-storage/pods/
[3] https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/openshift-storage/pods/rook-ceph-osd-0-5c75cff9d7-lbff4/osd/osd/logs/current.log

Comment 7 Neha Ojha 2021-02-17 19:16:49 UTC
(In reply to Travis Nielsen from comment #5)
> osd.0 is not showing as healthy. The OSD does not show any weight and is not
> in the expected location in the OSD tree.
> 
> ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP META  AVAIL   %USE
> VAR  PGS STATUS 
>  2   ssd 0.09769  1.00000 100 GiB 1.3 GiB 271 MiB  0 B 1 GiB  99 GiB 1.26
> 1.07  96     up 
>  1   ssd 0.09769  1.00000 100 GiB 1.3 GiB 271 MiB  0 B 1 GiB  99 GiB 1.26
> 1.07  96     up 
>  0   ssd       0  1.00000 100 GiB 1.0 GiB 200 KiB  0 B 1 GiB  99 GiB 1.00
> 0.85   0     up 
>                     TOTAL 300 GiB 3.5 GiB 541 MiB  0 B 3 GiB 296 GiB 1.18   
> 
> MIN/MAX VAR: 0.85/1.07  STDDEV: 0.12
> 
> ID  CLASS WEIGHT  TYPE NAME                                        STATUS
> REWEIGHT PRI-AFF 
>  -1       0.19537 root default                                              
> 
>  -6       0.19537     region us-west-1                                      
> 
>  -5       0.19537         zone us-west-1a                                   
> 
> -12       0.09769             rack rack1                                    
> 
> -11       0.09769                 host ocs-deviceset-0-data-0k99pn          
> 
>   2   ssd 0.09769                     osd.2                            up 
> 1.00000 1.00000 
>  -4       0.09769             rack rack2                                    
> 
>  -3       0.09769                 host ocs-deviceset-1-data-02wd6v          
> 
>   1   ssd 0.09769                     osd.1                            up 
> 1.00000 1.00000 
>   0   ssd       0 osd.0                                                up 
> 1.00000 1.00000 
> 
> The OSD pod spec looks as expected for the OSD to start in the correct
> location in the CRUSH map. The OSD pod is running and the pod logs [3] don't
> show any errors to my limited training reading OSD logs.
> 
> --crush-location=root=default host=ocs-deviceset-2-data-0h4gxk rack=rack0
> region=us-west-1 zone=us-west-1b
> 
> The operator logs look normal, except of course the messages about the PGs
> not being clean.
> 
> @Neha could you take a look at the ceph logs? [1] [2]

The problem is that osd.0's crush_weight isn't being set properly. 

    "stray": [
        {
            "id": 0,
            "device_class": "ssd",
            "name": "osd.0",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0,
            "depth": 0,
            "reweight": 1,
            "kb": 104857600,
            "kb_used": 1048784,
            "kb_used_data": 200,
            "kb_used_omap": 0,
            "kb_used_meta": 1048576,
            "kb_avail": 103808816,
            "utilization": 1.0001983642578125,
            "var": 0.85168021551108097,
            "pgs": 0,
            "status": "up"
        }
    ],

From the mon logs, following is the last thing I can see

2021-02-17T18:29:34.504+0000 7fa4c1693700  2 mon.a@0(leader) e1 send_reply 0x55d5be3a4760 0x55d5be61fba0 mon_command_ack([{"prefix": "osd crush create-or-move", "id": 0, "weight":0.0986, "args": ["host=vossi06", "root=default"]}]=0 create-or-move updating item name 'osd.0' weight 0.0986 at location {host=vossi06,root=default} to crush map v5) v1

This tells us us that we tried to set the crush_weight but doesn't tell us if this mon command was acked properly or not.

Can we capture monitor logs with debug_mon=20, debug_ms=1, debug_paxos 20 and debug_crush 20 to verify this?

> 
> 
> [1]
> https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/
> artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/
> ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-
> 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/ceph/
> must_gather_commands/
> [2]
> https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/
> artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/
> ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-
> 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/
> openshift-storage/pods/
> [3]
> https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/
> artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/
> ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-
> 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/
> openshift-storage/pods/rook-ceph-osd-0-5c75cff9d7-lbff4/osd/osd/logs/current.
> log

Comment 8 Travis Nielsen 2021-02-17 19:34:18 UTC
This appears to be the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1928471. Shall we close this as a dup? Or even better if we can get multiple repros with the increased logging.

Comment 9 Vijay Avuthu 2021-02-18 11:41:19 UTC
(In reply to Neha Ojha from comment #7)
> (In reply to Travis Nielsen from comment #5)
> > osd.0 is not showing as healthy. The OSD does not show any weight and is not
> > in the expected location in the OSD tree.
> > 
> > ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP META  AVAIL   %USE
> > VAR  PGS STATUS 
> >  2   ssd 0.09769  1.00000 100 GiB 1.3 GiB 271 MiB  0 B 1 GiB  99 GiB 1.26
> > 1.07  96     up 
> >  1   ssd 0.09769  1.00000 100 GiB 1.3 GiB 271 MiB  0 B 1 GiB  99 GiB 1.26
> > 1.07  96     up 
> >  0   ssd       0  1.00000 100 GiB 1.0 GiB 200 KiB  0 B 1 GiB  99 GiB 1.00
> > 0.85   0     up 
> >                     TOTAL 300 GiB 3.5 GiB 541 MiB  0 B 3 GiB 296 GiB 1.18   
> > 
> > MIN/MAX VAR: 0.85/1.07  STDDEV: 0.12
> > 
> > ID  CLASS WEIGHT  TYPE NAME                                        STATUS
> > REWEIGHT PRI-AFF 
> >  -1       0.19537 root default                                              
> > 
> >  -6       0.19537     region us-west-1                                      
> > 
> >  -5       0.19537         zone us-west-1a                                   
> > 
> > -12       0.09769             rack rack1                                    
> > 
> > -11       0.09769                 host ocs-deviceset-0-data-0k99pn          
> > 
> >   2   ssd 0.09769                     osd.2                            up 
> > 1.00000 1.00000 
> >  -4       0.09769             rack rack2                                    
> > 
> >  -3       0.09769                 host ocs-deviceset-1-data-02wd6v          
> > 
> >   1   ssd 0.09769                     osd.1                            up 
> > 1.00000 1.00000 
> >   0   ssd       0 osd.0                                                up 
> > 1.00000 1.00000 
> > 
> > The OSD pod spec looks as expected for the OSD to start in the correct
> > location in the CRUSH map. The OSD pod is running and the pod logs [3] don't
> > show any errors to my limited training reading OSD logs.
> > 
> > --crush-location=root=default host=ocs-deviceset-2-data-0h4gxk rack=rack0
> > region=us-west-1 zone=us-west-1b
> > 
> > The operator logs look normal, except of course the messages about the PGs
> > not being clean.
> > 
> > @Neha could you take a look at the ceph logs? [1] [2]
> 
> The problem is that osd.0's crush_weight isn't being set properly. 
> 
>     "stray": [
>         {
>             "id": 0,
>             "device_class": "ssd",
>             "name": "osd.0",
>             "type": "osd",
>             "type_id": 0,
>             "crush_weight": 0,
>             "depth": 0,
>             "reweight": 1,
>             "kb": 104857600,
>             "kb_used": 1048784,
>             "kb_used_data": 200,
>             "kb_used_omap": 0,
>             "kb_used_meta": 1048576,
>             "kb_avail": 103808816,
>             "utilization": 1.0001983642578125,
>             "var": 0.85168021551108097,
>             "pgs": 0,
>             "status": "up"
>         }
>     ],
> 
> From the mon logs, following is the last thing I can see
> 
> 2021-02-17T18:29:34.504+0000 7fa4c1693700  2 mon.a@0(leader) e1 send_reply
> 0x55d5be3a4760 0x55d5be61fba0 mon_command_ack([{"prefix": "osd crush
> create-or-move", "id": 0, "weight":0.0986, "args": ["host=vossi06",
> "root=default"]}]=0 create-or-move updating item name 'osd.0' weight 0.0986
> at location {host=vossi06,root=default} to crush map v5) v1
> 
> This tells us us that we tried to set the crush_weight but doesn't tell us
> if this mon command was acked properly or not.
> 
> Can we capture monitor logs with debug_mon=20, debug_ms=1, debug_paxos 20
> and debug_crush 20 to verify this?
> 
> > 
> > 
> > [1]
> > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/
> > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/
> > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-
> > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/ceph/
> > must_gather_commands/
> > [2]
> > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/
> > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/
> > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-
> > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/
> > openshift-storage/pods/
> > [3]
> > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/
> > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/
> > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-
> > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/
> > openshift-storage/pods/rook-ceph-osd-0-5c75cff9d7-lbff4/osd/osd/logs/current.
> > log

Since this is failing on Deployment, I am not sure how to add debug with debug_mon=20, debug_ms=1, debug_paxos 20
and debug_crush 20. Is it ok to add post deployment. is there anyway to add those debugs during deployment?
Could you tell us procedure to add

Comment 10 Sébastien Han 2021-02-18 15:09:31 UTC
Closing in favour of 1928471 which is older. Let's move the discussions over there.
Neha, we still need your help :)

*** This bug has been marked as a duplicate of bug 1928471 ***

Comment 11 Neha Ojha 2021-02-18 21:00:34 UTC
(In reply to Vijay Avuthu from comment #9)
> (In reply to Neha Ojha from comment #7)
> > (In reply to Travis Nielsen from comment #5)
> > > osd.0 is not showing as healthy. The OSD does not show any weight and is not
> > > in the expected location in the OSD tree.
> > > 
> > > ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP META  AVAIL   %USE
> > > VAR  PGS STATUS 
> > >  2   ssd 0.09769  1.00000 100 GiB 1.3 GiB 271 MiB  0 B 1 GiB  99 GiB 1.26
> > > 1.07  96     up 
> > >  1   ssd 0.09769  1.00000 100 GiB 1.3 GiB 271 MiB  0 B 1 GiB  99 GiB 1.26
> > > 1.07  96     up 
> > >  0   ssd       0  1.00000 100 GiB 1.0 GiB 200 KiB  0 B 1 GiB  99 GiB 1.00
> > > 0.85   0     up 
> > >                     TOTAL 300 GiB 3.5 GiB 541 MiB  0 B 3 GiB 296 GiB 1.18   
> > > 
> > > MIN/MAX VAR: 0.85/1.07  STDDEV: 0.12
> > > 
> > > ID  CLASS WEIGHT  TYPE NAME                                        STATUS
> > > REWEIGHT PRI-AFF 
> > >  -1       0.19537 root default                                              
> > > 
> > >  -6       0.19537     region us-west-1                                      
> > > 
> > >  -5       0.19537         zone us-west-1a                                   
> > > 
> > > -12       0.09769             rack rack1                                    
> > > 
> > > -11       0.09769                 host ocs-deviceset-0-data-0k99pn          
> > > 
> > >   2   ssd 0.09769                     osd.2                            up 
> > > 1.00000 1.00000 
> > >  -4       0.09769             rack rack2                                    
> > > 
> > >  -3       0.09769                 host ocs-deviceset-1-data-02wd6v          
> > > 
> > >   1   ssd 0.09769                     osd.1                            up 
> > > 1.00000 1.00000 
> > >   0   ssd       0 osd.0                                                up 
> > > 1.00000 1.00000 
> > > 
> > > The OSD pod spec looks as expected for the OSD to start in the correct
> > > location in the CRUSH map. The OSD pod is running and the pod logs [3] don't
> > > show any errors to my limited training reading OSD logs.
> > > 
> > > --crush-location=root=default host=ocs-deviceset-2-data-0h4gxk rack=rack0
> > > region=us-west-1 zone=us-west-1b
> > > 
> > > The operator logs look normal, except of course the messages about the PGs
> > > not being clean.
> > > 
> > > @Neha could you take a look at the ceph logs? [1] [2]
> > 
> > The problem is that osd.0's crush_weight isn't being set properly. 
> > 
> >     "stray": [
> >         {
> >             "id": 0,
> >             "device_class": "ssd",
> >             "name": "osd.0",
> >             "type": "osd",
> >             "type_id": 0,
> >             "crush_weight": 0,
> >             "depth": 0,
> >             "reweight": 1,
> >             "kb": 104857600,
> >             "kb_used": 1048784,
> >             "kb_used_data": 200,
> >             "kb_used_omap": 0,
> >             "kb_used_meta": 1048576,
> >             "kb_avail": 103808816,
> >             "utilization": 1.0001983642578125,
> >             "var": 0.85168021551108097,
> >             "pgs": 0,
> >             "status": "up"
> >         }
> >     ],
> > 
> > From the mon logs, following is the last thing I can see
> > 
> > 2021-02-17T18:29:34.504+0000 7fa4c1693700  2 mon.a@0(leader) e1 send_reply
> > 0x55d5be3a4760 0x55d5be61fba0 mon_command_ack([{"prefix": "osd crush
> > create-or-move", "id": 0, "weight":0.0986, "args": ["host=vossi06",
> > "root=default"]}]=0 create-or-move updating item name 'osd.0' weight 0.0986
> > at location {host=vossi06,root=default} to crush map v5) v1
> > 
> > This tells us us that we tried to set the crush_weight but doesn't tell us
> > if this mon command was acked properly or not.
> > 
> > Can we capture monitor logs with debug_mon=20, debug_ms=1, debug_paxos 20
> > and debug_crush 20 to verify this?
> > 
> > > 
> > > 
> > > [1]
> > > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/
> > > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/
> > > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-
> > > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/ceph/
> > > must_gather_commands/
> > > [2]
> > > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/
> > > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/
> > > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-
> > > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/
> > > openshift-storage/pods/
> > > [3]
> > > https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/241/
> > > artifact/logs/failed_testcase_ocs_logs_1613530585/test_deployment_ocs_logs/
> > > ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-
> > > 6888c16dcc349c2d98f1b72f3013290264f59ce88aa88f357b0dc63f93c128f3/namespaces/
> > > openshift-storage/pods/rook-ceph-osd-0-5c75cff9d7-lbff4/osd/osd/logs/current.
> > > log
> 
> Since this is failing on Deployment, I am not sure how to add debug with
> debug_mon=20, debug_ms=1, debug_paxos 20
> and debug_crush 20. Is it ok to add post deployment. is there anyway to add
> those debugs during deployment?
> Could you tell us procedure to add

You'll need to change these settings before the osds are deployed. Perhaps, you could do this using "ceph config set osd debug_*". @tnielsen any other ideas?


Note You need to log in before you can comment on or make changes to this bug.