Bug 1921811 - [IBM Z and Power] Ceph cluster goes into Warning state and also OSDs OOM during various tier1 tests listed
Summary: [IBM Z and Power] Ceph cluster goes into Warning state and also OSDs OOM duri...
Keywords:
Status: CLOSED DUPLICATE of bug 1917815
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ceph
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Neha Ojha
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-28 16:26 UTC by Sravika
Modified: 2021-02-01 15:31 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-01 15:31:49 UTC
Embargoed:


Attachments (Terms of Use)
Tier1 test logs excluding test_pvc_expansion.py due to bug #1920498 (431.50 KB, application/zip)
2021-01-28 16:26 UTC, Sravika
no flags Details
OSDs describe (15.79 KB, application/zip)
2021-01-28 16:32 UTC, Sravika
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1921811 0 unspecified CLOSED [IBM Z and Power] Ceph cluster goes into Warning state and also OSDs OOM during various tier1 tests listed 2021-02-22 00:41:40 UTC

Internal Links: 1921811

Description Sravika 2021-01-28 16:26:39 UTC
Created attachment 1751751 [details]
Tier1 test logs excluding  test_pvc_expansion.py due to bug #1920498

Created attachment 1751751 [details]
Tier1 test logs excluding  test_pvc_expansion.py due to bug #1920498

Created attachment 1751751 [details]
Tier1 test logs excluding  test_pvc_expansion.py due to bug #1920498

Description of problem (please be detailed as possible and provide log
snippests):

During the following tier1 tests execution, Ceph cluster went into Warning state and comes back Healthy again, also OSDs OOM and restart.

1.tests/manage/pv_services/test_change_reclaim_policy_of_pv.py::TestChangeReclaimPolicyOfPv::test_change_reclaim_policy_of_pv[CephFileSystem-Delete]

Ceph Status:

2021-01-27 19:26:43.376219 mon.a [WRN] Health check failed: 0 slow ops, oldest one blocked for 33 sec, osd.1 has slow ops (SLOW_OPS)
2021-01-27 19:26:45.691454 mon.a [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
2021-01-27 19:26:48.751071 mon.a [WRN] Health check update: 0 slow ops, oldest one blocked for 38 sec, osd.1 has slow ops (SLOW_OPS)
2021-01-27 19:26:53.271610 mon.a [INF] MDS health message cleared (mds.?): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 39 secs
2021-01-27 19:26:54.235007 mon.a [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs)
2021-01-27 19:26:55.995883 mon.a [WRN] Health check update: 0 slow ops, oldest one blocked for 43 sec, osd.1 has slow ops (SLOW_OPS)
2021-01-27 19:26:59.271091 mon.a [INF] Health check cleared: SLOW_OPS (was: 0 slow ops, oldest one blocked for 43 sec, osd.1 has slow ops)
2021-01-27 19:26:59.271150 mon.a [INF] Cluster is now healthy
2021-01-27 19:26:43.376219 mon.a [WRN] Health check failed: 0 slow ops, oldest one blocked for 33 sec, osd.1 has slow ops (SLOW_OPS)
2021-01-27 19:26:45.691454 mon.a [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
2021-01-27 19:26:48.751071 mon.a [WRN] Health check update: 0 slow ops, oldest one blocked for 38 sec, osd.1 has slow ops (SLOW_OPS)
2021-01-27 19:26:53.271610 mon.a [INF] MDS health message cleared (mds.?): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 39 secs
2021-01-27 19:26:54.235007 mon.a [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs)
2021-01-27 19:26:55.995883 mon.a [WRN] Health check update: 0 slow ops, oldest one blocked for 43 sec, osd.1 has slow ops (SLOW_OPS)
2021-01-27 19:26:59.271091 mon.a [INF] Health check cleared: SLOW_OPS (was: 0 slow ops, oldest one blocked for 43 sec, osd.1 has slow ops)
2021-01-27 19:26:59.271150 mon.a [INF] Cluster is now healthy


2.tests/manage/pv_services/test_change_reclaim_policy_of_pv.py::TestChangeReclaimPolicyOfPv::test_change_reclaim_policy_of_pv[CephFileSystem-Retain] 

 
Ceph Status:


2021-01-27 19:37:37.111099 mon.a [WRN] Health check failed: 0 slow ops, oldest one blocked for 33 sec, osd.3 has slow ops (SLOW_OPS)
2021-01-27 19:37:38.207741 mon.a [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
2021-01-27 19:37:43.374077 mon.a [WRN] Health check update: 0 slow ops, oldest one blocked for 38 sec, osd.3 has slow ops (SLOW_OPS)
2021-01-27 19:37:45.318075 mon.a [INF] MDS health message cleared (mds.?): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 34 secs
2021-01-27 19:37:46.052227 mon.a [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs)
2021-01-27 19:37:47.067737 mon.a [INF] Health check cleared: SLOW_OPS (was: 0 slow ops, oldest one blocked for 38 sec, osd.3 has slow ops)
2021-01-27 19:37:47.067793 mon.a [INF] Cluster is now healthy
2021-01-27 19:41:13.187089 mon.a [WRN] Health check failed: 0 slow ops, oldest one blocked for 39 sec, daemons [osd.2,osd.3] have slow ops. (SLOW_OPS)
2021-01-27 19:41:19.245882 mon.a [INF] Health check cleared: SLOW_OPS (was: 0 slow ops, oldest one blocked for 39 sec, osd.2 has slow ops)
2021-01-27 19:41:19.245948 mon.a [INF] Cluster is now healthy
2021-01-27 20:00:00.000178 mon.a [INF] overall HEALTH_OK


3.tests/manage/pv_services/pvc_snapshot/test_snapshot_at_different_pvc_utlilization_level.py::TestSnapshotAtDifferentPvcUsageLevel::test_snapshot_at_different_usage_level 


Ceph Status:


2021-01-27 21:42:54.635079 mon.a [INF] osd.2 failed (root=default,rack=rack0,host=worker-0-m1312001ocs-lnxne-boe) (connection refused reported by osd.1)
2021-01-27 21:42:54.948781 mon.a [WRN] Health check failed: 1 osds down (OSD_DOWN)
2021-01-27 21:42:54.948816 mon.a [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2021-01-27 21:42:54.948826 mon.a [WRN] Health check failed: 1 rack (1 osds) down (OSD_RACK_DOWN)
2021-01-27 21:42:58.003508 mon.a [WRN] Health check failed: Degraded data redundancy: 9354/58722 objects degraded (15.929%), 64 pgs degraded (PG_DEGRADED)
2021-01-27 21:43:03.197293 mon.a [WRN] Health check update: Degraded data redundancy: 13679/59508 objects degraded (22.987%), 96 pgs degraded (PG_DEGRADED)
2021-01-27 21:43:07.296215 mon.a [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2021-01-27 21:43:07.296260 mon.a [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2021-01-27 21:43:07.296275 mon.a [INF] Health check cleared: OSD_RACK_DOWN (was: 1 rack (1 osds) down)
2021-01-27 21:43:07.344101 mon.a [INF] osd.2 [v2:10.130.2.43:6800/1851606,v1:10.130.2.43:6801/1851606] boot
2021-01-27 21:43:09.363979 mon.a [WRN] Health check update: Degraded data redundancy: 10709/60075 objects degraded (17.826%), 77 pgs degraded (PG_DEGRADED)
2021-01-27 21:43:17.640219 mon.a [WRN] Health check update: Degraded data redundancy: 326/60591 objects degraded (0.538%), 27 pgs degraded (PG_DEGRADED)
2021-01-27 21:43:22.643581 mon.a [WRN] Health check update: Degraded data redundancy: 304/61203 objects degraded (0.497%), 25 pgs degraded (PG_DEGRADED)
2021-01-27 21:43:24.928576 mon.a [INF] osd.1 failed (root=default,rack=rack1,host=worker-1-m1312001ocs-lnxne-boe) (connection refused reported by osd.2)
2021-01-27 21:43:24.948596 mon.a [INF] osd.3 failed (root=default,rack=rack3,host=worker-3-m1312001ocs-lnxne-boe) (connection refused reported by osd.2)
2021-01-27 21:43:25.277733 mon.a [WRN] Health check failed: 2 osds down (OSD_DOWN)
2021-01-27 21:43:25.277765 mon.a [WRN] Health check failed: 2 hosts (2 osds) down (OSD_HOST_DOWN)
2021-01-27 21:43:25.277778 mon.a [WRN] Health check failed: 2 racks (2 osds) down (OSD_RACK_DOWN)
2021-01-27 21:43:27.294908 mon.a [WRN] Health check failed: Reduced data availability: 3 pgs inactive, 22 pgs peering (PG_AVAILABILITY)
2021-01-27 21:43:27.644557 mon.a [WRN] Health check update: Degraded data redundancy: 273/61836 objects degraded (0.441%), 19 pgs degraded (PG_DEGRADED)
2021-01-27 21:43:29.306160 mon.a [WRN] Health check failed: 132/20644 objects unfound (0.639%) (OBJECT_UNFOUND)
2021-01-27 21:43:29.306197 mon.a [ERR] Health check failed: Possible data damage: 8 pgs recovery_unfound (PG_DAMAGED)
2021-01-27 21:43:32.335616 mon.a [WRN] Health check update: 1 osds down (OSD_DOWN)
2021-01-27 21:43:32.335651 mon.a [WRN] Health check update: 1 host (1 osds) down (OSD_HOST_DOWN)
2021-01-27 21:43:32.335661 mon.a [WRN] Health check update: 1 rack (1 osds) down (OSD_RACK_DOWN)
2021-01-27 21:43:32.343753 mon.a [INF] osd.3 [v2:10.128.2.21:6800/3474833,v1:10.128.2.21:6801/3474833] boot
2021-01-27 21:43:32.645444 mon.a [WRN] Health check update: Reduced data availability: 3 pgs inactive, 25 pgs peering (PG_AVAILABILITY)
2021-01-27 21:43:32.645548 mon.a [WRN] Health check update: Degraded data redundancy: 15421/61932 objects degraded (24.900%), 71 pgs degraded (PG_DEGRADED)
2021-01-27 21:43:36.408517 mon.a [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2021-01-27 21:43:36.408553 mon.a [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2021-01-27 21:43:36.408565 mon.a [INF] Health check cleared: OSD_RACK_DOWN (was: 1 rack (1 osds) down)
2021-01-27 21:43:36.440144 mon.a [INF] osd.1 [v2:10.129.2.18:6800/2709708,v1:10.129.2.18:6801/2709708] boot
2021-01-27 21:43:37.439576 mon.a [WRN] Health check update: 132/20672 objects unfound (0.639%) (OBJECT_UNFOUND)
2021-01-27 21:43:37.646313 mon.a [WRN] Health check update: Reduced data availability: 3 pgs inactive (PG_AVAILABILITY)
2021-01-27 21:43:37.646372 mon.a [ERR] Health check update: Possible data damage: 9 pgs recovery_unfound (PG_DAMAGED)
2021-01-27 21:43:37.646388 mon.a [WRN] Health check update: Degraded data redundancy: 26553/62016 objects degraded (42.816%), 124 pgs degraded (PG_DEGRADED)
2021-01-27 21:43:42.647237 mon.a [WRN] Health check update: 46/20771 objects unfound (0.221%) (OBJECT_UNFOUND)
2021-01-27 21:43:42.647310 mon.a [WRN] Health check update: Reduced data availability: 2 pgs inactive (PG_AVAILABILITY)
2021-01-27 21:43:42.647327 mon.a [ERR] Health check update: Possible data damage: 3 pgs recovery_unfound (PG_DAMAGED)
2021-01-27 21:43:42.647338 mon.a [WRN] Health check update: Degraded data redundancy: 11945/62313 objects degraded (19.169%), 60 pgs degraded (PG_DEGRADED)
2021-01-27 21:43:43.573285 mon.a [INF] Health check cleared: OBJECT_UNFOUND (was: 46/20771 objects unfound (0.221%))
2021-01-27 21:43:43.573334 mon.a [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs inactive)
2021-01-27 21:43:43.573351 mon.a [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 3 pgs recovery_unfound)
2021-01-27 21:43:47.648029 mon.a [WRN] Health check update: Degraded data redundancy: 317/62826 objects degraded (0.505%), 33 pgs degraded (PG_DEGRADED)
2021-01-27 21:43:50.276660 mon.a [INF] osd.0 failed (root=default,rack=rack2,host=worker-2-m1312001ocs-lnxne-boe) (connection refused reported by osd.1)
2021-01-27 21:43:50.756047 mon.a [WRN] Health check failed: 1 osds down (OSD_DOWN)
2021-01-27 21:43:50.756081 mon.a [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2021-01-27 21:43:50.756097 mon.a [WRN] Health check failed: 1 rack (1 osds) down (OSD_RACK_DOWN)
2021-01-27 21:43:52.654913 mon.a [WRN] Health check update: Degraded data redundancy: 292/63393 objects degraded (0.461%), 30 pgs degraded (PG_DEGRADED)
2021-01-27 21:43:53.819868 mon.a [WRN] Health check failed: Reduced data availability: 10 pgs peering (PG_AVAILABILITY)
2021-01-27 21:43:57.199099 mon.a [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 10 pgs peering)
2021-01-27 21:43:57.655838 mon.a [WRN] Health check update: Degraded data redundancy: 16154/63813 objects degraded (25.315%), 100 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:00.471750 mon.a [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2021-01-27 21:44:00.471830 mon.a [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2021-01-27 21:44:00.471847 mon.a [INF] Health check cleared: OSD_RACK_DOWN (was: 1 rack (1 osds) down)
2021-01-27 21:44:00.483355 mon.a [INF] osd.0 [v2:10.131.0.26:6800/4139156,v1:10.131.0.26:6801/4139156] boot
2021-01-27 21:44:02.656758 mon.a [WRN] Health check update: Degraded data redundancy: 16235/64212 objects degraded (25.283%), 99 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:07.657652 mon.a [WRN] Health check update: Degraded data redundancy: 481/64941 objects degraded (0.741%), 38 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:12.658418 mon.a [WRN] Health check update: Degraded data redundancy: 465/65370 objects degraded (0.711%), 35 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:17.659213 mon.a [WRN] Health check update: Degraded data redundancy: 422/66105 objects degraded (0.638%), 30 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:22.660115 mon.a [WRN] Health check update: Degraded data redundancy: 400/66492 objects degraded (0.602%), 29 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:27.661018 mon.a [WRN] Health check update: Degraded data redundancy: 363/67227 objects degraded (0.540%), 23 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:32.661851 mon.a [WRN] Health check update: Degraded data redundancy: 326/67686 objects degraded (0.482%), 22 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:37.662666 mon.a [WRN] Health check update: Degraded data redundancy: 269/68373 objects degraded (0.393%), 19 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:42.663643 mon.a [WRN] Health check update: Degraded data redundancy: 243/68817 objects degraded (0.353%), 19 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:47.664531 mon.a [WRN] Health check update: Degraded data redundancy: 219/69504 objects degraded (0.315%), 17 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:52.666924 mon.a [WRN] Health check update: Degraded data redundancy: 170/69930 objects degraded (0.243%), 15 pgs degraded (PG_DEGRADED)
2021-01-27 21:44:57.667764 mon.a [WRN] Health check update: Degraded data redundancy: 150/70635 objects degraded (0.212%), 13 pgs degraded (PG_DEGRADED)
2021-01-27 21:45:02.668566 mon.a [WRN] Health check update: Degraded data redundancy: 130/71040 objects degraded (0.183%), 11 pgs degraded (PG_DEGRADED)
2021-01-27 21:45:07.669321 mon.a [WRN] Health check update: Degraded data redundancy: 92/71697 objects degraded (0.128%), 9 pgs degraded (PG_DEGRADED)
2021-01-27 21:45:12.670034 mon.a [WRN] Health check update: Degraded data redundancy: 68/72012 objects degraded (0.094%), 6 pgs degraded (PG_DEGRADED)
2021-01-27 21:45:17.670701 mon.a [WRN] Health check update: Degraded data redundancy: 35/72024 objects degraded (0.049%), 4 pgs degraded (PG_DEGRADED)
2021-01-27 21:45:22.671428 mon.a [WRN] Health check update: Degraded data redundancy: 12/72024 objects degraded (0.017%), 3 pgs degraded (PG_DEGRADED)
2021-01-27 21:45:23.231946 mon.a [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 12/72024 objects degraded (0.017%), 3 pgs degraded)
2021-01-27 21:45:23.232004 mon.a [INF] Cluster is now healthy
2021-01-27 22:00:00.000175 mon.a [INF] overall HEALTH_OK

Version of all relevant components (if applicable):

OCP 4.12
OCS 4.6.2 (4.6.2-233.ci) 
Ceph 14.2.8-115.el8cp

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Ceph reports Warnings, however Cluster is back to Healthy state end of test execution

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCP 4.12 and OCS 4.6.2 (4.6.2-233.ci) with 4 workers
2. The 4 OSDs should have resource limits as follows:
      Limits:
      cpu:     2
      memory:  5Gi

3. Run ocs-ci with "tier1" marker as follows excluding test_pvc_expansion.py due to the bug #1920498:

run-ci -m 'tier1' --ocsci-conf config.yaml --cluster-path <cluster_path> --html=<path> --self-contained-html 


Actual results:
Ceph Cluster goes into Warning state and OSDs are OOM killed during the tier1 tests listed in "Description". OSDs restart during test execution as they are OOM killed.

# oc get po -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-dkcct                                            3/3     Running     0          33h
csi-cephfsplugin-fc86d                                            3/3     Running     0          33h
csi-cephfsplugin-g9tr2                                            3/3     Running     0          33h
csi-cephfsplugin-provisioner-d8ccd695d-lcf87                      6/6     Running     0          28h
csi-cephfsplugin-provisioner-d8ccd695d-wwgdn                      6/6     Running     0          33h
csi-cephfsplugin-sv5dg                                            3/3     Running     0          33h
csi-rbdplugin-5v75r                                               3/3     Running     0          33h
csi-rbdplugin-b6h4z                                               3/3     Running     0          33h
csi-rbdplugin-jdsdp                                               3/3     Running     0          33h
csi-rbdplugin-provisioner-76988fbc89-45nj5                        6/6     Running     0          33h
csi-rbdplugin-provisioner-76988fbc89-tvbd5                        6/6     Running     0          33h
csi-rbdplugin-shq89                                               3/3     Running     0          33h
noobaa-core-0                                                     1/1     Running     0          28h
noobaa-db-0                                                       1/1     Running     0          28h
noobaa-endpoint-69b866c674-cxjxm                                  1/1     Running     1          33h
noobaa-endpoint-69b866c674-mfsd8                                  1/1     Running     1          32h
noobaa-operator-55fc95dc4c-7jfc6                                  1/1     Running     0          33h
ocs-metrics-exporter-c5655b599-pnhs7                              1/1     Running     0          28h
ocs-operator-c946699b4-bd7tm                                      1/1     Running     0          33h
rook-ceph-crashcollector-worker-0.m1312001ocs.lnxne.boe-7czl5n4   1/1     Running     0          6h31m
rook-ceph-crashcollector-worker-1.m1312001ocs.lnxne.boe-69b2xvl   1/1     Running     0          33h
rook-ceph-crashcollector-worker-2.m1312001ocs.lnxne.boe-57ztsc9   1/1     Running     0          33h
rook-ceph-crashcollector-worker-3.m1312001ocs.lnxne.boe-f4rdkkk   1/1     Running     0          33h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7874f787l7zrr   1/1     Running     0          33h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-667579dc9tlvt   1/1     Running     0          28h
rook-ceph-mgr-a-5b5b4b7cff-4gr6n                                  1/1     Running     0          33h
rook-ceph-mon-a-5466c8c67d-wpvxx                                  1/1     Running     0          33h
rook-ceph-mon-b-786785f49d-7s2hh                                  1/1     Running     0          33h
rook-ceph-mon-c-dc6965b87-p96kn                                   1/1     Running     0          33h
rook-ceph-operator-6c97bf77-7sk27                                 1/1     Running     0          33h
rook-ceph-osd-0-766ffcdfd5-vqdhm                                  1/1     Running     3          33h
rook-ceph-osd-1-556b7758b5-dzwq2                                  1/1     Running     3          33h
rook-ceph-osd-2-b5b84556d-pwjv7                                   1/1     Running     0          6h31m
rook-ceph-osd-3-79f4c5f689-vxhj9                                  1/1     Running     3          33h
rook-ceph-osd-prepare-ocs-deviceset-1-data-0-n94jc-qcsrh          0/1     Completed   0          33h
rook-ceph-osd-prepare-ocs-deviceset-2-data-0-tpvw8-z94ws          0/1     Completed   0          33h
rook-ceph-osd-prepare-ocs-deviceset-3-data-0-xg67l-6cwq9          0/1     Completed   0          33h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5966fcfqwhb8   1/1     Running     0          33h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-8557766c2nsp   1/1     Running     0          28h
rook-ceph-tools-6fdd868f75-kgm82                                  1/1     Running     0          33h
worker-1m1312001ocslnxneboe-debug                                 0/1     Completed   0          6h34m
worker-2m1312001ocslnxneboe-debug                                 0/1     Completed   0          6h34m
worker-3m1312001ocslnxneboe-debug                                 0/1     Completed   0          6h34m

Expected results:
Ceph Cluster should not go into Warning state and OSDs should not be OOM during the tests 

Additional info:

NOTE: Please note that Tier1 test log is in CET Timezone and the Openshift Cluster (master,worker,pods) is in UTC timezone. So please check for 1 hour ahead in the ocsci test log.


Please find the must-gather logs here

https://drive.google.com/file/d/15oDn8CHAjRKpQwpBkBGtcAS9tZgZNJVv/view?usp=sharing


Worker Node Resources are as follows :

No. of Workers:4
Memory: 64GB
CPU : 16
Disk: 500 GB each for osds

Comment 2 Sravika 2021-01-28 16:32:39 UTC
Created attachment 1751752 [details]
OSDs describe

Comment 3 Aaruni Aggarwal 2021-01-29 16:53:41 UTC
We are also facing multiple restarts of OSD pods due to OOMkilled on our IBM Power Platform. I ran scale,tier1 and also started performance test on the cluster. osd pod restarted 25 times.

[root@ocs4-aaragga1-5ed0-bastion-0 ~]# oc get pods -n openshift-storage |grep osd-1
rook-ceph-osd-1-5bd4d44b6f-dd6nq               1/1     Running     25         3d12h

So I setup kruize pod for monitoring the osd pod . So when the performance test was running , i checked the values generated by kruize. 
  
[root@ocs4-aaragga1-5ed0-bastion-0 ~]# curl http://kruize-openshift-monitoring.apps.ocs4-aaragga1-5ed0.ibm.com/recommendations?application_name=rook-ceph-osd-1

[
  {
    "application_name": "rook-ceph-osd-1",
    "resources": {
      "requests": {
        "memory": "3427.2M",
        "cpu": 0.5
      },
      "limits": {
        "memory": "6415.4M",
        "cpu": 1.0
      }
    }
  }
]

We are using this storagecluster.yaml file for deploying our storagecluster-> https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/templates/ocs-deployment/ibm-storage-cluster.yaml
we are having 3 worker nodes each having 16vcpus, 64GB memory and additional disk of 500GB and OCS version is 4.6.2 (4.6.2-233.ci)

Comment 4 Aaruni Aggarwal 2021-01-29 16:56:01 UTC
3 osds have configuration as follows -> 

Limits:
      cpu:     2
      memory:  5Gi
    Requests:
      cpu:     2
      memory:  5Gi

Comment 5 Josh Durgin 2021-02-01 15:31:49 UTC

*** This bug has been marked as a duplicate of bug 1917815 ***


Note You need to log in before you can comment on or make changes to this bug.