Description of problem (please be detailed as possible and provide log snippests): OSD pod crashes due OOM Version of all relevant components (if applicable): OCP v4.8 / ODF v4.9 $ ceph version ceph version 16.2.0-143.el8cp (0e2c6f9639c37a03e55885fb922dc0cb1b5173cb) pacific (stable) Default ODF installation - default limits/requests for ODF pods. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? I got this issue 2 times, but do not have clear reproducer - first time issue happened when I expanded cluster from 3->6 OSD nodes. When it happened cluster was in state described in https://bugzilla.redhat.com/show_bug.cgi?id=2021079 As issue here started to be visible when I expanded cluster, looks similar to what is logged in https://bugzilla.redhat.com/show_bug.cgi?id=2008420 - - second time issue happened when as described in "Steps to Reproduce" below Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: NA Steps to Reproduce: Below are steps which led to this issue. 1. Install OCP/ODF on 2 clusters with above version and use ACM to setup ODF mirroring between clusters. 2. Create for example 100s ( in this case it was 600 ) Pods and attach to each pod for example 5 GB PVC , and write 1 GB of data per pod. This will lead that cca 600 GB will be written to ceph backend. 3. Check are images replicated between OCP/ODF clusters ( I used "rbd -p storagecluster-cephblockpool ls"), check "ceph df" output - it should be same on both clusters 4. On first cluster delete all pods and VolumeReplication - this will trigger data delete on ceph backend. 5. After 4 this, some of OSDs will end with OOM issue and restart, leading cluster in unstable state. I have seen this problem on both of clusters involved, but not at same time Actual results: OSD(s) crash due OOM Expected results: OSD(s) not to crash due OOM Additional info: osd logs: http://perf148b.perf.lab.eng.bos.redhat.com/osd_crash_bz/ oc rsh -n openshift-storage $TOOLS_POD sh-4.4$ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 18 TiB 16 TiB 1.7 TiB 1.7 TiB 9.37 TOTAL 18 TiB 16 TiB 1.7 TiB 1.7 TiB 9.37 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL storagecluster-cephblockpool 1 128 723 GiB 175.64k 1.7 TiB 11.02 5.0 TiB storagecluster-cephfilesystem-metadata 2 32 503 KiB 22 1.5 MiB 0 4.7 TiB storagecluster-cephfilesystem-data0 3 128 0 B 0 0 B 0 4.5 TiB device_health_metrics 4 1 1.2 MiB 12 2.3 MiB 0 6.7 TiB sh-4.4$ ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 18.00000 root default -5 18.00000 region us-west-2 -10 6.00000 zone us-west-2a -9 6.00000 host ip-10-0-134-115 0 ssd 2.00000 osd.0 up 1.00000 1.00000 3 ssd 2.00000 osd.3 up 1.00000 1.00000 6 ssd 2.00000 osd.6 up 1.00000 1.00000 -4 6.00000 zone us-west-2b -3 6.00000 host ip-10-0-168-65 1 ssd 2.00000 osd.1 up 1.00000 1.00000 5 ssd 2.00000 osd.5 down 1.00000 1.00000 8 ssd 2.00000 osd.8 up 1.00000 1.00000 -14 6.00000 zone us-west-2c -13 6.00000 host ip-10-0-212-246 2 ssd 2.00000 osd.2 up 1.00000 1.00000 4 ssd 2.00000 osd.4 up 1.00000 1.00000 7 ssd 2.00000 osd.7 up 1.00000 1.00000 sh-4.4$ ceph -s cluster: id: d559afcb-accb-4431-a689-2e0555bf4b2b health: HEALTH_WARN 1 osds down Slow OSD heartbeats on back (longest 9222.186ms) Slow OSD heartbeats on front (longest 8944.892ms) Degraded data redundancy: 54696/525393 objects degraded (10.410%), 45 pgs degraded, 92 pgs undersized snap trim queue for 5 pg(s) >= 32768 (mon_osd_snap_trim_queue_warn_on) services: mon: 3 daemons, quorum a,b,c (age 6d) mgr: a(active, since 6d) mds: 1/1 daemons up, 1 hot standby osd: 9 osds: 8 up (since 80s), 9 in (since 2d) rbd-mirror: 1 daemon active (1 hosts) data: volumes: 1/1 healthy pools: 4 pools, 289 pgs objects: 175.13k objects, 570 GiB usage: 1.7 TiB used, 16 TiB / 18 TiB avail pgs: 54696/525393 objects degraded (10.410%) 135 active+clean 48 active+clean+snaptrim_wait 47 active+undersized 45 active+undersized+degraded 14 active+clean+snaptrim io: client: 29 KiB/s rd, 20 KiB/s wr, 33 op/s rd, 75 op/s wr
Missed the 6.1 z1 window. Retargeting to 6.1 z2.