1917815 – [IBM Z and Power] OSD pods restarting due to OOM during upgrade test using ocs-ci

Bug 1917815 - [IBM Z and Power] OSD pods restarting due to OOM during upgrade test using ocs-ci

Summary: [IBM Z and Power] OSD pods restarting due to OOM during upgrade test using oc...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.6
Hardware:	s390x
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Mark Nelson
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1921216 1921811 (view as bug list)
Depends On:	1925650 1925651
Blocks:	1920498
TreeView+	depends on / blocked

Reported:	2021-01-19 12:48 UTC by Abdul Kandathil (IBM)
Modified:	2021-07-12 08:28 UTC (History)
CC List:	28 users (show)
Fixed In Version:	v4.7.0-280.ci
Doc Type:	Known Issue
Doc Text:
Clone Of:
Clones:	1925650 1925651 (view as bug list)
Environment:
Last Closed:	2021-05-19 09:18:16 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
must gather logs (16.03 MB, application/zip) 2021-01-20 08:49 UTC, Abdul Kandathil (IBM)	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	49296	0	None	None	None	2021-02-16 23:21:49 UTC
Red Hat Product Errata	RHSA-2021:2041	0	None	None	None	2021-05-19 09:18:43 UTC

Description Abdul Kandathil (IBM) 2021-01-19 12:48:01 UTC

Description of problem (please be detailed as possible and provide log
snippets):
During ocs upgrade test (from 4.6.0-195.ci to 4.6.1-206.ci) using ocs-ci, osd pods were restarting multiple times due to OOM

```
rook-ceph-osd-0-54bd76b57f-vw9ld                                  1/1     Running     4          17h
rook-ceph-osd-1-7db8558947-q7n8s                                  1/1     Running     4          17h
rook-ceph-osd-2-548cfbdbff-bgcmd                                  1/1     Running     3          17h
```


Version of all relevant components (if applicable):

Upgrade from OCS 4.6.0-195.ci to 4.6.1-206.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Once the upgrade test was complete, the pods were stable and no more restarts happened.

$oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES                     PHASE
ocs-operator.v4.6.1-206.ci   OpenShift Container Storage   4.6.1-206.ci   ocs-operator.v4.6.0-195.ci   Succeeded
$

Is there any workaround available to the best of your knowledge?

Trying with increased memory limit for osd pods, but don't have results yet.
Currently, the memory limits are set at default 5GB.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCS : OCS operator + Local storage Operator + storage cluster
2. Run ocs-ci with "re_upgrade or upgrade or post_upgrade" markers

$ run-ci tests/ --ocsci-conf <config yaml> --cluster-path <path> -m 'pre_upgrade or upgrade or post_upgrade' --ocs-version 4.6 --upgrade-ocs-version 4.6.1 --upgrade-ocs-registry-image 'quay.io/rhceph-dev/ocs-registry:4.6.1-206.ci'

3. The issue starts appearing once the upgrade test (tests/ecosystem/upgrade/test_upgrade.py::test_upgrade) starts


Actual results:
The upgrade happened but OSD pods restarted multiple times due to OOM.

Expected results:
Upgrade without issues to OCS components


Additional info:

OCP version: 4.6.9
Worker node size: 4 worker nodes with each having 64GB memory, 16 cores, and 1 TB disk for osd.

Must-gather logs are uploaded to google drive because of size restriction:
https://drive.google.com/file/d/1oyU2MzS5BA86UuaEyrmPRMZ5pnurnpEE/view?usp=sharing

Comment 2 Yaniv Kaul 2021-01-19 14:05:42 UTC

Anything in the logs? Is that specific to the IBM platform?

Comment 3 Abdul Kandathil (IBM) 2021-01-19 14:30:52 UTC

I am testing only on the IBM Platform.

Just now tested with an 8GB memory limit and don't see any pod restarts during and after the upgrade.

Comment 4 Abdul Kandathil (IBM) 2021-01-20 08:49:03 UTC

Created attachment 1748977 [details]
must gather logs

Around two hours after the upgrade (with 8GB limit for osd pods), I notice a restart of 2 out of 3 osd pods due to OOM.

----
rook-ceph-osd-0-7d57865df5-7ch7s                                  1/1     Running     0          4h54m
rook-ceph-osd-1-844464b9b7-g2g4d                                  1/1     Running     1          4h54m
rook-ceph-osd-2-7fcd989b4c-j48rd                                  1/1     Running     1          4h54m
-----

Comment 5 Mudit Agarwal 2021-01-22 08:46:19 UTC

IMHO, rook team should take a look first.

Comment 7 Travis Nielsen 2021-01-27 18:40:36 UTC

I am not aware of any reason OSDs would consume so much more memory during the upgrade from OCS 4.6.0 to 4.6.1. 

Some questions:
- Can we get access to the system where the OSDs are seeing this issue?
- This only happens in the upgrade tests and not in the other tests?
- What operations are happening during the tests? I'm not familiar with them.

Moving to the Ceph component to have the core team look at the OSDs.

Comment 8 Josh Durgin 2021-01-27 21:18:49 UTC

Mark, can you take a look?

I think we'll need access to the system if possible, since must-gather does not collect the necessary logs for issues like this due to https://bugzilla.redhat.com/show_bug.cgi?id=1869406 and https://bugzilla.redhat.com/show_bug.cgi?id=1882534

Comment 9 Abdul Kandathil (IBM) 2021-01-28 10:34:23 UTC

I think it is not possible to access our internal systems. 
Is it ok to have a debug session?

Comment 10 Mudit Agarwal 2021-01-29 05:06:24 UTC

Hi Mark, do you need a debug session?

Comment 11 Aaruni Aggarwal 2021-01-29 16:14:27 UTC

We are also facing multiple restarts of OSD pods due to OOMkilled on our IBM Power Platform. I ran scale,tier1 and also started performance test on the cluster. osd pod restarted 25 times.

[root@ocs4-aaragga1-5ed0-bastion-0 ~]# oc get pods -n openshift-storage |grep osd-1
rook-ceph-osd-1-5bd4d44b6f-dd6nq               1/1     Running     25         3d12h

So I setup kruize pod for monitoring the osd pod . So when the performance test was running , i checked the values generated by kruize. 
  
[root@ocs4-aaragga1-5ed0-bastion-0 ~]# curl http://kruize-openshift-monitoring.apps.ocs4-aaragga1-5ed0.ibm.com/recommendations?application_name=rook-ceph-osd-1

[
  {
    "application_name": "rook-ceph-osd-1",
    "resources": {
      "requests": {
        "memory": "3427.2M",
        "cpu": 0.5
      },
      "limits": {
        "memory": "6415.4M",
        "cpu": 1.0
      }
    }
  }
]

We are using this storagecluster.yaml file for deploying our storagecluster-> https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/templates/ocs-deployment/ibm-storage-cluster.yaml
we are having 3 worker nodes each having 16vcpus, 64GB memory and additional disk of 500GB

Comment 12 Mark Nelson 2021-01-29 17:09:46 UTC

If we have the ability to look at the perf counter and mempool stats for the OSD(s), that would probably be a good first step.  We want to see if this is memory usage due to bluestore caches or something else.

Comment 13 Mark Nelson 2021-01-29 17:48:21 UTC

Quick update:  After doing a live session with Abdul it appears that the ceph executable is not linked to libtcmalloc as it normally is on x86:

"libtcmalloc.so.4 => /lib64/libtcmalloc.so.4 (0x00007f150663d000)"

The next immediate step is probably to investigate the build and make sure we are in fact using tcmalloc.  It's been observed in the past that the ceph-osd process can experience significant memory fragmentation with libc malloc and that could be an explanation for the behavior being observed.

side note: On this build the admin socket was returning {} for perf commands and dump_mempools was also failing with a different error.  We currently appear to have limited visibility into what's actually using memory.

Mark

Comment 14 Manoj Kumar 2021-01-29 18:51:46 UTC

ceph-osd on ppc64le is not linked with libctmalloc either.

Comment 15 lmcfadde 2021-01-29 21:57:20 UTC

@mnelson I reached out the the build MGR for OCS builds, mentioned this bug, and stated we might need to get updated builds.  

I was told to please email storage-container-release-team and they'll include it in the appropriate update release ... probably the next Z stream.

Should we do that now or wait for further details on your investigation?

Comment 16 Manoj Kumar 2021-01-29 22:56:17 UTC

@aaaggarw bumped up the memory for the OSD pods to 10Gi, and it still gets OOM Killed. We would like to monitor memory utilization with libctmalloc.

Comment 17 Ulrich Weigand 2021-02-01 14:10:28 UTC

@mnelson where do you get the version of libtcmalloc from that you're linking against on Intel?   I understand this was provided by the gperftools package in RHEL 7, but that has been removed in RHEL 8.

I see the gperftools package in EPEL 8 (also available for s390x), but this is not something we're ever really tested, so I'm wondering what the implications would be of using it in the production Ceph package.

Do we know for certain that the symptom is indeed caused by the malloc library and not anything else?

Comment 18 Josh Durgin 2021-02-01 15:31:49 UTC

*** Bug 1921811 has been marked as a duplicate of this bug. ***

Comment 19 Josh Durgin 2021-02-01 15:31:57 UTC

*** Bug 1921216 has been marked as a duplicate of this bug. ***

Comment 20 Josh Durgin 2021-02-01 15:42:36 UTC

(In reply to Ulrich Weigand from comment #17)
> @mnelson where do you get the version of libtcmalloc from that
> you're linking against on Intel?   I understand this was provided by the
> gperftools package in RHEL 7, but that has been removed in RHEL 8.
> 
> I see the gperftools package in EPEL 8 (also available for s390x), but this
> is not something we're ever really tested, so I'm wondering what the
> implications would be of using it in the production Ceph package.

This is what we use for x86 afaik. Ken, could you verify?

> Do we know for certain that the symptom is indeed caused by the malloc
> library and not anything else?

Yes glibc malloc using much more memory is a well-known behavior for ceph on x86.

tcmalloc is specifically required for the daemons to manage their memory envelope since it:

1) creates much less fragmentation with ceph's allocation patterns

2) gives us statistics about how much RSS is devoted to the application, the allocator, and how much the kernel has yet to reclaim.

Since ceph only has control over the application-level use, this allows us to stay within a memory envelope by adjusting usage and observing the impact to RSS and application usage.

These are both critical capabilities for ceph, I'm surprised this is the first time we're seeing this issue.

Comment 21 Ken Dreyer (Red Hat) 2021-02-01 17:54:10 UTC

Upstream Ceph on el8: We use EPEL 8's gperftools-libs.

Downstream RH Ceph Storage on RHEL 8: We use gperftools-libs that we build internally and ship as part of the RH Ceph Storage product.

Comment 22 Ulrich Weigand 2021-02-01 18:08:50 UTC

@kdreyer For the gperftools-libs you build internally, which version of the source code are you using?  Is this the same as the EPEL package or something else?

I'm asking because at some point *after* the version in EPEL, the upstream tree was temporarily broken on s390x by commit 73ee9b15440d72d5c4f93586ea1179c0a265980c, until it was fixed again in commit e40c7f231ad89e1ee8bf37a1d6680880c519c901.  So if you have anything in between, you need to take care.

Otherwise, I guess should should be able to build and use the package on Z (and probably Power) just as you do on Intel.

Comment 23 Sridhar Venkat (IBM) 2021-02-01 18:56:48 UTC

Josh, Are you rebuilding ceph based on the comments above, and if so, is there an estimate on when we can expect a new driver?

Comment 24 lbrownin 2021-02-01 20:10:23 UTC

@jdurgin ^^ Question for you above

Comment 25 Ken Dreyer (Red Hat) 2021-02-01 20:20:58 UTC

EPEL 8 uses gperftools-2.7-6.el8 , https://koji.fedoraproject.org/koji/buildinfo?buildID=1401729

RH Ceph Storage uses gperftools-2.6.3 (imported from Fedora 28, specifically https://src.fedoraproject.org/rpms/gperftools/c/751651e676c758e19ee74d776514f620b814519f)

Comment 26 Josh Durgin 2021-02-01 20:49:49 UTC

(In reply to Sridhar Venkat (IBM) from comment #23)
> Josh, Are you rebuilding ceph based on the comments above, and if so, is
> there an estimate on when we can expect a new driver?

This is a question for Ken

Comment 27 Sridhar Venkat (IBM) 2021-02-02 13:12:18 UTC

@kdreyer Can you provide an estimate on when this fix is available? It will help us to plan for various OCS tests on ppc64le platform.

Comment 28 Ulrich Weigand 2021-02-02 17:03:24 UTC

(In reply to Ken Dreyer (Red Hat) from comment #25)
> EPEL 8 uses gperftools-2.7-6.el8 ,
> https://koji.fedoraproject.org/koji/buildinfo?buildID=1401729
> 
> RH Ceph Storage uses gperftools-2.6.3 (imported from Fedora 28, specifically
> https://src.fedoraproject.org/rpms/gperftools/c/
> 751651e676c758e19ee74d776514f620b814519f)

Ok, those versions should be fine on s390x.

Comment 29 Ken Dreyer (Red Hat) 2021-02-02 17:35:10 UTC

On s390x, we have disabled tcmalloc since 2016: https://github.com/ceph/ceph/commit/efa7f7b365d27797573bf4e5a9878f94f41aede2 . We must remove that upstream to fix this for s390x.

On ppc64le, the problem is less clear. The build log only says "Could NOT find gperftools (missing: GPERFTOOLS_INCLUDE_DIR) (found version "2.6.3")" http://download.eng.bos.redhat.com/brewroot/vol/rhel-8/packages/ceph/14.2.11/95.el8cp/data/logs/ppc64le/build.log

Bisecting this problem in Fedora's ppc64le builds:

ceph-14.2.16-1.fc32 - Could NOT find gperftools (missing: GPERFTOOLS_INCLUDE_DIR) (found version "2.7")
ceph-15.1.0-2.fc33 - Could NOT find gperftools (missing: GPERFTOOLS_INCLUDE_DIR) (found version "2.7") (used gperftools-devel-2.7-7.fc32)

ceph-15.1.1-2.fc33 - Found gperftools: /usr/include (found version "2.7.90") (used gperftools-devel-2.7.90-1.fc33)
ceph-15.2.0-1.fc33 - Found gperftools: /usr/include (found version "2.7.90")
ceph-15.2.7-1.fc33 - Found gperftools: /usr/include (found version "2.8")

Did something change between gperftools 2.7 and 2.7.90 to enable ppc64le support?

Comment 30 Christina Meno 2021-02-02 20:58:36 UTC

Here's my summary and recommendations for what should happen next:

during upgrade we run outta mem in OCS on Z and P because ceph isn't linked to tcmalloc so...
* Split out the different platforms into separate bugzillas
* for Z Uli create a PR upstream ceph that re-enables linking against tcmalloc and let's see what the results are.
* for P Sridhar find the EPEL maintainer of the gpertools package and ask why we see https://bugzilla.redhat.com/show_bug.cgi?id=1917815#c29

What do you think?

Comment 31 Tulio Magno Quites Machado Filho 2021-02-02 21:09:24 UTC

(In reply to Ken Dreyer (Red Hat) from comment #29)
> Did something change between gperftools 2.7 and 2.7.90 to enable ppc64le
> support?

The first gperftools version to support ppc64le was gperftools-2.1.90 [1].

But looking at the gperftools F32 build logs [2] we see:
checking how to access the program counter from a struct ucontext... configure: WARNING: Could not find the PC.  Will not try to compile libprofiler...

ceph uses the profiler header to check for gperftools version [3].

The ucontext issue was indeed fixed in gperftools 2.7.90 [4].

[1] https://github.com/gperftools/gperftools/releases/tag/gperftools-2.1.90
[2] https://kojipkgs.fedoraproject.org//packages/gperftools/2.7/7.fc32/data/logs/ppc64le/build.log
[3] https://github.com/ceph/ceph/blob/d98075628e3d0ea3fb5e636a733ef94feaa77037/cmake/modules/Findgperftools.cmake#L15
[4] https://github.com/gperftools/gperftools/commit/fc00474ddc21fff618fc3f009b46590e241e425e

Comment 32 Manoj Kumar 2021-02-02 21:23:58 UTC

@gmeno https://bugzilla.redhat.com/show_bug.cgi?id=1917815#c31 seems be addressing why gperftools 2.7.90 resolves the issue? Can we use this BZ to resolve the issue for both P and Z?

Comment 33 Ken Dreyer (Red Hat) 2021-02-03 15:36:43 UTC

It sounds like we will upgrade gperftools to 2.7.90 in RHCS and EPEL. In Fedora, Danny also pushed this change on top of 2.7.90: https://src.fedoraproject.org/rpms/gperftools/c/f7fb993c56e6ee0c84b9e43fca12136cc7164393 Do we need this as well?

Comment 34 lbrownin 2021-02-03 17:08:07 UTC

Looks like the fedora spec file fix is required for Z.  It doesn't affect power.   Wouldn't you want to have a common source code base for all platforms.

Comment 35 Ulrich Weigand 2021-02-03 17:19:13 UTC

(In reply to Ken Dreyer (Red Hat) from comment #33)
> It sounds like we will upgrade gperftools to 2.7.90 in RHCS and EPEL. In
> Fedora, Danny also pushed this change on top of 2.7.90:
> https://src.fedoraproject.org/rpms/gperftools/c/
> f7fb993c56e6ee0c84b9e43fca12136cc7164393 Do we need this as well?

Note that if you add that change, you'll also have to update libunwind to a version that supports s390x.  Fedora updated to version 1.4.0 here
https://src.fedoraproject.org/rpms/libunwind/c/5d7bc2445160ca914c927ed302d412f7b21906a0?branch=rawhide
However, I think in RHEL this is not yet present, so you might have to pull it in as well.

Whether you *need* to use libunwind in the first place is another question.  It seems without libunwind it will fall back to _Unwind_Backtrace, which should work fine on Z, but maybe have slightly fewer info available in backtraces.

Comment 36 Dan Horák 2021-02-05 13:32:50 UTC

If I read this thread right, you will want to update both libunwind and gperftools in EPEL-8 to sync the features across all the arches, same as Fedora, right?

Comment 37 Christina Meno 2021-02-05 20:05:27 UTC

in reply to c32

No Manoj,

I've filed two bzs in the ceph product that block this. Essentially those need to exist for tracking and testing in the ceph product. 
Once fixed upstream and accepted we can get that image put into OCS(this BZ)

cheers,
C

Comment 39 Sridhar Venkat (IBM) 2021-02-19 17:50:22 UTC

We received a new OCS 4.7 build http://quay.io/rhceph-dev/ocs-registry:4.7.0-268.ci and tested it. After the install, the storage cluster is still in installing state:

[root@ocp47-327a-bastion-0 storage-cluster]# oc get storagecluster
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   45m   Progressing              2021-02-19T16:12:21Z   4.7.0

[root@ocp47-327a-bastion-0 storage-cluster]# oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.7.0-268.ci   OpenShift Container Storage   4.7.0-268.ci              Installing

All the pods created except rgw pods and storagecluster is in progressing state from past 45 min

[root@ocp47-327a-bastion-0 storage-cluster]# oc get pods
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-6lq9k                                            3/3     Running     0          49m
csi-cephfsplugin-7qjft                                            3/3     Running     0          49m
csi-cephfsplugin-provisioner-9dd96549c-lsfhw                      6/6     Running     0          49m
csi-cephfsplugin-provisioner-9dd96549c-zmx67                      6/6     Running     0          49m
csi-cephfsplugin-prrnn                                            3/3     Running     0          49m
csi-rbdplugin-2hnwz                                               3/3     Running     0          49m
csi-rbdplugin-fmcw7                                               3/3     Running     0          49m
csi-rbdplugin-l49cm                                               3/3     Running     0          49m
csi-rbdplugin-provisioner-6748ccdf44-bhqvx                        6/6     Running     0          49m
csi-rbdplugin-provisioner-6748ccdf44-kcbdm                        6/6     Running     0          49m
noobaa-core-0                                                     1/1     Running     0          47m
noobaa-db-pg-0                                                    1/1     Running     0          47m
noobaa-endpoint-84fbcfc99b-dvjm5                                  1/1     Running     0          45m
noobaa-operator-6fbd8d898-82xsm                                   1/1     Running     0          52m
ocs-metrics-exporter-6bfbc957c7-76km2                             1/1     Running     0          52m
ocs-operator-8c8c7cdc-76ls4                                       0/1     Running     0          21m
rook-ceph-crashcollector-worker-0-77977f8656-5lz9g                1/1     Running     0          48m
rook-ceph-crashcollector-worker-1-7fcdc884f7-mb5nt                1/1     Running     0          49m
rook-ceph-crashcollector-worker-2-5b47d755fd-xhv67                1/1     Running     0          49m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-776998bd87bdr   2/2     Running     0          47m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-bdb749675vb8v   2/2     Running     0          47m
rook-ceph-mgr-a-6f5bd98bb8-8n4lf                                  2/2     Running     0          48m
rook-ceph-mon-a-77cc97b9f-nnr2g                                   2/2     Running     0          49m
rook-ceph-mon-b-8f7dc76b9-dhlkr                                   2/2     Running     0          49m
rook-ceph-mon-c-76f6b5674c-xhl4d                                  2/2     Running     0          48m
rook-ceph-operator-58d8bf6c87-mpn4r                               1/1     Running     0          33m
rook-ceph-osd-0-6765598c49-vvgnn                                  2/2     Running     0          47m
rook-ceph-osd-1-c6ff488d9-jc5s6                                   2/2     Running     0          47m
rook-ceph-osd-2-5d6b9847c6-fv255                                  2/2     Running     0          47m
rook-ceph-osd-prepare-ocs-deviceset-1-0-data-0jr96f-84mmd         0/1     Completed   0          48m
rook-ceph-osd-prepare-ocs-deviceset-2-0-data-0kz8x8-9gcmb         0/1     Completed   0          48m
rook-ceph-osd-prepare-ocs-deviceset-3-0-data-0bh7zl-f92d2         0/1     Completed   0          48m

Comment 43 Christina Meno 2021-02-22 18:22:28 UTC

tcmalloc present in http://quay.io/rhceph-dev/ocs-registry:4.7.0-268.ci unfortunately it cannot be tested due to https://bugzilla.redhat.com/show_bug.cgi?id=1928471

Comment 44 Raz Tamir 2021-02-22 18:30:59 UTC

Moving to MODIFIED state and as soon as bz #1928471 will be fixed, we'll move to ON_QA

Comment 45 Ken Dreyer (Red Hat) 2021-02-23 21:45:34 UTC

gperftools 2.8 needs deeper performance testing. We've changed plans and Yaakov Selkowitz backported the following three commits to gperftools 2.6.3:

https://github.com/gperftools/gperftools/commit/fc00474ddc21fff618fc3f009b46590e241e425e
https://github.com/gperftools/gperftools/commit/8f9a873fce14337e113a3837603a11ade06da533
https://github.com/gperftools/gperftools/commit/73ee9b15440d72d5c4f93586ea1179c0a265980c

We're going to ship gperftools-2.6.3-3.el8cp with these changes.

Comment 46 lmcfadde 2021-03-10 19:23:31 UTC

@akandath is this verified?  I realize you cannot to the upgrade test case but I think you can validate tier 1 OCS-CI runs do not produce the OOM errors anymore?  Should it be moved to verified?

@tstober

Comment 50 Abdul Kandathil (IBM) 2021-04-07 10:37:58 UTC

We executed tier1 with ocs version 4.7.0-801.ci and below is current status of ocs pods
------
[root@m1308001 ~]# oc -n openshift-storage get pod
NAME                               READY  STATUS   RESTARTS  AGE
csi-cephfsplugin-5mq8f                      3/3   Running   0     6d
csi-cephfsplugin-6k82v                      3/3   Running   0     6d
csi-cephfsplugin-8hxcd                      3/3   Running   0     6d
csi-cephfsplugin-b9h7d                      3/3   Running   0     6d
csi-cephfsplugin-kmfxv                      3/3   Running   0     6d
csi-cephfsplugin-provisioner-75784cd9b5-hfg94           6/6   Running   0     6d
csi-cephfsplugin-provisioner-75784cd9b5-ls6zx           6/6   Running   0     6d
csi-cephfsplugin-rnkzb                      3/3   Running   0     6d
csi-cephfsplugin-tmq4n                      3/3   Running   0     6d
csi-cephfsplugin-xqdnz                      3/3   Running   0     6d
csi-cephfsplugin-zk27z                      3/3   Running   0     6d
csi-rbdplugin-22tl4                        3/3   Running   0     6d
csi-rbdplugin-8htfs                        3/3   Running   0     6d
csi-rbdplugin-97s9f                        3/3   Running   0     6d
csi-rbdplugin-98nhj                        3/3   Running   0     6d
csi-rbdplugin-fgfjm                        3/3   Running   0     6d
csi-rbdplugin-hqzs5                        3/3   Running   0     6d
csi-rbdplugin-pqr77                        3/3   Running   0     6d
csi-rbdplugin-provisioner-68fcbf97f4-bdtdv            6/6   Running   0     6d
csi-rbdplugin-provisioner-68fcbf97f4-kpslj            6/6   Running   0     5d17h
csi-rbdplugin-q6md7                        3/3   Running   0     6d
csi-rbdplugin-tfvw8                        3/3   Running   0     6d
noobaa-core-0                           1/1   Running   0     6d
noobaa-db-pg-0                          1/1   Running   0     5d17h
noobaa-endpoint-6b46c8d459-g8dl9                 1/1   Running   0     5d17h
noobaa-endpoint-6b46c8d459-ttgks                 1/1   Running   0     5d23h
noobaa-operator-579c68d6c5-kgvl9                 1/1   Running   0     6d
ocs-metrics-exporter-b5965f77d-h27sr               1/1   Running   0     6d
ocs-operator-9c454d7d8-qz5l2                   1/1   Running   0     6d
rook-ceph-crashcollector-worker-1.ocs-ci-large.test.ocs-6d2t9rf  1/1   Running   0     6d
rook-ceph-crashcollector-worker-2.ocs-ci-large.test.ocs-7c2dmsd  1/1   Running   0     6d
rook-ceph-crashcollector-worker-3.ocs-ci-large.test.ocs-6dg95xl  1/1   Running   0     6d
rook-ceph-crashcollector-worker-5.ocs-ci-large.test.ocs-5cjss8m  1/1   Running   0     6d
rook-ceph-crashcollector-worker-7.ocs-ci-large.test.ocs-5d5sxqz  1/1   Running   0     6d
rook-ceph-crashcollector-worker-8.ocs-ci-large.test.ocs-7cr2xbr  1/1   Running   0     6d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5b5d77b5gj4w9  2/2   Running   0     6d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6b7977986gd7s  2/2   Running   0     6d
rook-ceph-mgr-a-79895c7856-nzhwr                 2/2   Running   4     6d
rook-ceph-mon-a-6f8647d859-n87sd                 2/2   Running   1     6d
rook-ceph-mon-b-8484d47c54-9hvzx                 2/2   Running   0     6d
rook-ceph-mon-c-5cc7ccc867-dwvxb                 2/2   Running   0     6d
rook-ceph-operator-67dc64bbd4-qkd2c                1/1   Running   0     6d
rook-ceph-osd-0-6678cb6f7f-h9gbm                 2/2   Running   1     6d
rook-ceph-osd-1-545c698c57-xmwz7                 2/2   Running   0     6d
rook-ceph-osd-2-67f4dbb555-l2wkf                 2/2   Running   0     6d
rook-ceph-osd-3-85769fddfd-lc69p                 2/2   Running   0     5d17h
rook-ceph-osd-4-68fb59fdcb-pn65g                 2/2   Running   0     5d17h
rook-ceph-osd-5-77f4dd97fd-j6vv9                 2/2   Running   0     5d17h
rook-ceph-osd-prepare-ocs-deviceset-0-data-0524gs-6jgd5      0/1   Completed  0     6d
rook-ceph-osd-prepare-ocs-deviceset-0-data-1b98jl-jlpjk      0/1   Completed  0     5d17h
rook-ceph-osd-prepare-ocs-deviceset-1-data-0kc4bq-2crtg      0/1   Completed  0     6d
rook-ceph-osd-prepare-ocs-deviceset-1-data-1xvfbh-mhqsq      0/1   Completed  0     5d17h
rook-ceph-osd-prepare-ocs-deviceset-2-data-0qbjrn-7nzhs      0/1   Completed  0     6d
rook-ceph-osd-prepare-ocs-deviceset-2-data-1b7rq4-4d6sc      0/1   Completed  0     5d17h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-85976bb7gt7k  2/2   Running   0     6d
rook-ceph-tools-5f4dc9c678-9xbmp                 1/1   Running   0     6d
[root@m1308001 ~]#

Comment 51 lmcfadde 2021-04-29 17:13:03 UTC

@rcyriac z team would like to do one more validation with Tier 1 to be sure and then will update this BZ.

Comment 52 Sridhar Venkat (IBM) 2021-04-29 17:33:30 UTC

For system P, we no longer see the OSD pods restarting during the normal operation of OCS. This problem was seen in System P environment earlier in OCS 4.6. And when OCS was rebuilt with tcmalloc libraries, this problem went away in OCS 4.7.

Comment 55 Abdul Kandathil (IBM) 2021-05-06 11:51:55 UTC

We are not experiencing this issue anymore on IBM Z using OCS 4.7

Comment 57 lmcfadde 2021-05-11 12:36:13 UTC

Rejy I think you got the info you needed so I'm removing the need info flag on my name.

Comment 59 errata-xmlrpc 2021-05-19 09:18:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.

aaaggarw
bniver
dhorak
dwalveka
ebenahar
edonnell
gduarte
gmeno
hannsj_uhl
jdurgin
jijoy
kdreyer
lbrownin
lmcfadde
madam
manokuma
menantea
mnelson
muagarwa
ocs-bugs
owasserm
ratamir
rcyriac
sbalusu
svenkat
tstaudt
uweigand
yselkowi