1970483 – Nodes go down while running performance suite of tests from ocs-ci.

Bug 1970483 - Nodes go down while running performance suite of tests from ocs-ci.

Summary: Nodes go down while running performance suite of tests from ocs-ci.

Keywords:
Status:	CLOSED DUPLICATE of bug 1945016
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.7
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Scott Ostapovicz
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-10 14:40 UTC by Sridhar Venkat (IBM)
Modified:	2021-07-12 18:23 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-15 03:16:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Screen shot from IBM Cloud PowerVS console - Part 1 (81.94 KB, image/png) 2021-06-10 20:55 UTC, lbrownin	no flags	Details
Screen shot from IBM Cloud PowerVS console - Part 2 (76.79 KB, image/png) 2021-06-10 20:58 UTC, lbrownin	no flags	Details
Screen shot from IBM Cloud PowerVS console - Part 0 (76.68 KB, image/png) 2021-06-10 21:07 UTC, lbrownin	no flags	Details
View All

Description Sridhar Venkat (IBM) 2021-06-10 14:40:12 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Setup OCS on Ppc64le environment with four worker/storage nodes and three master nodes. Ran performance suite of ocs-ci. And found couple of worker/storage nodes went down. Notices soft cpu lockup for 22 seconds on the nodes.

Version of all relevant components (if applicable): 4.7
$ oc version
Client Version: 4.7.8
Server Version: 4.7.8


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
This is related to performance testing of the product, since nodes are down, we cannot continue.

Is there any workaround available to the best of your knowledge?
No.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
N/A

Steps to Reproduce:
1.Deploy OCS on OCP on ppc64le.
2.Run performance suite of tests from ocs-ci
3.


Actual results:
Tests fail and worker nodes go down.

Expected results:
Tests to complete and nodes stay healthy.

Additional info:

Comment 2 lbrownin 2021-06-10 20:47:37 UTC

$ oc get cephcluster -n openshift-storage
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE   PHASE     MESSAGE                            HEALTH
ocs-storagecluster-cephcluster   /var/lib/rook     3          11h   Failure   Failed to configure ceph cluster   HEALTH_ERR

$ oc get csv -A
NAMESPACE                              NAME                                           DISPLAY                       VERSION                 REPLACES   PHASE
openshift-local-storage                local-storage-operator.4.7.0-202105210300.p0   Local Storage                 4.7.0-202105210300.p0              Succeeded
openshift-operator-lifecycle-manager   packageserver                                  Package Server                0.17.0                             Succeeded
openshift-storage                      ocs-operator.v4.7.1-410.ci                     OpenShift Container Storage   4.7.1-410.ci                       Succeeded

$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS             RESTARTS   AGE
csi-cephfsplugin-6d9ss                                            3/3     Running            0          11h
csi-cephfsplugin-bjf7z                                            3/3     Running            0          11h
csi-cephfsplugin-g29mf                                            3/3     Running            0          11h
csi-cephfsplugin-n5hqp                                            3/3     Running            0          11h
csi-cephfsplugin-provisioner-5f668cb9df-hfsb8                     0/6     Pending            0          72m
csi-cephfsplugin-provisioner-5f668cb9df-krmwp                     6/6     Running            0          11h
csi-cephfsplugin-v6sqq                                            3/3     Running            0          11h
csi-rbdplugin-fckh2                                               3/3     Running            0          11h
csi-rbdplugin-kf4f6                                               3/3     Running            0          11h
csi-rbdplugin-lm2xw                                               3/3     Running            0          11h
csi-rbdplugin-provisioner-846f7dddd4-fw2l7                        6/6     Running            0          11h
csi-rbdplugin-provisioner-846f7dddd4-mpbc6                        6/6     Running            0          11h
csi-rbdplugin-qz4xl                                               3/3     Running            0          11h
csi-rbdplugin-r9hl2                                               3/3     Running            0          11h
noobaa-core-0                                                     1/1     Running            0          11h
noobaa-db-pg-0                                                    1/1     Running            0          11h
noobaa-endpoint-c9d985895-p68xk                                   1/1     Running            0          11h
noobaa-operator-c97cf58f7-wghls                                   1/1     Running            0          11h
ocs-metrics-exporter-fb465c96d-27kfb                              1/1     Running            0          11h
ocs-operator-7667c6f4cc-qpkhn                                     1/1     Running            0          11h
rook-ceph-crashcollector-worker-0-74d44fdf57-8thst                1/1     Terminating        0          11h
rook-ceph-crashcollector-worker-0-74d44fdf57-hcrc5                0/1     Pending            0          77m
rook-ceph-crashcollector-worker-1-6ff74969c6-k2hbs                1/1     Running            0          11h
rook-ceph-crashcollector-worker-2-7c966f66c5-mxc9l                0/1     Pending            0          77m
rook-ceph-crashcollector-worker-2-7c966f66c5-rzlxt                1/1     Terminating        0          11h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5d68b886fqwjv   2/2     Terminating        0          11h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5d68b886tgcg5   0/2     Pending            0          77m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-779fb8d7jhwp8   2/2     Terminating        0          11h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-779fb8d7wzlgq   1/2     CrashLoopBackOff   27         77m
rook-ceph-mgr-a-7c45d74674-fv7fw                                  0/2     Init:1/2           1          77m
rook-ceph-mgr-a-7c45d74674-xfgdp                                  2/2     Terminating        0          11h
rook-ceph-mon-a-695c5888b6-rmskk                                  2/2     Running            1          11h
rook-ceph-mon-b-d4db76d7d-7srzw                                   0/2     Pending            0          77m
rook-ceph-mon-b-d4db76d7d-xt6hm                                   2/2     Terminating        0          11h
rook-ceph-mon-c-7dc98cf6bc-jg9ht                                  2/2     Terminating        0          11h
rook-ceph-mon-c-7dc98cf6bc-srmxf                                  0/2     Pending            0          77m
rook-ceph-operator-5dc4cd9cfb-f4f5j                               1/1     Running            0          11h
rook-ceph-osd-0-6fccf45866-6vb6d                                  2/2     Terminating        0          11h
rook-ceph-osd-0-6fccf45866-wbj9b                                  0/2     Pending            0          77m
rook-ceph-osd-1-587bb48b67-cszq6                                  2/2     Running            0          11h
rook-ceph-osd-2-b6d8cd589-7rfgr                                   0/2     Pending            0          77m
rook-ceph-osd-2-b6d8cd589-bn2vk                                   2/2     Terminating        0          11h
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-279zshn2h   0/1     Completed          0          11h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-cfd8945c5x8x   2/2     Terminating        0          11h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-cfd8945mqj5f   2/2     Running            29         77m
rook-ceph-tools-76dbc6f57f-6pfq4                                  1/1     Terminating        0          11h
rook-ceph-tools-76dbc6f57f-sffst                                  0/1     Pending            0          77m

Comment 3 lbrownin 2021-06-10 20:51:21 UTC

file:///home/luke/Desktop/Screenshot%20from%202021-06-10%2009-03-45.png
file:///home/luke/Desktop/Screenshot%20from%202021-06-10%2008-47-24.png

Comment 4 lbrownin 2021-06-10 20:55:40 UTC

Created attachment 1790023 [details]
Screen shot from IBM Cloud PowerVS console - Part 1

Comment 5 lbrownin 2021-06-10 20:58:15 UTC

Created attachment 1790024 [details]
Screen shot from IBM Cloud PowerVS console - Part 2

Comment 6 lbrownin 2021-06-10 21:07:32 UTC

Created attachment 1790029 [details]
Screen shot from IBM Cloud PowerVS console - Part 0

Comment 7 lbrownin 2021-06-10 21:13:14 UTC

Sorry for the gaps in the console log.   It kept changing... My guess is that this is a locking problem.  Either a lock was not released when the critical section ended, or the same lock was taken at different levels.

Comment 8 lbrownin 2021-06-10 21:17:44 UTC

This is is the script that causes the error - https://github.com/ocp-power-automation/ocs-upi-kvm/blob/master/samples/test-ocs-perf.sh

The failure occurs while running benchmark-operator fio cephfs random test.  This is the third fio test out of four.  The first two succeed.

Comment 9 lbrownin 2021-06-10 21:20:48 UTC

The first two fio tests that succeeded are fio rbd sequential and fio cephfs sequential.  
The benchmark-operator is run by ocs-ci performance suite.

Comment 10 Mudit Agarwal 2021-06-11 09:22:29 UTC

This again looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1945016

Comment 11 lbrownin 2021-06-11 20:46:59 UTC

I am not able to access the tracker BZ 1953540 noted on the bugzilla above.  Is there a fix coming soon?  Is  development build available?   

I am not able to see bugzilla beyond 1945016, but from what I see in it the problem looks very different.  The node is NotReady, because the kernel has a problem.  Specifically,the ceph kernel module.  That is what the console attachments are trying to show.  The console noted above is supported through the support processor, not ppc64le processors controlled by the kernel.  This needs to be assigned and fixed in the ceph kernel code...

Is that what BZ 1953540 is fixing?

Comment 12 lbrownin 2021-06-11 20:50:29 UTC

I meant BZ 1953430 above.

Comment 13 Mudit Agarwal 2021-06-15 03:16:10 UTC


*** This bug has been marked as a duplicate of bug 1945016 ***

Comment 14 lmcfadde 2021-06-18 17:59:58 UTC

@muagarwa as per https://bugzilla.redhat.com/show_bug.cgi?id=1970483#c11 do you still think this is a duplicate. Originator also mention  "..Notices soft cpu lockup for 22 seconds on the nodes". BZ 1953430 is private so we cannot see it.

Comment 15 Mudit Agarwal 2021-06-23 16:55:18 UTC

(In reply to lmcfadde from comment #14)
> @muagarwa as per
> https://bugzilla.redhat.com/show_bug.cgi?id=1970483#c11 do you still think
> this is a duplicate. Originator also mention  "..Notices soft cpu lockup for
> 22 seconds on the nodes". BZ 1953430 is private so we cannot see it.

I have made all the comments public on  BZ #1953430, so you should be able to see it.
If you are able to repro this then please help ceph team who are looking into this bz.

And yes, they are looking in ceph kernel code only to find an issue.

Comment 16 lbrownin 2021-07-12 18:20:34 UTC

I have tested with the latest ocp 4.9 and the same issue recurrs as shown via new console logs posted on duplicate BZ https://bugzilla.redhat.com/show_bug.cgi?id=1945016.   This is a ceph module issue as noted above.  Has that specific ceph module problem been resolved?  A ceph module stack traceback is included in console log.

Comment 17 lbrownin 2021-07-12 18:23:07 UTC

I meant ocp 4.8.  We tested with rhcos 4.8 rc 3.  Note the kernel kernel version is included in the serial console log image from the support processor.  The server cannot be accessed via login or ssh as this is a kernel exception.

Note You need to log in before you can comment on or make changes to this bug.