Description of problem (please be detailed as possible and provide log snippests): Setup OCS on Ppc64le environment with four worker/storage nodes and three master nodes. Ran performance suite of ocs-ci. And found couple of worker/storage nodes went down. Notices soft cpu lockup for 22 seconds on the nodes. Version of all relevant components (if applicable): 4.7 $ oc version Client Version: 4.7.8 Server Version: 4.7.8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This is related to performance testing of the product, since nodes are down, we cannot continue. Is there any workaround available to the best of your knowledge? No. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? N/A If this is a regression, please provide more details to justify this: N/A Steps to Reproduce: 1.Deploy OCS on OCP on ppc64le. 2.Run performance suite of tests from ocs-ci 3. Actual results: Tests fail and worker nodes go down. Expected results: Tests to complete and nodes stay healthy. Additional info:
$ oc get cephcluster -n openshift-storage NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH ocs-storagecluster-cephcluster /var/lib/rook 3 11h Failure Failed to configure ceph cluster HEALTH_ERR $ oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-local-storage local-storage-operator.4.7.0-202105210300.p0 Local Storage 4.7.0-202105210300.p0 Succeeded openshift-operator-lifecycle-manager packageserver Package Server 0.17.0 Succeeded openshift-storage ocs-operator.v4.7.1-410.ci OpenShift Container Storage 4.7.1-410.ci Succeeded $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-6d9ss 3/3 Running 0 11h csi-cephfsplugin-bjf7z 3/3 Running 0 11h csi-cephfsplugin-g29mf 3/3 Running 0 11h csi-cephfsplugin-n5hqp 3/3 Running 0 11h csi-cephfsplugin-provisioner-5f668cb9df-hfsb8 0/6 Pending 0 72m csi-cephfsplugin-provisioner-5f668cb9df-krmwp 6/6 Running 0 11h csi-cephfsplugin-v6sqq 3/3 Running 0 11h csi-rbdplugin-fckh2 3/3 Running 0 11h csi-rbdplugin-kf4f6 3/3 Running 0 11h csi-rbdplugin-lm2xw 3/3 Running 0 11h csi-rbdplugin-provisioner-846f7dddd4-fw2l7 6/6 Running 0 11h csi-rbdplugin-provisioner-846f7dddd4-mpbc6 6/6 Running 0 11h csi-rbdplugin-qz4xl 3/3 Running 0 11h csi-rbdplugin-r9hl2 3/3 Running 0 11h noobaa-core-0 1/1 Running 0 11h noobaa-db-pg-0 1/1 Running 0 11h noobaa-endpoint-c9d985895-p68xk 1/1 Running 0 11h noobaa-operator-c97cf58f7-wghls 1/1 Running 0 11h ocs-metrics-exporter-fb465c96d-27kfb 1/1 Running 0 11h ocs-operator-7667c6f4cc-qpkhn 1/1 Running 0 11h rook-ceph-crashcollector-worker-0-74d44fdf57-8thst 1/1 Terminating 0 11h rook-ceph-crashcollector-worker-0-74d44fdf57-hcrc5 0/1 Pending 0 77m rook-ceph-crashcollector-worker-1-6ff74969c6-k2hbs 1/1 Running 0 11h rook-ceph-crashcollector-worker-2-7c966f66c5-mxc9l 0/1 Pending 0 77m rook-ceph-crashcollector-worker-2-7c966f66c5-rzlxt 1/1 Terminating 0 11h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5d68b886fqwjv 2/2 Terminating 0 11h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5d68b886tgcg5 0/2 Pending 0 77m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-779fb8d7jhwp8 2/2 Terminating 0 11h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-779fb8d7wzlgq 1/2 CrashLoopBackOff 27 77m rook-ceph-mgr-a-7c45d74674-fv7fw 0/2 Init:1/2 1 77m rook-ceph-mgr-a-7c45d74674-xfgdp 2/2 Terminating 0 11h rook-ceph-mon-a-695c5888b6-rmskk 2/2 Running 1 11h rook-ceph-mon-b-d4db76d7d-7srzw 0/2 Pending 0 77m rook-ceph-mon-b-d4db76d7d-xt6hm 2/2 Terminating 0 11h rook-ceph-mon-c-7dc98cf6bc-jg9ht 2/2 Terminating 0 11h rook-ceph-mon-c-7dc98cf6bc-srmxf 0/2 Pending 0 77m rook-ceph-operator-5dc4cd9cfb-f4f5j 1/1 Running 0 11h rook-ceph-osd-0-6fccf45866-6vb6d 2/2 Terminating 0 11h rook-ceph-osd-0-6fccf45866-wbj9b 0/2 Pending 0 77m rook-ceph-osd-1-587bb48b67-cszq6 2/2 Running 0 11h rook-ceph-osd-2-b6d8cd589-7rfgr 0/2 Pending 0 77m rook-ceph-osd-2-b6d8cd589-bn2vk 2/2 Terminating 0 11h rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-279zshn2h 0/1 Completed 0 11h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-cfd8945c5x8x 2/2 Terminating 0 11h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-cfd8945mqj5f 2/2 Running 29 77m rook-ceph-tools-76dbc6f57f-6pfq4 1/1 Terminating 0 11h rook-ceph-tools-76dbc6f57f-sffst 0/1 Pending 0 77m
file:///home/luke/Desktop/Screenshot%20from%202021-06-10%2009-03-45.png file:///home/luke/Desktop/Screenshot%20from%202021-06-10%2008-47-24.png
Created attachment 1790023 [details] Screen shot from IBM Cloud PowerVS console - Part 1
Created attachment 1790024 [details] Screen shot from IBM Cloud PowerVS console - Part 2
Created attachment 1790029 [details] Screen shot from IBM Cloud PowerVS console - Part 0
Sorry for the gaps in the console log. It kept changing... My guess is that this is a locking problem. Either a lock was not released when the critical section ended, or the same lock was taken at different levels.
This is is the script that causes the error - https://github.com/ocp-power-automation/ocs-upi-kvm/blob/master/samples/test-ocs-perf.sh The failure occurs while running benchmark-operator fio cephfs random test. This is the third fio test out of four. The first two succeed.
The first two fio tests that succeeded are fio rbd sequential and fio cephfs sequential. The benchmark-operator is run by ocs-ci performance suite.
This again looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1945016
I am not able to access the tracker BZ 1953540 noted on the bugzilla above. Is there a fix coming soon? Is development build available? I am not able to see bugzilla beyond 1945016, but from what I see in it the problem looks very different. The node is NotReady, because the kernel has a problem. Specifically,the ceph kernel module. That is what the console attachments are trying to show. The console noted above is supported through the support processor, not ppc64le processors controlled by the kernel. This needs to be assigned and fixed in the ceph kernel code... Is that what BZ 1953540 is fixing?
I meant BZ 1953430 above.
*** This bug has been marked as a duplicate of bug 1945016 ***
@muagarwa as per https://bugzilla.redhat.com/show_bug.cgi?id=1970483#c11 do you still think this is a duplicate. Originator also mention "..Notices soft cpu lockup for 22 seconds on the nodes". BZ 1953430 is private so we cannot see it.
(In reply to lmcfadde from comment #14) > @muagarwa as per > https://bugzilla.redhat.com/show_bug.cgi?id=1970483#c11 do you still think > this is a duplicate. Originator also mention "..Notices soft cpu lockup for > 22 seconds on the nodes". BZ 1953430 is private so we cannot see it. I have made all the comments public on BZ #1953430, so you should be able to see it. If you are able to repro this then please help ceph team who are looking into this bz. And yes, they are looking in ceph kernel code only to find an issue.
I have tested with the latest ocp 4.9 and the same issue recurrs as shown via new console logs posted on duplicate BZ https://bugzilla.redhat.com/show_bug.cgi?id=1945016. This is a ceph module issue as noted above. Has that specific ceph module problem been resolved? A ceph module stack traceback is included in console log.
I meant ocp 4.8. We tested with rhcos 4.8 rc 3. Note the kernel kernel version is included in the serial console log image from the support processor. The server cannot be accessed via login or ssh as this is a kernel exception.