Bug 1940860
Summary: | [IBM Z] Local Storage Operator and most pods in openshift-storage namespace in Pending/Terminating status after tier4c tests with ocs-ci | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Sarah Julia Kriesch <skriesch> | ||||
Component: | rook | Assignee: | Santosh Pillai <sapillai> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Elad <ebenahar> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.7 | CC: | akandath, madam, muagarwa, ocs-bugs, prsurve, ratamir | ||||
Target Milestone: | --- | Keywords: | Reopened | ||||
Target Release: | --- | ||||||
Hardware: | s390x | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-04-26 06:21:56 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Sarah Julia Kriesch
2021-03-19 11:53:14 UTC
All worker nodes are in NotReady status after the test, too: # oc get nodes NAME STATUS ROLES AGE VERSION master-01.m1307001ocs.lnxne.boe Ready master 3d2h v1.20.0+5fbfd19 master-02.m1307001ocs.lnxne.boe Ready master 3d2h v1.20.0+5fbfd19 master-03.m1307001ocs.lnxne.boe Ready master 3d2h v1.20.0+5fbfd19 worker-001.m1307001ocs.lnxne.boe NotReady worker 3d1h v1.20.0+5fbfd19 worker-002.m1307001ocs.lnxne.boe NotReady worker 3d1h v1.20.0+5fbfd19 worker-003.m1307001ocs.lnxne.boe NotReady worker 3d1h v1.20.0+5fbfd19 All worker nodes are in NotReady status after the test, too: # oc get nodes NAME STATUS ROLES AGE VERSION master-01.m1307001ocs.lnxne.boe Ready master 3d2h v1.20.0+5fbfd19 master-02.m1307001ocs.lnxne.boe Ready master 3d2h v1.20.0+5fbfd19 master-03.m1307001ocs.lnxne.boe Ready master 3d2h v1.20.0+5fbfd19 worker-001.m1307001ocs.lnxne.boe NotReady worker 3d1h v1.20.0+5fbfd19 worker-002.m1307001ocs.lnxne.boe NotReady worker 3d1h v1.20.0+5fbfd19 worker-003.m1307001ocs.lnxne.boe NotReady worker 3d1h v1.20.0+5fbfd19 I am waiting for must-gather results with this command: # oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:latest --dest-dir=/root/ocp4-workdir/ General logs of tier4c test: https://ibm.box.com/s/58ti5357dp2l9k5sohxge1k3jszl9xka log output file of tier4c: https://ibm.box.com/s/v6d8p5idg8efh7a3gl0g1p2wv1fw95rk Created attachment 1764679 [details]
must-gather-ocs tier4c
what are tier4c tests? Pods are pending/terminating because all works nodes are Not Ready. Any idea what caused that? (In reply to Santosh Pillai from comment #7) > what are tier4c tests? > Pods are pending/terminating because all works nodes are Not Ready. Any idea > what caused that? adding needInfo. ocs-ci provides multiple tier tests for OCS (see https://ocs-ci.readthedocs.io/en/latest/ ). You can test the functionality of OCP and OCS with that. One test suite is tier4c. I have used following command for running them: #run-ci -m "tier4c" --ocsci-conf ~/ocs-ci/config.yaml --cluster-path /root/ocp4-workdir/ tests --no-print-logs --capture=no --html ~/testtier4c_18thMarch2021.html --self-contained-html | tee ~/tier4c_18thMarch2021.log The crazy thing is that the html log looks like most would be working fine. But at the end "worker nodes" are not ready and pods are not working. That should not happen with default tests in customer environments... (In reply to Sarah Julia Kriesch from comment #6) > Created attachment 1764679 [details] > must-gather-ocs tier4c No logs are available for OCS/Rook operator in the must gather file attached. Could be because the pods are not available anymore. OCS CI logs should be investigated as to what went wrong. Adding need info for Prateek. from comment #9 it looks like "--collect-logs" was not passed to run-ci cmd and that's the reason must-gather logs were not collected with the test case failed I would suggest rerunning the run-ci test with "--collect-logs" eg:- run-ci -m "tier4c" --ocsci-conf ~/ocs-ci/config.yaml --cluster-path /root/ocp4-workdir/ tests --no-print-logs --capture=no --html ~/testtier4c_18thMarch2021.html --self-contained-html --collect-logs (In reply to Pratik Surve from comment #12) > from comment #9 it looks like "--collect-logs" was not passed to run-ci cmd > and that's the reason must-gather logs were not collected with the test case > failed > > I would suggest rerunning the run-ci test with "--collect-logs" > > eg:- run-ci -m "tier4c" --ocsci-conf ~/ocs-ci/config.yaml --cluster-path > /root/ocp4-workdir/ tests --no-print-logs --capture=no --html > ~/testtier4c_18thMarch2021.html --self-contained-html --collect-logs Please see this comment. Initial analysis suggests the OCS CI tests might have caused issue with the nodes. QE needs must gather logs to debug further. Please reopen with the required logs. Hi @muagarwa, I have reproduced the issue again with ocs 4.7. Collected logs as mentioned in the previous comment, please find same in google drive: https://drive.google.com/file/d/1x8MeBg24_AJeUP8ZxGOy9aqoQ4OnkqPF/view?usp=sharing Not sure whether this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1945016 This is the current status of nodes and ocs pods: [root@m1301015 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0.m1301015ocs.lnxne.boe Ready master 17h v1.20.0+551f7b2 master-1.m1301015ocs.lnxne.boe Ready master 17h v1.20.0+551f7b2 master-2.m1301015ocs.lnxne.boe Ready master 17h v1.20.0+551f7b2 worker-0.m1301015ocs.lnxne.boe Ready worker 17h v1.20.0+551f7b2 worker-1.m1301015ocs.lnxne.boe NotReady worker 17h v1.20.0+551f7b2 worker-2.m1301015ocs.lnxne.boe Ready worker 17h v1.20.0+551f7b2 [root@m1301015 ~]# [root@m1301015 ~]# oc -n openshift-storage get po NAME READY STATUS RESTARTS AGE csi-cephfsplugin-4mnz7 3/3 Running 0 4h53m csi-cephfsplugin-g8g42 3/3 Running 0 17h csi-cephfsplugin-provisioner-f975d886c-98b6c 6/6 Running 0 8h csi-cephfsplugin-provisioner-f975d886c-b4q7k 6/6 Running 0 4h2m csi-cephfsplugin-q4w7t 3/3 Running 0 9h csi-rbdplugin-ppcj7 3/3 Running 0 9h csi-rbdplugin-provisioner-6bbf798bfb-w85fg 6/6 Running 0 4h2m csi-rbdplugin-provisioner-6bbf798bfb-xsk6f 6/6 Running 0 9h csi-rbdplugin-scgbt 3/3 Running 0 4h40m csi-rbdplugin-tg658 3/3 Running 0 17h noobaa-core-0 1/1 Running 0 17h noobaa-db-pg-0 1/1 Running 0 17h noobaa-endpoint-7dcccc557b-kn8ph 1/1 Running 0 17h noobaa-operator-99b9845d5-67h84 1/1 Running 0 17h ocs-metrics-exporter-555554fd7b-b9kff 1/1 Running 0 4h2m ocs-metrics-exporter-555554fd7b-hktzz 1/1 Terminating 0 17h ocs-operator-6798f49bc6-5m68d 0/1 Terminating 0 17h ocs-operator-6798f49bc6-bqvqn 1/1 Running 0 4h2m rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-6d6lxj9 1/1 Running 0 17h rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-55ndz9t 1/1 Running 0 17h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-65ffbc9c6q4qb 2/2 Running 0 4h2m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-c787d8c9wm8vk 2/2 Running 0 17h rook-ceph-mgr-a-6b7955f858-kn2gh 2/2 Running 0 6h6m rook-ceph-mon-b-687f4bcf98-nwr6h 2/2 Running 0 17h rook-ceph-mon-c-5f55f54bd9-fkjwn 2/2 Running 0 17h rook-ceph-mon-d-canary-7997c4b4bf-rfkxt 0/2 Pending 0 55s rook-ceph-operator-56698787c-462tz 1/1 Running 0 7h22m rook-ceph-osd-0-b86d6d78c-btnzd 2/2 Running 0 5h29m rook-ceph-osd-1-9499578cf-bqnnn 2/2 Running 0 17h rook-ceph-osd-2-7cb485c5f-t4q7g 0/2 Pending 0 4h2m rook-ceph-osd-prepare-ocs-deviceset-1-data-0h7bd2-lnwks 0/1 Completed 0 17h rook-ceph-osd-prepare-ocs-deviceset-2-data-0l6cmn-mj6hp 0/1 Completed 0 17h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-586cc8555dtb 2/2 Running 0 17h rook-ceph-tools-69c5449589-t6gxp any idea what's causing the nodes to go down when running the ocs ci tests? Actually it looks like test "tests/manage/pv_services/test_daemon_kill_during_pvc_pod_creation_and_io.py::TestDaemonKillDuringCreationOperations::test_daemon_kill_during_pvc_pod_creation_and_io[CephFileSystem-mgr]" is bringing cluster to unhealthy state, which is the similar case with https://bugzilla.redhat.com/show_bug.cgi?id=1945016 *** This bug has been marked as a duplicate of bug 1945016 *** |