Description of problem (please be detailed as possible and provide log snippests): Upon running FIO workload test (test_fio_workload_simple[CephBlockPool-sequential]) on a FIPS enabled VMware VSAN environment through automated test, it was observed that one OSD (rook-ceph-osd-1-5dc87475bf-9zh5m) had crashed. snippet from log: ================= /ceph/ocsci-jenkins/openshift-clusters/jnk-vuf1cs33-p/jnk-vuf1cs33-p_20200815T005845/logs/failed_testcase_ocs_logs_1597456807/test_fio_workload_simple[CephBlockPool-sequential]_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-191e1a9fadc5b379104a64cc6516b8712acaf72f7c1ec31ad80263f5a3ba8128/ceph/namespaces/openshift-storage/pods/rook-ceph-mon-d-57b89c4978-cqj96/mon/mon/logs/current.log:2020-08-15T02:37:30.320182985Z audit 2020-08-15 02:39:06.2020-08-15T02:37:30.32019882Z 881419 mon.a (mon.0) 1612 : audit [INF] from='mgr.7725 10.131.0.48:0/1' entity='mgr.a' cmd='[{"prefix":"config-key set","key":"mgr/crash/crash/2020-08-15_02:26:03.240126Z_92de0579-78b1-4f79-a9b7-b3ec71a30452","val":"{\"crash_id\": \"2020-08-15_02:26:03.240126Z_92de0579-78b1-4f79-a9b7-b3ec71a30452\", \"timestamp\": \"2020-08-15 02:26:03.240126Z\", \"process_name\": \"ceph-osd\", \"entity_name\": \"osd.1\", \"ceph_version\": \"14.2.8-91.el8cp\", \"utsname_hostname\": \"rook-ceph-osd-1-5dc87475bf-9zh5m\", \"utsname_sysname\": \"Linux\", \"utsname_release\": \"4.18.0-193.14.3.el8_2.x86_64\", \"utsname_version\": \"#1 SMP Mon Jul 20 15:02:29 UTC 2020\", \"utsname_machine\": \"x86_64\", \"os_name\": \"Red Hat Enterprise Linux\", \"os_id\": \"rhel\", \"os_version_id\": \"8.2\", \"os_version\": \"8.2 (Ootpa)\", \"backtrace\": [\"(()+0x12dd0) [0x7f5487baedd0]\", \"(pthread_cond_wait()+0x1fc) [0x7f5487baa48c]\", \"(std::condition_variable::wait(std::unique_lock<std::mutex>&)+0x10) [0x7f54871f88c0]\", \"(Throttle::_wait(long, std::unique_lock<std::mutex>&)+0x10d) [0x557ca01e86ad]\", \"(Throttle::get(long, long)+0xb9) [0x557ca01e90c9]\", \"(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x921) [0x557ca00af1b1]\", \"(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x557c9fdfbde8]\", \"(ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xeaf) [0x557c9feff2ef]\", \"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x267) [0x557c9ff0e277]\", \"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x557c9fe13b42]\", \"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5ce) [0x557c9fdc24ee]\", \"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x372) [0x557c9fbe0f52]\", \"(PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x66) [0x557c9fe98d46]\", \"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1366) [0x557c9fc131b6]\", \"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x557ca01f3974]\", \"(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x557ca01f6544]\", \"(()+0x82de) [0x7f5487ba42de]\", \"(clone()+0x43) [0x7f54868dbe83]\"]}"}]': finished snippet from ocs-ci log during the time of crash: =================================================== 02:33:44 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg -o yaml 02:33:45 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg 02:33:45 - MainThread - ocs_ci.ocs.ocp - INFO - status of fio-client-ff1e0f2b-m2wgg at column STATUS was Terminating, but we were waiting for Completed 02:33:45 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 300 seconds before next iteration 02:38:45 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg -o yaml 02:38:45 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg 02:38:45 - MainThread - ocs_ci.ocs.ocp - INFO - status of fio-client-ff1e0f2b-m2wgg at column STATUS was Terminating, but we were waiting for Completed 02:38:45 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 300 seconds before next iteration 02:43:46 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg -o yaml 02:43:46 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: Error from server (NotFound): pods "fio-client-ff1e0f2b-m2wgg" not found 02:43:46 - MainThread - ocs_ci.ocs.ocp - WARNING - Failed to get resource: fio-client-ff1e0f2b-m2wgg of kind: pod, selector: None, Error: Error during execution of command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg -o yaml. Error is Error from server (NotFound): pods "fio-client-ff1e0f2b-m2wgg" not found mustgather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vuf1cs33-p/jnk-vuf1cs33-p_20200815T005845/logs/failed_testcase_ocs_logs_1597456807/test_fio_workload_simple%5bCephBlockPool-sequential%5d_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-191e1a9fadc5b379104a64cc6516b8712acaf72f7c1ec31ad80263f5a3ba8128/ Test case log: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/11091/testReport/junit/tests.e2e.performance.test_fio_benchmark/TestFIOBenchmark/test_fio_workload_simple_CephBlockPool_sequential_/ complete test log: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/11091/consoleText Version of all relevant components (if applicable): OCS operator v4.6.0-53.ci Cluster Version 4.6.0-0.nightly-2020-08-12-155346 Ceph Version 14.2.8-91.el8cp rook_csi_ceph cephcsi@sha256:9a78a22c3c375b61e5112468d1160afe5793e87dd1788b576ab9f05892d83e5e rook_ceph rook-ceph@sha256:89a6a7c0fe5f45a07db7cb1e2e41a5177c61813dd30b2faa2aa428b236e85d90 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - ClusterVersion .spec.channel: stable-4.5 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - ClusterVersion .status.desired.version: 4.5.0-0.nightly-2020-08-13-174234 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - ClusterVersion .status.desired.image: registry.svc.ci.openshift.org/ocp/release@sha256:e41b58b45c7a15009e5161580092eac9b31baaffb8cc15fc1ac7e28eb3380383 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - storage namespace openshift-cluster-storage-operator 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55527efb25dc71aa392b59f269afc5fed6a03af1bb0c2fa78a90cc67ac40342b {'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55527efb25dc71aa392b59f269afc5fed6a03af1bb0c2fa78a90cc67ac40342b'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2d0505764aab80d4cc297727f5baea31efd4d8627b5e6f3ebcb6e3c0b82af19b {'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2d0505764aab80d4cc297727f5baea31efd4d8627b5e6f3ebcb6e3c0b82af19b'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:874c7266607cdf9cd6996d1a3345a493fd13b7f719263bfae3c10ddaf0ae1132 {'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:874c7266607cdf9cd6996d1a3345a493fd13b7f719263bfae3c10ddaf0ae1132'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - storage namespace openshift-kube-storage-version-migrator 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:df263c82ee7da6142f4cd633b590468005f23e72f61427db3783d0c7b6120b3c {'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:df263c82ee7da6142f4cd633b590468005f23e72f61427db3783d0c7b6120b3c'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - storage namespace openshift-kube-storage-version-migrator-operator 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:552c2a0af54aa522e4e7545ce3d6813d7b103aea4a983387bca50a0a1178dc18 {'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:552c2a0af54aa522e4e7545ce3d6813d7b103aea4a983387bca50a0a1178dc18'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - storage namespace openshift-storage 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/cephcsi@sha256:6f873f8aaa4367ef835f43c35850d7bb86cc971fac7d0949d4079c58cb6728fc {'quay.io/rhceph-dev/cephcsi@sha256:540c0b93f6d2c76845ebbfa96a728b5eb58f08fd4ec78641ba3d23aaadbfcc0c'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image registry.redhat.io/openshift4/ose-csi-driver-registrar@sha256:39930a20d518455a9776fdae1f70945564fec4acd4f028a66ba9f24ee31bf1dc {'registry.redhat.io/openshift4/ose-csi-driver-registrar@sha256:39930a20d518455a9776fdae1f70945564fec4acd4f028a66ba9f24ee31bf1dc'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image registry.redhat.io/openshift4/ose-csi-external-attacher@sha256:74504ef79d8bb8ec3d517bf47ef5513fcd183190915ef55b7e1ddaca1e98d2cc {'registry.redhat.io/openshift4/ose-csi-external-attacher@sha256:74504ef79d8bb8ec3d517bf47ef5513fcd183190915ef55b7e1ddaca1e98d2cc'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image registry.redhat.io/openshift4/ose-csi-external-provisioner-rhel7@sha256:c237b0349c7aba8b3f32f27392f90ad07e1ca4bede000ff3a6dea34253b2278e {'registry.redhat.io/openshift4/ose-csi-external-provisioner-rhel7@sha256:bbdf56eec860aeeead082f54c7a7685a63d54f230df83216493af5623c1d6498'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image registry.redhat.io/openshift4/ose-csi-external-resizer-rhel7@sha256:12f6ed87b8b71443da15faa1c521cfac8fd5defeaf2734fb88c3305d8bd71a3d {'registry.redhat.io/openshift4/ose-csi-external-resizer-rhel7@sha256:12f6ed87b8b71443da15faa1c521cfac8fd5defeaf2734fb88c3305d8bd71a3d'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/mcg-core@sha256:d2e4edc717533ae0bdede3d8ada917cec06a946e0662b560ffd4493fa1b51f27 {'quay.io/rhceph-dev/mcg-core@sha256:6a511b8d44d9ced96db9156a0b672f85f2424a671c8a2c978e6f52c1d37fe9e2'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:ba74027bb4b244df0b0823ee29aa927d729da33edaa20ebdf51a2430cc6b4e95 {'registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:ba74027bb4b244df0b0823ee29aa927d729da33edaa20ebdf51a2430cc6b4e95'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/mcg-operator@sha256:7883296b72541ce63d127cdfa0f92fcdd7d5e977add678365401ac668489c805 {'quay.io/rhceph-dev/mcg-operator@sha256:7883296b72541ce63d127cdfa0f92fcdd7d5e977add678365401ac668489c805'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/ocs-operator@sha256:a25b99a86f0fcabf2289c04495a75788e79f5e750425b8b54c056cfae958900c {'quay.io/rhceph-dev/ocs-operator@sha256:2987b6300a63a155e8f20637b28f921804bf74bd34c6dbe1202890268a4a8a95'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8 {'quay.io/rhceph-dev/rhceph@sha256:3def885ad9e8440c5bd6d5c830dafdd59edf9c9e8cce0042b0f44a5396b5b0f6'} 02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/rook-ceph@sha256:d2a38f84f0c92d5427b41b9ff2b20db69c765291789e3419909d80255b1bbd7b {'quay.io/rhceph-dev/rook-ceph@sha256:38e5d6daaaef3a933b6e2328efeaf79130011d74a77bc0451429e51d7aeaf3ff'} Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Not sure, I believe rebalance was in progress and Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? 1/1 Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: NA, performance tests were not run in FIPS enabled environment earlier Steps to Reproduce: 1. Run performance tier tests from ocs-ci on FIPS enabled VMware environment - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/performance Actual results: OSD crash observed Expected results: No crash and disruption should be seen Additional info:
Can you try reproducing this and capturing the log from osd.1, e.g. setting log_to_file to true for all osds as documented here under: "Activate log to file for a particular Ceph daemon" https://rook.io/docs/rook/v1.4/ceph-common-issues.html It would be helpful to know if this is a race (likely from the backtrace) or in some way made more likely by FIPS mode. It looks like must-gather is not saving the log from the initial incarnation of osd.1, only the log from running after it was restarted (https://bugzilla.redhat.com/show_bug.cgi?id=1869406).
Version details has a copy paste error. This is the actual version. OCS operator v4.5.0-49.ci Cluster Version 4.5.0-0.nightly-2020-08-13-174234 Ceph Version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable) More info available here - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vuf1cs33-p/jnk-vuf1cs33-p_20200815T005845/logs/test_report_1597377706.html
Clearly not 4.5 material at this point - to evaluate severity and prioritize it for future releases it would help to know how reproducible this is.
This issue hasn't been reproducible after its first occurrence. I've tried running the tests around 5 times so far. We'll keep a close watch for this crash in the upcoming automation jobs too.
It looks like this should be closed at this point as non-reproducible.
Please re-open if this reproduces. We need logs to be able to diagnose as described in comment#2