1869372 – [FIPS]: OSD crash seen while running FIO test

Bug 1869372 - [FIPS]: OSD crash seen while running FIO test

Summary: [FIPS]: OSD crash seen while running FIO test

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Josh Durgin
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-17 18:03 UTC by krishnaram Karthick
Modified:	2021-06-23 01:15 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-04 22:57:57 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1971084	1	high	CLOSED	[Tracker for BZ #1971118] [GSS] Ceph crash - aio lock	2024-10-01 18:36:07 UTC

Description krishnaram Karthick 2020-08-17 18:03:24 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Upon running FIO workload test (test_fio_workload_simple[CephBlockPool-sequential]) on a FIPS enabled VMware VSAN environment through automated test, it was observed that one OSD (rook-ceph-osd-1-5dc87475bf-9zh5m) had crashed. 

snippet from log:
=================
/ceph/ocsci-jenkins/openshift-clusters/jnk-vuf1cs33-p/jnk-vuf1cs33-p_20200815T005845/logs/failed_testcase_ocs_logs_1597456807/test_fio_workload_simple[CephBlockPool-sequential]_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-191e1a9fadc5b379104a64cc6516b8712acaf72f7c1ec31ad80263f5a3ba8128/ceph/namespaces/openshift-storage/pods/rook-ceph-mon-d-57b89c4978-cqj96/mon/mon/logs/current.log:2020-08-15T02:37:30.320182985Z audit 2020-08-15 02:39:06.2020-08-15T02:37:30.32019882Z 881419 mon.a (mon.0) 1612 : audit [INF] from='mgr.7725 10.131.0.48:0/1' entity='mgr.a' cmd='[{"prefix":"config-key set","key":"mgr/crash/crash/2020-08-15_02:26:03.240126Z_92de0579-78b1-4f79-a9b7-b3ec71a30452","val":"{\"crash_id\": \"2020-08-15_02:26:03.240126Z_92de0579-78b1-4f79-a9b7-b3ec71a30452\", \"timestamp\": \"2020-08-15 02:26:03.240126Z\", \"process_name\": \"ceph-osd\", \"entity_name\": \"osd.1\", \"ceph_version\": \"14.2.8-91.el8cp\", \"utsname_hostname\": \"rook-ceph-osd-1-5dc87475bf-9zh5m\", \"utsname_sysname\": \"Linux\", \"utsname_release\": \"4.18.0-193.14.3.el8_2.x86_64\", \"utsname_version\": \"#1 SMP Mon Jul 20 15:02:29 UTC 2020\", \"utsname_machine\": \"x86_64\", \"os_name\": \"Red Hat Enterprise Linux\", \"os_id\": \"rhel\", \"os_version_id\": \"8.2\", \"os_version\": \"8.2 (Ootpa)\", \"backtrace\": [\"(()+0x12dd0) [0x7f5487baedd0]\", \"(pthread_cond_wait()+0x1fc) [0x7f5487baa48c]\", \"(std::condition_variable::wait(std::unique_lock<std::mutex>&)+0x10) [0x7f54871f88c0]\", \"(Throttle::_wait(long, std::unique_lock<std::mutex>&)+0x10d) [0x557ca01e86ad]\", \"(Throttle::get(long, long)+0xb9) [0x557ca01e90c9]\", \"(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x921) [0x557ca00af1b1]\", \"(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x557c9fdfbde8]\", \"(ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xeaf) [0x557c9feff2ef]\", \"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x267) [0x557c9ff0e277]\", \"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x557c9fe13b42]\", \"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5ce) [0x557c9fdc24ee]\", \"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x372) [0x557c9fbe0f52]\", \"(PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x66) [0x557c9fe98d46]\", \"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1366) [0x557c9fc131b6]\", \"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x557ca01f3974]\", \"(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x557ca01f6544]\", \"(()+0x82de) [0x7f5487ba42de]\", \"(clone()+0x43) [0x7f54868dbe83]\"]}"}]': finished

snippet from ocs-ci log during the time of crash:
===================================================
02:33:44 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg -o yaml
02:33:45 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg
02:33:45 - MainThread - ocs_ci.ocs.ocp - INFO - status of fio-client-ff1e0f2b-m2wgg at column STATUS was Terminating, but we were waiting for Completed
02:33:45 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 300 seconds before next iteration
02:38:45 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg -o yaml
02:38:45 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg
02:38:45 - MainThread - ocs_ci.ocs.ocp - INFO - status of fio-client-ff1e0f2b-m2wgg at column STATUS was Terminating, but we were waiting for Completed
02:38:45 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 300 seconds before next iteration
02:43:46 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg -o yaml
02:43:46 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: Error from server (NotFound): pods "fio-client-ff1e0f2b-m2wgg" not found

02:43:46 - MainThread - ocs_ci.ocs.ocp - WARNING - Failed to get resource: fio-client-ff1e0f2b-m2wgg of kind: pod, selector: None, Error: Error during execution of command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get pod fio-client-ff1e0f2b-m2wgg -o yaml.
Error is Error from server (NotFound): pods "fio-client-ff1e0f2b-m2wgg" not found

mustgather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vuf1cs33-p/jnk-vuf1cs33-p_20200815T005845/logs/failed_testcase_ocs_logs_1597456807/test_fio_workload_simple%5bCephBlockPool-sequential%5d_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-191e1a9fadc5b379104a64cc6516b8712acaf72f7c1ec31ad80263f5a3ba8128/

Test case log: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/11091/testReport/junit/tests.e2e.performance.test_fio_benchmark/TestFIOBenchmark/test_fio_workload_simple_CephBlockPool_sequential_/

complete test log: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/11091/consoleText

Version of all relevant components (if applicable):

OCS operator 	v4.6.0-53.ci
Cluster Version 	4.6.0-0.nightly-2020-08-12-155346
Ceph Version 	14.2.8-91.el8cp
rook_csi_ceph 	cephcsi@sha256:9a78a22c3c375b61e5112468d1160afe5793e87dd1788b576ab9f05892d83e5e
rook_ceph 	rook-ceph@sha256:89a6a7c0fe5f45a07db7cb1e2e41a5177c61813dd30b2faa2aa428b236e85d90


02:00:31 - MainThread - ocs_ci.ocs.version - INFO - ClusterVersion .spec.channel: stable-4.5
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - ClusterVersion .status.desired.version: 4.5.0-0.nightly-2020-08-13-174234
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - ClusterVersion .status.desired.image: registry.svc.ci.openshift.org/ocp/release@sha256:e41b58b45c7a15009e5161580092eac9b31baaffb8cc15fc1ac7e28eb3380383
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - storage namespace openshift-cluster-storage-operator
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55527efb25dc71aa392b59f269afc5fed6a03af1bb0c2fa78a90cc67ac40342b {'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55527efb25dc71aa392b59f269afc5fed6a03af1bb0c2fa78a90cc67ac40342b'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2d0505764aab80d4cc297727f5baea31efd4d8627b5e6f3ebcb6e3c0b82af19b {'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2d0505764aab80d4cc297727f5baea31efd4d8627b5e6f3ebcb6e3c0b82af19b'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:874c7266607cdf9cd6996d1a3345a493fd13b7f719263bfae3c10ddaf0ae1132 {'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:874c7266607cdf9cd6996d1a3345a493fd13b7f719263bfae3c10ddaf0ae1132'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - storage namespace openshift-kube-storage-version-migrator
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:df263c82ee7da6142f4cd633b590468005f23e72f61427db3783d0c7b6120b3c {'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:df263c82ee7da6142f4cd633b590468005f23e72f61427db3783d0c7b6120b3c'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - storage namespace openshift-kube-storage-version-migrator-operator
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:552c2a0af54aa522e4e7545ce3d6813d7b103aea4a983387bca50a0a1178dc18 {'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:552c2a0af54aa522e4e7545ce3d6813d7b103aea4a983387bca50a0a1178dc18'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - storage namespace openshift-storage
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/cephcsi@sha256:6f873f8aaa4367ef835f43c35850d7bb86cc971fac7d0949d4079c58cb6728fc {'quay.io/rhceph-dev/cephcsi@sha256:540c0b93f6d2c76845ebbfa96a728b5eb58f08fd4ec78641ba3d23aaadbfcc0c'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image registry.redhat.io/openshift4/ose-csi-driver-registrar@sha256:39930a20d518455a9776fdae1f70945564fec4acd4f028a66ba9f24ee31bf1dc {'registry.redhat.io/openshift4/ose-csi-driver-registrar@sha256:39930a20d518455a9776fdae1f70945564fec4acd4f028a66ba9f24ee31bf1dc'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image registry.redhat.io/openshift4/ose-csi-external-attacher@sha256:74504ef79d8bb8ec3d517bf47ef5513fcd183190915ef55b7e1ddaca1e98d2cc {'registry.redhat.io/openshift4/ose-csi-external-attacher@sha256:74504ef79d8bb8ec3d517bf47ef5513fcd183190915ef55b7e1ddaca1e98d2cc'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image registry.redhat.io/openshift4/ose-csi-external-provisioner-rhel7@sha256:c237b0349c7aba8b3f32f27392f90ad07e1ca4bede000ff3a6dea34253b2278e {'registry.redhat.io/openshift4/ose-csi-external-provisioner-rhel7@sha256:bbdf56eec860aeeead082f54c7a7685a63d54f230df83216493af5623c1d6498'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image registry.redhat.io/openshift4/ose-csi-external-resizer-rhel7@sha256:12f6ed87b8b71443da15faa1c521cfac8fd5defeaf2734fb88c3305d8bd71a3d {'registry.redhat.io/openshift4/ose-csi-external-resizer-rhel7@sha256:12f6ed87b8b71443da15faa1c521cfac8fd5defeaf2734fb88c3305d8bd71a3d'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/mcg-core@sha256:d2e4edc717533ae0bdede3d8ada917cec06a946e0662b560ffd4493fa1b51f27 {'quay.io/rhceph-dev/mcg-core@sha256:6a511b8d44d9ced96db9156a0b672f85f2424a671c8a2c978e6f52c1d37fe9e2'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:ba74027bb4b244df0b0823ee29aa927d729da33edaa20ebdf51a2430cc6b4e95 {'registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:ba74027bb4b244df0b0823ee29aa927d729da33edaa20ebdf51a2430cc6b4e95'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/mcg-operator@sha256:7883296b72541ce63d127cdfa0f92fcdd7d5e977add678365401ac668489c805 {'quay.io/rhceph-dev/mcg-operator@sha256:7883296b72541ce63d127cdfa0f92fcdd7d5e977add678365401ac668489c805'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/ocs-operator@sha256:a25b99a86f0fcabf2289c04495a75788e79f5e750425b8b54c056cfae958900c {'quay.io/rhceph-dev/ocs-operator@sha256:2987b6300a63a155e8f20637b28f921804bf74bd34c6dbe1202890268a4a8a95'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8 {'quay.io/rhceph-dev/rhceph@sha256:3def885ad9e8440c5bd6d5c830dafdd59edf9c9e8cce0042b0f44a5396b5b0f6'}
02:00:31 - MainThread - ocs_ci.ocs.version - INFO - image quay.io/rhceph-dev/rook-ceph@sha256:d2a38f84f0c92d5427b41b9ff2b20db69c765291789e3419909d80255b1bbd7b {'quay.io/rhceph-dev/rook-ceph@sha256:38e5d6daaaef3a933b6e2328efeaf79130011d74a77bc0451429e51d7aeaf3ff'}


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
Not sure, I believe rebalance was in progress and 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
1/1

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
NA, performance tests were not run in FIPS enabled environment earlier

Steps to Reproduce:
1. Run performance tier tests from ocs-ci on FIPS enabled VMware environment - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/performance


Actual results:
OSD crash observed

Expected results:
No crash and disruption should be seen

Additional info:

Comment 2 Josh Durgin 2020-08-17 21:52:37 UTC

Can you try reproducing this and capturing the log from osd.1, e.g. setting log_to_file to true for all osds as documented here under: "Activate log to file for a particular Ceph daemon" https://rook.io/docs/rook/v1.4/ceph-common-issues.html

It would be helpful to know if this is a race (likely from the backtrace) or in some way made more likely by FIPS mode.

It looks like must-gather is not saving the log from the initial incarnation of osd.1, only the log from running after it was restarted (https://bugzilla.redhat.com/show_bug.cgi?id=1869406).

Comment 3 krishnaram Karthick 2020-08-18 06:36:22 UTC

Version details has a copy paste error. This is the actual version. 

OCS operator	v4.5.0-49.ci
Cluster Version	4.5.0-0.nightly-2020-08-13-174234
Ceph Version	14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)

More info available here - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vuf1cs33-p/jnk-vuf1cs33-p_20200815T005845/logs/test_report_1597377706.html

Comment 4 Josh Durgin 2020-08-18 14:47:14 UTC

Clearly not 4.5 material at this point - to evaluate severity and prioritize it for future releases it would help to know how reproducible this is.

Comment 5 krishnaram Karthick 2020-08-24 04:57:09 UTC

This issue hasn't been reproducible after its first occurrence. I've tried running the tests around 5 times so far. We'll keep a close watch for this crash in the upcoming automation jobs too.

Comment 6 Scott Ostapovicz 2020-09-02 13:58:28 UTC

It looks like this should be closed at this point as non-reproducible.

Comment 7 Josh Durgin 2020-09-04 22:57:57 UTC

Please re-open if this reproduces. We need logs to be able to diagnose as described in comment#2

Note You need to log in before you can comment on or make changes to this bug.