Bug 2419834
| Summary: | Multiple OSDs errors due to Ceph assertion failures druing EC4+2 FIO tests | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Tejas Chaphekar <techaphe> | ||||||||||||
| Component: | RADOS | Assignee: | Adam Kupczyk <akupczyk> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Pawan <pdhiran> | ||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||
| Priority: | unspecified | ||||||||||||||
| Version: | 9.0 | CC: | bhubbard, ceph-eng-bugs, cephqe-warriors, idryomov, ngangadh, nojha, pdhiran, sangadi, tserlin, vereddy, vumrao | ||||||||||||
| Target Milestone: | --- | ||||||||||||||
| Target Release: | 9.0 | ||||||||||||||
| Hardware: | x86_64 | ||||||||||||||
| OS: | Linux | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | ceph-20.1.0-124 | Doc Type: | If docs needed, set a value | ||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2026-01-29 07:04:29 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Attachments: |
|
||||||||||||||
Created attachment 2117911 [details]
osd80-journalctl logs
Created attachment 2117912 [details]
osd107-journalctl logs
Created attachment 2117913 [details]
osd140-journalctl logs
Adding sosreports in Box Folder - https://ibm.box.com/s/2xuarlmbtqcmp92fq1hjfzd18sb22oyk - https://ibm.box.com/s/yxht8q8nda4b3zwmc9q8rba4a5uwfpqq - https://ibm.box.com/s/pxlbqnixkjsauddvg9j4d9u8nhv1p9r0 I have disabled the bluestore_elastic_shared_blobs and retriggered the tests. But i am getting the same error on Failed OSD's. Attaching the latest logs Created attachment 2118129 [details]
Round2 Test Logs
Post the fixes in IBM-CEPH-9.0-202512102340.ci.0 build, we are able to move forward on Fast EC testing for 2+2 & 4+2 profiles without the Ceph assertion failure errors. Following are the Hardware Environment details used to verify the fix. Ceph Cluster details - 6 x Dell X5D Baremetal Servers (2 x Intel(R) Xeon(R) Gold 6438N + 512 GB Memory) with 144 OSD's FIO Client details - 6 x Dell R660 Baremetal Servers (Intel(R) Xeon(R) Silver 4416+ + 128GB Memory) RBD Image details - 512GB x 30 Images (5 Images per Client) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 9.0 Security and Enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2026:1536 |
Created attachment 2117910 [details] osd60-journalctl logs Description of problem: While running EC4+2 pool FIO tests with 6 Clients and 30 Images we are encountering failed cephadm daemon errors, while examining the journalctl logs for the affected osd's we have seen "FAILED ceph_assert(!ito->is_valid())" errors for each osd. Version-Release number of selected component (if applicable): 20.1.0-107.el9cp How reproducible: 100% reproducible on all four OSDs Steps to Reproduce: 1.Start FIO with 30 Images and 6 clients on EC4+2 pool RBD images Actual results: We are seeing cluster in HEALTH_WARN status with 4 failed cephadm daemon(s). Expected results: All the OSD's show be up and cluster in healthy state during test duration Additional info: Attaching - sosreports from ceph55, ceph57, ceph59 - journalctl logs for failed osds.