Bug 2054724 - Detect and report OSD k8s job failures
Summary: Detect and report OSD k8s job failures
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Travis Nielsen
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks: 2054898
TreeView+ depends on / blocked
 
Reported: 2022-02-15 15:14 UTC by Samuel Blais-Dowdy
Modified: 2022-03-01 16:20 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2054898 (view as bug list)
Environment:
Last Closed: 2022-03-01 16:20:28 UTC
Embargoed:


Attachments (Terms of Use)
Logs for rook-ceph-osd-prepare-default-1-data-0f8266--1-zbjtx (16.91 KB, text/plain)
2022-02-15 15:14 UTC, Samuel Blais-Dowdy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github rook rook pull 9746 0 None Merged osd: Return error if fail to list osds in prepare job 2022-02-21 16:35:07 UTC

Description Samuel Blais-Dowdy 2022-02-15 15:14:29 UTC
Created attachment 1861278 [details]
Logs for rook-ceph-osd-prepare-default-1-data-0f8266--1-zbjtx

Description of problem (please be detailed as possible and provide log
snippests):

Following BZ2047318, it was found that the OSD.1 prepare job failed but reported as "Completed/Success". The log showed a python traceback:

View attached logs "Logs for rook-ceph-osd-prepare-default-1-data-0f8266--1-zbjtx ".


Additional info:

Please view extensive details in the original BZ and post-mortem notes.

- BZ2047318: https://bugzilla.redhat.com/show_bug.cgi?id=2047318
- Post-mortem OCS and MTSRE: https://docs.google.com/document/d/11VZL3OjL-gZzHtvdW3BzaBi9g26em6YN6LAfabb3lYA/

Comment 1 Travis Nielsen 2022-02-15 22:45:22 UTC
This bug is cloned to 4.10 here: https://bugzilla.redhat.com/show_bug.cgi?id=2054898
I would expect this was a rare condition where the disk was corrupt. The fix just makes it a bit more obvious where the failure is. 

Not sure we need to backport this to 4.8 unless this could be more consistently repro'd.

Comment 2 Travis Nielsen 2022-02-28 16:44:36 UTC
Samuel Is the fix in 4.10 sufficient, or to what release would you propose this be backported? It's simple to backport, but it's rare enough I'm not sure it's needed though.

Comment 3 Samuel Blais-Dowdy 2022-02-28 18:13:01 UTC
Fixing 4.10 is sufficient for MTSRE. It's not a critical bug, mostly a nice to have thing for debugging so let's not bother. :) Thank you.

Comment 4 Travis Nielsen 2022-03-01 16:20:28 UTC
(In reply to Samuel Blais-Dowdy from comment #3)
> Fixing 4.10 is sufficient for MTSRE. It's not a critical bug, mostly a nice
> to have thing for debugging so let's not bother. :) Thank you.

Sounds good, i'll close this one since the fix was merged with the clone for the 4.10 release. Thanks!


Note You need to log in before you can comment on or make changes to this bug.