Bug 2054898 - Detect and report OSD k8s job failures
Summary: Detect and report OSD k8s job failures
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ODF 4.10.0
Assignee: Travis Nielsen
QA Contact: Oded
URL:
Whiteboard:
Depends On: 2054724
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-15 22:39 UTC by Travis Nielsen
Modified: 2023-08-09 17:03 UTC (History)
6 users (show)

Fixed In Version: 4.10.0-175
Doc Type: No Doc Update
Doc Text:
Clone Of: 2054724
Environment:
Last Closed: 2022-04-21 09:12:46 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 352 0 None open Bug 2054898: osd: Return error if fail to list osds in prepare job 2022-02-24 23:00:51 UTC
Github rook rook pull 9746 0 None open osd: Return error if fail to list osds in prepare job 2022-02-15 22:40:38 UTC

Description Travis Nielsen 2022-02-15 22:39:06 UTC
+++ This bug was initially created as a clone of Bug #2054724 +++

Description of problem (please be detailed as possible and provide log
snippests):

Following BZ2047318, it was found that the OSD.1 prepare job failed but reported as "Completed/Success". The log showed a python traceback:

View attached logs "Logs for rook-ceph-osd-prepare-default-1-data-0f8266--1-zbjtx ".


Additional info:

Please view extensive details in the original BZ and post-mortem notes.

- BZ2047318: https://bugzilla.redhat.com/show_bug.cgi?id=2047318
- Post-mortem OCS and MTSRE: https://docs.google.com/document/d/11VZL3OjL-gZzHtvdW3BzaBi9g26em6YN6LAfabb3lYA/

Comment 1 Travis Nielsen 2022-02-15 22:40:38 UTC
There are several places the OSD prepare job was only logging an error when listing OSDs instead of returning an error to fail the job, retry, and make it more obvious that there is a failure that needs investigation.

Comment 2 Mudit Agarwal 2022-02-24 03:43:53 UTC
Travis, please open a backport PR for 4.10

Comment 5 Oded 2022-03-06 21:28:49 UTC
Hi Travis,

How to verify this bz?

Just check the rook-ceph-osd-prepare-ocs-deviceset-X-data logs?
like here:
http://pastebin.test.redhat.com/1034732

Do we need to test it on specific platform?

Thanks,
Oded

Comment 6 Travis Nielsen 2022-03-07 15:15:56 UTC
It is not clear how to repro the issue that caused the failure during OSD creation. The best I can think of is that a normal regression test for cluster and OSD creation will cover that OSDs can successfully be created, but not sure of a specific way to verify this change other than code inspection.


Note You need to log in before you can comment on or make changes to this bug.