Bug 2054898

Summary: Detect and report OSD k8s job failures
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Travis Nielsen <tnielsen>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED CURRENTRELEASE QA Contact: Oded <oviner>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.9CC: ebenahar, madam, muagarwa, ocs-bugs, odf-bz-bot, sblaisdo
Target Milestone: ---   
Target Release: ODF 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.10.0-175 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 2054724 Environment:
Last Closed: 2022-04-21 09:12:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2054724    
Bug Blocks:    

Description Travis Nielsen 2022-02-15 22:39:06 UTC
+++ This bug was initially created as a clone of Bug #2054724 +++

Description of problem (please be detailed as possible and provide log
snippests):

Following BZ2047318, it was found that the OSD.1 prepare job failed but reported as "Completed/Success". The log showed a python traceback:

View attached logs "Logs for rook-ceph-osd-prepare-default-1-data-0f8266--1-zbjtx ".


Additional info:

Please view extensive details in the original BZ and post-mortem notes.

- BZ2047318: https://bugzilla.redhat.com/show_bug.cgi?id=2047318
- Post-mortem OCS and MTSRE: https://docs.google.com/document/d/11VZL3OjL-gZzHtvdW3BzaBi9g26em6YN6LAfabb3lYA/

Comment 1 Travis Nielsen 2022-02-15 22:40:38 UTC
There are several places the OSD prepare job was only logging an error when listing OSDs instead of returning an error to fail the job, retry, and make it more obvious that there is a failure that needs investigation.

Comment 2 Mudit Agarwal 2022-02-24 03:43:53 UTC
Travis, please open a backport PR for 4.10

Comment 5 Oded 2022-03-06 21:28:49 UTC
Hi Travis,

How to verify this bz?

Just check the rook-ceph-osd-prepare-ocs-deviceset-X-data logs?
like here:
http://pastebin.test.redhat.com/1034732

Do we need to test it on specific platform?

Thanks,
Oded

Comment 6 Travis Nielsen 2022-03-07 15:15:56 UTC
It is not clear how to repro the issue that caused the failure during OSD creation. The best I can think of is that a normal regression test for cluster and OSD creation will cover that OSDs can successfully be created, but not sure of a specific way to verify this change other than code inspection.