+++ This bug was initially created as a clone of Bug #2054724 +++ Description of problem (please be detailed as possible and provide log snippests): Following BZ2047318, it was found that the OSD.1 prepare job failed but reported as "Completed/Success". The log showed a python traceback: View attached logs "Logs for rook-ceph-osd-prepare-default-1-data-0f8266--1-zbjtx ". Additional info: Please view extensive details in the original BZ and post-mortem notes. - BZ2047318: https://bugzilla.redhat.com/show_bug.cgi?id=2047318 - Post-mortem OCS and MTSRE: https://docs.google.com/document/d/11VZL3OjL-gZzHtvdW3BzaBi9g26em6YN6LAfabb3lYA/
There are several places the OSD prepare job was only logging an error when listing OSDs instead of returning an error to fail the job, retry, and make it more obvious that there is a failure that needs investigation.
Travis, please open a backport PR for 4.10
Hi Travis, How to verify this bz? Just check the rook-ceph-osd-prepare-ocs-deviceset-X-data logs? like here: http://pastebin.test.redhat.com/1034732 Do we need to test it on specific platform? Thanks, Oded
It is not clear how to repro the issue that caused the failure during OSD creation. The best I can think of is that a normal regression test for cluster and OSD creation will cover that OSDs can successfully be created, but not sure of a specific way to verify this change other than code inspection.
This issue did not reproduce on these setups: AWS-IPI-3AZ-RHCOS-3M-3W 4.10.0-189 http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-169ai3c33-a/j-169ai3c33-a_20220312T020159/logs/deployment_1647050762/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-50850c5a2a341151f8ee3e2ff3838b00c53ec224ae85ad8552cd5a39f2103953/namespaces/openshift-storage/pods/rook-ceph-osd-prepare-ocs-deviceset-0-data-0867mh-54f2h/provision/provision/logs/current.log OCP4-10-AWS-IPI-3AZ-RHCOS-3M-3W-3I 4.10.0-187 http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-025ai3c333-d/j-025ai3c333-d_20220311T100345/logs/deployment_1646993276/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-069ec954fe9382370ffff0871009b5ec5da39c1f3e724c3e6e61f04dc8ade648/namespaces/openshift-storage/pods/rook-ceph-osd-prepare-ocs-deviceset-0-data-024mz7-dld98/provision/provision/logs/current.log OCS4-10-Downstream-OCP4-10-VSPHERE6-UPI-ENCRYPTION-1AZ-RHCOS-VSAN-LSO-VMDK-3M-3W 4.10.0-187 http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-011vue1cslv33-d/j-011vue1cslv33-d_20220311T095520/logs/deployment_1646992850/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-069ec954fe9382370ffff0871009b5ec5da39c1f3e724c3e6e61f04dc8ade648/namespaces/openshift-storage/pods/rook-ceph-osd-prepare-02fdf003a77e82a11c09e90ec3f15975-wm997/provision/provision/logs/current.log