Description of problem (please be detailed as possible and provide log snippests): I am looking to some of the upgrade executions from 4.14 to 4.15 and see all failed. But in all of those the must gather is not collected looks like. Error: Command '['oc', '--kubeconfig', '/home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig', 'adm', 'must-gather', '--image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15', '--dest-dir=/home/jenkins/current-cluster-dir/logs/failed_testcase_ocs_logs_1702896210/test_upgrade_ocs_logs/j-113vi1cs33-uba/ocs_must_gather']' timed out after 2100 secondsMust-Gather Output:This is command I see we run from console output of the job so looks like it didn't collect the logs in 35 mins and our timeout killed the command pod "must-gather-jmwfl-helper" deleted error: resource name may not be empty error: resource name may not be empty error: resource name may not be empty error: resource name may not be empty error: resource name may not be empty error: resource name may not be empty Error from server (NotFound): error when deleting "pod_helper.yaml": pods "must-gather-jmwfl-helper" not found error: resource name may not be empty error: resource name may not be empty error: resource name may not be empty error: resource name may not be empty error: resource name may not be empty error: resource name may not be empty So seems, the helper pod was deleted before the completion of the collection and before the gather pod could copy the collection. Discussed here: https://chat.google.com/room/AAAAREGEba8/sSNlEKi4Gmk Yati said it's a bug on MG side and should be fixed here: https://github.com/red-hat-storage/odf-must-gather/pull/100 Version of all relevant components (if applicable): 4.15.0-89 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Run must gather on 4.15 cluster - in this execution it was upgraded cluster from 4.14 which I don't think is relevant. 2. 3. Actual results: Expected results: Have must gather collected
I tested mg on ODF4.15 [odf-operator.v4.15.0-98.stable] $ time oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15 .... [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-s5j5q deleted Reprinting Cluster State: When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: ClusterID: fce1583c-d2b1-44a7-9345-ee787616b50b ClusterVersion: Stable at "4.15.0-0.nightly-2023-12-25-100326" ClusterOperators: All healthy and stable real 7m3.801s user 0m1.896s sys 0m0.672s Niraj, Can I test it on "latest-4.15" image?
ocs-operator.v4.15.0-100.stable must gather is still failing to collect: 2024-01-03 08:27:57 02:57:56 - ThreadPoolExecutor-5_0 - ocs_ci.ocs.utils - INFO - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.15 will be used. 2024-01-03 08:27:57 02:57:56 - ThreadPoolExecutor-5_0 - ocs_ci.ocs.utils - INFO - OCS logs will be placed in location /home/jenkins/current-cluster-dir/logs/testcases_1704249707/vavuthuupq1/ocs_must_gather 2024-01-03 08:27:57 02:57:56 - ThreadPoolExecutor-5_0 - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15 --dest-dir=/home/jenkins/current-cluster-dir/logs/testcases_1704249707/vavuthuupq1/ocs_must_gather 2024-01-03 09:03:04 03:32:56 - ThreadPoolExecutor-5_0 - ocs_ci.ocs.utils - ERROR - Failed during must gather logs! Error: Command '['oc', '--kubeconfig', '/home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig', 'adm', 'must-gather', '--image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15', '--dest-dir=/home/jenkins/current-cluster-dir/logs/testcases_1704249707/vavuthuupq1/ocs_must_gather']' timed out after 2100 secondsMust-Gather Output: job ( upgrade from 4.14 to 4.15 ): https://url.corp.redhat.com/024bded must gather ( not collected ): https://url.corp.redhat.com/54ddfd5
We need not to wait for a cluster or for a job. Don't we have any cluster with the latest 4.15 image installed. This is just must-gather collection which we can simply verify, why to make it so complicated?
can you check my test procedure? Do I need to test it after tier1? Test process: 1.upgrade osd 4.14->4.15 $ oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-operator-lifecycle-manager packageserver Package Server 0.0.1-snapshot Succeeded openshift-storage mcg-operator.v4.15.0-113.stable NooBaa Operator 4.15.0-113.stable mcg-operator.v4.14.4-rhodf Succeeded openshift-storage ocs-operator.v4.15.0-113.stable OpenShift Container Storage 4.15.0-113.stable ocs-operator.v4.14.4-rhodf Succeeded openshift-storage odf-csi-addons-operator.v4.15.0-113.stable CSI Addons 4.15.0-113.stable odf-csi-addons-operator.v4.14.4-rhodf Succeeded openshift-storage odf-operator.v4.15.0-113.stable OpenShift Data Foundation 4.15.0-113.stable odf-operator.v4.14.4-rhodf Succeeded 2.Collect MG $ time oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15 real 3m29.078s user 0m0.819s sys 0m0.373s
tier1 is a suite with many tests. So after tier1 suite there is a lot of data on the cluster and that's we can simulate a "customer cluster"
I will move this bz to the verified state because this test https://bugzilla.redhat.com/show_bug.cgi?id=2255240#c20 I will try to find a method to create a cluster like a customer. [with a lot of data and operators]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383