@bkunal as discussed in mtg, asked cu if they can wait to upgrade 4.9. If not, backport fix needed for 4.7.5 and 4.8.3
@bkunal Customer stated no, backport fix needed for 4.7.5 and 4.8.3 It can wait until the 4.9 release
Verification of enabling huge pages after OCP deployment and before ODF deployment here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier1/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-tier1/38/ Here I scheduled job which will do regular deployment and will pause before tier execution: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-tier1/39/ Once it's paused we can enable huge pages and continue the run.
I enabled huge pages on cluster which had ODF installed. Enabled by this file: oc apply -f https://raw.githubusercontent.com/red-hat-storage/ocs-ci/master/ocs_ci/templates/ocp-deployment/huge_pages.yaml Then waited to nodes to restart: pbalogh@pbalogh-mac hugepages $ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-5285892ff5c4de19c01780772d80a409 True False False 3 3 3 0 176m worker rendered-worker-272f125863faaffd629cf8e12356da2e False True False 3 2 2 0 176m pbalogh@pbalogh-mac hugepages $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-129-165.us-east-2.compute.internal Ready,SchedulingDisabled worker 169m v1.22.0-rc.0+a44d0f0 ip-10-0-147-46.us-east-2.compute.internal Ready master 176m v1.22.0-rc.0+a44d0f0 ip-10-0-174-6.us-east-2.compute.internal Ready worker 170m v1.22.0-rc.0+a44d0f0 ip-10-0-179-40.us-east-2.compute.internal Ready master 177m v1.22.0-rc.0+a44d0f0 ip-10-0-206-103.us-east-2.compute.internal Ready worker 170m v1.22.0-rc.0+a44d0f0 ip-10-0-219-185.us-east-2.compute.internal Ready master 176m v1.22.0-rc.0+a44d0f0 pbalogh@pbalogh-mac hugepages $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-129-165.us-east-2.compute.internal Ready worker 169m v1.22.0-rc.0+a44d0f0 ip-10-0-147-46.us-east-2.compute.internal Ready master 176m v1.22.0-rc.0+a44d0f0 ip-10-0-174-6.us-east-2.compute.internal Ready worker 170m v1.22.0-rc.0+a44d0f0 ip-10-0-179-40.us-east-2.compute.internal Ready master 177m v1.22.0-rc.0+a44d0f0 ip-10-0-206-103.us-east-2.compute.internal Ready worker 170m v1.22.0-rc.0+a44d0f0 ip-10-0-219-185.us-east-2.compute.internal Ready master 177m v1.22.0-rc.0+a44d0f0 pbalogh@pbalogh-mac hugepages $ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-5285892ff5c4de19c01780772d80a409 True False False 3 3 3 0 176m worker rendered-worker-7b55731a49f9190cf8b37fcc7e23ca77 True False False 3 3 3 0 176m Checking status of cluster: pbalogh@pbalogh-mac hugepages $ oc rsh -n openshift-storage rook-ceph-tools-57b9b69bc5-765r6 sh-4.4$ ceph status cluster: id: 7e48a5d1-14df-49fa-aee1-408264005a2f health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 101s) mgr: a(active, since 5m) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 99s), 3 in (since 2h) data: volumes: 1/1 healthy pools: 4 pools, 97 pgs objects: 522 objects, 1.3 GiB usage: 3.0 GiB used, 1.5 TiB / 1.5 TiB avail pgs: 97 active+clean io: client: 853 B/s rd, 5.7 KiB/s wr, 1 op/s rd, 0 op/s wr sh-4.4$ ^C sh-4.4$ exit command terminated with exit code 130 pbalogh@pbalogh-mac hugepages $ oc get noobaa -n openshift-storage NAME MGMT-ENDPOINTS S3-ENDPOINTS IMAGE PHASE AGE noobaa ["https://10.0.206.103:32651"] ["https://10.0.206.103:30801"] quay.io/rhceph-dev/mcg-core@sha256:ff043dde04a8b83f10be1a2437c88b3cfd0c7e691868ed418b191a02fb8129c8 Ready 142m pbalogh@pbalogh-mac hugepages $ oc get pod -n openshift-storage -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-cephfsplugin-89zzv 3/3 Running 3 152m 10.0.129.165 ip-10-0-129-165.us-east-2.compute.internal <none> <none> csi-cephfsplugin-hpvrh 3/3 Running 3 152m 10.0.174.6 ip-10-0-174-6.us-east-2.compute.internal <none> <none> csi-cephfsplugin-provisioner-f5485c88c-599c4 6/6 Running 0 9m20s 10.131.0.9 ip-10-0-206-103.us-east-2.compute.internal <none> <none> csi-cephfsplugin-provisioner-f5485c88c-gfjfv 6/6 Running 0 6m29s 10.128.2.21 ip-10-0-174-6.us-east-2.compute.internal <none> <none> csi-cephfsplugin-tvx2x 3/3 Running 3 152m 10.0.206.103 ip-10-0-206-103.us-east-2.compute.internal <none> <none> csi-rbdplugin-fk5zw 3/3 Running 3 152m 10.0.174.6 ip-10-0-174-6.us-east-2.compute.internal <none> <none> csi-rbdplugin-kfdb8 3/3 Running 3 152m 10.0.206.103 ip-10-0-206-103.us-east-2.compute.internal <none> <none> csi-rbdplugin-provisioner-65d9bf8587-2hjrf 6/6 Running 0 9m19s 10.131.0.14 ip-10-0-206-103.us-east-2.compute.internal <none> <none> csi-rbdplugin-provisioner-65d9bf8587-4g6ck 6/6 Running 0 6m27s 10.128.2.25 ip-10-0-174-6.us-east-2.compute.internal <none> <none> csi-rbdplugin-z2xbb 3/3 Running 3 152m 10.0.129.165 ip-10-0-129-165.us-east-2.compute.internal <none> <none> noobaa-core-0 1/1 Running 0 8m49s 10.131.0.21 ip-10-0-206-103.us-east-2.compute.internal <none> <none> noobaa-db-pg-0 1/1 Running 0 9m2s 10.131.0.24 ip-10-0-206-103.us-east-2.compute.internal <none> <none> noobaa-endpoint-7bb45cc5c8-xgpkn 1/1 Running 0 9m20s 10.131.0.13 ip-10-0-206-103.us-east-2.compute.internal <none> <none> noobaa-operator-5b96d4cf64-fw8g6 1/1 Running 0 6m31s 10.128.2.7 ip-10-0-174-6.us-east-2.compute.internal <none> <none> ocs-metrics-exporter-c894f4fd5-5q6sv 1/1 Running 0 6m30s 10.128.2.17 ip-10-0-174-6.us-east-2.compute.internal <none> <none> ocs-operator-5454cb86b7-9s5sg 1/1 Running 0 9m20s 10.131.0.12 ip-10-0-206-103.us-east-2.compute.internal <none> <none> odf-console-77dc4875d4-82gl6 1/1 Running 0 9m21s 10.131.0.8 ip-10-0-206-103.us-east-2.compute.internal <none> <none> odf-operator-controller-manager-568f657687-qlttw 2/2 Running 0 6m28s 10.128.2.23 ip-10-0-174-6.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-32e003117c14a6a8adbfda64bd1f34bd-sq2kc 1/1 Running 0 9m28s 10.131.0.6 ip-10-0-206-103.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-a20d1188be8da66ce800d2f1ff2d5c6c-2lblx 1/1 Running 0 6m38s 10.128.2.6 ip-10-0-174-6.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-f4797d902b7bc0a6bbc07b7c6fa6f896-k5ptn 1/1 Running 0 2m40s 10.129.2.6 ip-10-0-129-165.us-east-2.compute.internal <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7c8577f5qg5lt 2/2 Running 0 6m29s 10.128.2.19 ip-10-0-174-6.us-east-2.compute.internal <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7b4fb48fgnrz7 2/2 Running 0 9m20s 10.131.0.15 ip-10-0-206-103.us-east-2.compute.internal <none> <none> rook-ceph-mgr-a-9bd978474-nrkxw 2/2 Running 0 6m30s 10.128.2.16 ip-10-0-174-6.us-east-2.compute.internal <none> <none> rook-ceph-mon-a-64986d5cfc-z569s 2/2 Running 0 11m 10.131.0.19 ip-10-0-206-103.us-east-2.compute.internal <none> <none> rook-ceph-mon-b-6468dc46cf-pb7wh 2/2 Running 0 8m53s 10.128.2.26 ip-10-0-174-6.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-c87988669-xt9z6 2/2 Running 0 5m51s 10.129.2.9 ip-10-0-129-165.us-east-2.compute.internal <none> <none> rook-ceph-operator-86bc97678-46vcd 1/1 Running 0 6m28s 10.128.2.24 ip-10-0-174-6.us-east-2.compute.internal <none> <none> rook-ceph-osd-0-d57ccfbcb-grk8m 2/2 Running 0 3m56s 10.129.2.8 ip-10-0-129-165.us-east-2.compute.internal <none> <none> rook-ceph-osd-1-84768d9c6f-p89tg 2/2 Running 0 11m 10.131.0.18 ip-10-0-206-103.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-5c4c88b9bd-cr2lv 2/2 Running 0 7m54s 10.128.2.27 ip-10-0-174-6.us-east-2.compute.internal <none> <none> rook-ceph-tools-57b9b69bc5-765r6 1/1 Running 0 6m31s 10.0.174.6 ip-10-0-174-6.us-east-2.compute.internal <none> <none> $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE noobaa-operator.v4.9.0 NooBaa Operator 4.9.0 Succeeded ocs-operator.v4.9.0 OpenShift Container Storage 4.9.0 Succeeded odf-operator.v4.9.0 OpenShift Data Foundation 4.9.0 Succeeded So one of the scenario so far so good. Let's see how tier1 results will look like here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-tier1/39/
The first one job failed in deployment, so it got re triggered here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-tier1/40/ The second scenario described above finished also tier1 and results looked OK beside the known issue with GCP MCG tests.
Previous job also failed, the finally finished here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-tier1/41/console https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/2175/console So marking as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days