Created attachment 1959812 [details] additional_logs_from_hub Description of problem (please be detailed as possible and provide log snippests): managed cluster1 - mc1 managed cluster2- mc2 Application failover from mc1 to mc2 stuck in "Failing Over" state as restoring the pvs to mc2 failed due to noobaa S3 communication failure. Only namespace of the application got created on mc2 during the failover operation. Before initiating the failover operation the noobaa status is Ready on both MC1 and MC2, uploading the BZ with the must-gather logs of mc1 and mc2 before failover operation. Hub: [root@a3e25001 ~]# oc get drpc -n busybox-sample NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE busybox-placement-1-drpc 20h ocsm4205001 ocpm4202001 Failover FailingOver [root@a3e25001 ~]# # oc get drpc busybox-placement-1-drpc -n busybox-sample -oyaml ... status: actionStartTime: "2023-04-24T18:08:50Z" conditions: - lastTransitionTime: "2023-04-24T18:08:50Z" message: Started failover to cluster "ocpm4202001" observedGeneration: 3 reason: NotStarted status: "False" type: PeerReady - lastTransitionTime: "2023-04-24T18:08:50Z" message: Waiting for PV restore to complete...) observedGeneration: 3 reason: FailingOver status: "False" type: Available lastUpdateTime: "2023-04-25T14:34:01Z" phase: FailingOver preferredDecision: clusterName: ocsm4205001 clusterNamespace: ocsm4205001 progression: WaitingForPVRestore resourceConditions: conditions: - lastTransitionTime: "2023-04-24T17:58:02Z" message: PVCs in the VolumeReplicationGroup are ready for use observedGeneration: 1 reason: Ready status: "True" type: DataReady - lastTransitionTime: "2023-04-24T17:58:02Z" message: VolumeReplicationGroup is replicating observedGeneration: 1 reason: Replicating status: "False" type: DataProtected - lastTransitionTime: "2023-04-24T17:58:01Z" message: Restored PV cluster data observedGeneration: 1 reason: Restored status: "True" type: ClusterDataReady - lastTransitionTime: "2023-04-25T14:02:42Z" message: VRG Kube object protect error observedGeneration: 1 reason: UploadError status: "False" type: ClusterDataProtected resourceMeta: generation: 1 kind: VolumeReplicationGroup name: busybox-placement-1-drpc namespace: busybox-sample protectedpvcs: - busybox-pvc MC2: [root@m4202001 ~]# oc get ns busybox-sample NAME STATUS AGE busybox-sample Active 20h [root@m4202001 ~]# oc get all,pvc -n busybox-sample No resources found in busybox-sample namespace. [root@m4202001 ~]# [root@m4202001 ~]# oc get po -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6bb96f77b6-fcb22 2/2 Running 0 22h csi-cephfsplugin-8h6td 2/2 Running 2 25h csi-cephfsplugin-9nwpf 2/2 Running 2 25h csi-cephfsplugin-provisioner-6c7d889599-25knr 5/5 Running 0 22h csi-cephfsplugin-provisioner-6c7d889599-cn6kg 5/5 Running 0 22h csi-cephfsplugin-sbx2r 2/2 Running 2 25h csi-rbdplugin-484rx 3/3 Running 3 25h csi-rbdplugin-5qpsx 3/3 Running 3 25h csi-rbdplugin-k7qkv 3/3 Running 3 25h csi-rbdplugin-provisioner-d46b79bbb-868p8 6/6 Running 0 22h csi-rbdplugin-provisioner-d46b79bbb-frgq8 6/6 Running 0 22h noobaa-core-0 1/1 Running 0 22h noobaa-db-pg-0 1/1 Running 0 22h noobaa-endpoint-5bdc586b7d-v97bf 1/1 Running 0 22h noobaa-operator-66fb78dd94-m7lbh 1/1 Running 0 22h ocs-metrics-exporter-6b96597864-sbrtd 1/1 Running 0 22h ocs-operator-5598965945-pkmgw 1/1 Running 0 22h odf-console-55f8c5f6dd-7fhxc 1/1 Running 0 22h odf-operator-controller-manager-5cbb545ddc-h72wf 2/2 Running 0 22h rook-ceph-operator-64bb84d64f-z5fs9 1/1 Running 0 22h token-exchange-agent-7fd47f9bd8-m6465 1/1 Running 0 21h [root@m4202001 ~]#oc get noobaa -n openshift-storage noobaa -o yaml .... phase: Configuring status: "False" [34/1412] type: Available - lastHeartbeatTime: "2023-04-24T12:57:13Z" lastTransitionTime: "2023-04-24T18:05:12Z" message: 'could not open file "base/16385/2601": Read-only file system' reason: TemporaryError status: "True" type: Progressing - lastHeartbeatTime: "2023-04-24T12:57:13Z" lastTransitionTime: "2023-04-24T12:57:13Z" message: 'could not open file "base/16385/2601": Read-only file system' reason: TemporaryError status: "False" type: Degraded - lastHeartbeatTime: "2023-04-24T12:57:13Z" lastTransitionTime: "2023-04-24T18:05:12Z" message: 'could not open file "base/16385/2601": Read-only file system' reason: TemporaryError status: "False" type: Upgradeable - lastHeartbeatTime: "2023-04-24T12:57:13Z" lastTransitionTime: "2023-04-24T12:57:13Z" status: k8s type: KMS-Type - lastHeartbeatTime: "2023-04-24T12:57:13Z" lastTransitionTime: "2023-04-24T12:58:15Z" status: Sync type: KMS-Status endpoints: readyCount: 1 virtualHosts: - s3.openshift-storage.svc observedGeneration: 2 phase: Configuring readme: "\n\n\tNooBaa operator is still working to reconcile this system.\n\tCheck out the system status.phase, status.conditions, and events with:\n\n\t\tkubectl -n openshift-storage describe noobaa\n\t\tkubectl -n openshift-storage get noobaa -o yaml\n\t\tkubectl -n openshift-storage get events --sort-by=metadata.creationTimestamp\n\n\tYou can wait for a specific condition with:\n\n\t\tkubectl -n openshift-storage wait noobaa/noobaa --for condition=available --timeout -1s\n\n\tNooBaa Core Version: \ master-20220913\n\tNooBaa Operator Version: 5.12.0\n" RHCS: [root@rhcs01 ~]# ceph -s cluster: id: 778d5284-ddf7-11ed-a790-525400c41d12 health: HEALTH_OK services: mon: 5 daemons, quorum rhcs01,rhcs02,rhcs04,rhcs05,rhcs07 (age 5d) mgr: rhcs01.ipckaw(active, since 6d), standbys: rhcs04.kfpmco mds: 1/1 daemons up, 1 standby osd: 6 osds: 6 up (since 6d), 6 in (since 6d) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 10 pools, 289 pgs objects: 1.18k objects, 1.8 GiB usage: 9.5 GiB used, 2.9 TiB / 2.9 TiB avail pgs: 289 active+clean Version of all relevant components (if applicable): OCP: 4.12.11 odf-operator.v4.12.2-rhodf RHCS: 5.3.z2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Configure Metro DR environment with MC1, MC2 and Hub cluster 2. Deploy sample application busybox 3. Apply fencing to mc1 on the hub cluster and verify that the fencing is successful 4. Initiate failover of the application from mc1 to mc2 Actual results: Failover stuck in "Failing Over" state Expected results: Application failover should be successful Additional info: Must gather logs of mc1 and mc2 before failover operation: https://drive.google.com/file/d/1JjZ3e2xSCw33eszmwpr3iGLYYw9ac7NW/view?usp=share_link https://drive.google.com/file/d/1XZVOxplFCsF4PNLBhZePl9LWsCSni7se/view?usp=share_link Must gather logs of mc1, mc2 and hub after failover operation initiated: https://drive.google.com/file/d/1WdKv-rTOO0cAtdotdz4G_yPEXC1RBmS7/view?usp=share_link https://drive.google.com/file/d/1tFZ2pvuJ9D_2yYC5EstpNYqQpuP0tvys/view?usp=share_link https://drive.google.com/file/d/1e4J2J_UzgcBvpWguEIZgMzlZDR9jsKKE/view?usp=share_link
Created attachment 1959813 [details] mc1 logs
Based on the logs attached, I am guessing that it is a Noobaa FIPS issue that has been fixed in 4.13. Sravika, is this an OpenShift cluster hosted on a VMware setup that has FIPS enabled? If yes, you should try one of the latest unreleased 4.13 build. Here is the bug https://bugzilla.redhat.com/show_bug.cgi?id=2175612.
@rtalur : FIPS enablement is not verified with ODF on IBM Z yet, and the MDR environment which I am testing on 4.12.2rhodf does not have FIPS enabled. Is FIPS enablement a pre-requisite ? [root@m4205001 ~]# oc logs pod/noobaa-core-0 -n openshift-storage | grep "found /proc/sys/crypto/fips_enabled" [root@m4205001 ~]# [root@a3e25001 ~]# oc logs pod/noobaa-core-0 -n openshift-storage | grep "found /proc/sys/crypto/fips_enabled" [root@a3e25001 ~]# [root@m4202001 ~]# oc logs pod/noobaa-core-0 -n openshift-storage | grep "found /proc/sys/crypto/fips_enabled" [root@m4202001 ~]#
Sravika, The noobaa team has clarified that it is a problem at the storage layer. I remember you were gathering the logs using must-gather but they are not attached to the bug yet. Please attach.
@rtalur : All the must-gather google drive links from all the 3 clusters have already been added to the Bugzilla when it is created Additional info: Must gather logs of mc1 and mc2 before failover operation: https://drive.google.com/file/d/1JjZ3e2xSCw33eszmwpr3iGLYYw9ac7NW/view?usp=share_link https://drive.google.com/file/d/1XZVOxplFCsF4PNLBhZePl9LWsCSni7se/view?usp=share_link Must gather logs of mc1, mc2 and hub after failover operation initiated: https://drive.google.com/file/d/1WdKv-rTOO0cAtdotdz4G_yPEXC1RBmS7/view?usp=share_link https://drive.google.com/file/d/1tFZ2pvuJ9D_2yYC5EstpNYqQpuP0tvys/view?usp=share_link https://drive.google.com/file/d/1e4J2J_UzgcBvpWguEIZgMzlZDR9jsKKE/view?usp=share_link
Hi @Sravika, would it be possible to upload the must-gather to a remote server so that we can review the logs without having to download them? This would be a more efficient option, especially for individuals with limited bandwidth who may have difficulty downloading large volumes of data. Thank you for your help. Hi @Talur, I've started reviewing the logs. Could you please provide further details about the issue at hand? Additionally, which cluster should I focus on for debugging purposes and where would be the best place to begin? Thanks!
@mrajanna : Can you pls share the remote server details where you want me to upload the logs ?
Yati, Sravika, Niels and I debugged this further. We looked at the kernel logs for the rbd device that noobaa pod is using on cluster 2 and found the error messages Apr 24 18:05:06 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: encountered watch error: -107 Apr 24 18:05:06 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: failed to unwatch: -108 Apr 24 18:05:06 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: failed to reregister watch: -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write at objno 6179 159744~69632 result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: blk_update_request: I/O error, dev rbd0, sector 50618680 op 0x1:(WRITE) flags 0x800 phys_seg 17 prio class 0 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: Aborting journal on device rbd0-8. Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write at objno 6176 0~4096 result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: blk_update_request: I/O error, dev rbd0, sector 50593792 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: Buffer I/O error on dev rbd0, logical block 6324224, lost sync page write Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: JBD2: Error -5 detected when updating journal superblock for rbd0-8. Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write at objno 0 0~4096 result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: blk_update_request: I/O error, dev rbd0, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: Buffer I/O error on dev rbd0, logical block 0, lost sync page write Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: EXT4-fs (rbd0): I/O error while writing superblock Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: EXT4-fs error (device rbd0): ext4_journal_check_start:61: Detected aborted journal Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: EXT4-fs (rbd0): Remounting filesystem read-only The timestamps were very close to the fence event and on checking the configuration data again found that the cidrs for the drclusters have a /24 suffix. This resulted in fencing of the all the worker nodes of the second cluster as well when the fence operation was performed. Moving this bug to the doc component now to ensure a note is added to explain the CIDR notation a little bit more to prevent such misconfigurations.
Doc team, in https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/metro-dr-solution#add-node-ip-addresses-to-drclusters_mdr after step 3, we need to add a note. The content of the note can be ``` When using the CIDR notation, specifying an individual IP requires the use of /32 suffix. Using a suffix other than /32 would result in fencing of IPs in the range of IPs as specified by the CIDR notation. ```
Failover works fine following updated documentation.