Bug 2189547 - [IBM Z] [MDR]: Failover of application stuck in "Failing over" state
Summary: [IBM Z] [MDR]: Failover of application stuck in "Failing over" state
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: documentation
Version: 4.12
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Erin Donnelly
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-25 14:57 UTC by Sravika
Modified: 2024-07-13 08:28 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)
mc1 logs (8.00 KB, application/zip)
2023-04-25 14:58 UTC, Sravika
no flags Details

Description Sravika 2023-04-25 14:57:12 UTC
Created attachment 1959812 [details]
additional_logs_from_hub

Description of problem (please be detailed as possible and provide log
snippests):

managed cluster1 - mc1
managed cluster2- mc2 

Application failover from mc1 to mc2 stuck in "Failing Over" state as restoring the pvs to mc2 failed due to noobaa S3 communication failure. 
Only namespace of the application got created on mc2 during the failover operation.

Before initiating the failover operation the noobaa status is Ready on both MC1 and MC2, uploading the BZ with the must-gather logs of mc1 and mc2 before failover operation.

Hub:

[root@a3e25001 ~]# oc get drpc -n busybox-sample
NAME                       AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE
busybox-placement-1-drpc   20h   ocsm4205001        ocpm4202001       Failover       FailingOver
[root@a3e25001 ~]#

# oc get drpc busybox-placement-1-drpc -n busybox-sample -oyaml
...
status:
  actionStartTime: "2023-04-24T18:08:50Z"
  conditions:
  - lastTransitionTime: "2023-04-24T18:08:50Z"
    message: Started failover to cluster "ocpm4202001"
    observedGeneration: 3
    reason: NotStarted
    status: "False"
    type: PeerReady
  - lastTransitionTime: "2023-04-24T18:08:50Z"
    message: Waiting for PV restore to complete...)
    observedGeneration: 3
    reason: FailingOver
    status: "False"
    type: Available
  lastUpdateTime: "2023-04-25T14:34:01Z"
  phase: FailingOver
  preferredDecision:
    clusterName: ocsm4205001
    clusterNamespace: ocsm4205001
  progression: WaitingForPVRestore
  resourceConditions:
    conditions:
    - lastTransitionTime: "2023-04-24T17:58:02Z"
      message: PVCs in the VolumeReplicationGroup are ready for use
      observedGeneration: 1
      reason: Ready
      status: "True"
      type: DataReady
    - lastTransitionTime: "2023-04-24T17:58:02Z"
      message: VolumeReplicationGroup is replicating
      observedGeneration: 1
      reason: Replicating
      status: "False"
      type: DataProtected
    - lastTransitionTime: "2023-04-24T17:58:01Z"
      message: Restored PV cluster data
      observedGeneration: 1
      reason: Restored
      status: "True"
      type: ClusterDataReady
    - lastTransitionTime: "2023-04-25T14:02:42Z"
      message: VRG Kube object protect error
      observedGeneration: 1
      reason: UploadError
      status: "False"
      type: ClusterDataProtected
    resourceMeta:
      generation: 1
      kind: VolumeReplicationGroup
      name: busybox-placement-1-drpc
      namespace: busybox-sample
      protectedpvcs:
      - busybox-pvc

MC2:

[root@m4202001 ~]# oc get ns busybox-sample
NAME             STATUS   AGE
busybox-sample   Active   20h

[root@m4202001 ~]# oc get all,pvc -n busybox-sample
No resources found in busybox-sample namespace.
[root@m4202001 ~]#


[root@m4202001 ~]# oc get po -n openshift-storage
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-6bb96f77b6-fcb22     2/2     Running   0          22h
csi-cephfsplugin-8h6td                             2/2     Running   2          25h
csi-cephfsplugin-9nwpf                             2/2     Running   2          25h
csi-cephfsplugin-provisioner-6c7d889599-25knr      5/5     Running   0          22h
csi-cephfsplugin-provisioner-6c7d889599-cn6kg      5/5     Running   0          22h
csi-cephfsplugin-sbx2r                             2/2     Running   2          25h
csi-rbdplugin-484rx                                3/3     Running   3          25h
csi-rbdplugin-5qpsx                                3/3     Running   3          25h
csi-rbdplugin-k7qkv                                3/3     Running   3          25h
csi-rbdplugin-provisioner-d46b79bbb-868p8          6/6     Running   0          22h
csi-rbdplugin-provisioner-d46b79bbb-frgq8          6/6     Running   0          22h
noobaa-core-0                                      1/1     Running   0          22h
noobaa-db-pg-0                                     1/1     Running   0          22h
noobaa-endpoint-5bdc586b7d-v97bf                   1/1     Running   0          22h
noobaa-operator-66fb78dd94-m7lbh                   1/1     Running   0          22h
ocs-metrics-exporter-6b96597864-sbrtd              1/1     Running   0          22h
ocs-operator-5598965945-pkmgw                      1/1     Running   0          22h
odf-console-55f8c5f6dd-7fhxc                       1/1     Running   0          22h
odf-operator-controller-manager-5cbb545ddc-h72wf   2/2     Running   0          22h
rook-ceph-operator-64bb84d64f-z5fs9                1/1     Running   0          22h
token-exchange-agent-7fd47f9bd8-m6465              1/1     Running   0          21h



[root@m4202001 ~]#oc get noobaa -n openshift-storage noobaa -o yaml

....
  phase: Configuring
    status: "False"                                                                                                          [34/1412]
    type: Available
  - lastHeartbeatTime: "2023-04-24T12:57:13Z"
    lastTransitionTime: "2023-04-24T18:05:12Z"
    message: 'could not open file "base/16385/2601": Read-only file system'
    reason: TemporaryError
    status: "True"
    type: Progressing
  - lastHeartbeatTime: "2023-04-24T12:57:13Z"
    lastTransitionTime: "2023-04-24T12:57:13Z"
    message: 'could not open file "base/16385/2601": Read-only file system'
    reason: TemporaryError
    status: "False"
    type: Degraded
  - lastHeartbeatTime: "2023-04-24T12:57:13Z"
    lastTransitionTime: "2023-04-24T18:05:12Z"
    message: 'could not open file "base/16385/2601": Read-only file system'
    reason: TemporaryError
    status: "False"
    type: Upgradeable
  - lastHeartbeatTime: "2023-04-24T12:57:13Z"
    lastTransitionTime: "2023-04-24T12:57:13Z"
    status: k8s
    type: KMS-Type
  - lastHeartbeatTime: "2023-04-24T12:57:13Z"
    lastTransitionTime: "2023-04-24T12:58:15Z"
    status: Sync
    type: KMS-Status
  endpoints:
    readyCount: 1
    virtualHosts:
    - s3.openshift-storage.svc
  observedGeneration: 2
  phase: Configuring
  readme: "\n\n\tNooBaa operator is still working to reconcile this system.\n\tCheck
    out the system status.phase, status.conditions, and events with:\n\n\t\tkubectl
    -n openshift-storage describe noobaa\n\t\tkubectl -n openshift-storage get noobaa
    -o yaml\n\t\tkubectl -n openshift-storage get events --sort-by=metadata.creationTimestamp\n\n\tYou
    can wait for a specific condition with:\n\n\t\tkubectl -n openshift-storage wait
    noobaa/noobaa --for condition=available --timeout -1s\n\n\tNooBaa Core Version:
    \    master-20220913\n\tNooBaa Operator Version: 5.12.0\n"




RHCS:

[root@rhcs01 ~]# ceph -s
  cluster:
    id:     778d5284-ddf7-11ed-a790-525400c41d12
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum rhcs01,rhcs02,rhcs04,rhcs05,rhcs07 (age 5d)
    mgr: rhcs01.ipckaw(active, since 6d), standbys: rhcs04.kfpmco
    mds: 1/1 daemons up, 1 standby
    osd: 6 osds: 6 up (since 6d), 6 in (since 6d)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   10 pools, 289 pgs
    objects: 1.18k objects, 1.8 GiB
    usage:   9.5 GiB used, 2.9 TiB / 2.9 TiB avail
    pgs:     289 active+clean

Version of all relevant components (if applicable):
OCP: 4.12.11
odf-operator.v4.12.2-rhodf
RHCS: 5.3.z2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Configure Metro DR environment with MC1, MC2 and Hub cluster
2. Deploy sample application busybox
3. Apply fencing to mc1 on the hub cluster and verify that the fencing is successful
4. Initiate failover of the application from mc1 to mc2 


Actual results:
Failover stuck in "Failing Over" state

Expected results:
Application failover should be successful

Additional info:

Must gather logs of mc1 and mc2 before failover operation:

https://drive.google.com/file/d/1JjZ3e2xSCw33eszmwpr3iGLYYw9ac7NW/view?usp=share_link

https://drive.google.com/file/d/1XZVOxplFCsF4PNLBhZePl9LWsCSni7se/view?usp=share_link

Must gather logs of mc1, mc2 and hub after failover operation initiated:

https://drive.google.com/file/d/1WdKv-rTOO0cAtdotdz4G_yPEXC1RBmS7/view?usp=share_link

https://drive.google.com/file/d/1tFZ2pvuJ9D_2yYC5EstpNYqQpuP0tvys/view?usp=share_link

https://drive.google.com/file/d/1e4J2J_UzgcBvpWguEIZgMzlZDR9jsKKE/view?usp=share_link

Comment 2 Sravika 2023-04-25 14:58:24 UTC
Created attachment 1959813 [details]
mc1 logs

Comment 4 Raghavendra Talur 2023-04-25 20:03:14 UTC
Based on the logs attached, I am guessing that it is a Noobaa FIPS issue that has been fixed in 4.13.

Sravika, is this an OpenShift cluster hosted on a VMware setup that has FIPS enabled? If yes, you should try one of the latest unreleased 4.13 build. Here is the bug https://bugzilla.redhat.com/show_bug.cgi?id=2175612.

Comment 5 Sravika 2023-04-26 08:27:55 UTC
@rtalur : FIPS enablement is not verified with ODF on IBM Z yet, and the MDR environment which I am testing on 4.12.2rhodf does not have FIPS enabled. Is FIPS enablement a pre-requisite ?

[root@m4205001 ~]# oc logs pod/noobaa-core-0 -n openshift-storage | grep "found /proc/sys/crypto/fips_enabled"
[root@m4205001 ~]#

[root@a3e25001 ~]# oc logs pod/noobaa-core-0 -n openshift-storage | grep "found /proc/sys/crypto/fips_enabled"
[root@a3e25001 ~]#

[root@m4202001 ~]# oc logs pod/noobaa-core-0 -n openshift-storage | grep "found /proc/sys/crypto/fips_enabled"
[root@m4202001 ~]#

Comment 6 Raghavendra Talur 2023-04-27 17:13:21 UTC
Sravika,

The noobaa team has clarified that it is a problem at the storage layer. I remember you were gathering the logs using must-gather but they are not attached to the bug yet. Please attach.

Comment 7 Sravika 2023-04-28 07:26:22 UTC
@rtalur : All the must-gather google drive links from all the 3 clusters have already been added to the Bugzilla when it is created

Additional info:

Must gather logs of mc1 and mc2 before failover operation:

https://drive.google.com/file/d/1JjZ3e2xSCw33eszmwpr3iGLYYw9ac7NW/view?usp=share_link

https://drive.google.com/file/d/1XZVOxplFCsF4PNLBhZePl9LWsCSni7se/view?usp=share_link

Must gather logs of mc1, mc2 and hub after failover operation initiated:

https://drive.google.com/file/d/1WdKv-rTOO0cAtdotdz4G_yPEXC1RBmS7/view?usp=share_link

https://drive.google.com/file/d/1tFZ2pvuJ9D_2yYC5EstpNYqQpuP0tvys/view?usp=share_link

https://drive.google.com/file/d/1e4J2J_UzgcBvpWguEIZgMzlZDR9jsKKE/view?usp=share_link

Comment 9 Madhu Rajanna 2023-05-02 07:23:34 UTC
Hi @Sravika, would it be possible to upload the must-gather to a remote server so that we can review the logs without having to download them? This would be a more efficient option, especially for individuals with limited bandwidth who may have difficulty downloading large volumes of data. Thank you for your help.
Hi @Talur, I've started reviewing the logs. Could you please provide further details about the issue at hand? Additionally, which cluster should I focus on for debugging purposes and where would be the best place to begin? Thanks!

Comment 10 Sravika 2023-05-02 10:07:46 UTC
@mrajanna : Can you pls share the remote server details where you want me to upload the logs ?

Comment 19 Raghavendra Talur 2023-05-03 11:50:39 UTC
Yati, Sravika, Niels and I debugged this further.

We looked at the kernel logs for the rbd device that noobaa pod is using on cluster 2 and found the error messages
Apr 24 18:05:06 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: encountered watch error: -107
Apr 24 18:05:06 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: failed to unwatch: -108
Apr 24 18:05:06 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: failed to reregister watch: -108
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write at objno 6179 159744~69632 result -108
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write result -108
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: blk_update_request: I/O error, dev rbd0, sector 50618680 op 0x1:(WRITE) flags 0x800 phys_seg 17 prio class 0
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: Aborting journal on device rbd0-8.
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write at objno 6176 0~4096 result -108
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write result -108
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: blk_update_request: I/O error, dev rbd0, sector 50593792 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: Buffer I/O error on dev rbd0, logical block 6324224, lost sync page write
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: JBD2: Error -5 detected when updating journal superblock for rbd0-8.
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write at objno 0 0~4096 result -108
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write result -108
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: blk_update_request: I/O error, dev rbd0, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: Buffer I/O error on dev rbd0, logical block 0, lost sync page write
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: EXT4-fs (rbd0): I/O error while writing superblock
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: EXT4-fs error (device rbd0): ext4_journal_check_start:61: Detected aborted journal
Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: EXT4-fs (rbd0): Remounting filesystem read-only


The timestamps were very close to the fence event and on checking the configuration data again found that the cidrs for the drclusters have a /24 suffix.
This resulted in fencing of the all the worker nodes of the second cluster as well when the fence operation was performed.


Moving this bug to the doc component now to ensure a note is added to explain the CIDR notation a little bit more to prevent such misconfigurations.

Comment 20 Raghavendra Talur 2023-05-03 11:54:19 UTC
Doc team,

in https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/metro-dr-solution#add-node-ip-addresses-to-drclusters_mdr after step 3, we need to add a note. The content of the note can be

```
When using the CIDR notation, specifying an individual IP requires the use of /32 suffix. Using a suffix other than /32 would result in fencing of IPs in the range of IPs as specified by the CIDR notation.
```

Comment 24 Abdul Kandathil (IBM) 2024-01-29 09:04:34 UTC
Failover works fine following updated documentation.


Note You need to log in before you can comment on or make changes to this bug.