Bug 2189547
| Summary: | [IBM Z] [MDR]: Failover of application stuck in "Failing over" state | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Sravika <sbalusu> | ||||
| Component: | documentation | Assignee: | Erin Donnelly <edonnell> | ||||
| Status: | VERIFIED --- | QA Contact: | Neha Berry <nberry> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.12 | CC: | akandath, amagrawa, hnallurv, kramdoss, kseeger, muagarwa, odf-bz-bot, rtalur, ypadia | ||||
| Target Milestone: | --- | Keywords: | TestBlocker | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | Type: | Bug | |||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Sravika
2023-04-25 14:57:12 UTC
Created attachment 1959813 [details]
mc1 logs
Based on the logs attached, I am guessing that it is a Noobaa FIPS issue that has been fixed in 4.13. Sravika, is this an OpenShift cluster hosted on a VMware setup that has FIPS enabled? If yes, you should try one of the latest unreleased 4.13 build. Here is the bug https://bugzilla.redhat.com/show_bug.cgi?id=2175612. @rtalur : FIPS enablement is not verified with ODF on IBM Z yet, and the MDR environment which I am testing on 4.12.2rhodf does not have FIPS enabled. Is FIPS enablement a pre-requisite ? [root@m4205001 ~]# oc logs pod/noobaa-core-0 -n openshift-storage | grep "found /proc/sys/crypto/fips_enabled" [root@m4205001 ~]# [root@a3e25001 ~]# oc logs pod/noobaa-core-0 -n openshift-storage | grep "found /proc/sys/crypto/fips_enabled" [root@a3e25001 ~]# [root@m4202001 ~]# oc logs pod/noobaa-core-0 -n openshift-storage | grep "found /proc/sys/crypto/fips_enabled" [root@m4202001 ~]# Sravika, The noobaa team has clarified that it is a problem at the storage layer. I remember you were gathering the logs using must-gather but they are not attached to the bug yet. Please attach. @rtalur : All the must-gather google drive links from all the 3 clusters have already been added to the Bugzilla when it is created Additional info: Must gather logs of mc1 and mc2 before failover operation: https://drive.google.com/file/d/1JjZ3e2xSCw33eszmwpr3iGLYYw9ac7NW/view?usp=share_link https://drive.google.com/file/d/1XZVOxplFCsF4PNLBhZePl9LWsCSni7se/view?usp=share_link Must gather logs of mc1, mc2 and hub after failover operation initiated: https://drive.google.com/file/d/1WdKv-rTOO0cAtdotdz4G_yPEXC1RBmS7/view?usp=share_link https://drive.google.com/file/d/1tFZ2pvuJ9D_2yYC5EstpNYqQpuP0tvys/view?usp=share_link https://drive.google.com/file/d/1e4J2J_UzgcBvpWguEIZgMzlZDR9jsKKE/view?usp=share_link Hi @Sravika, would it be possible to upload the must-gather to a remote server so that we can review the logs without having to download them? This would be a more efficient option, especially for individuals with limited bandwidth who may have difficulty downloading large volumes of data. Thank you for your help. Hi @Talur, I've started reviewing the logs. Could you please provide further details about the issue at hand? Additionally, which cluster should I focus on for debugging purposes and where would be the best place to begin? Thanks! @mrajanna : Can you pls share the remote server details where you want me to upload the logs ? Yati, Sravika, Niels and I debugged this further. We looked at the kernel logs for the rbd device that noobaa pod is using on cluster 2 and found the error messages Apr 24 18:05:06 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: encountered watch error: -107 Apr 24 18:05:06 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: failed to unwatch: -108 Apr 24 18:05:06 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: failed to reregister watch: -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write at objno 6179 159744~69632 result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: blk_update_request: I/O error, dev rbd0, sector 50618680 op 0x1:(WRITE) flags 0x800 phys_seg 17 prio class 0 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: Aborting journal on device rbd0-8. Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write at objno 6176 0~4096 result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: blk_update_request: I/O error, dev rbd0, sector 50593792 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: Buffer I/O error on dev rbd0, logical block 6324224, lost sync page write Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: JBD2: Error -5 detected when updating journal superblock for rbd0-8. Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write at objno 0 0~4096 result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: rbd: rbd0: write result -108 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: blk_update_request: I/O error, dev rbd0, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0 Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: Buffer I/O error on dev rbd0, logical block 0, lost sync page write Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: EXT4-fs (rbd0): I/O error while writing superblock Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: EXT4-fs error (device rbd0): ext4_journal_check_start:61: Detected aborted journal Apr 24 18:05:08 worker-0.ocpm4202001.lnxero1.boe kernel: EXT4-fs (rbd0): Remounting filesystem read-only The timestamps were very close to the fence event and on checking the configuration data again found that the cidrs for the drclusters have a /24 suffix. This resulted in fencing of the all the worker nodes of the second cluster as well when the fence operation was performed. Moving this bug to the doc component now to ensure a note is added to explain the CIDR notation a little bit more to prevent such misconfigurations. Doc team, in https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/metro-dr-solution#add-node-ip-addresses-to-drclusters_mdr after step 3, we need to add a note. The content of the note can be ``` When using the CIDR notation, specifying an individual IP requires the use of /32 suffix. Using a suffix other than /32 would result in fencing of IPs in the range of IPs as specified by the CIDR notation. ``` Failover works fine following updated documentation. |