Bug 2131581
| Summary: | After the node replacement procedure[provider] and editing the rosa addon, the mon IP endpoint revert to the old worker node IP | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Itzhak <ikave> |
| Component: | odf-managed-service | Assignee: | Parth Arora <paarora> |
| Status: | ON_QA --- | QA Contact: | Itzhak <ikave> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.10 | CC: | aeyal, dbindra, fbalak, lgangava, muagarwa, nberry, odf-bz-bot, paarora |
| Target Milestone: | --- | Keywords: | Automation, Tracking |
| Target Release: | --- | Flags: | paarora:
needinfo?
(ikave) |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2147580 | ||
| Bug Blocks: | |||
|
Description
Itzhak
2022-10-02 16:18:43 UTC
Additional info: Provider versions: OC version: Client Version: 4.10.24 Server Version: 4.10.33 Kubernetes Version: v1.23.5+012e945 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.33 True False 5h6m Cluster version is 4.10.33 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) Consumer Versions: OC version: Client Version: 4.10.24 Server Version: 4.10.33 Kubernetes Version: v1.23.5+012e945 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.33 True False 3h51m Cluster version is 4.10.33 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) Link to the provider Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17142/ Link to the consumer Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17144/ Please let me know if someone can look at it. Thanks @ikave I see the cluster is down, Do we have the rook operator logs for the same? No. I didn't get the rook operator logs. Should I test it again and get the rook operator logs of the consumer and provider? I checked again the steps in the comment https://bugzilla.redhat.com/show_bug.cgi?id=2131581#c0. I didn't edit the rosa addon this time. 1. Check the configmap rook-ceph-mon-endpoint data on the consumer: $ oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: b=10.0.156.26:6789 metadata: 2. Find the corresponding worker node with the IP above on the provider - Here is the worker node: "ip-10-0-156-26.ec2.internal". 3. Delete the worker node with the provided IP(Here it is 'ip-10-0-135-169.ec2.internal') as described in the doc https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_dynamic_devices#replacing-an-operational-aws-node-ipi_rhodf 4. Wait for a new worker node to come up and reach the "Ready" state. 5. Wait about 25 min(until the mon pods are running on the provider), and check the configmap rook-ceph-mon-endpoint data on the consumer: $ oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: b=10.0.156.26:6789 metadata: We can see from the output above that we still have the old worker node IP. I also tried to edit the configmap manually, and it was back to the old IP. I remember there was a bug related to it, https://bugzilla.redhat.com/show_bug.cgi?id=2086485 can you make sure your version include this BZ fix? Additional info: Consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.10.35 Kubernetes Version: v1.23.5+8471591 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.35 True False 3h18m Cluster version is 4.10.35 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Provider Versions: OC version: Client Version: 4.10.24 Server Version: 4.10.35 Kubernetes Version: v1.23.5+8471591 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.35 True False 5h34m Cluster version is 4.10.35 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) Link to consumer Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17455/ Link to provider Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17453/ I have looked at the BZ https://bugzilla.redhat.com/show_bug.cgi?id=2086485, and the target version is ODF 4.11. Should we also check this BZ with ODF 4.11? Yes, I think so. After terminating a provider worker node, I tested this scenario again and found that deleting the ocs-operator pod resolved the issue. After deleting it, the "rook-ceph-mon-endpoints" updated to one of the provider worker nodes' IP, and the ceph health command worked again as expected.
This is the "rook-ceph-mon-endpoints" before deleting the ocs-operator pod"
$ oc get configmaps rook-ceph-mon-endpoints -o yaml
apiVersion: v1
data:
data: c=10.0.170.158:6789
mapping: '{}'
maxMonId: "0"
And after deleting the pod:
$ oc get configmaps rook-ceph-mon-endpoints -o yaml
apiVersion: v1
data:
data: d=10.0.170.232:6789
mapping: '{}'
maxMonId: "0"
Here is the link to the test: https://os4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1055/console.
As you can see in the test the ceph health was stuck until this line:
" 2022-11-23 17:18:31 15:18:31 - MainThread - ocs_ci.utility.utils - INFO - C[ikave-48-c2] - Executing command: oc -n openshift-storage exec rook-ceph-tools-7c7ccfb8d6-mv24p -- ceph health"
when I deleted the ocs-operator pod, and after that, it recovered.
Provider Versions:
OC version:
Client Version: 4.10.24
Server Version: 4.10.40
Kubernetes Version: v1.23.12+7566c4d
OCS version:
ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded
Cluster version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.40 True False 4h54m Cluster version is 4.10.40
Rook version:
rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
go: go1.16.12
Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
Consumer versions:
OC version:
Client Version: 4.10.24
Server Version: 4.10.40
Kubernetes Version: v1.23.12+7566c4d
OCS version:
ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded
Cluster version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.40 True False 6h37m Cluster version is 4.10.40
Rook version:
rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
go: go1.16.12
Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
Here is the correct link to the test: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1055/console To be tested in v2.0.11(ODF v4.10.9) According to the test run https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1338/console and the output: 2023-02-13 18:31:40 16:31:40 - MainThread - ocs_ci.ocs.resources.storage_cluster - WARNING - C[ikave-60-c1] - The endpoint ip 10.0.12.82 of mon d is not found in the provider worker node ips 2023-02-13 18:31:40 16:31:40 - MainThread - ocs_ci.utility.utils - INFO - C[ikave-60-c1] - Going to sleep for 10 seconds before next iteration 2023-02-13 18:31:50 16:31:50 - MainThread - ocs_ci.utility.utils - ERROR - C[ikave-60-c1] - function inner failed to return expected value True after multiple retries during 180 second timeout 2023-02-13 18:31:50 16:31:50 - MainThread - ocs_ci.ocs.node - INFO - C[ikave-60-c1] - Try to restart the ocs-operator pod This bug still exists with ODF 4.10.9 version. Cluster versions: OC version: Client Version: 4.10.24 Server Version: 4.10.50 Kubernetes Version: v1.23.12+8a6bfe4 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.5 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.50 True False 4h25m Cluster version is 4.10.50 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) CSV version: NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.5 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11-11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.5 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.5 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.461-dbddf1f Route Monitor Operator 0.1.461-dbddf1f route-monitor-operator.v0.1.456-02ea942 Succeeded The above is the provider versions. The consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.10.50 Kubernetes Version: v1.23.12+8a6bfe4 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.50 True False 4h22m Cluster version is 4.10.50 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) ========CSV ====== NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11-11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.461-dbddf1f Route Monitor Operator 0.1.461-dbddf1f route-monitor-operator.v0.1.456-02ea942 Succeeded The rook-ceph-mon-endpoints have updated, but just after restarting the ocs-operator pod. Parth, please take a look. Is this PR https://github.com/red-hat-storage/ocs-operator/pull/1891 in the build you are testing Yes, I think the changes in the pr above are included in the automation test. Because I see the test I ran above was a month after the pr has been merged. Can you share the ocs and rook operator pod logs before and after restart? Seems like OCS operator didn't got a reconciliation trigger that's why rook also didn't reconcile. Okay, I will rerun the test and save the ocs and rook operator logs. I attached in the previous comments the ocs operator logs before and after the restart and the rook-ceph-operator logs. Consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.11.27 Kubernetes Version: v1.24.6+263df15 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.27 True False 8h Cluster version is 4.11.27 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) CSV versions: NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.10 NooBaa Operator 4.10.10 mcg-operator.v4.10.9 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11-11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.461-dbddf1f Route Monitor Operator 0.1.461-dbddf1f route-monitor-operator.v0.1.456-02ea942 Succeeded Link to the relevant automation test: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1482/console. Parth, any update on this? After the new ocs worker node is up and all the ocs pods are running, it waits 3 minutes for rook-ceph-mon-endpoints to be updated. When I checked it manually a month ago, it didn't help to wait for more. Maybe we can reproduce it if we only remove a mon IP from the rook-ceph-mon-endpoints - I didn't try it yet. I will add additional logs in the following comment, including the storage cluster state in the test. Okay, I can try adding more time. How much time do you think we should wait for the 'rook-ceph-mon-endpoints' to be updated? From the comment here https://bugzilla.redhat.com/show_bug.cgi?id=2086485#c15, the fix will be ready in ODF 4.11 VERSION. |