Description of problem: After the node replacement procedure[provider], the mon IP endpoints are updated as expected. But after editing the rosa addon - the mon IP endpoint revert to the old worker node IP. Version-Release number of selected component (if applicable): Provider: OC version: Client Version: 4.10.24 Server Version: 4.10.33 Kubernetes Version: v1.23.5+012e945 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.33 True False 5h6m Cluster version is 4.10.33 Consumer: OC version: Client Version: 4.10.24 Server Version: 4.10.33 Kubernetes Version: v1.23.5+012e945 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.33 True False 3h51m Cluster version is 4.10.33 How reproducible: Steps to Reproduce: 1. Check the configmap rook-ceph-mon-endpoint data on the consumer: $ oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: a=10.0.135.169:6789 metadata: 2. Find the corresponding worker node with the IP above on the provider: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-134-230.ec2.internal Ready infra,worker 158m v1.23.5+012e945 ip-10-0-135-169.ec2.internal Ready worker 174m v1.23.5+012e945 ip-10-0-142-110.ec2.internal Ready master 3h1m v1.23.5+012e945 ip-10-0-145-47.ec2.internal Ready infra,worker 158m v1.23.5+012e945 ip-10-0-149-149.ec2.internal Ready worker 174m v1.23.5+012e945 ip-10-0-159-240.ec2.internal Ready master 3h1m v1.23.5+012e945 ip-10-0-161-138.ec2.internal Ready infra,worker 158m v1.23.5+012e945 ip-10-0-168-154.ec2.internal Ready worker 174m v1.23.5+012e945 ip-10-0-168-248.ec2.internal Ready master 3h1m v1.23.5+012e945 3. Delete the worker node with the provided IP(Here it is 'ip-10-0-135-169.ec2.internal') as described in the doc https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_dynamic_devices#replacing-an-operational-aws-node-ipi_rhodf 4. Wait for a new worker node to come up and reach the "Ready" state. 5. Wait about 20 min, and check the configmap rook-ceph-mon-endpoint data on the consumer: oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: b=10.0.149.149:6789,c=10.0.168.154:6789,d=10.0.139.52:6789 metadata: 6. Check the worker node IPs on the provider: oc get nodes | grep worker | grep -v infra ip-10-0-139-52.ec2.internal Ready worker 167m v1.23.5+012e945 ip-10-0-149-149.ec2.internal Ready worker 6h4m v1.23.5+012e945 ip-10-0-168-154.ec2.internal Ready worker 6h5m v1.23.5+012e945 We can see that the IPs on the provider match the mon endpoint IPs. 7. Check the storage provider endpoint on the consumer: $ oc get storageclusters.ocs.openshift.io ocs-storagecluster -o yaml | grep storageProviderEndpoint storageProviderEndpoint: 10.0.135.169:31659 We can see that it uses the old provider worker node IP 8. Edit the rosa addon with the following command on consumer: rosa edit addon ocs-consumer-qe -c ikave-24-c1 --storage-provider-endpoint "10.0.149.149:31659" 9. Check the storage provider endpoint on the consumer: $ oc get storageclusters.ocs.openshift.io ocs-storagecluster -o yaml | grep storageProviderEndpoint storageProviderEndpoint: 10.0.149.149:31659 10. Now check the configmap rook-ceph-mon-endpoint data again on the consumer: $ oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: a=10.0.135.169:6789 metadata: We can see that for some reason, the configmap rook-ceph-mon-endpoint data changed again to the old IP of the provider worker node. Actual results: After editing the rosa addon and changing the storage provider endpoint, the configmap rook-ceph-mon-endpoint data changed again to the old IP of the provider worker node. Expected results: After editing the rosa addon and changing the storage provider endpoint, the configmap rook-ceph-mon-endpoint data should point to the existing provider worker node IPs and not the old IP. Additional info:
Additional info: Provider versions: OC version: Client Version: 4.10.24 Server Version: 4.10.33 Kubernetes Version: v1.23.5+012e945 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.33 True False 5h6m Cluster version is 4.10.33 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) Consumer Versions: OC version: Client Version: 4.10.24 Server Version: 4.10.33 Kubernetes Version: v1.23.5+012e945 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.33 True False 3h51m Cluster version is 4.10.33 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
Link to the provider Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17142/ Link to the consumer Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17144/ Please let me know if someone can look at it. Thanks
@ikave I see the cluster is down, Do we have the rook operator logs for the same?
No. I didn't get the rook operator logs. Should I test it again and get the rook operator logs of the consumer and provider?
I checked again the steps in the comment https://bugzilla.redhat.com/show_bug.cgi?id=2131581#c0. I didn't edit the rosa addon this time. 1. Check the configmap rook-ceph-mon-endpoint data on the consumer: $ oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: b=10.0.156.26:6789 metadata: 2. Find the corresponding worker node with the IP above on the provider - Here is the worker node: "ip-10-0-156-26.ec2.internal". 3. Delete the worker node with the provided IP(Here it is 'ip-10-0-135-169.ec2.internal') as described in the doc https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_dynamic_devices#replacing-an-operational-aws-node-ipi_rhodf 4. Wait for a new worker node to come up and reach the "Ready" state. 5. Wait about 25 min(until the mon pods are running on the provider), and check the configmap rook-ceph-mon-endpoint data on the consumer: $ oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: b=10.0.156.26:6789 metadata: We can see from the output above that we still have the old worker node IP. I also tried to edit the configmap manually, and it was back to the old IP.
I remember there was a bug related to it, https://bugzilla.redhat.com/show_bug.cgi?id=2086485 can you make sure your version include this BZ fix?
Additional info: Consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.10.35 Kubernetes Version: v1.23.5+8471591 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.35 True False 3h18m Cluster version is 4.10.35 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Provider Versions: OC version: Client Version: 4.10.24 Server Version: 4.10.35 Kubernetes Version: v1.23.5+8471591 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.35 True False 5h34m Cluster version is 4.10.35 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
Link to consumer Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17455/ Link to provider Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17453/
I have looked at the BZ https://bugzilla.redhat.com/show_bug.cgi?id=2086485, and the target version is ODF 4.11. Should we also check this BZ with ODF 4.11?
Yes, I think so.
After terminating a provider worker node, I tested this scenario again and found that deleting the ocs-operator pod resolved the issue. After deleting it, the "rook-ceph-mon-endpoints" updated to one of the provider worker nodes' IP, and the ceph health command worked again as expected. This is the "rook-ceph-mon-endpoints" before deleting the ocs-operator pod" $ oc get configmaps rook-ceph-mon-endpoints -o yaml apiVersion: v1 data: data: c=10.0.170.158:6789 mapping: '{}' maxMonId: "0" And after deleting the pod: $ oc get configmaps rook-ceph-mon-endpoints -o yaml apiVersion: v1 data: data: d=10.0.170.232:6789 mapping: '{}' maxMonId: "0" Here is the link to the test: https://os4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1055/console. As you can see in the test the ceph health was stuck until this line: " 2022-11-23 17:18:31 15:18:31 - MainThread - ocs_ci.utility.utils - INFO - C[ikave-48-c2] - Executing command: oc -n openshift-storage exec rook-ceph-tools-7c7ccfb8d6-mv24p -- ceph health" when I deleted the ocs-operator pod, and after that, it recovered. Provider Versions: OC version: Client Version: 4.10.24 Server Version: 4.10.40 Kubernetes Version: v1.23.12+7566c4d OCS version: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.40 True False 4h54m Cluster version is 4.10.40 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) Consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.10.40 Kubernetes Version: v1.23.12+7566c4d OCS version: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.40 True False 6h37m Cluster version is 4.10.40 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
Here is the correct link to the test: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1055/console
To be tested in v2.0.11(ODF v4.10.9)
According to the test run https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1338/console and the output: 2023-02-13 18:31:40 16:31:40 - MainThread - ocs_ci.ocs.resources.storage_cluster - WARNING - C[ikave-60-c1] - The endpoint ip 10.0.12.82 of mon d is not found in the provider worker node ips 2023-02-13 18:31:40 16:31:40 - MainThread - ocs_ci.utility.utils - INFO - C[ikave-60-c1] - Going to sleep for 10 seconds before next iteration 2023-02-13 18:31:50 16:31:50 - MainThread - ocs_ci.utility.utils - ERROR - C[ikave-60-c1] - function inner failed to return expected value True after multiple retries during 180 second timeout 2023-02-13 18:31:50 16:31:50 - MainThread - ocs_ci.ocs.node - INFO - C[ikave-60-c1] - Try to restart the ocs-operator pod This bug still exists with ODF 4.10.9 version. Cluster versions: OC version: Client Version: 4.10.24 Server Version: 4.10.50 Kubernetes Version: v1.23.12+8a6bfe4 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.5 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.50 True False 4h25m Cluster version is 4.10.50 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) CSV version: NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.5 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11-11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.5 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.5 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.461-dbddf1f Route Monitor Operator 0.1.461-dbddf1f route-monitor-operator.v0.1.456-02ea942 Succeeded
The above is the provider versions. The consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.10.50 Kubernetes Version: v1.23.12+8a6bfe4 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.50 True False 4h22m Cluster version is 4.10.50 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) ========CSV ====== NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11-11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.461-dbddf1f Route Monitor Operator 0.1.461-dbddf1f route-monitor-operator.v0.1.456-02ea942 Succeeded
The rook-ceph-mon-endpoints have updated, but just after restarting the ocs-operator pod.
Parth, please take a look.
Is this PR https://github.com/red-hat-storage/ocs-operator/pull/1891 in the build you are testing
Yes, I think the changes in the pr above are included in the automation test. Because I see the test I ran above was a month after the pr has been merged.
Can you share the ocs and rook operator pod logs before and after restart? Seems like OCS operator didn't got a reconciliation trigger that's why rook also didn't reconcile.
Okay, I will rerun the test and save the ocs and rook operator logs.
I attached in the previous comments the ocs operator logs before and after the restart and the rook-ceph-operator logs. Consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.11.27 Kubernetes Version: v1.24.6+263df15 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.27 True False 8h Cluster version is 4.11.27 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) CSV versions: NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.10 NooBaa Operator 4.10.10 mcg-operator.v4.10.9 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11-11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.461-dbddf1f Route Monitor Operator 0.1.461-dbddf1f route-monitor-operator.v0.1.456-02ea942 Succeeded
Link to the relevant automation test: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1482/console.
Parth, any update on this?
After the new ocs worker node is up and all the ocs pods are running, it waits 3 minutes for rook-ceph-mon-endpoints to be updated. When I checked it manually a month ago, it didn't help to wait for more. Maybe we can reproduce it if we only remove a mon IP from the rook-ceph-mon-endpoints - I didn't try it yet. I will add additional logs in the following comment, including the storage cluster state in the test.
Okay, I can try adding more time. How much time do you think we should wait for the 'rook-ceph-mon-endpoints' to be updated?
From the comment here https://bugzilla.redhat.com/show_bug.cgi?id=2086485#c15, the fix will be ready in ODF 4.11 VERSION.
This issue is no longer relevant. The last tier4 tests that check the issue above were passed successfully.
The ODF Managed Service Project has sunset and is now consider obsolete