Description of problem: After the node replacement procedure[provider], the mon IP endpoints are updated as expected. But after editing the rosa addon - the mon IP endpoint revert to the old worker node IP. Version-Release number of selected component (if applicable): Provider: OC version: Client Version: 4.10.24 Server Version: 4.10.33 Kubernetes Version: v1.23.5+012e945 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.33 True False 5h6m Cluster version is 4.10.33 Consumer: OC version: Client Version: 4.10.24 Server Version: 4.10.33 Kubernetes Version: v1.23.5+012e945 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.33 True False 3h51m Cluster version is 4.10.33 How reproducible: Steps to Reproduce: 1. Check the configmap rook-ceph-mon-endpoint data on the consumer: $ oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: a=10.0.135.169:6789 metadata: 2. Find the corresponding worker node with the IP above on the provider: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-134-230.ec2.internal Ready infra,worker 158m v1.23.5+012e945 ip-10-0-135-169.ec2.internal Ready worker 174m v1.23.5+012e945 ip-10-0-142-110.ec2.internal Ready master 3h1m v1.23.5+012e945 ip-10-0-145-47.ec2.internal Ready infra,worker 158m v1.23.5+012e945 ip-10-0-149-149.ec2.internal Ready worker 174m v1.23.5+012e945 ip-10-0-159-240.ec2.internal Ready master 3h1m v1.23.5+012e945 ip-10-0-161-138.ec2.internal Ready infra,worker 158m v1.23.5+012e945 ip-10-0-168-154.ec2.internal Ready worker 174m v1.23.5+012e945 ip-10-0-168-248.ec2.internal Ready master 3h1m v1.23.5+012e945 3. Delete the worker node with the provided IP(Here it is 'ip-10-0-135-169.ec2.internal') as described in the doc https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_dynamic_devices#replacing-an-operational-aws-node-ipi_rhodf 4. Wait for a new worker node to come up and reach the "Ready" state. 5. Wait about 20 min, and check the configmap rook-ceph-mon-endpoint data on the consumer: oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: b=10.0.149.149:6789,c=10.0.168.154:6789,d=10.0.139.52:6789 metadata: 6. Check the worker node IPs on the provider: oc get nodes | grep worker | grep -v infra ip-10-0-139-52.ec2.internal Ready worker 167m v1.23.5+012e945 ip-10-0-149-149.ec2.internal Ready worker 6h4m v1.23.5+012e945 ip-10-0-168-154.ec2.internal Ready worker 6h5m v1.23.5+012e945 We can see that the IPs on the provider match the mon endpoint IPs. 7. Check the storage provider endpoint on the consumer: $ oc get storageclusters.ocs.openshift.io ocs-storagecluster -o yaml | grep storageProviderEndpoint storageProviderEndpoint: 10.0.135.169:31659 We can see that it uses the old provider worker node IP 8. Edit the rosa addon with the following command on consumer: rosa edit addon ocs-consumer-qe -c ikave-24-c1 --storage-provider-endpoint "10.0.149.149:31659" 9. Check the storage provider endpoint on the consumer: $ oc get storageclusters.ocs.openshift.io ocs-storagecluster -o yaml | grep storageProviderEndpoint storageProviderEndpoint: 10.0.149.149:31659 10. Now check the configmap rook-ceph-mon-endpoint data again on the consumer: $ oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: a=10.0.135.169:6789 metadata: We can see that for some reason, the configmap rook-ceph-mon-endpoint data changed again to the old IP of the provider worker node. Actual results: After editing the rosa addon and changing the storage provider endpoint, the configmap rook-ceph-mon-endpoint data changed again to the old IP of the provider worker node. Expected results: After editing the rosa addon and changing the storage provider endpoint, the configmap rook-ceph-mon-endpoint data should point to the existing provider worker node IPs and not the old IP. Additional info:
Additional info: Provider versions: OC version: Client Version: 4.10.24 Server Version: 4.10.33 Kubernetes Version: v1.23.5+012e945 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.33 True False 5h6m Cluster version is 4.10.33 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) Consumer Versions: OC version: Client Version: 4.10.24 Server Version: 4.10.33 Kubernetes Version: v1.23.5+012e945 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.33 True False 3h51m Cluster version is 4.10.33 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
Link to the provider Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17142/ Link to the consumer Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17144/ Please let me know if someone can look at it. Thanks
@ikave I see the cluster is down, Do we have the rook operator logs for the same?
No. I didn't get the rook operator logs. Should I test it again and get the rook operator logs of the consumer and provider?
I checked again the steps in the comment https://bugzilla.redhat.com/show_bug.cgi?id=2131581#c0. I didn't edit the rosa addon this time. 1. Check the configmap rook-ceph-mon-endpoint data on the consumer: $ oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: b=10.0.156.26:6789 metadata: 2. Find the corresponding worker node with the IP above on the provider - Here is the worker node: "ip-10-0-156-26.ec2.internal". 3. Delete the worker node with the provided IP(Here it is 'ip-10-0-135-169.ec2.internal') as described in the doc https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_dynamic_devices#replacing-an-operational-aws-node-ipi_rhodf 4. Wait for a new worker node to come up and reach the "Ready" state. 5. Wait about 25 min(until the mon pods are running on the provider), and check the configmap rook-ceph-mon-endpoint data on the consumer: $ oc get cm rook-ceph-mon-endpoints -o yaml | grep data data: data: b=10.0.156.26:6789 metadata: We can see from the output above that we still have the old worker node IP. I also tried to edit the configmap manually, and it was back to the old IP.
I remember there was a bug related to it, https://bugzilla.redhat.com/show_bug.cgi?id=2086485 can you make sure your version include this BZ fix?
Additional info: Consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.10.35 Kubernetes Version: v1.23.5+8471591 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.35 True False 3h18m Cluster version is 4.10.35 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Provider Versions: OC version: Client Version: 4.10.24 Server Version: 4.10.35 Kubernetes Version: v1.23.5+8471591 OCS verison: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.35 True False 5h34m Cluster version is 4.10.35 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
Link to consumer Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17455/ Link to provider Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17453/
I have looked at the BZ https://bugzilla.redhat.com/show_bug.cgi?id=2086485, and the target version is ODF 4.11. Should we also check this BZ with ODF 4.11?
Yes, I think so.
After terminating a provider worker node, I tested this scenario again and found that deleting the ocs-operator pod resolved the issue. After deleting it, the "rook-ceph-mon-endpoints" updated to one of the provider worker nodes' IP, and the ceph health command worked again as expected. This is the "rook-ceph-mon-endpoints" before deleting the ocs-operator pod" $ oc get configmaps rook-ceph-mon-endpoints -o yaml apiVersion: v1 data: data: c=10.0.170.158:6789 mapping: '{}' maxMonId: "0" And after deleting the pod: $ oc get configmaps rook-ceph-mon-endpoints -o yaml apiVersion: v1 data: data: d=10.0.170.232:6789 mapping: '{}' maxMonId: "0" Here is the link to the test: https://os4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1055/console. As you can see in the test the ceph health was stuck until this line: " 2022-11-23 17:18:31 15:18:31 - MainThread - ocs_ci.utility.utils - INFO - C[ikave-48-c2] - Executing command: oc -n openshift-storage exec rook-ceph-tools-7c7ccfb8d6-mv24p -- ceph health" when I deleted the ocs-operator pod, and after that, it recovered. Provider Versions: OC version: Client Version: 4.10.24 Server Version: 4.10.40 Kubernetes Version: v1.23.12+7566c4d OCS version: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.40 True False 4h54m Cluster version is 4.10.40 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) Consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.10.40 Kubernetes Version: v1.23.12+7566c4d OCS version: ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.40 True False 6h37m Cluster version is 4.10.40 Rook version: rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
Here is the correct link to the test: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1055/console
To be tested in v2.0.11(ODF v4.10.9)
According to the test run https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1338/console and the output: 2023-02-13 18:31:40 16:31:40 - MainThread - ocs_ci.ocs.resources.storage_cluster - WARNING - C[ikave-60-c1] - The endpoint ip 10.0.12.82 of mon d is not found in the provider worker node ips 2023-02-13 18:31:40 16:31:40 - MainThread - ocs_ci.utility.utils - INFO - C[ikave-60-c1] - Going to sleep for 10 seconds before next iteration 2023-02-13 18:31:50 16:31:50 - MainThread - ocs_ci.utility.utils - ERROR - C[ikave-60-c1] - function inner failed to return expected value True after multiple retries during 180 second timeout 2023-02-13 18:31:50 16:31:50 - MainThread - ocs_ci.ocs.node - INFO - C[ikave-60-c1] - Try to restart the ocs-operator pod This bug still exists with ODF 4.10.9 version. Cluster versions: OC version: Client Version: 4.10.24 Server Version: 4.10.50 Kubernetes Version: v1.23.12+8a6bfe4 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.5 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.50 True False 4h25m Cluster version is 4.10.50 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) CSV version: NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.5 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11-11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.5 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.5 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.461-dbddf1f Route Monitor Operator 0.1.461-dbddf1f route-monitor-operator.v0.1.456-02ea942 Succeeded
The above is the provider versions. The consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.10.50 Kubernetes Version: v1.23.12+8a6bfe4 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.50 True False 4h22m Cluster version is 4.10.50 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) ========CSV ====== NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11-11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.461-dbddf1f Route Monitor Operator 0.1.461-dbddf1f route-monitor-operator.v0.1.456-02ea942 Succeeded
The rook-ceph-mon-endpoints have updated, but just after restarting the ocs-operator pod.
Parth, please take a look.
Is this PR https://github.com/red-hat-storage/ocs-operator/pull/1891 in the build you are testing
Yes, I think the changes in the pr above are included in the automation test. Because I see the test I ran above was a month after the pr has been merged.
Can you share the ocs and rook operator pod logs before and after restart? Seems like OCS operator didn't got a reconciliation trigger that's why rook also didn't reconcile.
Okay, I will rerun the test and save the ocs and rook operator logs.
I attached in the previous comments the ocs operator logs before and after the restart and the rook-ceph-operator logs. Consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.11.27 Kubernetes Version: v1.24.6+263df15 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.27 True False 8h Cluster version is 4.11.27 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) CSV versions: NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.10 NooBaa Operator 4.10.10 mcg-operator.v4.10.9 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11-11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.461-dbddf1f Route Monitor Operator 0.1.461-dbddf1f route-monitor-operator.v0.1.456-02ea942 Succeeded
Link to the relevant automation test: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1482/console.
Parth, any update on this?
After the new ocs worker node is up and all the ocs pods are running, it waits 3 minutes for rook-ceph-mon-endpoints to be updated. When I checked it manually a month ago, it didn't help to wait for more. Maybe we can reproduce it if we only remove a mon IP from the rook-ceph-mon-endpoints - I didn't try it yet. I will add additional logs in the following comment, including the storage cluster state in the test.
Okay, I can try adding more time. How much time do you think we should wait for the 'rook-ceph-mon-endpoints' to be updated?
From the comment here https://bugzilla.redhat.com/show_bug.cgi?id=2086485#c15, the fix will be ready in ODF 4.11 VERSION.