Bug 2131581 - After the node replacement procedure[provider] and editing the rosa addon, the mon IP endpoint revert to the old worker node IP [NEEDINFO]
Summary: After the node replacement procedure[provider] and editing the rosa addon, th...
Keywords:
Status: ON_QA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Parth Arora
QA Contact: Itzhak
URL:
Whiteboard:
Depends On: 2147580
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-10-02 16:18 UTC by Itzhak
Modified: 2023-08-09 17:00 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
paarora: needinfo? (ikave)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2147580 0 unspecified CLOSED [odf-clone] Mons IP not updated correctly in the rook-ceph-mon-endpoints cm 2023-08-09 17:00:43 UTC

Description Itzhak 2022-10-02 16:18:43 UTC
Description of problem:

After the node replacement procedure[provider], the mon IP endpoints are updated as expected. But after editing the rosa addon - the mon IP endpoint revert to the old worker node IP.

Version-Release number of selected component (if applicable):
Provider:
OC version:
Client Version: 4.10.24
Server Version: 4.10.33
Kubernetes Version: v1.23.5+012e945

OCS verison:
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.33   True        False         5h6m    Cluster version is 4.10.33

Consumer:
OC version:
Client Version: 4.10.24
Server Version: 4.10.33
Kubernetes Version: v1.23.5+012e945

OCS verison:
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.33   True        False         3h51m   Cluster version is 4.10.33
 
How reproducible:


Steps to Reproduce:
1. Check the configmap rook-ceph-mon-endpoint data on the consumer:
$ oc get cm rook-ceph-mon-endpoints -o yaml | grep data
data:
  data: a=10.0.135.169:6789
metadata:

2. Find the corresponding worker node with the IP above on the provider:
$ oc get nodes
NAME                           STATUS   ROLES          AGE    VERSION
ip-10-0-134-230.ec2.internal   Ready    infra,worker   158m   v1.23.5+012e945
ip-10-0-135-169.ec2.internal   Ready    worker         174m   v1.23.5+012e945
ip-10-0-142-110.ec2.internal   Ready    master         3h1m   v1.23.5+012e945
ip-10-0-145-47.ec2.internal    Ready    infra,worker   158m   v1.23.5+012e945
ip-10-0-149-149.ec2.internal   Ready    worker         174m   v1.23.5+012e945
ip-10-0-159-240.ec2.internal   Ready    master         3h1m   v1.23.5+012e945
ip-10-0-161-138.ec2.internal   Ready    infra,worker   158m   v1.23.5+012e945
ip-10-0-168-154.ec2.internal   Ready    worker         174m   v1.23.5+012e945
ip-10-0-168-248.ec2.internal   Ready    master         3h1m   v1.23.5+012e945

3. Delete the worker node with the provided IP(Here it is 'ip-10-0-135-169.ec2.internal') as described in the doc https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_dynamic_devices#replacing-an-operational-aws-node-ipi_rhodf

4. Wait for a new worker node to come up and reach the "Ready" state.
5. Wait about 20 min, and check the configmap rook-ceph-mon-endpoint data on the consumer: 
oc get cm rook-ceph-mon-endpoints -o yaml | grep data
data:
  data: b=10.0.149.149:6789,c=10.0.168.154:6789,d=10.0.139.52:6789
metadata:

6. Check the worker node IPs on the provider:
oc get nodes | grep worker | grep -v infra
ip-10-0-139-52.ec2.internal    Ready    worker         167m    v1.23.5+012e945
ip-10-0-149-149.ec2.internal   Ready    worker         6h4m    v1.23.5+012e945
ip-10-0-168-154.ec2.internal   Ready    worker         6h5m    v1.23.5+012e945

We can see that the IPs on the provider match the mon endpoint IPs. 

7. Check the storage provider endpoint on the consumer: 
$ oc get storageclusters.ocs.openshift.io ocs-storagecluster -o yaml | grep storageProviderEndpoint
storageProviderEndpoint: 10.0.135.169:31659

We can see that it uses the old provider worker node IP

8. Edit the rosa addon with the following command on consumer: 
rosa edit addon ocs-consumer-qe -c ikave-24-c1 --storage-provider-endpoint "10.0.149.149:31659"

9. Check the storage provider endpoint on the consumer: 
$ oc get storageclusters.ocs.openshift.io ocs-storagecluster -o yaml | grep storageProviderEndpoint
storageProviderEndpoint: 10.0.149.149:31659

10. Now check the configmap rook-ceph-mon-endpoint data again on the consumer: 
$ oc get cm rook-ceph-mon-endpoints -o yaml | grep data
data:
  data: a=10.0.135.169:6789
metadata:

We can see that for some reason, the configmap rook-ceph-mon-endpoint data changed again to the old IP of the provider worker node.

Actual results:
After editing the rosa addon and changing the storage provider endpoint, the configmap rook-ceph-mon-endpoint data changed again to the old IP of the provider worker node.  

Expected results:
After editing the rosa addon and changing the storage provider endpoint, the configmap rook-ceph-mon-endpoint data should point to the existing provider worker node IPs and not the old IP.

Additional info:

Comment 1 Itzhak 2022-10-02 16:27:04 UTC
Additional info:

Provider versions:

OC version:
Client Version: 4.10.24
Server Version: 4.10.33
Kubernetes Version: v1.23.5+012e945

OCS verison:
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.33   True        False         5h6m    Cluster version is 4.10.33

Rook version:
rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
go: go1.16.12

Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)


Consumer Versions:

OC version:
Client Version: 4.10.24
Server Version: 4.10.33
Kubernetes Version: v1.23.5+012e945

OCS verison:
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.33   True        False         3h51m   Cluster version is 4.10.33

Rook version:
rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
go: go1.16.12

Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)

Comment 2 Itzhak 2022-10-06 14:34:59 UTC
Link to the provider Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17142/
Link to the consumer Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/17144/

Please let me know if someone can look at it. 
Thanks

Comment 3 Parth Arora 2022-10-11 08:04:19 UTC
@ikave I see the cluster is down,
Do we have the rook operator logs for the same?

Comment 4 Itzhak 2022-10-18 07:55:35 UTC
No. I didn't get the rook operator logs. 
Should I test it again and get the rook operator logs of the consumer and provider?

Comment 6 Itzhak 2022-10-18 15:17:28 UTC
I checked again the steps in the comment https://bugzilla.redhat.com/show_bug.cgi?id=2131581#c0. 
I didn't edit the rosa addon this time. 

1. Check the configmap rook-ceph-mon-endpoint data on the consumer:
$ oc get cm rook-ceph-mon-endpoints -o yaml | grep data
data:
  data: b=10.0.156.26:6789
metadata:

2. Find the corresponding worker node with the IP above on the provider - 
Here is the worker node: "ip-10-0-156-26.ec2.internal".

3. Delete the worker node with the provided IP(Here it is 'ip-10-0-135-169.ec2.internal') as described in the doc https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_dynamic_devices#replacing-an-operational-aws-node-ipi_rhodf

4. Wait for a new worker node to come up and reach the "Ready" state.
5. Wait about 25 min(until the mon pods are running on the provider), and check the configmap rook-ceph-mon-endpoint data on the consumer:
$ oc get cm rook-ceph-mon-endpoints -o yaml | grep data
data:
  data: b=10.0.156.26:6789
metadata:


We can see from the output above that we still have the old worker node IP.

I also tried to edit the configmap manually, and it was back to the old IP.

Comment 7 Parth Arora 2022-10-18 15:21:35 UTC
I remember there was a bug related to it,
https://bugzilla.redhat.com/show_bug.cgi?id=2086485 can you make sure your version include this BZ fix?

Comment 8 Itzhak 2022-10-18 16:09:39 UTC
Additional info: 

Consumer versions: 

OC version:
Client Version: 4.10.24
Server Version: 4.10.35
Kubernetes Version: v1.23.5+8471591

OCS verison:
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.35   True        False         3h18m   Cluster version is 4.10.35

Rook version:
rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
go: go1.16.12


Provider Versions: 

OC version:
Client Version: 4.10.24
Server Version: 4.10.35
Kubernetes Version: v1.23.5+8471591

OCS verison:
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.35   True        False         5h34m   Cluster version is 4.10.35

Rook version:
rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
go: go1.16.12

Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)

Comment 12 Itzhak 2022-10-19 09:31:04 UTC
I have looked at the BZ https://bugzilla.redhat.com/show_bug.cgi?id=2086485, and the target version is ODF 4.11. 
Should we also check this BZ with ODF 4.11?

Comment 14 Itzhak 2022-10-19 12:27:53 UTC
Yes, I think so.

Comment 18 Itzhak 2022-11-23 16:31:11 UTC
After terminating a provider worker node, I tested this scenario again and found that deleting the ocs-operator pod resolved the issue. After deleting it, the "rook-ceph-mon-endpoints" updated to one of the provider worker nodes' IP, and the ceph health command worked again as expected.

This is the "rook-ceph-mon-endpoints" before deleting the ocs-operator pod"
$ oc get configmaps rook-ceph-mon-endpoints -o yaml
apiVersion: v1
data:
  data: c=10.0.170.158:6789
  mapping: '{}'
  maxMonId: "0"

And after deleting the pod: 
$ oc get configmaps rook-ceph-mon-endpoints -o yaml
apiVersion: v1
data:
  data: d=10.0.170.232:6789
  mapping: '{}'
  maxMonId: "0"


Here is the link to the test: https://os4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1055/console.
As you can see in the test the ceph health was stuck until this line: 
" 2022-11-23 17:18:31  15:18:31 - MainThread - ocs_ci.utility.utils - INFO - C[ikave-48-c2] - Executing command: oc -n openshift-storage exec rook-ceph-tools-7c7ccfb8d6-mv24p -- ceph health"

when I deleted the ocs-operator pod, and after that, it recovered.


Provider Versions:

OC version:
Client Version: 4.10.24
Server Version: 4.10.40
Kubernetes Version: v1.23.12+7566c4d

OCS version:
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.40   True        False         4h54m   Cluster version is 4.10.40

Rook version:
rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
go: go1.16.12

Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)

Consumer versions:

OC version:
Client Version: 4.10.24
Server Version: 4.10.40
Kubernetes Version: v1.23.12+7566c4d

OCS version:
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.40   True        False         6h37m   Cluster version is 4.10.40

Rook version:
rook: v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
go: go1.16.12

Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)

Comment 25 Dhruv Bindra 2023-01-20 10:00:27 UTC
To be tested in v2.0.11(ODF v4.10.9)

Comment 26 Itzhak 2023-02-13 16:47:34 UTC
According to the test run https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1338/console and the output: 
2023-02-13 18:31:40  16:31:40 - MainThread - ocs_ci.ocs.resources.storage_cluster - WARNING - C[ikave-60-c1] - The endpoint ip 10.0.12.82 of mon d is not found in the provider worker node ips
2023-02-13 18:31:40  16:31:40 - MainThread - ocs_ci.utility.utils - INFO - C[ikave-60-c1] - Going to sleep for 10 seconds before next iteration
2023-02-13 18:31:50  16:31:50 - MainThread - ocs_ci.utility.utils - ERROR - C[ikave-60-c1] - function inner failed to return expected value True after multiple retries during 180 second timeout
2023-02-13 18:31:50  16:31:50 - MainThread - ocs_ci.ocs.node - INFO - C[ikave-60-c1] - Try to restart the ocs-operator pod


This bug still exists with ODF 4.10.9 version.

Cluster versions:


OC version:
Client Version: 4.10.24
Server Version: 4.10.50
Kubernetes Version: v1.23.12+8a6bfe4

OCS version:
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.5                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.50   True        False         4h25m   Cluster version is 4.10.50

Rook version:
rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08
go: go1.16.12

Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)

CSV version:
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.9                      NooBaa Operator               4.10.9            mcg-operator.v4.10.8                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.5                      Succeeded
ocs-osd-deployer.v2.0.11                  OCS OSD Deployer              2.0.11-11         ocs-osd-deployer.v2.0.10                  Succeeded
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.5           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.5                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.461-dbddf1f   Route Monitor Operator        0.1.461-dbddf1f   route-monitor-operator.v0.1.456-02ea942   Succeeded

Comment 27 Itzhak 2023-02-13 16:51:26 UTC
The above is the provider versions. 

The consumer versions:

OC version:
Client Version: 4.10.24
Server Version: 4.10.50
Kubernetes Version: v1.23.12+8a6bfe4

OCS version:
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.50   True        False         4h22m   Cluster version is 4.10.50

Rook version:
rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08
go: go1.16.12

Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)


========CSV ======
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.9                      NooBaa Operator               4.10.9            mcg-operator.v4.10.8                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded
ocs-osd-deployer.v2.0.11                  OCS OSD Deployer              2.0.11-11         ocs-osd-deployer.v2.0.10                  Succeeded
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.8                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.461-dbddf1f   Route Monitor Operator        0.1.461-dbddf1f   route-monitor-operator.v0.1.456-02ea942   Succeeded

Comment 28 Itzhak 2023-02-13 17:07:26 UTC
The rook-ceph-mon-endpoints have updated, but just after restarting the ocs-operator pod.

Comment 29 Mudit Agarwal 2023-02-22 02:35:16 UTC
Parth, please take a look.

Comment 30 Parth Arora 2023-02-22 10:19:27 UTC
Is this PR https://github.com/red-hat-storage/ocs-operator/pull/1891 in the build you are testing

Comment 31 Itzhak 2023-02-27 10:42:21 UTC
Yes, I think the changes in the pr above are included in the automation test. 
Because I see the test I ran above was a month after the pr has been merged.

Comment 32 Parth Arora 2023-02-27 12:54:35 UTC
Can you share the ocs and rook operator pod logs before and after restart?

Seems like OCS operator didn't got a reconciliation trigger that's why rook also didn't reconcile.

Comment 33 Itzhak 2023-02-28 08:44:37 UTC
Okay, I will rerun the test and save the ocs and rook operator logs.

Comment 37 Itzhak 2023-02-28 18:06:42 UTC
I attached in the previous comments the ocs operator logs before and after the restart and the rook-ceph-operator logs.

Consumer versions: 

OC version:
Client Version: 4.10.24
Server Version: 4.11.27
Kubernetes Version: v1.24.6+263df15

OCS version:
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.27   True        False         8h      Cluster version is 4.11.27

Rook version:
rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08
go: go1.16.12

Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)


CSV versions:

NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.10                     NooBaa Operator               4.10.10           mcg-operator.v4.10.9                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded
ocs-osd-deployer.v2.0.11                  OCS OSD Deployer              2.0.11-11         ocs-osd-deployer.v2.0.10                  Succeeded
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.8                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.461-dbddf1f   Route Monitor Operator        0.1.461-dbddf1f   route-monitor-operator.v0.1.456-02ea942   Succeeded

Comment 39 Mudit Agarwal 2023-03-06 06:31:51 UTC
Parth, any update on this?

Comment 42 Itzhak 2023-03-06 14:35:58 UTC
After the new ocs worker node is up and all the ocs pods are running, it waits 3 minutes for rook-ceph-mon-endpoints to be updated.
When I checked it manually a month ago, it didn't help to wait for more.

Maybe we can reproduce it if we only remove a mon IP from the rook-ceph-mon-endpoints - I didn't try it yet.

I will add additional logs in the following comment, including the storage cluster state in the test.

Comment 45 Itzhak 2023-03-08 12:36:15 UTC
Okay, I can try adding more time. How much time do you think we should wait for the 'rook-ceph-mon-endpoints' to be updated?

Comment 46 Itzhak 2023-03-30 13:12:45 UTC
From the comment here https://bugzilla.redhat.com/show_bug.cgi?id=2086485#c15, the fix will be ready in ODF 4.11 VERSION.


Note You need to log in before you can comment on or make changes to this bug.