1789305 – [DOCS] Need description about how to move the monitoring solution for cluster in the vSphere environment

Bug 1789305 - [DOCS] Need description about how to move the monitoring solution for cluster in the vSphere environment

Summary: [DOCS] Need description about how to move the monitoring solution for cluster...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.2.z
Assignee:	Maxim Svistunov
QA Contact:	Junqi Zhao
Docs Contact:	Vikram Goyal
URL:
Whiteboard:
Depends On:
Blocks:	1807852
TreeView+	depends on / blocked

Reported:	2020-01-09 10:10 UTC by Masaki Furuta ( RH )
Modified:	2023-09-07 21:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1807852 (view as bug list)
Environment:
Last Closed:	2020-04-09 03:25:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1807852	0	medium	CLOSED	nodeSelector for openshift-state-metrics pod should be configurable.	2023-09-07 22:05:49 UTC

Internal Links: 1808183

Description Masaki Furuta ( RH ) 2020-01-09 10:10:33 UTC

Document URL: 

  https://docs.openshift.com/container-platform/4.2/machine_management/creating-infrastructure-machinesets.html#infrastructure-moving-monitoring_creating-infrastructure-machinesets

Section Number and Name: 

  Creating infrastructure MachineSets - Moving the monitoring solution | Machine management | OpenShift Container Platform 4.2

Describe the issue: 

      Q1:
        I was able to move monitoring pod (excluding openshift-state-metrics-7f4bdfbdf9-qpdn4) to infra node referring to [1], but is this procedure supportable and applicable in a vSphere environment without machinesets?

        [cloud-user@rhcl-0 ~]$ oc get pod -n openshift-monitoring -o wide
        NAME                                           READY   STATUS    RESTARTS   AGE     IP               NODE                         NOMINATED NODE   READINESS GATES
        alertmanager-main-0                            3/3     Running   0          79s     10.130.2.17      infra-0.test4.example.com    <none>           <none>
        alertmanager-main-1                            3/3     Running   0          99s     10.131.2.13      infra-1.test4.example.com    <none>           <none>
        alertmanager-main-2                            3/3     Running   0          2m20s   10.130.2.13      infra-0.test4.example.com    <none>           <none>
        cluster-monitoring-operator-6bf7c89799-m5jn5   1/1     Running   0          7d1h    10.130.0.19      master-2.test4.example.com   <none>           <none>
        grafana-59dffb4f5-8hsvn                        2/2     Running   0          2m19s   10.130.2.14      infra-0.test4.example.com    <none>           <none>
        kube-state-metrics-68bc45c96b-98pg4            3/3     Running   0          2m33s   10.131.2.9       infra-1.test4.example.com    <none>           <none>
        node-exporter-6xsgm                            2/2     Running   0          7d1h    172.16.231.188   master-1.test4.example.com   <none>           <none>
        node-exporter-77mqn                            2/2     Running   0          7d1h    172.16.231.195   worker-1.test4.example.com   <none>           <none>
        node-exporter-95kdc                            2/2     Running   0          7d1h    172.16.231.187   master-0.test4.example.com   <none>           <none>
        node-exporter-9n49n                            2/2     Running   2          25h     172.16.231.190   infra-0.test4.example.com    <none>           <none>
        node-exporter-kq6q2                            2/2     Running   0          7d1h    172.16.231.189   master-2.test4.example.com   <none>           <none>
        node-exporter-kz9mh                            2/2     Running   2          25h     172.16.231.191   infra-1.test4.example.com    <none>           <none>
        node-exporter-l2qtm                            2/2     Running   0          7d1h    172.16.231.196   worker-2.test4.example.com   <none>           <none>
        node-exporter-s68xg                            2/2     Running   0          7d1h    172.16.231.194   worker-0.test4.example.com   <none>           <none>
        openshift-state-metrics-7f4bdfbdf9-qpdn4       3/3     Running   0          7d1h    10.131.0.4       worker-0.test4.example.com   <none>           <none>
        prometheus-adapter-5546dc5fb4-27v8q            1/1     Running   0          2m      10.130.2.15      infra-0.test4.example.com    <none>           <none>
        prometheus-adapter-5546dc5fb4-rlb42            1/1     Running   0          2m20s   10.131.2.11      infra-1.test4.example.com    <none>           <none>
        prometheus-k8s-0                               6/6     Running   1          87s     10.130.2.16      infra-0.test4.example.com    <none>           <none>
        prometheus-k8s-1                               6/6     Running   1          2m17s   10.131.2.12      infra-1.test4.example.com    <none>           <none>
        prometheus-operator-b95584fbb-7qzdc            1/1     Running   0          2m33s   10.130.2.12      infra-0.test4.example.com    <none>           <none>
        telemeter-client-64955f868f-rm7lp              3/3     Running   0          2m24s   10.131.2.10      infra-1.test4.example.com    <none>           <none>
        [cloud-user@rhcl-0 ~]$ 

      
      Q2:
        Please tell me why "openshift-state-metrics-7f4bdfbdf9-qpdn4 pod" does not move to infrastructure node.
        - If there is a problem in the implementation procedure, please tell me how to deal with it.
        - If you have a reason not to move to an infrastructure node, let us know.

      Q3:
        I suspect that this procedure is not the correct procedure for a vSphere environment (without machinesets) and that there is something more appropriate. 
        I couldn't find any other documentation, would you please let me know if there are any more appropriate instructions?

Suggestions for improvement: 

    I wrote and provided following answer to the customer based on the result by our Sr, SME ( please see Additional information).
    This should be doable, as our Sr, SME reviewed my proposing answer to provide,  and then responded that the answers  look correct and reasonable.
    Please explicitly describe exact steps and results for moving the monitoring solution.

    >       Q1:
    >         I was able to move monitoring pod (excluding openshift-state-metrics-7f4bdfbdf9-qpdn4) to infra node referring to [1], but is this procedure supportable and applicable in a vSphere environment without machinesets?
    ...
    
    Yes, this is supportable without machineset as this is vSphere environment. Please note that creating machineconfigpool is also very important. 
    
    > ...
    >       
    >       Q2:
    >         Please tell me why "openshift-state-metrics-7f4bdfbdf9-qpdn4 pod" does not move to infrastructure node.
    >         - If there is a problem in the implementation procedure, please tell me how to deal with it.
    >         - If you have a reason not to move to an infrastructure node, let us know.
    
    During our testing by our Sr, SME, we found that it seems that we need to delete some pods, to reschedule it to ran on another places, but this should work and supportable as we answered on Q1.
    
    > 
    >       Q3:
    >         I suspect that this procedure is not the correct procedure for a vSphere environment (without machinesets) and that there is something more appropriate. 
    >         I couldn't find any other documentation, would you please let me know if there are any more appropriate instructions?
    
    This is supportable and doable (just need to delete it once before rescheduing). 

Additional information: 

  Here is the result which our Sr SME ( rvanderp ) verified on sfdc#02546889.


    There is a procedure in article[ref 0] which describes the process of creating an infra node.  The article you linked covers the creation of the new machine config pool which is very important as well.  You might try deleting all the pods out of the openshift-monitoring namespace.  I ran through the process this morning of creating infra nodes and configuring the monitoring stack to schedule on the infra nodes.  Some pods did not reschedule until I deleted them.  
    
    References:
    [0] - https://access.redhat.com/solutions/4287111
    
    [1] - configmap
    [shadowman@gss-ose-3-openshift ~]$ oc -n openshift-monitoring get configmap cluster-monitoring-config -o yaml
    apiVersion: v1
    data:
      config.yaml: |
        prometheusOperator:
          nodeSelector:
            node-role.kubernetes.io/infra: ""
        prometheusK8s:
          nodeSelector:
            node-role.kubernetes.io/infra: ""
        alertmanagerMain:
          nodeSelector:
            node-role.kubernetes.io/infra: ""
        kubeStateMetrics:
          nodeSelector:
            node-role.kubernetes.io/infra: ""
        grafana:
          nodeSelector:
            node-role.kubernetes.io/infra: ""
        telemeterClient:
          nodeSelector:
            node-role.kubernetes.io/infra: ""
        k8sPrometheusAdapter:
          nodeSelector:
            node-role.kubernetes.io/infra: ""
    kind: ConfigMap
    metadata:
      creationTimestamp: "2019-12-31T17:16:26Z"
      name: cluster-monitoring-config
      namespace: openshift-monitoring
      resourceVersion: "3739349"
      selfLink: /api/v1/namespaces/openshift-monitoring/configmaps/cluster-monitoring-config
      uid: 46b4d51a-2bf1-11ea-a09e-5254002ffd64
    
    
    [2] - oc get nodes
    
    [shadowman@gss-ose-3-openshift ~]$ oc get nodes
    NAME       STATUS   ROLES           AGE   VERSION
    master-0   Ready    master,worker   9d    v1.14.6+cebabbf4a
    master-1   Ready    master,worker   9d    v1.14.6+cebabbf4a
    master-2   Ready    master,worker   9d    v1.14.6+cebabbf4a
    worker-0   Ready    infra           9d    v1.14.6+cebabbf4a
    worker-1   Ready    infra           9d    v1.14.6+cebabbf4a
    worker-2   Ready    worker          9d    v1.14.6+cebabbf4a
    
    [3] - oc get pods -o wide
    
    [shadowman@gss-ose-3-openshift ~]$ oc get pods -o wide
    NAME                                           READY   STATUS    RESTARTS   AGE     IP               NODE       NOMINATED NODE   READINESS GATES
    alertmanager-main-0                            3/3     Running   0          5m20s   10.128.2.54      worker-0   <none>           <none>
    alertmanager-main-1                            3/3     Running   0          5m46s   10.131.0.64      worker-1   <none>           <none>
    alertmanager-main-2                            3/3     Running   0          6m12s   10.131.0.61      worker-1   <none>           <none>
    cluster-monitoring-operator-598656fd74-79ndn   1/1     Running   0          6m49s   10.129.0.131     master-2   <none>           <none>
    grafana-77bf6d8bf-lvkjx                        2/2     Running   0          6m10s   10.128.2.52      worker-0   <none>           <none>
    kube-state-metrics-6b9f864976-f2ft9            3/3     Running   0          6m20s   10.128.2.48      worker-0   <none>           <none>
    node-exporter-8wd5w                            2/2     Running   0          6m35s   192.168.100.10   master-0   <none>           <none>
    node-exporter-dn77v                            2/2     Running   0          6m41s   192.168.100.21   worker-1   <none>           <none>
    node-exporter-fqpxr                            2/2     Running   0          6m38s   192.168.100.22   worker-2   <none>           <none>
    node-exporter-k6gcj                            2/2     Running   0          6m37s   192.168.100.20   worker-0   <none>           <none>
    node-exporter-lzfv5                            2/2     Running   0          6m39s   192.168.100.12   master-2   <none>           <none>
    node-exporter-xskn7                            2/2     Running   0          6m37s   192.168.100.11   master-1   <none>           <none>
    openshift-state-metrics-65488cbc6f-2znkm       3/3     Running   0          6m45s   10.131.0.59      worker-1   <none>           <none>
    prometheus-adapter-79884dd577-4bmmd            1/1     Running   0          5m56s   10.131.0.63      worker-1   <none>           <none>
    prometheus-adapter-79884dd577-sx4zt            1/1     Running   0          6m11s   10.128.2.51      worker-0   <none>           <none>
    prometheus-k8s-0                               6/6     Running   1          5m18s   10.128.2.55      worker-0   <none>           <none>
    prometheus-k8s-1                               6/6     Running   1          5m40s   10.131.0.65      worker-1   <none>           <none>
    prometheus-operator-7494cfc564-74rtp           1/1     Running   0          6m21s   10.128.2.47      worker-0   <none>           <none>
    telemeter-client-67c86784bd-m9fsf              3/3     Running   0          6m14s   10.128.2.49      worker-0   <none>           <none>

Comment 2 Masaki Furuta ( RH ) 2020-03-05 07:11:46 UTC

(In reply to Maxim Svistunov from comment #1)
<...>
> Unless I am mistaken, the rest of the customers questions have been
> positively answered by rvanderp and do not require changes in documentation.
<...>

Hi Maxim Svistunov,

Thank you for your help and support on this BZ.

In addtion to what Richard Vanderpool kindly answered on sfdc#02546889, I think we need to have some note as Pawel Krupa suggested on https://bugzilla.redhat.com/show_bug.cgi?id=1807852#c6 here as well ( or should we cover all related things regarding openshiftStateMetrics on Bug 1808183, instead ??) 

Anyway I'm leaving this comment for the reference between this BZ and 1808183.

  - https://bugzilla.redhat.com/show_bug.cgi?id=1807852#c6

    Pawel Krupa 2020-02-27 23:17:04 JST
    This feature is supported since 4.2 (or maybe earlier). You need to add:
    
    openshiftStateMetrics:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    
    to cluster-monitoring-config ConfigMap. The same way as for other components.

Thank you,

BR,
Masaki

Note You need to log in before you can comment on or make changes to this bug.