Bug 1909179 - [DOCS] Cluster Logging scaling down supportability
Summary: [DOCS] Cluster Logging scaling down supportability
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.6.z
Assignee: Latha S
QA Contact: Anping Li
Latha S
URL:
Whiteboard: logging-exploration
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-18 14:09 UTC by Masaki Furuta ( RH )
Modified: 2022-09-27 12:46 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-27 12:46:14 UTC
Target Upstream Version:
Embargoed:
rdlugyhe: needinfo-


Attachments (Terms of Use)

Description Masaki Furuta ( RH ) 2020-12-18 14:09:43 UTC
Document URL: 

  https://docs.openshift.com/container-platform/4.6/logging/config/cluster-logging-log-store.html#cluster-logging-elasticsearch-scaledown_cluster-logging-store

Section Number and Name: 

  Scaling down Elasticsearch pods - Configuring the log store - Configuring your cluster logging deployment | Logging | OpenShift Container Platform 4.6

Describe the issue: 

  How should we scale down ES concretely?
  Under managed state, normally NEC would expected that ClusterMonitoring Operator and ElasticSearch Operator should revert our operation automatically after we scaled down.
  In that case, will Red Hat allow us to move them to Unmanaged state for maintanance?

  This time, support team and NEC confirmed that that won't be reverted. 
  If this should be expected behaviour and allowed to moidify behaviour, NEC request Red Hat should mention the concrete steps on OCP4 manual since most of users might not notice that they need to set Unmanaged  state before scaling down.

Suggestions for improvement: 

  Red Hat support representative answered;
    ~~~
    We can directly scale down a Deployment of ES and do the increase storage step and then again scale it up and please do this for each ES pod one by one.

      $ oc get deployment
      $ oc scale deploy/<ES_deployment_name> --replica=0

    To scale back to 1, please execute -

      $ oc scale deploy/<ES_deployment_name> --replica=1
     ~~~
 
  But , really don't we need to move ClusterLogging/ElasticSearch operators to unmanaged state?
  In our understanding, ElasticSearch operator is monitoring ES cluster. So if we scaled down ES pods manually, it will report error.
  Really don't we need to take care of it? 

Additional information: 

  Red Hat support representative also answered to above question as below;
     ~~~
    Well, Yes, I agree with you. The ES Operator manages it. But in this case I checked scaling down the Deployments and they got scaled down without ES operator scaling them up again. 

    So hence we can scale them down without making the ES in Unmanaged stage.
 
    We should be bending more toward not making is Unmanaged as there are other things as well which the ES operator manages and we should not leave them Unmanaged so its better to do the work with the ES Operator being in Managed state.
     ~~~

    But NEC still has a concern.

  Elasticsearch operator periodically checks the status of ElasticSearch cluster and updates the status of Elasticsearch object.
  And we also know the operator also has a feature to recover ElasticSearch cluster by updating their deployment objects.
  
  We don't know whether it's a bug or not that Elaticsearch operator doesn't attempt to revert the change you did manually. 
  However, at least we know that the deployment objects you changed were created by Elasticsearch operator. 

  So, even though the operator doesn't revert the change for now, it might be modified as to sync the configuration of deployment objects to the spec of Elasticsearch object.

  We cannot agree that to change the configuration without Unmanaged state is the safe way, if it haven't stated clarly as expected behaviour.

Comment 1 Masaki Furuta ( RH ) 2021-01-04 05:49:09 UTC
 === In Red Hat Customer Portal Case 02638426 ===
--- Comment by NEC, OpenShift engineer on 12/23/2020 11:03 AM ---

Dear Anshul,

Thank you for your update.
But we would like to know the reply of Bug 1909179, since this relates to the official support policy of ClusterLogging.

And, if Red Hat supports the steps you mentioned, Red Hat should describe it in the manual.
Since, customer might try to scale down ES with different steps(e.g. Modifying nodeCount of ClusterLogging instance by oc edit command).

Best Regards,
Masaki Hatada

Comment 2 Masaki Furuta ( RH ) 2021-01-04 06:09:32 UTC
(In reply to Masaki Furuta from comment #1)

Hello Vikram Goyal (and possible assignee who would be working on),

As per the issue and concern reported by NEC on this BZ, would you please clarify it specifically ?

As we know that Bug 1846458 and https://github.com/openshift/openshift-docs/pull/27773 had been merged already, I think that the description of the behaviour asked by NEC should be supported. But would you please also clear away a doubt asked by NEC on this BZ ?

Since NEC strongly wants to clarify whether we can scale down without setting "Unmanaged" state, so that their many end customers could carry it out completely safely. 

If it'd be the correct understanding, NEC requested us not to left it as the undocumented specification, but want RH to state it explicitly on our product documents.

I am grateful for your help and clarification,

Thank you so much,

BR,
Masaki

Comment 3 Rolfe Dlugy-Hegwer 2021-05-17 16:54:52 UTC
Jeff, could you assign someone to review and comment on this?

Comment 10 Jeff Cantrill 2022-01-19 15:56:20 UTC
It was asked here if this was sufficient https://bugzilla.redhat.com/show_bug.cgi?id=1909179#c5 with no response AFAIK.  It's not clear to me why the customer is asking about scaling down an ES cluster and for what purpose.  This would help us understand what is the intention and what needs to be documented or corrected

Comment 11 Masaki Hatada 2022-01-20 03:04:01 UTC
Dear Jeff,

Maybe https://bugzilla.redhat.com/show_bug.cgi?id=1909179#c5 was private comment, we couldn't see it.

I explain the background below.

In the first place, we wanted to know the way to extend PV of each elasticsearch pods as we filed in Case 02638426.
We attached vSphere VMDK PV with Static Provisioning. vSphere In-tree driver doesn't support dynamic resizing.
So, in order to extend PV, we need to scale down elasticsearch pods temporarily and recreate those PV/PVCs one by one with a new size.

We verified that PV can be extended by the following steps.

  1. Set ClusterLogging to Unmanaged state
  2. Set Elasticsearch object to Unmanaged state.
  3. Scale down one of ElasticSearch pods by oc scale --replicas=0 deployments elastic...
  4. Delete its PVC and PV of the pod
  5. Extend the volume size
     I. Extend VMDK size
     II. Attach it to one of worker node, and then run xfs_growfs
     III. Detach it from the node
  6. Create its PVC and PV with the same names, but change the size
  7. Scale up one of ElasticSearch pods by oc scale --replicas=1 deployments elastic...
  8. Do Step 3 to 7 for other ElasticSearch pods.
  9. Set the Elasticsearch object to Managed state.
  10. Set ClusterLogging to Managed state

We had thought that we must set ClusterLogging/Elasticsearch to Unmanaged state, since each elasticsearch pods are managed by those Operators.
However, in Case 02638426 Red Hat support team said that we don't need to set them to Unmanaged state.

Really does Red Hat admit user to scale down elasticsearch pods temporarily without setting Unmanaged state?
If Red Hat admits it, there is no problem.
But as I mentioned before, elasticsearch pods are managed by ClusterLogging/Elasticsearch operators.
As a general idea, We think it's not good to modify configuration manually without setting them to Unmanaged State.

We would like to know the opinion of Red Hat engineer about the above.
Plus, we would like Red Hat to write how user should extend PV of Logging in OpenShift manual if possible.

Best Regards,
Masaki Hatada

Comment 13 Masaki Hatada 2022-02-04 01:06:06 UTC
Dear Jeff,

Hara-san opened Comment 4 and 5 now I could see it.

Unfortunetely, it doesn't match to our request.
Our request is to scale down each elasticsearch pod one by one temporarily for modifying their PV configurations.
On the other hand, Comment 4 mentions information for scaling down elasticsearch cluster permanently.

Yes, we can reduce the size of elasticsearch cluster by reducing the nodeCount of ClusterLogging object.
However, if we reduce it, the same elasticsearch pod is always removed, isn't it?
For example, if the current nodecount is 3 and you reduce it to 2, elasticsearch-xxx-2 is always removed. So we can modify the PV of elasticsearch-xxx-2 but cannot modify PVs of elasticsearch-xxx-0,1.

To modify PV configurations of elasticsearch-xxx-0,1,2, we need to scale down them one by one.
I think that the way I wrote in Comment 11 is the only way to do that.

Best Regards,
Masaki Hatada

Comment 17 Jeff Cantrill 2022-02-22 16:16:19 UTC
Moving it to the docs team for review and scheduling. In general, https://bugzilla.redhat.com/show_bug.cgi?id=1909179#c11  looks to be accurate but IMO, the bits regarding expanding the disk should be more generically documented as they apply to any pod that relies on storage.  The storage team may have a better recommendation for disk expansion.  It is likely the disk expansion part is implementation dependent.  

In summary, if you want minimal service interruption, then you would do as identified and scale ES nodes individually.  Note your ES cluster's ability to retain more logs may be unaffected by this action.  There is no feature in our product offering that moves indices into a cold state which would allow those indices to free up memory.  The storage capacity is restricted by the amount of memory available to the node.

Comment 18 Kazuhisa Hara 2022-06-08 02:03:47 UTC
Hello @cbremble and team,

The RHDEVDOCS-3093 linked to this BZ is already closed, but this BZ seems to be still unresolved.
Would you please process to document Comment 11 using the insights provided in Comment 17? (or could you please assign the right person?)

Thanks,
Kazuhisa

Comment 19 landerso 2022-06-13 14:16:54 UTC
(In reply to Kazuhisa Hara from comment #18)
> Hello @cbremble and team,
> 
> The RHDEVDOCS-3093 linked to this BZ is already closed, but this BZ seems to
> be still unresolved.
> Would you please process to document Comment 11 using the insights provided
> in Comment 17? (or could you please assign the right person?)
> 
> Thanks,
> Kazuhisa

Per input provided by Jeff Cantrill the documentation requested is widely applicable to all pod types, and as such the storage team is best equipped to fulfill this request.


Note You need to log in before you can comment on or make changes to this bug.