Bug 1944580

Summary:	Increase mons db allocated space(fresh deployment and upgrade to 4.8+)
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Orit Wasserman <owasserm>
Component:	ocs-operator	Assignee:	Nitin Goyal <nigoyal>
Status:	VERIFIED ---	QA Contact:	Itzhak <ikave>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.8	CC:	ebenahar, jdurgin, muagarwa, nberry, nigoyal, pdhange, sostapov, tnielsen
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	OCS 4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 2 Travis Nielsen 2021-04-01 15:06:19 UTC

Before we increase the size of the mon PVs, a couple questions:
1. Why haven't we seen an issue with the existing 10G mon size? I don't recall hearing any issues with this previously. Is ceph already handling the compaction when it sees the space is running low?
2. What new limit will be sufficient? Any limit we choose seems arbitrary.

Comment 3 Orit Wasserman 2021-04-04 08:14:19 UTC

(In reply to Travis Nielsen from comment #2)
> Before we increase the size of the mon PVs, a couple questions:
> 1. Why haven't we seen an issue with the existing 10G mon size? I don't
> recall hearing any issues with this previously. Is ceph already handling the
> compaction when it sees the space is running low?

I suspect this is related to the load on the cluster and its size.
In https://bugzilla.redhat.com/show_bug.cgi?id=1941939 the mons datastore is filling with OSD slow ops and it seems the trimming was too slow.

> 2. What new limit will be sufficient? Any limit we choose seems arbitrary.
Right, we should set the limit according to the cluster size.
@Josh, any guidelines?

As users start with 3 nodes cluster and expand, let's start with a size that can fit for most cases and expand the datastore if needed. 
I will open a different ticket to handle the mons datastore expansion.

Comment 4 Josh Durgin 2021-04-09 22:54:33 UTC

(In reply to Orit Wasserman from comment #3)
> (In reply to Travis Nielsen from comment #2)
> > Before we increase the size of the mon PVs, a couple questions:
> > 1. Why haven't we seen an issue with the existing 10G mon size? I don't
> > recall hearing any issues with this previously. Is ceph already handling the
> > compaction when it sees the space is running low?
> 
> I suspect this is related to the load on the cluster and its size.
> In https://bugzilla.redhat.com/show_bug.cgi?id=1941939 the mons datastore is
> filling with OSD slow ops and it seems the trimming was too slow.
> 
> > 2. What new limit will be sufficient? Any limit we choose seems arbitrary.
> Right, we should set the limit according to the cluster size.
> @Josh, any guidelines?

What's the cost function for PV size, vs risk of downtime for the cluster?

We recommend at least 50GB for RHCS:

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/hardware_guide/minimum-hardware-recommendations_hw

Generally any cluster we've seen would be fine with a few hundred GB.

Comment 5 Nitin Goyal 2021-04-19 14:09:57 UTC

https://github.com/openshift/ocs-operator/pull/1155

Comment 7 Neha Berry 2021-05-26 08:23:04 UTC

Note: Test

Comment 13 Itzhak 2021-06-22 07:58:03 UTC

Hi. Can we have a look at the ocs-ci runs at the weekend instead of deploying a new cluster and test it? 
For example, at this deployment in AWS, https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1157/, we can see in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009ai3c333-t4bn/j009ai3c333-t4bn_20210619T042251/logs/deployment_1624077149/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-d42be5905fd3e8ea280695a48b162a1ff44706a25a4815edbf75499a0c8f516b/cluster-scoped-resources/oc_output/get_pv that the mon pv size is 50Gi.

And another 2 questions: 
- Except for the different platforms we need to test, 
do we also need to check for the deployment type(UPI, IPI, LSO)?

- Which message we expect to find in the rook-ceph pod logs in case of upgrading from OCS 4.7 to 4.8? 
Can you provide more details about the message content?

Comment 15 Itzhak 2021-06-22 11:13:15 UTC

Okay, thanks Nirit.

Comment 16 Itzhak 2021-06-22 11:16:57 UTC

Sorry, I was confused with someone else... 
Thanks for the information, Nitin.

Comment 17 Itzhak 2021-06-27 12:13:47 UTC

I checked a few executions with 4.8 clusters(without the upgrade) and checked the mon PVC size.
Here are the results:

VSPHERE UPI Dynamic cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/3487/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ikave-vc7-vm48/ikave-vc7-vm48_20210606T090127/logs/deployment_1622971392/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-5be86e9bbc0750c9d042fdb88007d3259c344b33176854626282dd7b52eb8f93/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi.

VSPHERE UPI FIPS 1AZ RHCOS VSAN 3M 6W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/989/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j051vuf1cs36-t4a/j051vuf1cs36-t4a_20210603T063238/logs/deployment_1622702792/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-4cf9b04bc34bccb6fd801e42867308aee3dec18987d8507f2b58552d6d45dc19/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi.


AWS IPI 3AZ RHCOS cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1157/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009ai3c333-t4bn/j009ai3c333-t4bn_20210619T042251/logs/deployment_1624077149/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-d42be5905fd3e8ea280695a48b162a1ff44706a25a4815edbf75499a0c8f516b/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi.

AWS UPI 3AZ RHCOS cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4019/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ikave-aws-upi48/ikave-aws-upi48_20210623T064915/logs/deployment_1624431541/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi.


AZURE IPI 3AZ RHCOS 3M 3W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1183/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006zi3c33-t1/j006zi3c33-t1_20210623T153322/logs/deployment_1624463182/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi.


So in summarize, all the 4.8 clusters have the mon PVC size of 50Gi.

Comment 18 Itzhak 2021-06-27 14:09:44 UTC

I checked another 2 clusters when upgrading OCS from 4.7 to 4.8.
Here are the results:

AWS IPI 3AZ RHCOS 3M 3W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1168/, pv output in the logs before upgrade - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j033ai3c33-ua/j033ai3c33-ua_20210622T142140/logs/deployment_1624372481/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2f1e3210dece6e783ce0b2630b7dd1aa8f821aba60d4c28d550edd9d0275f1c2/cluster-scoped-resources/oc_output/get_pv mon PVC size is 10Gi. After the upgrade, the pv output in the logs, as we can see in the test case, is http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j033ai3c33-ua/j033ai3c33-ua_20210622T142140/logs/testcases_1624378429/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi.


VSPHERE UPI 1AZ RHCOS VSAN 3M 3W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1169/, pv output in the logs before upgrade -  http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j063vu1cs33-ua/j063vu1cs33-ua_20210622T142311/logs/deployment_1624372597/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2f1e3210dece6e783ce0b2630b7dd1aa8f821aba60d4c28d550edd9d0275f1c2/cluster-scoped-resources/oc_output/get_pv mon PVC size is 10Gi. After the upgrade, the pv output in the logs, as we can see in the test case, is http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j063vu1cs33-ua/j063vu1cs33-ua_20210622T142311/logs/testcases_1624377731/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is still 10Gi. 
When we look at the rook-ceph-operator logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j063vu1cs33-ua/j063vu1cs33-ua_20210622T142311/logs/testcases_1624377731/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/namespaces/openshift-storage/pods/rook-ceph-operator-5b644b8cd4-nj97d/rook-ceph-operator/rook-ceph-operator/logs/current.log, we can see this warning: "cannot expand PVC "rook-ceph-mon-a". storage class "thin" does not allow expansion"(And the same warning with the other 2 mon pods). This warning is expected according to this function https://github.com/rook/rook/blob/master/pkg/operator/k8sutil/pvc.go#L29.


Please let me know if there are the results you were expected.

Comment 20 Itzhak 2021-06-30 08:49:13 UTC

I deployed an AWS IPI 3AZ cluster and upgraded from OCS 4.7 to 4.8 and checked the size of mounted volume inside the mon pod. 

Here are the results:

Cluster link: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4175/


Before upgrade:

sh-4.4# df -kh
Filesystem      Size  Used Avail Use% Mounted on
overlay         120G   12G  108G  10% /
tmpfs            64M     0   64M   0% /dev
tmpfs            32G     0   32G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
tmpfs            32G   57M   32G   1% /etc/hostname
tmpfs            32G  4.0K   32G   1% /etc/ceph
/dev/nvme0n1p4  120G   12G  108G  10% /etc/hosts
tmpfs            32G  4.0K   32G   1% /etc/ceph/keyring-store
/dev/nvme1n1    9.8G  107M  9.7G   2% /var/lib/ceph/mon/ceph-a
tmpfs            32G   20K   32G   1% /run/secrets/kubernetes.io/serviceaccount
sh-4.4# 


The size of the mounted PVC is 10Gi


After upgrade:

sh-4.4# df -kh
Filesystem      Size  Used Avail Use% Mounted on
overlay         120G   13G  107G  11% /
tmpfs            64M     0   64M   0% /dev
tmpfs            32G     0   32G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
tmpfs            32G   56M   32G   1% /etc/hostname
tmpfs            32G  4.0K   32G   1% /etc/ceph
/dev/nvme0n1p4  120G   13G  107G  11% /etc/hosts
tmpfs            32G  4.0K   32G   1% /etc/ceph/keyring-store
/dev/nvme1n1     50G  123M   50G   1% /var/lib/ceph/mon/ceph-a
tmpfs            32G   20K   32G   1% /run/secrets/kubernetes.io/serviceaccount
sh-4.4# 


sh-4.4# df -kh
Filesystem      Size  Used Avail Use% Mounted on
overlay         120G   14G  107G  11% /
tmpfs            64M     0   64M   0% /dev
tmpfs            32G     0   32G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
tmpfs            32G   59M   32G   1% /etc/hostname
tmpfs            32G  4.0K   32G   1% /etc/ceph
/dev/nvme0n1p4  120G   14G  107G  11% /etc/hosts
tmpfs            32G  4.0K   32G   1% /etc/ceph/keyring-store
/dev/nvme1n1     50G  119M   50G   1% /var/lib/ceph/mon/ceph-b
tmpfs            32G   20K   32G   1% /run/secrets/kubernetes.io/serviceaccount
sh-4.4# 


sh-4.4# df -kh
Filesystem      Size  Used Avail Use% Mounted on
overlay         120G   17G  104G  14% /
tmpfs            64M     0   64M   0% /dev
tmpfs            32G     0   32G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
tmpfs            32G   63M   31G   1% /etc/hostname
tmpfs            32G  4.0K   32G   1% /etc/ceph
/dev/nvme0n1p4  120G   17G  104G  14% /etc/hosts
tmpfs            32G  4.0K   32G   1% /etc/ceph/keyring-store
/dev/nvme1n1     50G  119M   50G   1% /var/lib/ceph/mon/ceph-c
tmpfs            32G   20K   32G   1% /run/secrets/kubernetes.io/serviceaccount
sh-4.4# 


The size of the mounted PVC for the 3 mon pods is 50Gi

Comment 21 Itzhak 2021-06-30 08:51:18 UTC

I am moving the bug to Verified, as the results above were as expected.