Bug 1944580
| Summary: | Increase mons db allocated space(fresh deployment and upgrade to 4.8+) | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Orit Wasserman <owasserm> |
| Component: | ocs-operator | Assignee: | Nitin Goyal <nigoyal> |
| Status: | VERIFIED --- | QA Contact: | Itzhak <ikave> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.8 | CC: | ebenahar, jdurgin, lithomas, muagarwa, nberry, nigoyal, pdhange, sostapov, tnielsen |
| Target Milestone: | --- | Keywords: | AutomationBackLog |
| Target Release: | OCS 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Comment 2
Travis Nielsen
2021-04-01 15:06:19 UTC
(In reply to Travis Nielsen from comment #2) > Before we increase the size of the mon PVs, a couple questions: > 1. Why haven't we seen an issue with the existing 10G mon size? I don't > recall hearing any issues with this previously. Is ceph already handling the > compaction when it sees the space is running low? I suspect this is related to the load on the cluster and its size. In https://bugzilla.redhat.com/show_bug.cgi?id=1941939 the mons datastore is filling with OSD slow ops and it seems the trimming was too slow. > 2. What new limit will be sufficient? Any limit we choose seems arbitrary. Right, we should set the limit according to the cluster size. @Josh, any guidelines? As users start with 3 nodes cluster and expand, let's start with a size that can fit for most cases and expand the datastore if needed. I will open a different ticket to handle the mons datastore expansion. (In reply to Orit Wasserman from comment #3) > (In reply to Travis Nielsen from comment #2) > > Before we increase the size of the mon PVs, a couple questions: > > 1. Why haven't we seen an issue with the existing 10G mon size? I don't > > recall hearing any issues with this previously. Is ceph already handling the > > compaction when it sees the space is running low? > > I suspect this is related to the load on the cluster and its size. > In https://bugzilla.redhat.com/show_bug.cgi?id=1941939 the mons datastore is > filling with OSD slow ops and it seems the trimming was too slow. > > > 2. What new limit will be sufficient? Any limit we choose seems arbitrary. > Right, we should set the limit according to the cluster size. > @Josh, any guidelines? What's the cost function for PV size, vs risk of downtime for the cluster? We recommend at least 50GB for RHCS: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/hardware_guide/minimum-hardware-recommendations_hw Generally any cluster we've seen would be fine with a few hundred GB. Note: Test Hi. Can we have a look at the ocs-ci runs at the weekend instead of deploying a new cluster and test it? For example, at this deployment in AWS, https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1157/, we can see in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009ai3c333-t4bn/j009ai3c333-t4bn_20210619T042251/logs/deployment_1624077149/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-d42be5905fd3e8ea280695a48b162a1ff44706a25a4815edbf75499a0c8f516b/cluster-scoped-resources/oc_output/get_pv that the mon pv size is 50Gi. And another 2 questions: - Except for the different platforms we need to test, do we also need to check for the deployment type(UPI, IPI, LSO)? - Which message we expect to find in the rook-ceph pod logs in case of upgrading from OCS 4.7 to 4.8? Can you provide more details about the message content? Okay, thanks Nirit. Sorry, I was confused with someone else... Thanks for the information, Nitin. I checked a few executions with 4.8 clusters(without the upgrade) and checked the mon PVC size. Here are the results: VSPHERE UPI Dynamic cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/3487/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ikave-vc7-vm48/ikave-vc7-vm48_20210606T090127/logs/deployment_1622971392/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-5be86e9bbc0750c9d042fdb88007d3259c344b33176854626282dd7b52eb8f93/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. VSPHERE UPI FIPS 1AZ RHCOS VSAN 3M 6W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/989/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j051vuf1cs36-t4a/j051vuf1cs36-t4a_20210603T063238/logs/deployment_1622702792/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-4cf9b04bc34bccb6fd801e42867308aee3dec18987d8507f2b58552d6d45dc19/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. AWS IPI 3AZ RHCOS cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1157/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009ai3c333-t4bn/j009ai3c333-t4bn_20210619T042251/logs/deployment_1624077149/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-d42be5905fd3e8ea280695a48b162a1ff44706a25a4815edbf75499a0c8f516b/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. AWS UPI 3AZ RHCOS cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4019/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ikave-aws-upi48/ikave-aws-upi48_20210623T064915/logs/deployment_1624431541/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. AZURE IPI 3AZ RHCOS 3M 3W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1183/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006zi3c33-t1/j006zi3c33-t1_20210623T153322/logs/deployment_1624463182/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. So in summarize, all the 4.8 clusters have the mon PVC size of 50Gi. I checked another 2 clusters when upgrading OCS from 4.7 to 4.8. Here are the results: AWS IPI 3AZ RHCOS 3M 3W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1168/, pv output in the logs before upgrade - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j033ai3c33-ua/j033ai3c33-ua_20210622T142140/logs/deployment_1624372481/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2f1e3210dece6e783ce0b2630b7dd1aa8f821aba60d4c28d550edd9d0275f1c2/cluster-scoped-resources/oc_output/get_pv mon PVC size is 10Gi. After the upgrade, the pv output in the logs, as we can see in the test case, is http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j033ai3c33-ua/j033ai3c33-ua_20210622T142140/logs/testcases_1624378429/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. VSPHERE UPI 1AZ RHCOS VSAN 3M 3W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1169/, pv output in the logs before upgrade - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j063vu1cs33-ua/j063vu1cs33-ua_20210622T142311/logs/deployment_1624372597/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2f1e3210dece6e783ce0b2630b7dd1aa8f821aba60d4c28d550edd9d0275f1c2/cluster-scoped-resources/oc_output/get_pv mon PVC size is 10Gi. After the upgrade, the pv output in the logs, as we can see in the test case, is http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j063vu1cs33-ua/j063vu1cs33-ua_20210622T142311/logs/testcases_1624377731/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is still 10Gi. When we look at the rook-ceph-operator logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j063vu1cs33-ua/j063vu1cs33-ua_20210622T142311/logs/testcases_1624377731/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/namespaces/openshift-storage/pods/rook-ceph-operator-5b644b8cd4-nj97d/rook-ceph-operator/rook-ceph-operator/logs/current.log, we can see this warning: "cannot expand PVC "rook-ceph-mon-a". storage class "thin" does not allow expansion"(And the same warning with the other 2 mon pods). This warning is expected according to this function https://github.com/rook/rook/blob/master/pkg/operator/k8sutil/pvc.go#L29. Please let me know if there are the results you were expected. I deployed an AWS IPI 3AZ cluster and upgraded from OCS 4.7 to 4.8 and checked the size of mounted volume inside the mon pod. Here are the results: Cluster link: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4175/ Before upgrade: sh-4.4# df -kh Filesystem Size Used Avail Use% Mounted on overlay 120G 12G 108G 10% / tmpfs 64M 0 64M 0% /dev tmpfs 32G 0 32G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm tmpfs 32G 57M 32G 1% /etc/hostname tmpfs 32G 4.0K 32G 1% /etc/ceph /dev/nvme0n1p4 120G 12G 108G 10% /etc/hosts tmpfs 32G 4.0K 32G 1% /etc/ceph/keyring-store /dev/nvme1n1 9.8G 107M 9.7G 2% /var/lib/ceph/mon/ceph-a tmpfs 32G 20K 32G 1% /run/secrets/kubernetes.io/serviceaccount sh-4.4# The size of the mounted PVC is 10Gi After upgrade: sh-4.4# df -kh Filesystem Size Used Avail Use% Mounted on overlay 120G 13G 107G 11% / tmpfs 64M 0 64M 0% /dev tmpfs 32G 0 32G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm tmpfs 32G 56M 32G 1% /etc/hostname tmpfs 32G 4.0K 32G 1% /etc/ceph /dev/nvme0n1p4 120G 13G 107G 11% /etc/hosts tmpfs 32G 4.0K 32G 1% /etc/ceph/keyring-store /dev/nvme1n1 50G 123M 50G 1% /var/lib/ceph/mon/ceph-a tmpfs 32G 20K 32G 1% /run/secrets/kubernetes.io/serviceaccount sh-4.4# sh-4.4# df -kh Filesystem Size Used Avail Use% Mounted on overlay 120G 14G 107G 11% / tmpfs 64M 0 64M 0% /dev tmpfs 32G 0 32G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm tmpfs 32G 59M 32G 1% /etc/hostname tmpfs 32G 4.0K 32G 1% /etc/ceph /dev/nvme0n1p4 120G 14G 107G 11% /etc/hosts tmpfs 32G 4.0K 32G 1% /etc/ceph/keyring-store /dev/nvme1n1 50G 119M 50G 1% /var/lib/ceph/mon/ceph-b tmpfs 32G 20K 32G 1% /run/secrets/kubernetes.io/serviceaccount sh-4.4# sh-4.4# df -kh Filesystem Size Used Avail Use% Mounted on overlay 120G 17G 104G 14% / tmpfs 64M 0 64M 0% /dev tmpfs 32G 0 32G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm tmpfs 32G 63M 31G 1% /etc/hostname tmpfs 32G 4.0K 32G 1% /etc/ceph /dev/nvme0n1p4 120G 17G 104G 14% /etc/hosts tmpfs 32G 4.0K 32G 1% /etc/ceph/keyring-store /dev/nvme1n1 50G 119M 50G 1% /var/lib/ceph/mon/ceph-c tmpfs 32G 20K 32G 1% /run/secrets/kubernetes.io/serviceaccount sh-4.4# The size of the mounted PVC for the 3 mon pods is 50Gi I am moving the bug to Verified, as the results above were as expected. |