Before we increase the size of the mon PVs, a couple questions: 1. Why haven't we seen an issue with the existing 10G mon size? I don't recall hearing any issues with this previously. Is ceph already handling the compaction when it sees the space is running low? 2. What new limit will be sufficient? Any limit we choose seems arbitrary.
(In reply to Travis Nielsen from comment #2) > Before we increase the size of the mon PVs, a couple questions: > 1. Why haven't we seen an issue with the existing 10G mon size? I don't > recall hearing any issues with this previously. Is ceph already handling the > compaction when it sees the space is running low? I suspect this is related to the load on the cluster and its size. In https://bugzilla.redhat.com/show_bug.cgi?id=1941939 the mons datastore is filling with OSD slow ops and it seems the trimming was too slow. > 2. What new limit will be sufficient? Any limit we choose seems arbitrary. Right, we should set the limit according to the cluster size. @Josh, any guidelines? As users start with 3 nodes cluster and expand, let's start with a size that can fit for most cases and expand the datastore if needed. I will open a different ticket to handle the mons datastore expansion.
(In reply to Orit Wasserman from comment #3) > (In reply to Travis Nielsen from comment #2) > > Before we increase the size of the mon PVs, a couple questions: > > 1. Why haven't we seen an issue with the existing 10G mon size? I don't > > recall hearing any issues with this previously. Is ceph already handling the > > compaction when it sees the space is running low? > > I suspect this is related to the load on the cluster and its size. > In https://bugzilla.redhat.com/show_bug.cgi?id=1941939 the mons datastore is > filling with OSD slow ops and it seems the trimming was too slow. > > > 2. What new limit will be sufficient? Any limit we choose seems arbitrary. > Right, we should set the limit according to the cluster size. > @Josh, any guidelines? What's the cost function for PV size, vs risk of downtime for the cluster? We recommend at least 50GB for RHCS: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/hardware_guide/minimum-hardware-recommendations_hw Generally any cluster we've seen would be fine with a few hundred GB.
https://github.com/openshift/ocs-operator/pull/1155
Note: Test
Hi. Can we have a look at the ocs-ci runs at the weekend instead of deploying a new cluster and test it? For example, at this deployment in AWS, https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1157/, we can see in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009ai3c333-t4bn/j009ai3c333-t4bn_20210619T042251/logs/deployment_1624077149/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-d42be5905fd3e8ea280695a48b162a1ff44706a25a4815edbf75499a0c8f516b/cluster-scoped-resources/oc_output/get_pv that the mon pv size is 50Gi. And another 2 questions: - Except for the different platforms we need to test, do we also need to check for the deployment type(UPI, IPI, LSO)? - Which message we expect to find in the rook-ceph pod logs in case of upgrading from OCS 4.7 to 4.8? Can you provide more details about the message content?
Okay, thanks Nirit.
Sorry, I was confused with someone else... Thanks for the information, Nitin.
I checked a few executions with 4.8 clusters(without the upgrade) and checked the mon PVC size. Here are the results: VSPHERE UPI Dynamic cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/3487/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ikave-vc7-vm48/ikave-vc7-vm48_20210606T090127/logs/deployment_1622971392/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-5be86e9bbc0750c9d042fdb88007d3259c344b33176854626282dd7b52eb8f93/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. VSPHERE UPI FIPS 1AZ RHCOS VSAN 3M 6W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/989/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j051vuf1cs36-t4a/j051vuf1cs36-t4a_20210603T063238/logs/deployment_1622702792/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-4cf9b04bc34bccb6fd801e42867308aee3dec18987d8507f2b58552d6d45dc19/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. AWS IPI 3AZ RHCOS cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1157/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009ai3c333-t4bn/j009ai3c333-t4bn_20210619T042251/logs/deployment_1624077149/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-d42be5905fd3e8ea280695a48b162a1ff44706a25a4815edbf75499a0c8f516b/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. AWS UPI 3AZ RHCOS cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4019/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ikave-aws-upi48/ikave-aws-upi48_20210623T064915/logs/deployment_1624431541/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. AZURE IPI 3AZ RHCOS 3M 3W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1183/, pv output in the logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006zi3c33-t1/j006zi3c33-t1_20210623T153322/logs/deployment_1624463182/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. So in summarize, all the 4.8 clusters have the mon PVC size of 50Gi.
I checked another 2 clusters when upgrading OCS from 4.7 to 4.8. Here are the results: AWS IPI 3AZ RHCOS 3M 3W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1168/, pv output in the logs before upgrade - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j033ai3c33-ua/j033ai3c33-ua_20210622T142140/logs/deployment_1624372481/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2f1e3210dece6e783ce0b2630b7dd1aa8f821aba60d4c28d550edd9d0275f1c2/cluster-scoped-resources/oc_output/get_pv mon PVC size is 10Gi. After the upgrade, the pv output in the logs, as we can see in the test case, is http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j033ai3c33-ua/j033ai3c33-ua_20210622T142140/logs/testcases_1624378429/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is 50Gi. VSPHERE UPI 1AZ RHCOS VSAN 3M 3W Cluster: Jenkins job link https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1169/, pv output in the logs before upgrade - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j063vu1cs33-ua/j063vu1cs33-ua_20210622T142311/logs/deployment_1624372597/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2f1e3210dece6e783ce0b2630b7dd1aa8f821aba60d4c28d550edd9d0275f1c2/cluster-scoped-resources/oc_output/get_pv mon PVC size is 10Gi. After the upgrade, the pv output in the logs, as we can see in the test case, is http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j063vu1cs33-ua/j063vu1cs33-ua_20210622T142311/logs/testcases_1624377731/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/cluster-scoped-resources/oc_output/get_pv - mon PVC size is still 10Gi. When we look at the rook-ceph-operator logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j063vu1cs33-ua/j063vu1cs33-ua_20210622T142311/logs/testcases_1624377731/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/namespaces/openshift-storage/pods/rook-ceph-operator-5b644b8cd4-nj97d/rook-ceph-operator/rook-ceph-operator/logs/current.log, we can see this warning: "cannot expand PVC "rook-ceph-mon-a". storage class "thin" does not allow expansion"(And the same warning with the other 2 mon pods). This warning is expected according to this function https://github.com/rook/rook/blob/master/pkg/operator/k8sutil/pvc.go#L29. Please let me know if there are the results you were expected.
I deployed an AWS IPI 3AZ cluster and upgraded from OCS 4.7 to 4.8 and checked the size of mounted volume inside the mon pod. Here are the results: Cluster link: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4175/ Before upgrade: sh-4.4# df -kh Filesystem Size Used Avail Use% Mounted on overlay 120G 12G 108G 10% / tmpfs 64M 0 64M 0% /dev tmpfs 32G 0 32G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm tmpfs 32G 57M 32G 1% /etc/hostname tmpfs 32G 4.0K 32G 1% /etc/ceph /dev/nvme0n1p4 120G 12G 108G 10% /etc/hosts tmpfs 32G 4.0K 32G 1% /etc/ceph/keyring-store /dev/nvme1n1 9.8G 107M 9.7G 2% /var/lib/ceph/mon/ceph-a tmpfs 32G 20K 32G 1% /run/secrets/kubernetes.io/serviceaccount sh-4.4# The size of the mounted PVC is 10Gi After upgrade: sh-4.4# df -kh Filesystem Size Used Avail Use% Mounted on overlay 120G 13G 107G 11% / tmpfs 64M 0 64M 0% /dev tmpfs 32G 0 32G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm tmpfs 32G 56M 32G 1% /etc/hostname tmpfs 32G 4.0K 32G 1% /etc/ceph /dev/nvme0n1p4 120G 13G 107G 11% /etc/hosts tmpfs 32G 4.0K 32G 1% /etc/ceph/keyring-store /dev/nvme1n1 50G 123M 50G 1% /var/lib/ceph/mon/ceph-a tmpfs 32G 20K 32G 1% /run/secrets/kubernetes.io/serviceaccount sh-4.4# sh-4.4# df -kh Filesystem Size Used Avail Use% Mounted on overlay 120G 14G 107G 11% / tmpfs 64M 0 64M 0% /dev tmpfs 32G 0 32G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm tmpfs 32G 59M 32G 1% /etc/hostname tmpfs 32G 4.0K 32G 1% /etc/ceph /dev/nvme0n1p4 120G 14G 107G 11% /etc/hosts tmpfs 32G 4.0K 32G 1% /etc/ceph/keyring-store /dev/nvme1n1 50G 119M 50G 1% /var/lib/ceph/mon/ceph-b tmpfs 32G 20K 32G 1% /run/secrets/kubernetes.io/serviceaccount sh-4.4# sh-4.4# df -kh Filesystem Size Used Avail Use% Mounted on overlay 120G 17G 104G 14% / tmpfs 64M 0 64M 0% /dev tmpfs 32G 0 32G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm tmpfs 32G 63M 31G 1% /etc/hostname tmpfs 32G 4.0K 32G 1% /etc/ceph /dev/nvme0n1p4 120G 17G 104G 14% /etc/hosts tmpfs 32G 4.0K 32G 1% /etc/ceph/keyring-store /dev/nvme1n1 50G 119M 50G 1% /var/lib/ceph/mon/ceph-c tmpfs 32G 20K 32G 1% /run/secrets/kubernetes.io/serviceaccount sh-4.4# The size of the mounted PVC for the 3 mon pods is 50Gi
I am moving the bug to Verified, as the results above were as expected.