Bug 1966662

Summary:	cannot set just CPU or just memory in StorageDeviceSet resources
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Ben England <bengland>
Component:	ocs-operator	Assignee:	Jose A. Rivera <jarrpa>
Status:	CLOSED WONTFIX	QA Contact:	Raz Tamir <ratamir>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	ekuric, jarrpa, jhopper, kramdoss, madam, mmuench, ocs-bugs, odf-bz-bot, owasserm, sostapov
Target Milestone:	---	Keywords:	Performance
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-11 15:50:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben England 2021-06-01 15:57:41 UTC

Description of problem (please be detailed as possible and provide log
snippests):

OCS ceph-osd pods have no memory limit anymore. They used to transmit this memory limit to the OSD s/w via osd_memory_target. AFAICT this is not being set and neither is the memory Cgroup limit. With the right sort of workload this can result in OOM kills across the cluster and fatal failure of the application. This was triggered by a smallfile run but in theory it should happen for any sort of run where the OSD tries to cache enough data in memory.

Version of all relevant components (if applicable):

OCS 4.8.0-394.ci
OSP 4.7 GA

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

I'll answer as if I was a customer. If a customer hit this problem, it would be fatal.

Is there any workaround available to the best of your knowledge?

Yes, instantiate OSD memory limits.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Haven't made it happen twice because it takes a while, but will work on that.

Can this issue reproduce from the UI?

Issue has nothing to do with the UI.

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. install OCP 4.7 GA
2. install OCS 4.8 version above using CR (more info below)
3. run smallfile from benchmark-operator to create 800 million 16-KiB files

Actual results:

OSDs grow in memory consumption until OOM kills happen. OSD repair is very slow due to the workload. Is the repair process itself triggering OOMs?

Expected results:

It doesn't crash and delivers performance related to the hardware available from the system.

Additional info:

evidence that OOMs happened and that OSDs were unconstrained on memory - look for "rook-ceph-osd" in this log:
http://nfs-storage01.scalelab.redhat.com/bengland/tmp/alias-cloud03-2012-05-10/crs/logs/describe-nodes-oom.log

partial must-gather:
http://nfs-storage01.scalelab.redhat.com/bengland/tmp/alias-cloud03-2012-05-10/crs/logs/must-gather.local.4617900638156272935/

cluster was built with 7 Dell 740xds, 3 were masters, 4 were workers and OCS nodes. Each had 1 NVM device, because of varying device sizes and only 1 NVM per SSD, I created 3 400-GB partitions on each NVM and made those into OSDs using the Local storage operator. I deployed with Multus.

LSO CR:
http://nfs-storage01.scalelab.redhat.com/bengland/tmp/alias-cloud03-2012-05-10/deploy/lso-cr.yaml

OCS operator deploy YAML:
http://nfs-storage01.scalelab.redhat.com/bengland/tmp/alias-cloud03-2012-05-10/deploy/deploy-ocs.yaml

OCS cluster deploy YAML:
http://nfs-storage01.scalelab.redhat.com/bengland/tmp/alias-cloud03-2012-05-10/deploy/ocs-storagecluster-4.8.0-4node-3part-publiconly.yaml

benchmark-operator was deployed as described here:
https://github.com/cloud-bulldozer/benchmark-operator

and smallfile workload was deployed using this CR:
http://nfs-storage01.scalelab.redhat.com/bengland/tmp/alias-cloud03-2012-05-10/crs/smf.yaml

The results can be seen through grafana dashboards available in the openshift console GUI, screenshots in this directory:
http://nfs-storage01.scalelab.redhat.com/bengland/public/ceph/rhocs/smf-oom-meltdown/

Screenshot%20from%202021-05-29%2020-58-46.png is a terminal session showing the huge growth in memory for the Ceph OSDs, up to around 40 GiB of RAM, way more than they used to use. Almost all system RAM is in use at this point.

Screenshot%20from%202021-05-29%2021-08-58.png : That's not the problem though - the problem is the unconstrained growth! See the straight-line memory growth.

Screenshot%20from%202021-05-29%2021-10-18.png : another way of looking at the same thing

Screenshot%20from%202021-05-29%2023-31-03.png :shows an OOM kill happening and ceph OSD pods being restarted

Screenshot%20from%202021-05-29%2023-49-48.png : by this point the application was terminating and all memory growth was caused by backfill and recovery activities. This workload consisted of creating 800 million 16-KiB files (in parallel) so these activities take a long time and stress bluestore metadata management

Comment 3 Sébastien Han 2021-06-01 16:19:36 UTC

ocs-op is setting the resources, José, has anything changed for the OSD recently?
Thanks!

Comment 4 Ben England 2021-06-01 17:52:52 UTC

also see bz https://bugzilla.redhat.com/show_bug.cgi?id=1850954  , which was closed as NOTABUG.  THey resolved by saying Elko should have specified resources: {}, but I didn't specify memory resources at all and this happened to me.  I did specify CPU resources.  Specifically I put:

  storageDeviceSets:
  - config: {}
    count: 12
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: "1"
        storageClassName: localblock
        volumeMode: Block
    name: ocs-deviceset-nvmdevs
    replica: 1
    resources:
      limits:
        cpu: 10
      requests:
        cpu: 5

BTW, how would Kubernetes be able to schedule pods correctly or avoid an OOM kill situation if it does not know what memory is required by the Ceph OSDs?

Comment 5 Orit Wasserman 2021-06-02 06:29:07 UTC

(In reply to Ben England from comment #4)
> also see bz https://bugzilla.redhat.com/show_bug.cgi?id=1850954  , which was
> closed as NOTABUG.  THey resolved by saying Elko should have specified
> resources: {}, but I didn't specify memory resources at all and this
> happened to me.  I did specify CPU resources.  Specifically I put:
> 
>   storageDeviceSets:
>   - config: {}
>     count: 12
>     dataPVCTemplate:
>       metadata: {}
>       spec:
>         accessModes:
>         - ReadWriteOnce
>         resources:
>           requests:
>             storage: "1"
>         storageClassName: localblock
>         volumeMode: Block
>     name: ocs-deviceset-nvmdevs
>     replica: 1
>     resources:
>       limits:
>         cpu: 10
>       requests:
>         cpu: 5
> 
> BTW, how would Kubernetes be able to schedule pods correctly or avoid an OOM
> kill situation if it does not know what memory is required by the Ceph OSDs?

I see, the bug is that you only overridden the cpu request/limits but this affected/cleared the OSD memory setting as well.
Two action items:
- doc bug to make sure users know they have to set the memory requests/limits as well (for older versions)
- Fix it in OCS Operator

@Jose, can you look at it?

Comment 6 Jose A. Rivera 2021-06-02 15:24:33 UTC

I don't see anything that needs fixing. With an explicitly defined Resources field, the behavior of completely overriding the defaults is working as intended. There is no way (that I'm aware of) to determine whether an empty field means use default or use nothing. I guess we can try and make that more explicit? But I don't even know if we have documentation for this in the first place, since I don't know if we officially support this in GA. Maybe a KCS is in order.

Leaving this open for now, but moving to ODF 4.9.

Comment 7 krishnaram Karthick 2021-06-07 13:27:45 UTC

(In reply to Orit Wasserman from comment #5)
> (In reply to Ben England from comment #4)
> > also see bz https://bugzilla.redhat.com/show_bug.cgi?id=1850954  , which was
> > closed as NOTABUG.  THey resolved by saying Elko should have specified
> > resources: {}, but I didn't specify memory resources at all and this
> > happened to me.  I did specify CPU resources.  Specifically I put:
> > 
> >   storageDeviceSets:
> >   - config: {}
> >     count: 12
> >     dataPVCTemplate:
> >       metadata: {}
> >       spec:
> >         accessModes:
> >         - ReadWriteOnce
> >         resources:
> >           requests:
> >             storage: "1"
> >         storageClassName: localblock
> >         volumeMode: Block
> >     name: ocs-deviceset-nvmdevs
> >     replica: 1
> >     resources:
> >       limits:
> >         cpu: 10
> >       requests:
> >         cpu: 5
> > 
> > BTW, how would Kubernetes be able to schedule pods correctly or avoid an OOM
> > kill situation if it does not know what memory is required by the Ceph OSDs?
> 
> I see, the bug is that you only overridden the cpu request/limits but this
> affected/cleared the OSD memory setting as well.
> Two action items:
> - doc bug to make sure users know they have to set the memory
> requests/limits as well (for older versions)
> - Fix it in OCS Operator
> 
> @Jose, can you look at it?

@Orit - if there is no limit set for memory, does the OSD pod go ahead and consume all of the available memory in the system? 
My understanding was that it shouldn't go beyond 5G irrespective of whether a limit is defined or not.

Comment 8 Orit Wasserman 2021-06-07 14:58:28 UTC

(In reply to krishnaram Karthick from comment #7)
> (In reply to Orit Wasserman from comment #5)
> > (In reply to Ben England from comment #4)
> > > also see bz https://bugzilla.redhat.com/show_bug.cgi?id=1850954  , which was
> > > closed as NOTABUG.  THey resolved by saying Elko should have specified
> > > resources: {}, but I didn't specify memory resources at all and this
> > > happened to me.  I did specify CPU resources.  Specifically I put:
> > > 
> > >   storageDeviceSets:
> > >   - config: {}
> > >     count: 12
> > >     dataPVCTemplate:
> > >       metadata: {}
> > >       spec:
> > >         accessModes:
> > >         - ReadWriteOnce
> > >         resources:
> > >           requests:
> > >             storage: "1"
> > >         storageClassName: localblock
> > >         volumeMode: Block
> > >     name: ocs-deviceset-nvmdevs
> > >     replica: 1
> > >     resources:
> > >       limits:
> > >         cpu: 10
> > >       requests:
> > >         cpu: 5
> > > 
> > > BTW, how would Kubernetes be able to schedule pods correctly or avoid an OOM
> > > kill situation if it does not know what memory is required by the Ceph OSDs?
> > 
> > I see, the bug is that you only overridden the cpu request/limits but this
> > affected/cleared the OSD memory setting as well.
> > Two action items:
> > - doc bug to make sure users know they have to set the memory
> > requests/limits as well (for older versions)
> > - Fix it in OCS Operator
> > 
> > @Jose, can you look at it?
> 
> @Orit - if there is no limit set for memory, does the OSD pod go ahead and
> consume all of the available memory in the system? 
> My understanding was that it shouldn't go beyond 5G irrespective of whether
> a limit is defined or not.

The main issue is not the limits but the empty requests, this means k8s my place other pods that will use memory and will not leave enough for the OSD.
OSD memory usage depends on the workload type and disk size could be more than 5G in your case.

Comment 9 Ben England 2021-06-10 17:59:02 UTC

I have an AWS cluster where I can experiment with this further.  Plan is to try reproducing the null memory limit using the same CR as before (CPU specified, no memory specified), then try setting the memory limit explicitly, then try not specifying resources at all.  I think it's just an easily fixed bug in the rook-ceph-operator about how it interprets the resources: section of the storagecluster CR, but have to prove it.

Comment 10 Ben England 2021-06-11 11:34:22 UTC

I reproduced and isolated the problem in AWS with a similar set of OCS storagecluster CRs.   The good news is that this problem can be worked around by editing the live storagecluster CR and ocs-operator will redeploy the OSDs with the right limits. 

The problem is exactly what I thought.  There are actually 4 cases to consider:

- neither memory or CPU resources specified, "resources: {}" - no problem
- CPU specified, but not memory - memory is unbounded
- both CPU and memory specified - no problem
- memory specified but not CPU - CPU is unbounded

What you want is that if one resource is specified, the other defaults just as it does in case where neither resource is specified, correct?   This is a much more intuitive, surprise-free behavior.   

when neither memory nor CPU is specified:

  storageDeviceSets:
    - resources: {}
      ...

I get:

$ ocos describe node | awk '/CPU/||/ceph-osd/' 
  Namespace                        Name                                                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits 
  openshift-storage                rook-ceph-osd-0-547788c87c-r622l                                      2 (12%)       2 (12%)     5Gi (8%)         5Gi (8%)      
  openshift-storage                rook-ceph-osd-2-84c49b497d-6jbss                                      2 (12%)       2 (12%)     5Gi (8%)         5Gi (8%)     
  openshift-storage                rook-ceph-osd-1-677445dbb6-5b649                                      2 (12%)       2 (12%)     5Gi (8%)         5Gi (8%)  


when CPU is specified but not memory:

  storageDeviceSets:
    - resources:
        limits:
          cpu: 4
        requests:
          cpu: 2
      ...

I get:

ocos describe node | awk '/CPU/||/ceph-osd/' 
  Namespace                               Name                                                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  openshift-storage                       rook-ceph-osd-0-5d6bdf4f78-qs4zk                                      2 (12%)       4 (25%)     0 (0%)           0 (0%) 
  openshift-storage                       rook-ceph-osd-1-84bb6c56b9-qnkws                                      2 (12%)       4 (25%)     0 (0%)           0 (0%) 
  openshift-storage                       rook-ceph-osd-2-69c469ff64-dvh8p                                      2 (12%)       4 (25%)     0 (0%)           0 (0%)  


When both CPU and memory are specified, such as:

    - resources:
        limits:
          cpu: 4
          memory: "6Gi"
        requests:
          cpu: 2
          memory: "6Gi"
      ...

I get 

ocos describe node | awk '/CPU/||/ceph-osd/' | tee aws-cpu-mem-limit-specified.log
  Namespace                               Name                                                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  openshift-storage                       rook-ceph-osd-0-77bbbf45c4-npjdr                                      2 (12%)       4 (25%)     6Gi (9%)         6Gi (9%)       97s
  openshift-storage                       rook-ceph-osd-1-5b97995cf5-tmjcr                                      2 (12%)       4 (25%)     6Gi (9%)         6Gi (9%)       98s
  openshift-storage                       rook-ceph-osd-2-cb7ccb8-w8r4g                                         2 (12%)       4 (25%)     6Gi (9%)         6Gi (9%)       92s


when memory is specified but not CPU:

  storageDeviceSets:
    - resources:
        limits:
          memory: "6Gi"
        requests:
          memory: "6Gi"
     ...

I get:

$ ocos describe node | awk '/CPU/||/ceph-osd/'
  Namespace                               Name                                                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  openshift-storage                       rook-ceph-osd-1-6d598b6744-glfqs                                      0 (0%)        0 (0%)      6Gi (9%)         6Gi (9%)     
  openshift-storage                       rook-ceph-osd-2-56d5c5db9f-c9r8r                                      0 (0%)        0 (0%)      6Gi (9%)         6Gi (9%)    
  openshift-storage                       rook-ceph-osd-0-859774c898-4gqcq                                      0 (0%)        0 (0%)      6Gi (9%)         6Gi (9%)

Comment 11 Ben England 2021-06-17 17:26:41 UTC

The fix is pretty simple, see here where defaulting for OSD memory and CPU limits happens:

https://github.com/openshift/ocs-operator/blob/master/controllers/storagecluster/cephcluster.go#L517

                if resources.Requests == nil && resources.Limits == nil {
                        resources = defaults.DaemonResources["osd"]
                }

It clearly shows that defaulting only happens if both resources.Requests and resources.Limits are nil.   What it should do in pseudocode is:

if resources CPU request is undefined:
   default to defaults.DaemonResources["osd"].requests.cpu
if resources memory request is undefined:
   default to defaults.DaemonResources["osd"].requests.memory

if resources CPU limit is undefined:
   default to defaults.DaemonResources["osd"].limits.cpu
if resources memory limit is undefined:
   default to defaults.DaemonResources["osd"].limits.memory

if resources.CPU.requests > resources.CPU.limits:
   error requests must be <= limits
if resources.memory.requests > resources.memory.limits:
   error requests must be <= limits

Make sense?   This way, the user can default any subset of these 4 fields and it will still provide a sane answer.  Why workrequests so hard at it?   The user may want to override the CPU request/limit without overriding the memory request/limit and vice versa, and may not know enough to specify both of them correctly.

Comment 15 Jose A. Rivera 2021-10-11 15:50:57 UTC

Since there's been no further discussion nor customer inputs, closing this as WONTFIX. If there's any further demand for this, feel free to reopen this BZ.