Bug 1866848

Summary: [RFE] Limit concurrent Block Storage service backup/restore operations to control memory usage
Product: Red Hat OpenStack Reporter: Gorka Eguileor <geguileo>
Component: openstack-cinderAssignee: Gorka Eguileor <geguileo>
Status: CLOSED MIGRATED QA Contact: Yosi Ben Shimon <ybenshim>
Severity: high Docs Contact: Ian Frangs <ifrangs>
Priority: high    
Version: 16.2 (Train)CC: brian.rosmaita, dhill, eharney, gcharot, geguileo, ifrangs, jamsmith, ltoscano, mariel, pcaruana, pratik.bandarkar, rlondhe, spower, vhariria
Target Milestone: ---Keywords: FutureFeature, TestOnly, Triaged
Target Release: ---Flags: tshefi: automate_bug?
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-cinder-18.2.1-0.20220526042308.5532645.el8ost Doc Type: Enhancement
Doc Text:
With the new `backup_max_operations` parameter, you can now tune Block Storage service backups to operate more reliably within your hardware environment and usage patterns. + An unlimited number of concurrent backup and restore operations can lead to excessive memory consumption, which can kill the cinder backup service and result in service disruptions. + You can adjust the value of `backup_max_operations` to prevent these service disruptions.
Story Points: ---
Clone Of: 1806975
: 1866853 (view as bug list) Environment:
Last Closed: 2025-01-17 16:13:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1866853    

Description Gorka Eguileor 2020-08-06 14:53:52 UTC
The feature is to be able to limit the number of concurrent backup/restore operations by each cinder-backup service, thus controlling the maximum amount of memory the service will be using.


+++ This bug was initially created as a clone of Bug #1806975 +++
+++ Original summary was: cinder backup restore: decompression uses lots of memory +++

Description of problem:

unable to restore Cinder volumes created after an FFU upgrade from OSP10 to OSP13

Noticed nova_api_wsgi and nova-conductor are the current high memory processes. 

It seems that cinder-backup was consuming 162GB of RAM when it was oom killed.

~~~
Feb 24 14:28:18 controller3 kernel: Out of memory: Kill process 2501135 (cinder-backup) score 797 or sacrifice child
Feb 24 14:28:18 controller3 kernel: Killed process 2501135 (cinder-backup), UID 0, total-vm:195150272kB, anon-rss:162185040kB, file-rss:536kB, shmem-rss:0kB
Feb 24 14:28:18 controller3 kernel: cinder-backup: page allocation failure: order:0, mode:0x280da
Feb 24 14:28:18 controller3 kernel: CPU: 13 PID: 2501135 Comm: cinder-backup Kdump: loaded Tainted: G               ------------ T 3.10.0-1062.12.1.el7.x86_64 #1
~~~

Also, noticed high resource utilization by snmpd on the same controller


Version-Release number of selected component (if applicable):

openstack-cinder-12.0.8-3.el7ost.noarch                     Fri Feb  7 12:53:05 2020
puppet-cinder-12.4.1-5.el7ost.noarch                        Fri Feb  7 12:52:15 2020
python2-cinderclient-3.5.0-1.el7ost.noarch                  Fri Feb  7 12:50:55 2020
python-cinder-12.0.8-3.el7ost.noarch                        Fri Feb  7 12:53:00 2020

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2822235 root      20   0   76.5g  76.3g   3296 R 100.0 40.5 867:53.35 snmpd

# rpm -qf  /usr/sbin/snmpd
net-snmp-5.7.2-43.el7_7.3.x86_64

Tried downgrading the net-snmp version but still got the same results.


How reproducible:


Steps to Reproduce:
1. create backup of openstack volume with some large data inside
2. try to restore multiple backup at the same time.
3. You will notice OOM

Actual results:

cinder-backup getting OOM

Expected results:

multiple cinder backup volume should get restored at a time.

At this moment we are able to restore single volumes, but not multiple volumes at the same time.

Comment 18 Brian Rosmaita 2023-08-04 13:21:34 UTC
@astillma Added suggested doc text.  There is more extensive documentation upstream that could be added somewhere if appropriate: https://review.opendev.org/c/openstack/cinder/+/710297/9/doc/source/admin/blockstorage-volume-backups.rst