Description of problem: unable to restore Cinder volumes created after an FFU upgrade from OSP10 to OSP13 Noticed nova_api_wsgi and nova-conductor are the current high memory processes. It seems that cinder-backup was consuming 162GB of RAM when it was oom killed. ~~~ Feb 24 14:28:18 controller3 kernel: Out of memory: Kill process 2501135 (cinder-backup) score 797 or sacrifice child Feb 24 14:28:18 controller3 kernel: Killed process 2501135 (cinder-backup), UID 0, total-vm:195150272kB, anon-rss:162185040kB, file-rss:536kB, shmem-rss:0kB Feb 24 14:28:18 controller3 kernel: cinder-backup: page allocation failure: order:0, mode:0x280da Feb 24 14:28:18 controller3 kernel: CPU: 13 PID: 2501135 Comm: cinder-backup Kdump: loaded Tainted: G ------------ T 3.10.0-1062.12.1.el7.x86_64 #1 ~~~ Also, noticed high resource utilization by snmpd on the same controller Version-Release number of selected component (if applicable): openstack-cinder-12.0.8-3.el7ost.noarch Fri Feb 7 12:53:05 2020 puppet-cinder-12.4.1-5.el7ost.noarch Fri Feb 7 12:52:15 2020 python2-cinderclient-3.5.0-1.el7ost.noarch Fri Feb 7 12:50:55 2020 python-cinder-12.0.8-3.el7ost.noarch Fri Feb 7 12:53:00 2020 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2822235 root 20 0 76.5g 76.3g 3296 R 100.0 40.5 867:53.35 snmpd # rpm -qf /usr/sbin/snmpd net-snmp-5.7.2-43.el7_7.3.x86_64 Tried downgrading the net-snmp version but still got the same results. How reproducible: Steps to Reproduce: 1. create backup of openstack volume with some large data inside 2. try to restore multiple backup at the same time. 3. You will notice OOM Actual results: cinder-backup getting OOM Expected results: multiple cinder backup volume should get restored at a time. At this moment we are able to restore single volumes, but not multiple volumes at the same time. Additional info:
The amount of memory that cinder backup may require when we are restoring 41 compressed volumes simultaneously can be huge. I believe it's even worse for RBD volumes (many customers have RBD for both volumes and backups, and that's more efficient). Here's the explanation of the peak memory we will need for 1 restore: - First we use as much as the size of the chunk we have stored, which in our case is compressed, so if we assume a 50% compression ratio, it would be 50% of the original chunk which can be 1999994880 bytes. So here we use 0.93GB when the ChunkedBackupDriver reads the object [1] with self._get_object_reader( container, object_name, extra_metadata=extra_metadata) as reader: body = reader.read() - Then when we decompress the data [2] we will need an additional 1,86GB which is the full original chunk size: decompressed = decompressor.decompress(body) - Then when the ChunkedBackupDriver writes the data [3]: volume_file.write(decompressed) What happens behind that call, because this is an RBD volume, is that we use the osb-brick connector which uses librbd Image object to do the writing [4]: def write(self, data): self._rbd_volume.image.write(data, self._offset) The write method in librbd calls the rbd_write2 method [5] ret = rbd_write2(self.image, _offset, length, _data, _fadvise_flags) And the method will call the create_write_raw method with nullptr as the aio_completion parameter [6]: bl.push_back(create_write_raw(ictx, buf, len, nullptr)); And because of this the create_write_raw method will copy the data in a different buffer [7]: if (ictx->disable_zero_copy || aio_completion == nullptr) { // must copy the buffer if writeback/writearound cache is in-use (or using // non-AIO) return buffer::copy(buf, len); And we end up using another 1.86GB of RAM So we end up needing 0.93GB + 1.86GB + 1.86GB for a single restore operation. If we now do 41 simultaneous operations, we will end up using 190.65GB, more than the machine has, hence the OOM kill. I see 2 improvements that can be done in the Cinder code: - Help Python free memory faster by setting to None the body variable as soon as we decompress it, and setting to None the decompressed variable as soon as we've written it. - Introduce a max concurrent backup & restore operations that will queue operations that exceed them. To mitigate the problem in this deployment they can do any of: - Reduce the number of concurrent restore operations - Disable compression - Reduce the size of the chunks with the backup_file_size variable [1]: https://opendev.org/openstack/cinder/src/commit/a154a1360be62eed0e2bf20937503b55659f4701/cinder/backup/chunkeddriver.py#L712 [2]: https://opendev.org/openstack/cinder/src/commit/a154a1360be62eed0e2bf20937503b55659f4701/cinder/backup/chunkeddriver.py#L719 [3]: https://opendev.org/openstack/cinder/src/commit/a154a1360be62eed0e2bf20937503b55659f4701/cinder/backup/chunkeddriver.py#L720 [4]: https://opendev.org/openstack/os-brick/src/commit/49d5616f86d637c846d54cd48c5ed4e17bd6695e/os_brick/initiator/linuxrbd.py#L195 [5]: https://github.com/ceph/ceph/blob/53febd478dfc7282f0948853c117061d96cda9b1/src/pybind/rbd/rbd.pyx#L4321 [6]: https://github.com/ceph/ceph/blob/b2e825debc4d47cede8df86b96af94893241ddf7/src/librbd/librbd.cc#L5826 [7]: https://github.com/ceph/ceph/blob/b2e825debc4d47cede8df86b96af94893241ddf7/src/librbd/librbd.cc#L91-L94
According to our records, this should be resolved by openstack-cinder-12.0.10-2.el7ost. This build is available now.
This BZ is for the mitigation fix that speeds up the freeing of memory during a backup restore operations. There is an additional feature that we are working on to limit the number of concurrent "memory heavy" operations, but that one will only be backported to OSP16. That RFE is being tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1866848
*** Bug 1810629 has been marked as a duplicate of this bug. ***
Verified on: openstack-cinder-12.0.10-11.el7ost.noarch Notice Eric's #17, fixed-in landed on openstack-cinder-12.0.10-9.el7 I'd deployed two separate OSP13 systems, on two identical servers (CPU/RAM/disk/network) Titan92 (openstack-cinder-12.0.10-2.el7ost.noarch) Titan93 (openstack-cinder-12.0.10-11.el7ost.noarch) On both systems C-vol was backed by Ceph, Cinder backup was NFS backed. Created a 3G volume filled with random data, uploaded volume to Glance, cloned same image on both systems. From said image I'd created 5 volumes on each system, backed-up all 5 volumes on each system. Opened TOP on both systems, monitoring Cinder-backup's memory consumption. Then I restored all 5 backups on each system simultaneously. Titan92(pre-fixed) system consistently consumed more RAM than Titan93(post-fixed) system. On average Titan92 consumed about 2.5-3 times more RAM than Titan93. I had duplicated restore procedure twice RAM consumption trend re-confirmed. Unfortunately my resources are not production grade, can't simulate multiple 10+ large volumes in the 100G+ range. Explains why my pre-fixed system didn't exhibit the reported OOM state. However as mentioned above RAM consumption reduction was clearly visible when comparing both systems. Good to verify.
TestOnly bug, shipped in 13z13, need to be manually closed.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days