Description of problem: When performing a backup of a large cinder volume, the cinder-backup service shows as down in 'cinder service-list', however, the backup will succeed. This is causing monitoring software to incorrectly report that there is an issue. Version-Release number of selected component (if applicable): openstack-cinder-7.0.1-8.el7ost.noarch How reproducible: Every time a backup is made of a large volume that takes more than a few minutes to complete Steps to Reproduce: 1. cinder backup-create <uuid> --name test --force 2. watch cinder service-list Actual results: Cinder backup service will report down until the backup is complete Expected results: Cinder backup service will remain up throughout the backup task Additional info: This was reported on RHOSP 8 using a Ceph backend. However, I also observed the same issue on RHOSP 9 using NFS.
Few questions before I try to recreate check this. 1. Internal or external ceph? 2. If external ceph which version? 3. By large volume which size tripped this 5G 10G more? 4. Which steps were used to configure Cinder backup ? I think it was only automated starting from OPSD9.
This is external ceph: $ ceph version ceph version 0.94.3 The ceph volumes that triggered this issue were around or larger than 15 GB. # grep backup /etc/cinder/cinder.conf|grep -v ^# backup_ceph_conf = /etc/ceph/ceph.conf backup_ceph_user = backup backup_ceph_chunk_size = 134217728 backup_ceph_pool = backup backup_ceph_stripe_unit = 0 backup_ceph_stripe_count = 0 backup_driver = cinder.backup.drivers.ceph It seems to happen if the backup has been running longer than about 1 minute. The service will stay 'down' until the backup is complete and then return to 'up'. For clarification, when I observed this in OSP 9 with NFS, I noticed the service would flap up and down periodically, not stay down for the duration.
I've just tested this with an external ceph. Created an image from a ~8G Centos ISO. Cinder created a 10G volume from that image. Cinder backup took a good few minutes, watched cinder service-list. Backup service remained up all the time, watch -n2 -d should have picked up a change. I'll try to reproduce again with a larger volume. Maybe this only happens with mounted volumes? I noted the --force option on reproduce steps 1.
Most of these volumes are mounted, however, this is reproducible on 'available' volumes as well.
I've created a 40G raw image, from it a 41G volume. Cinder backup service remains up during the whole 17 minutes while being created. Next I deleted the ^ cinder backup. Booted a small Cirros instance, attached that same 41G volume. Again created a backup of that volume now with --force Took 18 minutes Cinder backup service again remains up during the whole time. Actually service has remained up since last year (happy new year:) Today is 2017-01-02 openstack-cinder-backup.service - OpenStack Cinder Backup Server Loaded: loaded (/usr/lib/systemd/system/openstack-cinder-backup.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2016-12-29 12:27:27 UTC; 3 days ago Main PID: 4401 (cinder-backup) CGroup: /system.slice/openstack-cinder-backup.service └─4401 /usr/bin/python2 /usr/bin/cinder-backup - My ext ceph version: ceph version 10.2.2-38.el7cp I'll try a few more attempts and report if I see any service status change.
I can reproduce this behaviour on Mitaka with the NFS backend. Even with much smaller volumes. The problem seems to be the CPU load not the size of the volume. cinder backup.log while a backup is running on a 6 Core XeonE5-2603 system shows this type of messages... 2017-05-12 11:34:11.551 12139 DEBUG oslo_service.periodic_task [req-65bd26d1-eeb1-48c9-b63b-62beb05e8ce6 Openstack Admin 15d30213519a4e55aa38821b1dc9f21c - - -] Running periodic task BackupManager._publish_service_capabilities run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215 2017-05-12 11:34:11.566 12139 WARNING oslo.service.loopingcall [req-65bd26d1-eeb1-48c9-b63b-62beb05e8ce6 Openstack Admin 15d30213519a4e55aa38821b1dc9f21c - - -] Function 'cinder.service.Service.report_state' run outlasted interval by 106.58 sec This causes the cinder service to be marked as "down" after a few minutes.
I also see this issue when I run heavy requests(multiple cinder backup-create tasks) to cinder-backup service. Platform: centos 7.3.1611 + Liberty(eol) 1. service status down 2. outlasted interval message in log 3. back to "up" once tasks completed
I could track down the problem further. It seems that in cinder/backup/chunkeddriver.py line 341 eventlet.sleep(0) is called only once per data chunk. With a default chunk size (backup_file_size) of 1999994880 bytes (~2GB) this just does not happen often enough, especially on slower (1GBit) Ethernet connections. With a smaller backup_file_size (499974144 bytes) eventlet.sleep(0) fires more often and gives the backup service enough time to report its state.
We're hitting this issue with RHOSP 10 as well.
I believe this to be the same problem reported on bz #1464146 This seems to be caused by Cinder's data compression (a CPU intensive operation) during backups being done directly in the greenthread, which would prevent thread switching to other greenthreads. The CPU intensive operation would prevent other threads from running. Solution would be to run the compression on a native thread so they don't interfere with greenthread switching.