Bug 1403948

Summary:	Cinder-backup service reports as down during backup of large volumes
Product:	Red Hat OpenStack	Reporter:	nalmond
Component:	openstack-cinder	Assignee:	Gorka Eguileor <geguileo>
Status:	CLOSED EOL	QA Contact:	Avi Avraham <aavraham>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	8.0 (Liberty)	CC:	cschwede, dhill, eharney, geguileo, gregor.bregenzer, mario.sommer, mfuruta, mschuppe, nalmond, pativ5dev, pcaruana, pgrist, srevivo
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	8.0 (Liberty)
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-05-01 20:39:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1464146
Bug Blocks:

Description nalmond 2016-12-12 17:00:53 UTC

Description of problem:
When performing a backup of a large cinder volume, the cinder-backup service shows as down in 'cinder service-list', however, the backup will succeed. This is causing monitoring software to incorrectly report that there is an issue.

Version-Release number of selected component (if applicable):
openstack-cinder-7.0.1-8.el7ost.noarch

How reproducible:
Every time a backup is made of a large volume that takes more than a few minutes to complete

Steps to Reproduce:
1. cinder backup-create <uuid> --name test --force
2. watch cinder service-list

Actual results:
Cinder backup service will report down until the backup is complete

Expected results:
Cinder backup service will remain up throughout the backup task

Additional info:
This was reported on RHOSP 8 using a Ceph backend. However, I also observed the same issue on RHOSP 9 using NFS.

Comment 2 Tzach Shefi 2016-12-14 14:20:15 UTC

Few questions before I try to recreate check this.

1. Internal or external ceph?

2. If external ceph which version?  

3. By large volume which size tripped this 5G 10G more? 

4. Which steps were used to configure Cinder backup ?
I think it was only automated starting from OPSD9.

Comment 3 nalmond 2016-12-14 22:49:57 UTC

This is external ceph:
$ ceph version
ceph version 0.94.3

The ceph volumes that triggered this issue were around or larger than 15 GB.

# grep backup /etc/cinder/cinder.conf|grep -v ^#
backup_ceph_conf = /etc/ceph/ceph.conf
backup_ceph_user = backup
backup_ceph_chunk_size = 134217728
backup_ceph_pool = backup
backup_ceph_stripe_unit = 0
backup_ceph_stripe_count = 0
backup_driver = cinder.backup.drivers.ceph


It seems to happen if the backup has been running longer than about 1 minute. The service will stay 'down' until the backup is complete and then return to 'up'.


For clarification, when I observed this in OSP 9 with NFS, I noticed the service would flap up and down periodically, not stay down for the duration.

Comment 4 Tzach Shefi 2016-12-29 13:52:00 UTC

I've just tested this with an external ceph.
Created an image from a ~8G Centos ISO.
Cinder created a 10G volume from that image.
Cinder backup took a good few minutes, watched cinder service-list.
Backup service remained up all the time, watch -n2 -d should have picked up a change. 

I'll try to reproduce again with a larger volume. 

Maybe this only happens with mounted volumes?
I noted the --force option on reproduce steps 1.

Comment 5 nalmond 2016-12-30 18:58:04 UTC

Most of these volumes are mounted, however, this is reproducible on 'available' volumes as well.

Comment 6 Tzach Shefi 2017-01-02 07:27:40 UTC

I've created a 40G raw image, from it a 41G volume.  
Cinder backup service remains up during the whole 17 minutes while being created.  

Next I deleted the ^ cinder backup. 

Booted a small Cirros instance, attached that same 41G volume.
Again created a backup of that volume now with --force 
Took 18 minutes Cinder backup service again remains up during the whole time.  

Actually service has remained up since last year (happy new year:) 
Today is 2017-01-02 

openstack-cinder-backup.service - OpenStack Cinder Backup Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-cinder-backup.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2016-12-29 12:27:27 UTC; 3 days ago
 Main PID: 4401 (cinder-backup)
   CGroup: /system.slice/openstack-cinder-backup.service
           └─4401 /usr/bin/python2 /usr/bin/cinder-backup -


My ext ceph version:
ceph version 10.2.2-38.el7cp 

I'll try a few more attempts and report if I see any service status change.

Comment 12 Mario Sommer 2017-05-12 09:55:53 UTC

I can reproduce this behaviour on Mitaka with the NFS backend. Even with much smaller volumes. The problem seems to be the CPU load not the size of the volume.

cinder backup.log while a backup is running on a 6 Core XeonE5-2603 system shows this type of messages...

2017-05-12 11:34:11.551 12139 DEBUG oslo_service.periodic_task [req-65bd26d1-eeb1-48c9-b63b-62beb05e8ce6 Openstack Admin 15d30213519a4e55aa38821b1dc9f21c - - -] Running periodic task BackupManager._publish_service_capabilities run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
2017-05-12 11:34:11.566 12139 WARNING oslo.service.loopingcall [req-65bd26d1-eeb1-48c9-b63b-62beb05e8ce6 Openstack Admin 15d30213519a4e55aa38821b1dc9f21c - - -] Function 'cinder.service.Service.report_state' run outlasted interval by 106.58 sec

This causes the cinder service to be marked as "down" after a few minutes.

Comment 13 pativ5dev 2017-05-15 09:47:31 UTC

I also see this issue when I run heavy requests(multiple cinder backup-create tasks) to cinder-backup service. 
Platform: centos 7.3.1611 + Liberty(eol)
1. service status down
2. outlasted interval message in log
3. back to "up" once tasks completed

Comment 14 Mario Sommer 2017-05-15 13:42:11 UTC

I could track down the problem further.
It seems that in cinder/backup/chunkeddriver.py line 341 eventlet.sleep(0) is called only once per data chunk. 
With a default chunk size (backup_file_size) of 1999994880 bytes (~2GB) this just does not happen often enough, especially on slower (1GBit) Ethernet connections.  
With a smaller backup_file_size (499974144 bytes) eventlet.sleep(0) fires more often and gives the backup service enough time to report its state.

Comment 17 David Hill 2017-09-09 15:22:04 UTC

We're hitting this issue with RHOSP 10 as well.

Comment 18 Gorka Eguileor 2017-09-26 12:14:12 UTC

I believe this to be the same problem reported on bz #1464146

This seems to be caused by Cinder's data compression (a CPU intensive operation) during backups being done directly in the greenthread, which would prevent thread switching to other greenthreads.

The CPU intensive operation would prevent other threads from running.

Solution would be to run the compression on a native thread so they don't interfere with greenthread switching.