Bug 1464146

Summary: Problem in creating several cinder backups at same time
Product: Red Hat OpenStack Reporter: Eric Beaudoin <ebeaudoi>
Component: openstack-cinderAssignee: Gorka Eguileor <geguileo>
Status: CLOSED ERRATA QA Contact: Avi Avraham <aavraham>
Severity: medium Docs Contact:
Priority: medium    
Version: 9.0 (Mitaka)CC: bschmaus, colm77, cschwede, dciabrin, dhill, ebeaudoi, geguileo, joflynn, lkuchlan, mbayer, pgrist, scohen, srevivo, tshefi
Target Milestone: ---Keywords: Reopened, Triaged, ZStream
Target Release: 9.0 (Mitaka)Flags: lkuchlan: automate_bug+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-cinder-8.1.1-19.el7ost Doc Type: Bug Fix
Doc Text:
Previously, certain method calls for backup/restore operations blocked the eventlet's thread switching. Consequently, operations were slower and connection errors were observed in the database and RabbitMQ logs. With this update, proxy blocking method calls were changed into native threads to prevent blocking. As a result, restore and backup operations are faster and the connection issues are resolved.
Story Points: ---
Clone Of:
: 1504661 (view as bug list) Environment:
Last Closed: 2018-07-05 12:31:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1504661, 1504670, 1504671, 1542607    
Bug Blocks: 1403948, 1549598    
Attachments:
Description Flags
grepped database errors in cinder backup
none
haproxy segment showing connectivity during these times
none
haproxy's doc on tcplog none

Comment 39 Michael Bayer 2017-08-24 16:31:19 UTC
Created attachment 1317835 [details]
grepped database errors in cinder backup

this is the individual occurrences of the error within the backup log

Comment 40 Michael Bayer 2017-08-24 16:34:59 UTC
Created attachment 1317838 [details]
haproxy segment showing connectivity during these times

in this file, the errors observed in the service line up exactly with requests in haproxy that look like this:

Aug 16 22:52:22 localhost haproxy[2681]: 10.223.147.12:58600 [16/Aug/2017:22:51:21.883] mysql mysql/cld1-control-0 1/0/60241 92 -- 966/957/957/957/0 0/0


the key is the number "92", you can grep this file literally for " 92 " (with the leading/trailing space) to locate these lines.   they indicate that 92 bytes have been transmitted from server to client total.   they line up with where these connections are getting cut off.    There are exactly the same number / timestamp of the " 92 " lines as there are errors in the backup.log.

Comment 41 Michael Bayer 2017-08-24 16:37:47 UTC
Created attachment 1317839 [details]
haproxy's doc on tcplog

Comment 42 Michael Bayer 2017-08-24 17:00:12 UTC
We observe the disconnects in haproxy indicate that the connection was established, was not queued, and 92 bytes were sent successfully.   Then the connection was unexpectedly ended.   The time spent on the connection was very long - a minimum of about 12 seconds, and at max some are 60 - 70 seconds - that is it was connected for a minute and only received these 92 bytes before getting cut off at the authentication phase.

so over at https://bugs.mysql.com/bug.php?id=28359 if you search for comment "[31 Aug 2007 9:21] Magnus BlÄudd " he does a great description of why you'd get dropped exactly at this behavior, which is that the MySQL server is specifically going to cut the client off if it gets no response within a fixed timeout - the comment here says five seconds but from my own testing on current MariaDB versions it is ten seconds.     This timeout is specific to the authentication phase because the server is trying to get rid of clients that aren't able to communicate up front.

If we grep for the entries that show this, we see:

$ cat haproxy_17081622.log | grep " 92 " | cut -c 62-200
[16/Aug/2017:22:25:56.738] mysql mysql/cld1-control-0 1/0/12623 92 -- 913/866/866/866/0 0/0
[16/Aug/2017:22:26:11.855] mysql mysql/cld1-control-0 1/0/11953 92 -- 925/888/888/888/0 0/0
[16/Aug/2017:22:29:30.138] mysql mysql/cld1-control-0 1/0/13588 92 -- 898/888/888/888/0 0/0
[16/Aug/2017:22:38:03.639] mysql mysql/cld1-control-0 1/0/11569 92 -- 936/926/926/926/0 0/0
[16/Aug/2017:22:42:46.359] mysql mysql/cld1-control-0 1/0/15211 92 -- 938/928/928/928/0 0/0
[16/Aug/2017:22:43:03.408] mysql mysql/cld1-control-0 1/0/22303 92 -- 962/953/953/953/0 0/0
[16/Aug/2017:22:43:03.411] mysql mysql/cld1-control-0 1/0/22304 92 -- 962/953/953/953/0 0/0
[16/Aug/2017:22:43:03.406] mysql mysql/cld1-control-0 1/0/22314 92 -- 962/953/953/953/0 0/0
[16/Aug/2017:22:51:21.883] mysql mysql/cld1-control-0 1/0/60241 92 -- 966/957/957/957/0 0/0
[16/Aug/2017:22:51:18.117] mysql mysql/cld1-control-0 1/0/64011 92 -- 965/956/956/956/0 0/0
[16/Aug/2017:22:51:40.785] mysql mysql/cld1-control-0 1/0/41347 92 -- 964/955/955/955/0 0/0
[16/Aug/2017:22:51:09.935] mysql mysql/cld1-control-0 1/0/72200 92 -- 963/954/954/954/0 0/0
[16/Aug/2017:22:51:13.076] mysql mysql/cld1-control-0 1/0/69062 92 -- 962/953/953/953/0 0/0
[16/Aug/2017:22:51:06.801] mysql mysql/cld1-control-0 1/0/75340 92 -- 961/952/952/952/0 0/0
[16/Aug/2017:22:51:34.490] mysql mysql/cld1-control-0 1/0/47655 92 -- 960/951/951/951/0 0/0
[16/Aug/2017:22:51:30.058] mysql mysql/cld1-control-0 1/0/52090 92 -- 959/950/950/950/0 0/0
[16/Aug/2017:22:52:09.598] mysql mysql/cld1-control-0 1/0/12555 92 -- 958/949/949/949/0 0/0
[16/Aug/2017:22:51:57.533] mysql mysql/cld1-control-0 1/0/24623 92 -- 957/948/948/948/0 0/0
[16/Aug/2017:22:52:09.618] mysql mysql/cld1-control-0 1/0/12541 92 -- 956/947/947/947/0 0/0
[16/Aug/2017:22:52:09.625] mysql mysql/cld1-control-0 1/0/12539 92 -- 955/946/946/946/0 0/0


we want to look at the five-digit number after the "1/0/", e.g. 12623, 11953, 13588, 11569, etc.  That's the number of milliseconds each connection was open before ending.   None are fewer than 11 seconds or so corresponding to the ten second timeout from the server.    A bunch are much longer indicating the client took a long time to figure out that the connection was dropped.

This pattern, of the client taking a very long time to read data that has been sent to it, points to CPU-bound operations preventing the non-blocking IO pattern used by Cinder from attending to socket information in a timely manner.   This is caused by the fact that Cinder, like many openstack services, is based on the eventlet server model which uses a single thread to handle all IO and CPU work, pausing the handling of available IO until the single thing it's working on is done.  If these backup processes are blocking on CPU for a long time, you will see other greenlets stuck waiting for a long time and that's what this looks like. 

Openstack services like Keystone and Nova have already moved off of eventlet and onto Apache / mod_wsgi to resolve this inefficiency.  Older versions of Nova at least use many worker processes in order to offload requests to multiple processes.  However looking at Cinder, even current versions, the backup service as well as the others are using only a single process.   If Cinder were to allow for a "workers" configuration, you'd see this being passed to oslo_service right here: https://github.com/openstack/cinder/blob/master/cinder/cmd/backup.py#L59    If we look at Nova for example we can see a "worker" number is being passed: https://github.com/openstack/nova/blob/master/nova/cmd/conductor.py#L45 .

The setting you want to adjust in your my.cnf.d/galera.cnf is connect_timeout:  https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_connect_timeout .   Change this to 120 seconds.  However, this is really a bigger architectural issue in Cinder: cinder should provide for a configurable workers value, and should also move off of the eventlet model and run under apache mod_wsgi.

Comment 43 Eric Beaudoin 2017-09-08 12:22:33 UTC
Hi Michael,
The customer and the customer's customer confirmed that changing the galera "connect_timeout" value to 120 fixed the problem of creating cinder backup using scripts.
Thanks for the hard work.
Eric

Comment 46 Gorka Eguileor 2017-09-26 12:06:13 UTC
I believe @mbayer's assessment is correct.

This seems to be caused by Cinder's data compression (a CPU intensive operation) during backups being done directly in the greenthread, which would prevent thread switching to other greenthreads.

Given enough greenthreads doing compression they would end up running mostly serially and preventing other threads from running.

Solution would be to run the compression on a native thread so they don't interfere with greenthread switching.

Comment 48 Benjamin Schmaus 2018-03-12 00:46:51 UTC
*** Bug 1549598 has been marked as a duplicate of this bug. ***

Comment 64 Tzach Shefi 2018-06-21 09:19:46 UTC
Verified on:
openstack-cinder-8.1.1-20.el7ost.noarch

Ran both loops created 9 volumes then back them up. 
Reran again backup and delete 9 volumes two more cycles. 
Nothing failed, OK to verify.

Comment 66 errata-xmlrpc 2018-07-05 12:31:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2136