Bug 1464146 - Problem in creating several cinder backups at same time
Summary: Problem in creating several cinder backups at same time
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 9.0 (Mitaka)
Assignee: Gorka Eguileor
QA Contact: Avi Avraham
URL:
Whiteboard:
: 1549598 (view as bug list)
Depends On: 1504661 1504670 1504671 1542607
Blocks: 1403948 1549598
TreeView+ depends on / blocked
 
Reported: 2017-06-22 14:02 UTC by Eric Beaudoin
Modified: 2022-08-16 12:54 UTC (History)
14 users (show)

Fixed In Version: openstack-cinder-8.1.1-19.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, certain method calls for backup/restore operations blocked the eventlet's thread switching. Consequently, operations were slower and connection errors were observed in the database and RabbitMQ logs. With this update, proxy blocking method calls were changed into native threads to prevent blocking. As a result, restore and backup operations are faster and the connection issues are resolved.
Clone Of:
: 1504661 (view as bug list)
Environment:
Last Closed: 2018-07-05 12:31:02 UTC
Target Upstream Version:
Embargoed:
lkuchlan: automate_bug+


Attachments (Terms of Use)
grepped database errors in cinder backup (3.29 KB, text/plain)
2017-08-24 16:31 UTC, Michael Bayer
no flags Details
haproxy segment showing connectivity during these times (198.46 KB, text/plain)
2017-08-24 16:34 UTC, Michael Bayer
no flags Details
haproxy's doc on tcplog (9.63 KB, text/plain)
2017-08-24 16:37 UTC, Michael Bayer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1719580 0 None None None 2017-09-26 12:00:02 UTC
OpenStack gerrit 507510 0 'None' MERGED Run backup compression on native thread 2021-02-12 17:30:39 UTC
Red Hat Issue Tracker OSP-4648 0 None None None 2022-08-16 12:54:47 UTC
Red Hat Product Errata RHBA-2018:2136 0 None None None 2018-07-05 12:31:33 UTC

Comment 39 Michael Bayer 2017-08-24 16:31:19 UTC
Created attachment 1317835 [details]
grepped database errors in cinder backup

this is the individual occurrences of the error within the backup log

Comment 40 Michael Bayer 2017-08-24 16:34:59 UTC
Created attachment 1317838 [details]
haproxy segment showing connectivity during these times

in this file, the errors observed in the service line up exactly with requests in haproxy that look like this:

Aug 16 22:52:22 localhost haproxy[2681]: 10.223.147.12:58600 [16/Aug/2017:22:51:21.883] mysql mysql/cld1-control-0 1/0/60241 92 -- 966/957/957/957/0 0/0


the key is the number "92", you can grep this file literally for " 92 " (with the leading/trailing space) to locate these lines.   they indicate that 92 bytes have been transmitted from server to client total.   they line up with where these connections are getting cut off.    There are exactly the same number / timestamp of the " 92 " lines as there are errors in the backup.log.

Comment 41 Michael Bayer 2017-08-24 16:37:47 UTC
Created attachment 1317839 [details]
haproxy's doc on tcplog

Comment 42 Michael Bayer 2017-08-24 17:00:12 UTC
We observe the disconnects in haproxy indicate that the connection was established, was not queued, and 92 bytes were sent successfully.   Then the connection was unexpectedly ended.   The time spent on the connection was very long - a minimum of about 12 seconds, and at max some are 60 - 70 seconds - that is it was connected for a minute and only received these 92 bytes before getting cut off at the authentication phase.

so over at https://bugs.mysql.com/bug.php?id=28359 if you search for comment "[31 Aug 2007 9:21] Magnus Blåudd " he does a great description of why you'd get dropped exactly at this behavior, which is that the MySQL server is specifically going to cut the client off if it gets no response within a fixed timeout - the comment here says five seconds but from my own testing on current MariaDB versions it is ten seconds.     This timeout is specific to the authentication phase because the server is trying to get rid of clients that aren't able to communicate up front.

If we grep for the entries that show this, we see:

$ cat haproxy_17081622.log | grep " 92 " | cut -c 62-200
[16/Aug/2017:22:25:56.738] mysql mysql/cld1-control-0 1/0/12623 92 -- 913/866/866/866/0 0/0
[16/Aug/2017:22:26:11.855] mysql mysql/cld1-control-0 1/0/11953 92 -- 925/888/888/888/0 0/0
[16/Aug/2017:22:29:30.138] mysql mysql/cld1-control-0 1/0/13588 92 -- 898/888/888/888/0 0/0
[16/Aug/2017:22:38:03.639] mysql mysql/cld1-control-0 1/0/11569 92 -- 936/926/926/926/0 0/0
[16/Aug/2017:22:42:46.359] mysql mysql/cld1-control-0 1/0/15211 92 -- 938/928/928/928/0 0/0
[16/Aug/2017:22:43:03.408] mysql mysql/cld1-control-0 1/0/22303 92 -- 962/953/953/953/0 0/0
[16/Aug/2017:22:43:03.411] mysql mysql/cld1-control-0 1/0/22304 92 -- 962/953/953/953/0 0/0
[16/Aug/2017:22:43:03.406] mysql mysql/cld1-control-0 1/0/22314 92 -- 962/953/953/953/0 0/0
[16/Aug/2017:22:51:21.883] mysql mysql/cld1-control-0 1/0/60241 92 -- 966/957/957/957/0 0/0
[16/Aug/2017:22:51:18.117] mysql mysql/cld1-control-0 1/0/64011 92 -- 965/956/956/956/0 0/0
[16/Aug/2017:22:51:40.785] mysql mysql/cld1-control-0 1/0/41347 92 -- 964/955/955/955/0 0/0
[16/Aug/2017:22:51:09.935] mysql mysql/cld1-control-0 1/0/72200 92 -- 963/954/954/954/0 0/0
[16/Aug/2017:22:51:13.076] mysql mysql/cld1-control-0 1/0/69062 92 -- 962/953/953/953/0 0/0
[16/Aug/2017:22:51:06.801] mysql mysql/cld1-control-0 1/0/75340 92 -- 961/952/952/952/0 0/0
[16/Aug/2017:22:51:34.490] mysql mysql/cld1-control-0 1/0/47655 92 -- 960/951/951/951/0 0/0
[16/Aug/2017:22:51:30.058] mysql mysql/cld1-control-0 1/0/52090 92 -- 959/950/950/950/0 0/0
[16/Aug/2017:22:52:09.598] mysql mysql/cld1-control-0 1/0/12555 92 -- 958/949/949/949/0 0/0
[16/Aug/2017:22:51:57.533] mysql mysql/cld1-control-0 1/0/24623 92 -- 957/948/948/948/0 0/0
[16/Aug/2017:22:52:09.618] mysql mysql/cld1-control-0 1/0/12541 92 -- 956/947/947/947/0 0/0
[16/Aug/2017:22:52:09.625] mysql mysql/cld1-control-0 1/0/12539 92 -- 955/946/946/946/0 0/0


we want to look at the five-digit number after the "1/0/", e.g. 12623, 11953, 13588, 11569, etc.  That's the number of milliseconds each connection was open before ending.   None are fewer than 11 seconds or so corresponding to the ten second timeout from the server.    A bunch are much longer indicating the client took a long time to figure out that the connection was dropped.

This pattern, of the client taking a very long time to read data that has been sent to it, points to CPU-bound operations preventing the non-blocking IO pattern used by Cinder from attending to socket information in a timely manner.   This is caused by the fact that Cinder, like many openstack services, is based on the eventlet server model which uses a single thread to handle all IO and CPU work, pausing the handling of available IO until the single thing it's working on is done.  If these backup processes are blocking on CPU for a long time, you will see other greenlets stuck waiting for a long time and that's what this looks like. 

Openstack services like Keystone and Nova have already moved off of eventlet and onto Apache / mod_wsgi to resolve this inefficiency.  Older versions of Nova at least use many worker processes in order to offload requests to multiple processes.  However looking at Cinder, even current versions, the backup service as well as the others are using only a single process.   If Cinder were to allow for a "workers" configuration, you'd see this being passed to oslo_service right here: https://github.com/openstack/cinder/blob/master/cinder/cmd/backup.py#L59    If we look at Nova for example we can see a "worker" number is being passed: https://github.com/openstack/nova/blob/master/nova/cmd/conductor.py#L45 .

The setting you want to adjust in your my.cnf.d/galera.cnf is connect_timeout:  https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_connect_timeout .   Change this to 120 seconds.  However, this is really a bigger architectural issue in Cinder: cinder should provide for a configurable workers value, and should also move off of the eventlet model and run under apache mod_wsgi.

Comment 43 Eric Beaudoin 2017-09-08 12:22:33 UTC
Hi Michael,
The customer and the customer's customer confirmed that changing the galera "connect_timeout" value to 120 fixed the problem of creating cinder backup using scripts.
Thanks for the hard work.
Eric

Comment 46 Gorka Eguileor 2017-09-26 12:06:13 UTC
I believe @mbayer's assessment is correct.

This seems to be caused by Cinder's data compression (a CPU intensive operation) during backups being done directly in the greenthread, which would prevent thread switching to other greenthreads.

Given enough greenthreads doing compression they would end up running mostly serially and preventing other threads from running.

Solution would be to run the compression on a native thread so they don't interfere with greenthread switching.

Comment 48 Benjamin Schmaus 2018-03-12 00:46:51 UTC
*** Bug 1549598 has been marked as a duplicate of this bug. ***

Comment 64 Tzach Shefi 2018-06-21 09:19:46 UTC
Verified on:
openstack-cinder-8.1.1-20.el7ost.noarch

Ran both loops created 9 volumes then back them up. 
Reran again backup and delete 9 volumes two more cycles. 
Nothing failed, OK to verify.

Comment 66 errata-xmlrpc 2018-07-05 12:31:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2136


Note You need to log in before you can comment on or make changes to this bug.