Bug 1209711 - Thousands of OSP cinder snapshots cause significant EmsRefresh slowdown
Summary: Thousands of OSP cinder snapshots cause significant EmsRefresh slowdown
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Providers
Version: 5.3.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: GA
: 5.4.0
Assignee: Matthew Draper
QA Contact: Thom Carlin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-04-08 03:26 UTC by Thomas Hennessy
Modified: 2019-07-11 08:54 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
The previous version of CloudForms Management Engine collected inventory on all snapshots on Red Hat OpenStack Platform providers, causing significant EmsRefresh slowdown. This issue was caused by scalability issues in the refresh process, which could not handle large volumes of OpenStack Platform cinder snapshots. A temporary workaround has been provided to fix this issue until the underlying scalability problem can be fixed. The fix will allow users that have a large number of snapshots to disable the collection of inventory information for snapshots, which avoids the EmsRefresh slowdown.
Clone Of:
Environment:
Last Closed: 2015-06-16 12:57:12 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
zip of current fog.log from customer QA appliance (not the full archive set) (11.11 MB, application/x-rar)
2015-04-08 03:26 UTC, Thomas Hennessy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:1100 0 normal SHIPPED_LIVE CFME 5.4.0 bug fixes, and enhancement update 2015-06-16 16:28:42 UTC

Description Thomas Hennessy 2015-04-08 03:26:07 UTC
Created attachment 1012010 [details]
zip of current fog.log from customer QA appliance (not the full archive set)

Description of problem:Customer is reporting that a single instance of OpenStack is being used in multiple CFME environments.

Each environment is reporting the same degradation in emsrefresh times over the course of several weeks, growing from about two minutes to > 1200 seconds or more.


Version-Release number of selected component (if applicable): 
Version: 5.3.0.15
Build:   20140929084440_2192916



How reproducible: Behavior not reproduced in testing environment, but is reproduced in several of the customer environments after CFME has been operating for at least two weeks.


Steps to Reproduce:
1.
2.
3.

Actual results: EMS refresh times grow on a daily basis.  Each worker process lives the restart_interval time period and is replaced by another worker process.  The growth in emsrefresh times seems to be independent of worker process.


Expected results: ems refresh times should stay roughly the same in an environment where vm instance counts are either roughtly the same or are actually reduced by nearly 50%


Additional info:

Comment 8 Greg Blomquist 2015-04-22 17:52:31 UTC
https://github.com/ManageIQ/manageiq/pull/2723

Doc text could largely be pulled directly from the PR comment.

Comment 9 CFME Bot 2015-04-22 19:00:51 UTC
New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/a54999a048cc40be7af899375ac9e5d969463579

commit a54999a048cc40be7af899375ac9e5d969463579
Author:     Matthew Draper <mdraper>
AuthorDate: Thu Apr 23 02:51:52 2015 +0930
Commit:     Matthew Draper <mdraper>
CommitDate: Thu Apr 23 02:51:52 2015 +0930

    Provide a limited ability for OpenStack refresh to skip item types
    
    Documenting this in the config file seems likely to make it too
    enticing; as a quick fix, it's not really as universal as it would first
    appear. To wit: while it works for our immediate situation (needing to
    skip `:cloud_volumes` and `:cloud_volume_snapshots`), it would have no
    effect on (for example) `:firewall_rules`, and would actively break for
    `:security_groups`.
    
    Not to mention the likelihood of someone assuming such an option will
    work equally in other providers, where it would in fact be ignored
    completely.
    
    With the above limitations in mind, to use, configure:
    
    ems_refresh:
      openstack:
        :inventory_ignore:
          - :cloud_volumes
          - :cloud_volume_snapshots
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1209711

 vmdb/app/models/ems_refresh/parsers/openstack.rb   |  1 +
 .../openstack_refresher_rhos_havana_spec.rb        | 24 ++++++++++++++++++++++
 2 files changed, 25 insertions(+)

Comment 10 Greg Blomquist 2015-04-23 14:12:13 UTC
This fix is largely a temporary workaround until we can address the real scalability issue being tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1214780

Comment 12 Thom Carlin 2015-05-19 15:39:40 UTC
Please provide steps to reproduce.

Comment 13 Thomas Hennessy 2015-05-20 15:41:20 UTC
sure:
create an Openstack environemtn with a few thousand images.
take multiple volume snapshots of each from OpenStack.
when you have about 16k volume snapshots do a standard ems refresh of the openstack provider.

I suspect this is not likely to be a reasonable ask for the QE department, but that is the environment in which this problem has presented itself.

it seems to me that a better question that ought to be being asked is: Is there a reasonable number of volume snapshots that we ought to include?  Should any be included?  What does CFME do with these?

Comment 14 Thom Carlin 2015-06-09 15:39:36 UTC
Verified in 5.4.0.5.20150605150206_7daa1a8 by:
1) Run SmartState Analysis on image and verifying both cloud_volumes and cloud_volume_snapshots are created
2) Configuring as above and verifying neither cloud_volumes nor cloud_volume_snapshots are created

Comment 16 errata-xmlrpc 2015-06-16 12:57:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1100.html


Note You need to log in before you can comment on or make changes to this bug.