Bug 1209711

Summary:

Thousands of OSP cinder snapshots cause significant EmsRefresh slowdown

Product:

Red Hat CloudForms Management Engine

Reporter:

Thomas Hennessy <thenness>

Component:

Providers

Assignee:

Matthew Draper <mdraper>

Status:

CLOSED ERRATA

QA Contact:

Thom Carlin <tcarlin>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.3.0

CC:

dclarizi, gblomqui, jdeubel, jfrey, jhardy, jocarter, mfeifer, obarenbo, snansi, tcarlin, thenness

Target Milestone:

Target Release:

5.4.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Known Issue

Doc Text:

The previous version of CloudForms Management Engine collected inventory on all snapshots on Red Hat OpenStack Platform providers, causing significant EmsRefresh slowdown. This issue was caused by scalability issues in the refresh process, which could not handle large volumes of OpenStack Platform cinder snapshots. A temporary workaround has been provided to fix this issue until the underlying scalability problem can be fixed. The fix will allow users that have a large number of snapshots to disable the collection of inventory information for snapshots, which avoids the EmsRefresh slowdown.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-06-16 12:57:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
zip of current fog.log from customer QA appliance (not the full archive set)	none

Description Thomas Hennessy 2015-04-08 03:26:07 UTC

Created attachment 1012010 [details]
zip of current fog.log from customer QA appliance (not the full archive set)

Description of problem:Customer is reporting that a single instance of OpenStack is being used in multiple CFME environments.

Each environment is reporting the same degradation in emsrefresh times over the course of several weeks, growing from about two minutes to > 1200 seconds or more.


Version-Release number of selected component (if applicable): 
Version: 5.3.0.15
Build:   20140929084440_2192916



How reproducible: Behavior not reproduced in testing environment, but is reproduced in several of the customer environments after CFME has been operating for at least two weeks.


Steps to Reproduce:
1.
2.
3.

Actual results: EMS refresh times grow on a daily basis.  Each worker process lives the restart_interval time period and is replaced by another worker process.  The growth in emsrefresh times seems to be independent of worker process.


Expected results: ems refresh times should stay roughly the same in an environment where vm instance counts are either roughtly the same or are actually reduced by nearly 50%


Additional info:

Comment 8 Greg Blomquist 2015-04-22 17:52:31 UTC

https://github.com/ManageIQ/manageiq/pull/2723

Doc text could largely be pulled directly from the PR comment.

Comment 9 CFME Bot 2015-04-22 19:00:51 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/a54999a048cc40be7af899375ac9e5d969463579

commit a54999a048cc40be7af899375ac9e5d969463579
Author:     Matthew Draper <mdraper>
AuthorDate: Thu Apr 23 02:51:52 2015 +0930
Commit:     Matthew Draper <mdraper>
CommitDate: Thu Apr 23 02:51:52 2015 +0930

    Provide a limited ability for OpenStack refresh to skip item types
    
    Documenting this in the config file seems likely to make it too
    enticing; as a quick fix, it's not really as universal as it would first
    appear. To wit: while it works for our immediate situation (needing to
    skip `:cloud_volumes` and `:cloud_volume_snapshots`), it would have no
    effect on (for example) `:firewall_rules`, and would actively break for
    `:security_groups`.
    
    Not to mention the likelihood of someone assuming such an option will
    work equally in other providers, where it would in fact be ignored
    completely.
    
    With the above limitations in mind, to use, configure:
    
    ems_refresh:
      openstack:
        :inventory_ignore:
          - :cloud_volumes
          - :cloud_volume_snapshots
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1209711

 vmdb/app/models/ems_refresh/parsers/openstack.rb   |  1 +
 .../openstack_refresher_rhos_havana_spec.rb        | 24 ++++++++++++++++++++++
 2 files changed, 25 insertions(+)

Comment 10 Greg Blomquist 2015-04-23 14:12:13 UTC

This fix is largely a temporary workaround until we can address the real scalability issue being tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1214780

Comment 12 Thom Carlin 2015-05-19 15:39:40 UTC

Please provide steps to reproduce.

Comment 13 Thomas Hennessy 2015-05-20 15:41:20 UTC

sure:
create an Openstack environemtn with a few thousand images.
take multiple volume snapshots of each from OpenStack.
when you have about 16k volume snapshots do a standard ems refresh of the openstack provider.

I suspect this is not likely to be a reasonable ask for the QE department, but that is the environment in which this problem has presented itself.

it seems to me that a better question that ought to be being asked is: Is there a reasonable number of volume snapshots that we ought to include?  Should any be included?  What does CFME do with these?

Comment 14 Thom Carlin 2015-06-09 15:39:36 UTC

Verified in 5.4.0.5.20150605150206_7daa1a8 by:
1) Run SmartState Analysis on image and verifying both cloud_volumes and cloud_volume_snapshots are created
2) Configuring as above and verifying neither cloud_volumes nor cloud_volume_snapshots are created

Comment 16 errata-xmlrpc 2015-06-16 12:57:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1100.html