1436176 – OPenShift Refresh duration exceeds default two hour timeout and grows > 8GB never fully completing

Bug 1436176 - OPenShift Refresh duration exceeds default two hour timeout and grows > 8GB never fully completing

Summary: OPenShift Refresh duration exceeds default two hour timeout and grows > 8GB n...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Providers
Sub Component:
Version:	5.7.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	GA
Target Release:	5.9.0
Assignee:	Adam Grare
QA Contact:	Einat Pacifici
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1428570 (view as bug list)
Depends On:
Blocks:	1441201 1441202
TreeView+	depends on / blocked

Reported:	2017-03-27 12:07 UTC by Thomas Hennessy
Modified:	2020-06-11 13:29 UTC (History)
CC List:	15 users (show)
Fixed In Version:	5.9.0.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1441201 1441202 (view as bug list)
Environment:
Last Closed:	2018-03-06 15:42:49 UTC
Category:	---
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
chart showing successive ems refresh worker memory growth and stepping on prior ems refresh (37.72 KB, application/pdf) 2017-03-27 12:07 UTC, Thomas Hennessy	no flags	Details
5.7.1.3 optimize store_ids_for_new_records (6.42 KB, text/plain) 2017-03-30 17:06 UTC, Adam Grare	no flags	Details
View All

Description Thomas Hennessy 2017-03-27 12:07:25 UTC

Created attachment 1266611 [details]
chart showing successive ems refresh worker memory growth and stepping on prior ems refresh

Description of problem: See provided charts and logs:  Customer has a very large OpenShift provider environment which has never successfully completed a full ems refresh.  memory threshold has been increased to 6GB which is routinely exceeded. Default timeout of 7200 seconds is insufficient and restart_interval of two hours causes active ems refresh to be terminated while a replacement refresh worker starts and pushes appliance further into swap.


Version-Release number of selected component (if applicable): Version: 5.7.1.3


How reproducible: engineering has not yet been able to reproduce.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 10 CFME Bot 2017-03-28 16:56:30 UTC

https://github.com/ManageIQ/manageiq/pull/14542

Comment 12 Thomas Hennessy 2017-03-28 21:19:01 UTC

Federico,
there seems to be multiple misunderstandings in our exchanges, but I will do my best to clarify.

the materials I have provided with this case are not based on any testing done either by myself nor anyone else in Red Hat GSS as we have no OpenShift environment to perform *ANY* testing let along scale testing.

All of the information provided in this BZ is distilled from the set of customer logs provided for the *6825 customer appliance.

I do not understand how I can confirm what you are asking me to confirm any better than you can.  I have provided the full set of ems refresh worker process logs associated with each of the processes for which the charts indicating memory usage have been provided.  If you asking me to provide the raw top data for each of these processes which would include both memory and cpu consumption every 60 seconds, I can and will do that as soon as I save this case response.  I will provide this data as a separate xls file for each of the identified ems refresh worker processes.

your description of slowness is not something that I can make any assessment about because there are too few log lines emitted by the processes with any information that would allow anyone to make a judgement as to fast or slow it is doing anything, there are just long gaps of time between log lines with a few logs indicating the deleting of some element of inventory.  See sample below:  
=====
[----] I, [2017-03-24T15:23:54.084120 #16443:e53134]  INFO -- : MIQ(EmsRefresh.save_inventory_multi) [container_builds] Deleting id: [10000000004850] name: [php70-oracle-client]
[----] I, [2017-03-24T15:30:07.014947 #16443:e53134]  INFO -- : MIQ(EmsRefresh.save_inventory_multi) [container_build_pods] Deleting id: [10000000025199] name: [php70-oracle-client-1-1], id: [10000000025200] name: [php70-oracle-client-4]
[----] I, [2017-03-24T15:37:38.800698 #16443:e53134]  INFO -- : MIQ(EmsRefresh.save_inventory_multi) [persistent_volume_claims] Deleting id: [10000000001720] name: [mysqldb-bkp], id: [10000000001721] name: [mysqldb-data]
========
As the gaps between these log lines are about 7 minutes and these are the only emitted bits of information about the refresh processing I am at a loss to make any assessment about speed of any kind.

I will provide the top information for each of the identified processes for which the full logs are provided in the attachment "AllEmsRefreshWorkerProcessLogsFromMarch24Logs.zip"

Comment 14 Federico Simoncelli 2017-03-30 08:51:02 UTC

Looking at the code:

https://github.com/ManageIQ/manageiq/blob/f4936d6/app/models/manageiq/providers/kubernetes/container_manager/refresh_parser.rb#L16-L30

the collection of component statuses (get_component_statuses) is the last collection we are performing and luckily it generates a warning: "cannot list all componentstatuses in the cluster"

Based on that and on the logs provided here the collection seems fast enough:

pid_16443_combined.txt:[----] I, [2017-03-24T15:02:14.296142 #16443:e53134]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets...
pid_16443_combined.txt:[----] W, [2017-03-24T15:02:33.187699 #16443:e53134]  WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster

=> ~19 seconds collection time

pid_19437_combined.txt:[----] I, [2017-03-24T11:24:59.146854 #19437:e53134]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets...
pid_19437_combined.txt:[----] W, [2017-03-24T11:25:09.301395 #19437:e53134]  WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster

=> ~10 seconds collection time

pid_21793_combined.txt:[----] I, [2017-03-24T07:46:59.688957 #21793:e53134]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets...
pid_21793_combined.txt:[----] W, [2017-03-24T07:47:09.927956 #21793:e53134]  WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster

=> ~10 seconds collection time

pid_27999_combined.txt:[----] I, [2017-03-24T04:07:41.580341 #27999:e53134]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets...
pid_27999_combined.txt:[----] W, [2017-03-24T04:07:52.015339 #27999:e53134]  WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster

=> ~11 seconds collection time

pid_27999_combined.txt:[----] I, [2017-03-24T04:10:06.674175 #27999:e53134]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets...
pid_27999_combined.txt:[----] W, [2017-03-24T04:10:18.049651 #27999:e53134]  WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster

=> ~12 seconds collection time

pid_4348_combined.txt:[----] I, [2017-03-24T05:46:51.161437 #4348:e53134]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets...
pid_4348_combined.txt:[----] W, [2017-03-24T05:47:04.434823 #4348:e53134]  WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster

=> ~13 seconds collection time

pid_4623_combined.txt:[----] I, [2017-03-24T13:20:54.199982 #4623:e53134]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets...
pid_4623_combined.txt:[----] W, [2017-03-24T13:21:04.464541 #4623:e53134]  WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster

=> ~10 seconds collection time

pid_6207_combined.txt:[----] I, [2017-03-24T09:37:20.796049 #6207:e53134]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets...
pid_6207_combined.txt:[----] W, [2017-03-24T09:37:30.967726 #6207:e53134]  WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster

=> ~10 seconds collection time


It makes sense to think that the rest of the time is spent in database operations.

Adam, assigning this to you: I saw that in comment 10 you already sent a PR to improve the performance of save inventory (not sure if that is going to be enough).
There may be also specific issues here in the env (e.g. slow/lossy connection between appliance and db?)

Adam, let me know if we can assist you in any way. Thanks.

Comment 15 Adam Grare 2017-03-30 17:06:44 UTC

Created attachment 1267618 [details]
5.7.1.3 optimize store_ids_for_new_records

From https://github.com/agrare/manageiq/tree/uhc_5713_ladas_optimize_store_ids_for_new_records which is a 5.7.z backport of https://github.com/ManageIQ/manageiq/pull/14542

Copy to /var/www/miq/vmdb/app/models/ems_refresh/save_inventory_helper.rb then restart evmserverd or reboot the appliance.

Comment 16 Dana Safford 2017-03-30 18:23:31 UTC

I set the 'customer escalation' flag, UHC will want a hotfix for this also.

Thanks,

Comment 17 CFME Bot 2017-04-01 18:26:27 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/3d135d0966c3748f18f2618c1382bd850ee42154

commit 3d135d0966c3748f18f2618c1382bd850ee42154
Author:     Ladislav Smola <lsmola>
AuthorDate: Tue Mar 28 18:40:06 2017 +0200
Commit:     Ladislav Smola <lsmola>
CommitDate: Tue Mar 28 22:36:09 2017 +0200

    Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups
    
    Optimize store_ids_for_new_records by getting rid of the O(n^2)
    lookups. These lookups can take hours when we go over 50-100k records
    being processed by store_ids_for_new_records.
    
    And seems like there might be a Rails issue, since by doing
    association.push(new_records) at
    https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L69-L69
    
    The association has the records doubled, the association is passed
    into store_ids_for_new_records as 'records' parameter
    https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L114-L114
    
    And here when we do:
    records.count => 50k
    records.uniq.count => 25k
    records.reload.count => 25k
    
    That was slowing down the store_ids_for_new_records method even
    more it seems.
    
    Partially fixes:
    https://bugzilla.redhat.com/show_bug.cgi?id=1436176

 app/models/ems_refresh/save_inventory_helper.rb | 27 ++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

Comment 18 CFME Bot 2017-04-03 16:17:58 UTC

https://github.com/ManageIQ/manageiq/pull/14028

Comment 19 Adam Grare 2017-04-05 17:29:47 UTC

https://github.com/ManageIQ/manageiq/pull/14606

Comment 20 Ladislav Smola 2017-04-07 07:59:01 UTC

https://github.com/ManageIQ/manageiq/pull/14670

Comment 21 Adam Grare 2017-04-08 00:36:36 UTC

https://github.com/ManageIQ/manageiq/pull/13594
https://github.com/ManageIQ/manageiq/pull/14542
https://github.com/ManageIQ/manageiq/pull/14670

Comment 22 Adam Grare 2017-04-10 14:51:14 UTC

*** Bug 1428570 has been marked as a duplicate of this bug. ***

Comment 23 Adam Grare 2017-04-10 16:19:22 UTC

With the PRs in https://bugzilla.redhat.com/show_bug.cgi?id=1436176#c21 applied refresh on a remote database with 6ms db ping was reduced from >2hr to 22min

Comment 26 CFME Bot 2017-04-11 12:56:02 UTC

New commit detected on ManageIQ/manageiq/euwe:
https://github.com/ManageIQ/manageiq/commit/23f6fbbaf8117d8af0a1af1fe179a06dc78e7bfa

commit 23f6fbbaf8117d8af0a1af1fe179a06dc78e7bfa
Author:     Adam Grare <agrare>
AuthorDate: Sat Apr 1 14:22:13 2017 -0400
Commit:     Satoe Imaishi <simaishi>
CommitDate: Tue Apr 11 08:49:39 2017 -0400

    Merge pull request #14542 from Ladas/optimize_store_ids_for_new_records
    
    Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups
    (cherry picked from commit 053c16e673d1ba73b98b6ebf73d98f46d2853c75)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1436176

 app/models/ems_refresh/save_inventory_helper.rb | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

Comment 29 Einat Pacifici 2018-02-07 10:34:52 UTC

Tested in the scope of refresh regression.

Note You need to log in before you can comment on or make changes to this bug.