Created attachment 1266611 [details] chart showing successive ems refresh worker memory growth and stepping on prior ems refresh Description of problem: See provided charts and logs: Customer has a very large OpenShift provider environment which has never successfully completed a full ems refresh. memory threshold has been increased to 6GB which is routinely exceeded. Default timeout of 7200 seconds is insufficient and restart_interval of two hours causes active ems refresh to be terminated while a replacement refresh worker starts and pushes appliance further into swap. Version-Release number of selected component (if applicable): Version: 5.7.1.3 How reproducible: engineering has not yet been able to reproduce. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
https://github.com/ManageIQ/manageiq/pull/14542
Federico, there seems to be multiple misunderstandings in our exchanges, but I will do my best to clarify. the materials I have provided with this case are not based on any testing done either by myself nor anyone else in Red Hat GSS as we have no OpenShift environment to perform *ANY* testing let along scale testing. All of the information provided in this BZ is distilled from the set of customer logs provided for the *6825 customer appliance. I do not understand how I can confirm what you are asking me to confirm any better than you can. I have provided the full set of ems refresh worker process logs associated with each of the processes for which the charts indicating memory usage have been provided. If you asking me to provide the raw top data for each of these processes which would include both memory and cpu consumption every 60 seconds, I can and will do that as soon as I save this case response. I will provide this data as a separate xls file for each of the identified ems refresh worker processes. your description of slowness is not something that I can make any assessment about because there are too few log lines emitted by the processes with any information that would allow anyone to make a judgement as to fast or slow it is doing anything, there are just long gaps of time between log lines with a few logs indicating the deleting of some element of inventory. See sample below: ===== [----] I, [2017-03-24T15:23:54.084120 #16443:e53134] INFO -- : MIQ(EmsRefresh.save_inventory_multi) [container_builds] Deleting id: [10000000004850] name: [php70-oracle-client] [----] I, [2017-03-24T15:30:07.014947 #16443:e53134] INFO -- : MIQ(EmsRefresh.save_inventory_multi) [container_build_pods] Deleting id: [10000000025199] name: [php70-oracle-client-1-1], id: [10000000025200] name: [php70-oracle-client-4] [----] I, [2017-03-24T15:37:38.800698 #16443:e53134] INFO -- : MIQ(EmsRefresh.save_inventory_multi) [persistent_volume_claims] Deleting id: [10000000001720] name: [mysqldb-bkp], id: [10000000001721] name: [mysqldb-data] ======== As the gaps between these log lines are about 7 minutes and these are the only emitted bits of information about the refresh processing I am at a loss to make any assessment about speed of any kind. I will provide the top information for each of the identified processes for which the full logs are provided in the attachment "AllEmsRefreshWorkerProcessLogsFromMarch24Logs.zip"
Looking at the code: https://github.com/ManageIQ/manageiq/blob/f4936d6/app/models/manageiq/providers/kubernetes/container_manager/refresh_parser.rb#L16-L30 the collection of component statuses (get_component_statuses) is the last collection we are performing and luckily it generates a warning: "cannot list all componentstatuses in the cluster" Based on that and on the logs provided here the collection seems fast enough: pid_16443_combined.txt:[----] I, [2017-03-24T15:02:14.296142 #16443:e53134] INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets... pid_16443_combined.txt:[----] W, [2017-03-24T15:02:33.187699 #16443:e53134] WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster => ~19 seconds collection time pid_19437_combined.txt:[----] I, [2017-03-24T11:24:59.146854 #19437:e53134] INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets... pid_19437_combined.txt:[----] W, [2017-03-24T11:25:09.301395 #19437:e53134] WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster => ~10 seconds collection time pid_21793_combined.txt:[----] I, [2017-03-24T07:46:59.688957 #21793:e53134] INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets... pid_21793_combined.txt:[----] W, [2017-03-24T07:47:09.927956 #21793:e53134] WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster => ~10 seconds collection time pid_27999_combined.txt:[----] I, [2017-03-24T04:07:41.580341 #27999:e53134] INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets... pid_27999_combined.txt:[----] W, [2017-03-24T04:07:52.015339 #27999:e53134] WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster => ~11 seconds collection time pid_27999_combined.txt:[----] I, [2017-03-24T04:10:06.674175 #27999:e53134] INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets... pid_27999_combined.txt:[----] W, [2017-03-24T04:10:18.049651 #27999:e53134] WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster => ~12 seconds collection time pid_4348_combined.txt:[----] I, [2017-03-24T05:46:51.161437 #4348:e53134] INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets... pid_4348_combined.txt:[----] W, [2017-03-24T05:47:04.434823 #4348:e53134] WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster => ~13 seconds collection time pid_4623_combined.txt:[----] I, [2017-03-24T13:20:54.199982 #4623:e53134] INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets... pid_4623_combined.txt:[----] W, [2017-03-24T13:21:04.464541 #4623:e53134] WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster => ~10 seconds collection time pid_6207_combined.txt:[----] I, [2017-03-24T09:37:20.796049 #6207:e53134] INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets... pid_6207_combined.txt:[----] W, [2017-03-24T09:37:30.967726 #6207:e53134] WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster => ~10 seconds collection time It makes sense to think that the rest of the time is spent in database operations. Adam, assigning this to you: I saw that in comment 10 you already sent a PR to improve the performance of save inventory (not sure if that is going to be enough). There may be also specific issues here in the env (e.g. slow/lossy connection between appliance and db?) Adam, let me know if we can assist you in any way. Thanks.
Created attachment 1267618 [details] 5.7.1.3 optimize store_ids_for_new_records From https://github.com/agrare/manageiq/tree/uhc_5713_ladas_optimize_store_ids_for_new_records which is a 5.7.z backport of https://github.com/ManageIQ/manageiq/pull/14542 Copy to /var/www/miq/vmdb/app/models/ems_refresh/save_inventory_helper.rb then restart evmserverd or reboot the appliance.
I set the 'customer escalation' flag, UHC will want a hotfix for this also. Thanks,
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/3d135d0966c3748f18f2618c1382bd850ee42154 commit 3d135d0966c3748f18f2618c1382bd850ee42154 Author: Ladislav Smola <lsmola> AuthorDate: Tue Mar 28 18:40:06 2017 +0200 Commit: Ladislav Smola <lsmola> CommitDate: Tue Mar 28 22:36:09 2017 +0200 Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups. These lookups can take hours when we go over 50-100k records being processed by store_ids_for_new_records. And seems like there might be a Rails issue, since by doing association.push(new_records) at https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L69-L69 The association has the records doubled, the association is passed into store_ids_for_new_records as 'records' parameter https://github.com/Ladas/manageiq/blob/48aa323551c8825e02170e251afa717ca807d2ed/app/models/ems_refresh/save_inventory_helper.rb#L114-L114 And here when we do: records.count => 50k records.uniq.count => 25k records.reload.count => 25k That was slowing down the store_ids_for_new_records method even more it seems. Partially fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1436176 app/models/ems_refresh/save_inventory_helper.rb | 27 ++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-)
https://github.com/ManageIQ/manageiq/pull/14028
https://github.com/ManageIQ/manageiq/pull/14606
https://github.com/ManageIQ/manageiq/pull/14670
https://github.com/ManageIQ/manageiq/pull/13594 https://github.com/ManageIQ/manageiq/pull/14542 https://github.com/ManageIQ/manageiq/pull/14670
*** Bug 1428570 has been marked as a duplicate of this bug. ***
With the PRs in https://bugzilla.redhat.com/show_bug.cgi?id=1436176#c21 applied refresh on a remote database with 6ms db ping was reduced from >2hr to 22min
New commit detected on ManageIQ/manageiq/euwe: https://github.com/ManageIQ/manageiq/commit/23f6fbbaf8117d8af0a1af1fe179a06dc78e7bfa commit 23f6fbbaf8117d8af0a1af1fe179a06dc78e7bfa Author: Adam Grare <agrare> AuthorDate: Sat Apr 1 14:22:13 2017 -0400 Commit: Satoe Imaishi <simaishi> CommitDate: Tue Apr 11 08:49:39 2017 -0400 Merge pull request #14542 from Ladas/optimize_store_ids_for_new_records Optimize store_ids_for_new_records by getting rid of the O(n^2) lookups (cherry picked from commit 053c16e673d1ba73b98b6ebf73d98f46d2853c75) https://bugzilla.redhat.com/show_bug.cgi?id=1436176 app/models/ems_refresh/save_inventory_helper.rb | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-)
Tested in the scope of refresh regression.