1722808 – Refresh of OpenShift provider in CloudForms happen to panic apiserver

Bug 1722808 - Refresh of OpenShift provider in CloudForms happen to panic apiserver

Summary: Refresh of OpenShift provider in CloudForms happen to panic apiserver

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Providers
Sub Component:
Version:	5.10.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	GA
Target Release:	5.12.0
Assignee:	Oved Ourfali
QA Contact:	juwatts
Docs Contact:	Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:	1729461
Blocks:	1754071 1767834
TreeView+	depends on / blocked

Reported:	2019-06-21 11:28 UTC by Khushbu Borole
Modified:	2023-03-24 14:57 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1754071 1767834 (view as bug list)
Environment:
Last Closed:	2020-03-25 18:03:47 UTC
Category:	Bug
Cloudforms Team:	Container Management
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Khushbu Borole 2019-06-21 11:28:16 UTC

Description of problem:
OCP provider is unable to refresh and receive 'apiserver panic' error.
~~~
[----] E, [2019-06-21T10:03:51.072752 #19344:590f4c] ERROR -- : MIQ(ManageIQ::Providers::Openshift::ContainerManager::Refresher#refresh) EMS: [prod2-ocp], id: [1000000000005] Refresh failed
[----] E, [2019-06-21T10:03:51.073003 #19344:590f4c] ERROR -- : [JSON::ParserError]: 765: unexpected token at 'This request caused apiserver to panic. Look in the logs for details.
'  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-06-21T10:03:51.073119 #19344:590f4c] ERROR -- : /usr/share/gems/gems/json-2.1.0/lib/json/common.rb:156:in `parse'
/usr/share/gems/gems/json-2.1.0/lib/json/common.rb:156:in `parse'
/opt/rh/cfme-gemset/gems/kubeclient-4.1.2/lib/kubeclient/common.rb:501:in `format_response'
/opt/rh/cfme-gemset/gems/kubeclient-4.1.2/lib/kubeclient/common.rb:315:in `get_entities'
/opt/rh/cfme-gemset/gems/kubeclient-4.1.2/lib/kubeclient/common.rb:206:in `block (2 levels) in define_entity_methods'
. . .
~~~

Also namespace missing error can be seen
~~~
[----] E, [2019-06-17T13:54:06.551838 #29257:4a8f58] ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics unavailable: [ContainerGroup(1000000083721)] Hawkular::Exception: {"ErrorMsg": "Could not determine a namespace id for namespace its-debrief-cbj-build"}
[----] I, [2019-06-17T13:54:06.572714 #29257:4a8f58]  INFO -- : Exception in realtime_block :total_time - Timings: {:capture_state=>0.0023462772369384766, :collect_data=>1.13718843460083, :total_time=>1.163102388381958}
[----] E, [2019-06-17T13:54:06.573013 #29257:4a8f58] ERROR -- : MIQ(MiqQueue#deliver) Message id: [1000168357079], Error: [Hawkular::Exception: {"ErrorMsg": "Could not determine a namespace id for namespace its-debrief-cbj-build"}]
[----] E, [2019-06-17T13:54:06.573163 #29257:4a8f58] ERROR -- : [ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::CollectionFailure]: Hawkular::Exception: {"ErrorMsg": "Could not determine a namespace id for namespace its-debrief-cbj-build"}  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-06-17T13:54:06.573248 #29257:4a8f58] ERROR -- : /opt/rh/cfme-gemset/bundler/gems/cfme-providers-kubernetes-339029025f99/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_legacy_capture_context.rb:59:in `rescue in fetch_counters_data'
~~~

Note: CU has two providers, while one is successfully refreshing but the other is not.

Version-Release number of selected component (if applicable):
Red Hat CloudForms 4.7

How reproducible:
CU has two providers, while one is successfully refreshing but the other is not.

Steps to Reproduce:
1. Add an Openshift provider in CloudForms
The data is never refreshed as the data is not fetched from OpenShift to CloudForms [ in CU scenario ]

Actual results:
Refresh for the provider fails

Expected results:
It should refresh

Comment 4 Beni Paskin-Cherniavsky 2019-06-24 09:47:25 UTC

This sounds like a crash in OpenShift, which should be looked at by OpenShift people.
We definitely need OpenShift logs, especially from the apiserver!

My completely uneducated guess is that CloudForms makes unusually large requests — e.g. get all pods in all namespaces — and this exhausts some resource on OpenShift side (RAM? timeout?).
(If so, the crash should be reproducable by requests like `oc get pod --all-namespaces -o json` etc.)
If customer can _easily_ increase RAM on OpenShift masters, that's worth trying, but as I said this is a wild guess...

Does openshift have HA apiservers?  Do other openshift users experience access interruption from the crash?
You can "pause" the Cloudforms provider to stop it from attempting refresh.
https://access.redhat.com/documentation/en-us/red_hat_cloudforms/4.7/html/managing_providers/containers_providers#pausing_resuming_containers_providers

---

The exception about Hawkular "Could not determine a namespace id for namespace its-debrief-cbj-build" might not be a big deal.
IIUC this will only fail metrics for this specific resource (is it a pod? project?), unless the message repeats a lot.

Is this same provider that can't complete refresh?  Without inventory refresh, metrics collection will get increasingly broken, and is not worth debugging until refersh is fixed.

Comment 5 dmetzger 2019-06-24 14:02:05 UTC

Can we get the questions in comment #4 answered and possibly verify if memory exhaustion caused an OpenShift crash?

Comment 6 Beni Paskin-Cherniavsky 2019-06-25 09:13:17 UTC

Quick summary of findings on linked case:
/api/v1/pods request times out in openshift apiserver waiting for etcd response.
https://access.redhat.com/solutions/4220191 might provide relief by increasing timeout?
This is a workaround (>1min for single etcd request feels bad), somebody with more openshift experience needed for better advice.

The high load large requests like "get all pods" put on Openshift is a long known issue.
Ideally ManageIQ would use chunked queries which are easier on the whole chain; https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/312 had some abandoned work towards this.

Comment 10 Beni Paskin-Cherniavsky 2019-06-26 11:51:42 UTC

Interesting, prod1-ocp env refreshes also occasionally failed, in 10–20% of runs, for different reason:

[----] E, [2019-06-17T15:01:55.414115 #17048:4a8f58] ERROR -- : Error when saving InventoryCollection:<ContainerImage> with strategy: , saver_strategy: batch, targeted: false. Message: undefined method `[]=' for
 nil:NilClass
[----] I, [2019-06-17T15:01:55.414492 #17048:4a8f58]  INFO -- : Exception in realtime_block :ems_refresh - Timings: {:collect_inventory_for_targets=>49.55200004577637, :parse_targeted_inventory=>96.7337815761566
2, :save_inventory=>52.38033437728882, :ems_refresh=>198.66671109199524}
[----] E, [2019-06-17T15:01:55.414643 #17048:4a8f58] ERROR -- : MIQ(ManageIQ::Providers::Openshift::ContainerManager::Refresher#refresh) EMS: [prod1-ocp], id: [1000000000004] Refresh failed
[----] E, [2019-06-17T15:01:55.416552 #17048:4a8f58] ERROR -- : [NoMethodError]: undefined method `[]=' for nil:NilClass  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-06-17T15:01:55.419961 #17048:4a8f58] ERROR -- : /var/www/miq/vmdb/app/models/manageiq/providers/inventory/persister/builder/container_manager.rb:252:in `block (3 levels) in custom_reconnect_block
'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/relation/delegation.rb:38:in `each'
/opt/rh/cfme-gemset/gems/activerecord-5.0.7.2/lib/active_record/relation/delegation.rb:38:in `each'
/var/www/miq/vmdb/app/models/manageiq/providers/inventory/persister/builder/container_manager.rb:243:in `block (2 levels) in custom_reconnect_block'
/var/www/miq/vmdb/app/models/manageiq/providers/inventory/persister/builder/container_manager.rb:242:in `each'
/var/www/miq/vmdb/app/models/manageiq/providers/inventory/persister/builder/container_manager.rb:242:in `each_slice'
/var/www/miq/vmdb/app/models/manageiq/providers/inventory/persister/builder/container_manager.rb:242:in `block in custom_reconnect_block'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/saver/concurrent_safe_batch.rb:107:in `save!'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/saver/base.rb:84:in `save_inventory_collection!'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/base.rb:37:in `save_inventory'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/base.rb:24:in `save_inventory_object_inventory'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:29:in `block (2 levels) in save_collections'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:28:in `each'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:28:in `block in save_collections'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:26:in `each'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:26:in `each_with_index'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_collection/topological_sort.rb:26:in `save_collections'
/opt/rh/cfme-gemset/gems/inventory_refresh-0.1.2/lib/inventory_refresh/save_inventory.rb:24:in `save_inventory'
/var/www/miq/vmdb/app/models/manageiq/providers/inventory/persister.rb:29:in `persist!'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/refresher.rb:189:in `save_inventory'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/refresher.rb:109:in `block in refresh_targets_for_ems'

The relevant line is apparently https://github.com/ManageIQ/manageiq/blob/hammer-7/app/models/manageiq/providers/inventory/persister/builder/container_manager.rb#L252
Indeed `hash = attributes_index.delete(index)` might return nil if `attributes_index` didn't contain `index` as a key.
Adam or Ladislav, could you take a look?
This is less urgent, as most prod1-ocp refreshes do complete.

Comment 11 Beni Paskin-Cherniavsky 2019-07-01 12:01:18 UTC

Ladislav, see comment 10, sounds like a bug.  I'm not familiar with that code, could you take a look?

Comment 12 Ladislav Smola 2019-07-01 12:44:47 UTC

Beni: could you try to reproduce it?

Also this seems to be explained and fixed in our cloud/infra reconnect code https://github.com/Ladas/manageiq/blob/6834b42669bb0d33494736d0653ca1158c25f6ab/app/models/manageiq/providers/inventory/persister/builder/shared.rb#L21 (the MIQ DB is not protected from duplicates)

The code under https://github.com/ManageIQ/manageiq/blob/hammer-7/app/models/manageiq/providers/inventory/persister/builder/container_manager.rb#L252 was taken from OpenShift provider. 

The codes do look very similar (except for some asserts and batch size), we should probably unify it.

Comment 13 Beni Paskin-Cherniavsky 2019-07-07 09:17:44 UTC

prod2 openshift "panic" timeout: Basically, I'm stuck and unlikely to solve this.  Need someone who understands Openshift apiserver!

We failed to reproduce the timeout with `oc` and `curl` — they succeed in 7sec, 2MB transferred.  No idea why but CF reliably sees "panic" error.


prod1 NoMethodError: haven't touched it.  Ladislav's analysis looks good.  Should probably split to separate BZ.

Comment 26 Beni Paskin-Cherniavsky 2019-09-20 14:59:42 UTC

API chunking implemented now (thanks to Adam!):
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/312
https://github.com/ManageIQ/manageiq-providers-openshift/pull/145
https://github.com/ManageIQ/manageiq-providers-openshift/pull/146

We don't really know if that'll solve the specific customer's problem, but CloudForm's requests should be much easier on Openshift now.
Moving to POST as this is the improvement we know how to make.

I don't think we can backport these now.
IMO we need them on master for a while to become confident they are not risky, and they actually help, before considering backport.
But I'll leave this decision to Adam.

== Notes for QE ==

We don't know how to reproduce customer's problem.

To test these changes, the main goal is testing for regressions - does refresh still work:
- Does refresh work with openshift 3.7 (or earlier) that ignores `limit` param?
- Does refresh work with openshift 3.8–3.11 that understands  `limit`?

Hopefully on openshift ≥3.8, this change might also make refresh faster, but that's not certain.
What it should do is make more, smaller requests to OpenShift (A) reducing the risk they fail, (B) reducing OpenShift RAM usage (in apiserver & etcd).

Comment 27 Adam Grare 2019-09-20 15:12:06 UTC

I agree we should do some testing with an upstream appliance before backporting this.
In addition to to the tests that Beni lists above, I would add on a 3.8-3.11 openshift drop the chunk_size down to ~10 on a relatively large openshift to confirm that looping over the pages works correctly.

Comment 31 dmetzger 2019-09-20 17:43:26 UTC

QE has verified they have resources to address this ticket in 5.10.11 and 5.11.1 testing.

Comment 33 Satoe Imaishi 2019-09-20 17:57:16 UTC

(In reply to dmetzger from comment #31)
> QE has verified they have resources to address this ticket in 5.10.11 and
> 5.11.1 testing.

In that case... Beni/Adam - not sure if all 3 PRs in Comment 26 should be "ivanchuk/yes" and "hammer/yes". Please label the PRs accordingly.

Comment 34 Adam Grare 2019-09-20 18:17:38 UTC

ACK will add the right labels

Comment 37 dmetzger 2019-10-22 18:49:35 UTC

*** Bug 1753788 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.