Bug 1367114 - Events for recently created pods & replicators don't trigger policies "Unable to find target"
Summary: Events for recently created pods & replicators don't trigger policies "Unable...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Control
Version: 5.6.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.10.0
Assignee: Loic Avenel
QA Contact: juwatts
URL:
Whiteboard: container:event
: 1489616 1518188 (view as bug list)
Depends On:
Blocks: 1503797
TreeView+ depends on / blocked
 
Reported: 2016-08-15 14:51 UTC by Pavel Zagalsky
Modified: 2021-09-09 11:54 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-30 17:57:58 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Pavel Zagalsky 2016-08-15 14:51:21 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Pavel Zagalsky 2016-08-15 15:01:53 UTC
Description of problem:
When simulating a Pod event in policies, "Unable to find target" error will stop from a created Policy to run

Version-Release number of selected component (if applicable):


How reproducible:
Often

Steps to Reproduce:
1. Open https://10.16.5.159 to see the created Pod Policy Profile
2. Log into 10.35.161.112 to control Pods scaling
3. Run oc scale --replicas=10 replicationcontrollers jenkins-1 to scale up the jenkins pod
4. Check that the Pod Policy profile was run successfully following the scale up
4. Run grep "Unable to find target" evm.log to look for the error

Actual results:
The mail often is not being sent and there will be the "Uanble to find target" errors in the log

Expected results:
The policy should run and the user should be able to receive a mail

Additional info:

Comment 3 Mooli Tayer 2016-08-16 17:21:49 UTC
Beni please take a look, thanks

Comment 4 Beni Paskin-Cherniavsky 2016-08-17 08:10:30 UTC
Indeed, this is a known limitation of the currently implemention shipping in 5.6.1; it's certainly a bug from a user's perspective and deserves a BZ to track.

The problem happens when events arrive for a recently created Pod/Replicator/Node that we don't yet have in CFME's inventory.  The symptoms are: "Unable to find target [...], skipping policy evaluation" message in evm.log, and no policies run for this event.  (Which event it was to can be figured out from preceding evm.log lines.)

How bad is it?  Inventory refresh defaults to every 15min, plus the system *tries* to schedule inventory refresh for that provider soon after *every* event, but that doesn't scale well — the "window of lost events" will get longer for bigger kubernetes/openshift clusters!

- Pods are the main victim, as they are created frequently.
  Events like Pod Sheduled / Pod Failed Scheduling happen quickly after 
  creation and will be *almost always* get dropped!

- New Replicators: first few events are likely to get dropped.

- Nodes?  Bringing up a new node is much slower, but I suppose 
  the first Node Ready event might get dropped.

- There is no problem with Container Image Discovered event as it's synthesized internally *after* we see the image in inventory refresh.

Comment 5 Beni Paskin-Cherniavsky 2016-08-17 08:17:46 UTC
There is some discussion of directions starting at
https://github.com/ManageIQ/manageiq/issues/8654#issuecomment-221315280

There are secondary problems with Scope & Condition evaluation using stale data.
E.g. a replicator condition checking its number of replicas will use the number we saw at last refresh; when Replicator Sucessfully Created Pod event arrives, that number will always be incorrect.

Comment 6 Beni Paskin-Cherniavsky 2017-03-27 20:47:01 UTC
A good solution will only be possible after we implement targetted refresh.  We started work towards that but it's a long way off.
An alternative which *maybe* could work is delaying event delivery until we observe the entity.  Also non-trivial.

=> Switching cfme-future, this won't make 5.8.0, and in my estimation any solution will be to invasive for 5.8.z.

Comment 7 Federico Simoncelli 2017-03-27 21:16:26 UTC
(In reply to Beni Paskin-Cherniavsky from comment #6)
> => Switching cfme-future, this won't make 5.8.0, and in my estimation any
> solution will be to invasive for 5.8.z.

cfme-future is currently for unscheduled work.
This is a scheduled/assigned BZ and I am managing the target release.
If indeed it won't be possible in 5.8, I will move to 5.9/6.0 (not cfme-future).

Comment 8 Mooli Tayer 2017-04-23 12:51:14 UTC
Federico should this be moved to 5.9?

Comment 9 Beni Paskin-Cherniavsky 2017-09-26 17:07:21 UTC
Usability improvement: current log message is buried in evm.log and only says if it was pod, replicator, or node:

    Unable to find target [container_group], skipping policy evaluation

It should go to policy.log, because that's where people look to understand if/why events are not working.
And it should include the event name and the specific target name (and namespace?) that was not found.

Comment 10 Beni Paskin-Cherniavsky 2017-10-19 10:37:56 UTC
*** Bug 1489616 has been marked as a duplicate of this bug. ***

Comment 11 Ari Zellner 2017-12-07 16:06:31 UTC
Discussion on this in https://github.com/ManageIQ/manageiq/pull/16497

Comment 12 Beni Paskin-Cherniavsky 2018-01-25 15:35:57 UTC
*** Bug 1518188 has been marked as a duplicate of this bug. ***

Comment 17 Beni Paskin-Cherniavsky 2018-08-13 17:14:23 UTC
While we haven't figured out how to properly solve the root issue,
there is one workaround direction that's clearly doable, for *some use cases*:
Generate synthetic events from refresh.

We had Container Image Discovered and added Container Project Discovered in https://github.com/ManageIQ/manageiq/pull/16903.
Not hard to add more (e.g. BZ 1499539 proposed Pod Discovered)...
These are reliable (except for short-lived objects that refresh might never see; streaming refresh would be necessary and sufficient for those).

This is hard to extend for events other than created/deleted (although if we get streaming refresh we will see all transitions, and it _might_ become an interesting idea)


Note You need to log in before you can comment on or make changes to this bug.