Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Description of problem: When simulating a Pod event in policies, "Unable to find target" error will stop from a created Policy to run Version-Release number of selected component (if applicable): How reproducible: Often Steps to Reproduce: 1. Open https://10.16.5.159 to see the created Pod Policy Profile 2. Log into 10.35.161.112 to control Pods scaling 3. Run oc scale --replicas=10 replicationcontrollers jenkins-1 to scale up the jenkins pod 4. Check that the Pod Policy profile was run successfully following the scale up 4. Run grep "Unable to find target" evm.log to look for the error Actual results: The mail often is not being sent and there will be the "Uanble to find target" errors in the log Expected results: The policy should run and the user should be able to receive a mail Additional info:
Beni please take a look, thanks
Indeed, this is a known limitation of the currently implemention shipping in 5.6.1; it's certainly a bug from a user's perspective and deserves a BZ to track. The problem happens when events arrive for a recently created Pod/Replicator/Node that we don't yet have in CFME's inventory. The symptoms are: "Unable to find target [...], skipping policy evaluation" message in evm.log, and no policies run for this event. (Which event it was to can be figured out from preceding evm.log lines.) How bad is it? Inventory refresh defaults to every 15min, plus the system *tries* to schedule inventory refresh for that provider soon after *every* event, but that doesn't scale well — the "window of lost events" will get longer for bigger kubernetes/openshift clusters! - Pods are the main victim, as they are created frequently. Events like Pod Sheduled / Pod Failed Scheduling happen quickly after creation and will be *almost always* get dropped! - New Replicators: first few events are likely to get dropped. - Nodes? Bringing up a new node is much slower, but I suppose the first Node Ready event might get dropped. - There is no problem with Container Image Discovered event as it's synthesized internally *after* we see the image in inventory refresh.
There is some discussion of directions starting at https://github.com/ManageIQ/manageiq/issues/8654#issuecomment-221315280 There are secondary problems with Scope & Condition evaluation using stale data. E.g. a replicator condition checking its number of replicas will use the number we saw at last refresh; when Replicator Sucessfully Created Pod event arrives, that number will always be incorrect.
A good solution will only be possible after we implement targetted refresh. We started work towards that but it's a long way off. An alternative which *maybe* could work is delaying event delivery until we observe the entity. Also non-trivial. => Switching cfme-future, this won't make 5.8.0, and in my estimation any solution will be to invasive for 5.8.z.
(In reply to Beni Paskin-Cherniavsky from comment #6) > => Switching cfme-future, this won't make 5.8.0, and in my estimation any > solution will be to invasive for 5.8.z. cfme-future is currently for unscheduled work. This is a scheduled/assigned BZ and I am managing the target release. If indeed it won't be possible in 5.8, I will move to 5.9/6.0 (not cfme-future).
Federico should this be moved to 5.9?
Usability improvement: current log message is buried in evm.log and only says if it was pod, replicator, or node: Unable to find target [container_group], skipping policy evaluation It should go to policy.log, because that's where people look to understand if/why events are not working. And it should include the event name and the specific target name (and namespace?) that was not found.
*** Bug 1489616 has been marked as a duplicate of this bug. ***
Discussion on this in https://github.com/ManageIQ/manageiq/pull/16497
*** Bug 1518188 has been marked as a duplicate of this bug. ***
While we haven't figured out how to properly solve the root issue, there is one workaround direction that's clearly doable, for *some use cases*: Generate synthetic events from refresh. We had Container Image Discovered and added Container Project Discovered in https://github.com/ManageIQ/manageiq/pull/16903. Not hard to add more (e.g. BZ 1499539 proposed Pod Discovered)... These are reliable (except for short-lived objects that refresh might never see; streaming refresh would be necessary and sufficient for those). This is hard to extend for events other than created/deleted (although if we get streaming refresh we will see all transitions, and it _might_ become an interesting idea)