Description of problem: We have connected to an OpenShift provider and are trying to create a control policy, however the actions are not being executed. I do not see any OpenShift events in the evm logs on any of the Event Monitor appliances. Version-Release number of selected component (if applicable): 5.8.1.5-20170725160636_e433fc0 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
First thing that stands out is workers stopped with "evm_worker_memory_exceeded", thousands of times. [----] W, [2017-09-05T03:12:45.635906 #14767:1073130] WARN -- : MIQ(MiqServer#validate_worker) Worker [MiqPriorityWorker] with ID: [101000000120629], PID: [30391], GUID: [6b21d240-9208-11e7-a7fe-005056962079] process memory usage [734983000] exceeded limit [629145600], requesting worker to exit > grep 'WARN.*requesting worker to exit' */log/evm.log | egrep -o 'Worker \S+' | sort | uniq -c 2632 Worker [ManageIQ::Providers::Openshift::ContainerManager::MetricsCollectorWorker] 2272 Worker [MiqGenericWorker] 360 Worker [MiqPriorityWorker] 133 Worker [MiqReportingWorker] I don't think this explains lack of events, but is not a healthy situation. This system needs more RAM. Looking deeper...
Good news from evm_current_densba3osclf01.qic.tiaa-cref.org_20170914_124432/log/evm.log is that it did receive events from openshift. 2 event_type=>"CONTAINER_CREATED" 525 event_type=>"CONTAINER_FAILED" 1 event_type=>"CONTAINER_KILLING" 2 event_type=>"CONTAINER_STARTED" 31 event_type=>"CONTAINER_UNHEALTHY" 2 event_type=>"POD_SCHEDULED" Checking whether they made it any further through automate/policy...
Found another problem in logs: customer has several Node alerts defined ("OSE Node CPU > 0", "OSE Node Datawarehouse Alerts", "OSE Node Mem > 0"), and they don't work. This never worked, it's a mistake that it's allowed in UI. RFE to implement: bug 1494599 (added stacktrace from this log there).
For this BZ, the problem customer is seeing is that specifically Pod policies don't work most of the time. Node & Image events do trigger policies reliably. I suspected but wasn't certain this is bug 1367114, and now finally found the evidence in logs. It is bug 1367114, policies don't work when event arrives before pod was seen by inventory refresh. (such event may not even be logged to policy.log, but in evm.log we can see it processed and hit "Unable to find target".) Work continues on implementing workarounds for customer's use case, and several RFEs were filed, but this BZ I'm going to close as duplicate. *** This bug has been marked as a duplicate of bug 1367114 ***