Description of problem: When multiple pods created on the same time the automation skip the event and the automation is not executing Version-Release number of selected component (if applicable): 4.6 How reproducible: Steps to Reproduce: 1. Import the attached automate domain 2. Add OCP provider if you don't have one already added 3. On OCP side create new app e.g ~~~ oc new-app redis ~~~ After the app completed you should see the automate code executed. "automation.log" <AEMethod create_start_date_msg> ----- Create ServiceNow CMDB Record ----- - Return Event: CONTAINER_CREATED POD: 4. No try to scale the up ~~~ oc scale dc --replicas=1 redis ~~~ At this point the automation also run fine, and you should see the message in the logs. 5. Scale the app up again with 4 or more replicas ~~~ oc scale dc --replicas=4 redis ~~~ Actual results: The automation executed once Expected results: The automation should be execution equal to the replicas Additional info:
Does it reliably execute exactly once, or is a random less-than-expected-times?
I think its more like less than-than-expected times.
I managed to reproduce (happens reliably), debugging... - The problem is before Automate! Not all the events get an EmsEvent created! - ManageIQ does receive the events! kubernetes event_catcher_mixin does log "Queuing event ..." This is encouraging, because it suggests a bug in kubernetes event processing code path, which will be easier for me to debug than Automate :-))
Debugged with help from Elad & Mooli. EmsEvent.create_event drops "duplicate" events. For container events, this happens if they are of same event type, provider, and timestamp - which we only get with 1 second resolution. Scaling up pods hits this easily, if they're all scheduled within same second. It doesn't even have to be from scaling 1 replication controller - simultaneous events of same type on completely unrelated pods will still collide! I'm going to set EmsEvent ems_ref field to the k8s event's uid. Then 2 events will never collide. It's a somewhat recently added field; I'm not yet sure whether this worked reliably at some older point, or never worked. ---- This is independent of whether we already have the object in inventory when event arrives. Same type same second => dropped! After this is fixed, some things will still not work when event arrives before refresh got the object (which happens a lot). - policies won't evaluate (bz 1367114) + I hope automate will run - untested, will know when I complete the change... - EmsEvent container_*_id will be nil + EmsEvent container_*_name and container_namespace are set - Automate script *might* break if they look up the object in the inventory. + attached customer's automate scripts seem safe at a quick glance, they only access fields on the event ($evm.root['event_stream']), don't look up the object I should add unit tests for event processing where we don't have object yet to exercise both code paths.
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/264
New commit detected on ManageIQ/manageiq-providers-kubernetes/master: https://github.com/ManageIQ/manageiq-providers-kubernetes/commit/6d1cb2c7e69212e8b7d655cdfc8f31e95bb0763a commit 6d1cb2c7e69212e8b7d655cdfc8f31e95bb0763a Author: Beni Cherniavsky-Paskin <cben> AuthorDate: Thu Jun 21 11:20:27 2018 -0400 Commit: Beni Cherniavsky-Paskin <cben> CommitDate: Thu Jun 21 11:20:27 2018 -0400 Set EmsEvent ems_ref to event's uid, to avoid same-second collision EmsEvent.create_event drops "duplicate" events, determined by (event_type, timestamp, chain_id, ems_id, ems_ref). Container events had no ems_ref (and no chain_id), meaning events of same type in same second collided and only one EmsEvent was created. The lost events weren't handled by Automate / Policy. Using k8s event's uid, not involvedObject's uid which still wouldn't be unique (for repeating events on same object). https://bugzilla.redhat.com/show_bug.cgi?id=1583832 app/models/manageiq/providers/kubernetes/container_manager/event_catcher_mixin.rb | 3 +- app/models/manageiq/providers/kubernetes/container_manager/event_parser_mixin.rb | 3 +- spec/models/manageiq/providers/kubernetes/container_manager/event_catcher_mixin_spec.rb | 9 +- 3 files changed, 10 insertions(+), 5 deletions(-)
Moving to POST so this BZ can be cloned for 5.9. Testing 5.9 backport now, will update on clone BZ. I'm going to attach minimal reproducer automate domain soon for QE.
Created attachment 1455224 [details] minimal repro, without confidential params Minimal repro ============= Attached container_scheduled_datastore.zip, clear of any confidential params. 1. Automate->Import/export-> import container_scheduled_datastore.zip This creates a new automate domain `repro`. 2. Automate->Explorer-> select `repro` domain -> edit the domain: make sure the domain is enabled. 3. Scale up some deployment, replication controller or replicaset by 3 replicas at once. 4. Wait until the new pods got created and *scheduled* to nodes. (POD_SCHEDULED is the event automate script reacts to) Can be confirmed by `oc get event | grep -i scheduled`. 5. Wait a few seconds, grep '@@@@@' log/automation.log Expected: [----] I, [2018-06-26T16:02:50.379418 #12186:2accd596b528] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-06-26 13:02:38 UTC, event uid: 33be41ad-7941-11e8-8706-001a4a162695, project: automate-bz, name: hello-world-1-87478, message: Successfully assigned hello-world-1-87478 to cben39-compute002.10.35.48.148.cmio.eng.lab.tlv.redhat.com [----] I, [2018-06-26T16:02:50.483113 #12084:2accd596bbe0] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-06-26 13:02:38 UTC, event uid: 33c4909c-7941-11e8-8706-001a4a162695, project: automate-bz, name: hello-world-1-vsp6g, message: Successfully assigned hello-world-1-vsp6g to cben39-master001.10.35.48.149.cmio.eng.lab.tlv.redhat.com [----] I, [2018-06-26T16:02:50.650881 #6188:2accd596b3d4] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-06-26 13:02:38 UTC, event uid: 33c61607-7941-11e8-8706-001a4a162695, project: automate-bz, name: hello-world-1-5xcrn, message: Successfully assigned hello-world-1-5xcrn to cben39-compute001.10.35.48.150.cmio.eng.lab.tlv.redhat.com Each '@@@@@@@@@@@' line comes from 1 execution of the Automate script - number of lines should match by how many pods you scaled up. Important: confirm at least 2 lines have exactly same `timestamp:` (probably all of them will). If unlucky and all had distinct timestamps, you didn't test the bug (that worked before as well), scale down and up again...
> Important: confirm at least 2 lines have exactly same `timestamp:` (probably all of them will). To clarify, this refers to the EmsEvent's timestamps printed as `timestamp: 2018-06-26 13:02:38 UTC` in the middle of the line. The moment of logging `[----] I, [2018-06-26T16:02:50.650881 #6188:2accd596b3d4]` at the start of the line will all differ, that's irrelevant.
Verified in: 5.10.0.22.20181030184024_26956a0 Followed verification steps in https://bugzilla.redhat.com/show_bug.cgi?id=1583832#c20 After scaling redis to 4 replicas, saw 3 pods scheduled in the oc events: to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com 9s 9s 1 redis-1-h4vlt.1563120f0d6b1da2 Pod Normal Scheduled default-scheduler Successfully assigned redis-1-h4vlt to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com 9s 9s 1 redis-1-hwbdn.1563120f0dad2ab9 Pod Normal Scheduled default-scheduler Successfully assigned redis-1-hwbdn to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com 9s 9s 1 redis-1-jpxkf.1563120f0dee71bb Pod Normal Scheduled default-scheduler Successfully assigned redis-1-jpxkf to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com [root@epacific-ocp-master-v3 ~]# Verified 3 POD_SCHEDULED events with same timestamp: [root@dhcp-8-198-226 log]# grep '@@@@@' automation.log [----] I, [2018-11-01T14:00:08.253488 #15177:6af048c] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 17:59:57 UTC, event uid: f1ac9557-ddff-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-deploy, message: Successfully assigned redis-1-deploy to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com [----] I, [2018-11-01T14:00:09.185951 #15177:d054f8] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:00:00 UTC, event uid: f354c5e2-ddff-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-7k5r7, message: Successfully assigned redis-1-7k5r7 to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com [----] I, [2018-11-01T14:02:41.886685 #15177:d099f4] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:02:33 UTC, event uid: 4e851efb-de00-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-h4vlt, message: Successfully assigned redis-1-h4vlt to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com [----] I, [2018-11-01T14:02:42.231086 #3181:d0019c] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:02:33 UTC, event uid: 4e867753-de00-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-hwbdn, message: Successfully assigned redis-1-hwbdn to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com [----] I, [2018-11-01T14:02:42.527367 #15159:d09468] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:02:33 UTC, event uid: 4e874463-de00-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-jpxkf, message: Successfully assigned redis-1-jpxkf to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com [root@dhcp-8-198-226 log]#