Bug 1583832
| Summary: | Cloudforms Automation not executing properly when multiple pods are created or killed in a short timeframe. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat CloudForms Management Engine | Reporter: | Saif Ali <saali> | ||||
| Component: | Providers | Assignee: | Beni Paskin-Cherniavsky <cben> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | juwatts | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 5.9.0 | CC: | akarol, cben, cpelland, gblomqui, jfrey, jhardy, jprause, mkanoor, mrobson, obarenbo, oourfali, saali, simaishi, tfitzger | ||||
| Target Milestone: | GA | Keywords: | TestOnly, ZStream | ||||
| Target Release: | 5.10.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | 5.10.0.2 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1595324 (view as bug list) | Environment: | |||||
| Last Closed: | 2019-02-11 14:09:51 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | Container Management | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1595324 | ||||||
| Attachments: |
|
||||||
|
Description
Saif Ali
2018-05-29 20:29:32 UTC
Does it reliably execute exactly once, or is a random less-than-expected-times? I think its more like less than-than-expected times. I managed to reproduce (happens reliably), debugging... - The problem is before Automate! Not all the events get an EmsEvent created! - ManageIQ does receive the events! kubernetes event_catcher_mixin does log "Queuing event ..." This is encouraging, because it suggests a bug in kubernetes event processing code path, which will be easier for me to debug than Automate :-)) Debugged with help from Elad & Mooli. EmsEvent.create_event drops "duplicate" events. For container events, this happens if they are of same event type, provider, and timestamp - which we only get with 1 second resolution. Scaling up pods hits this easily, if they're all scheduled within same second. It doesn't even have to be from scaling 1 replication controller - simultaneous events of same type on completely unrelated pods will still collide! I'm going to set EmsEvent ems_ref field to the k8s event's uid. Then 2 events will never collide. It's a somewhat recently added field; I'm not yet sure whether this worked reliably at some older point, or never worked. ---- This is independent of whether we already have the object in inventory when event arrives. Same type same second => dropped! After this is fixed, some things will still not work when event arrives before refresh got the object (which happens a lot). - policies won't evaluate (bz 1367114) + I hope automate will run - untested, will know when I complete the change... - EmsEvent container_*_id will be nil + EmsEvent container_*_name and container_namespace are set - Automate script *might* break if they look up the object in the inventory. + attached customer's automate scripts seem safe at a quick glance, they only access fields on the event ($evm.root['event_stream']), don't look up the object I should add unit tests for event processing where we don't have object yet to exercise both code paths. New commit detected on ManageIQ/manageiq-providers-kubernetes/master: https://github.com/ManageIQ/manageiq-providers-kubernetes/commit/6d1cb2c7e69212e8b7d655cdfc8f31e95bb0763a commit 6d1cb2c7e69212e8b7d655cdfc8f31e95bb0763a Author: Beni Cherniavsky-Paskin <cben> AuthorDate: Thu Jun 21 11:20:27 2018 -0400 Commit: Beni Cherniavsky-Paskin <cben> CommitDate: Thu Jun 21 11:20:27 2018 -0400 Set EmsEvent ems_ref to event's uid, to avoid same-second collision EmsEvent.create_event drops "duplicate" events, determined by (event_type, timestamp, chain_id, ems_id, ems_ref). Container events had no ems_ref (and no chain_id), meaning events of same type in same second collided and only one EmsEvent was created. The lost events weren't handled by Automate / Policy. Using k8s event's uid, not involvedObject's uid which still wouldn't be unique (for repeating events on same object). https://bugzilla.redhat.com/show_bug.cgi?id=1583832 app/models/manageiq/providers/kubernetes/container_manager/event_catcher_mixin.rb | 3 +- app/models/manageiq/providers/kubernetes/container_manager/event_parser_mixin.rb | 3 +- spec/models/manageiq/providers/kubernetes/container_manager/event_catcher_mixin_spec.rb | 9 +- 3 files changed, 10 insertions(+), 5 deletions(-) Moving to POST so this BZ can be cloned for 5.9. Testing 5.9 backport now, will update on clone BZ. I'm going to attach minimal reproducer automate domain soon for QE. Created attachment 1455224 [details]
minimal repro, without confidential params
Minimal repro
=============
Attached container_scheduled_datastore.zip, clear of any confidential params.
1. Automate->Import/export-> import container_scheduled_datastore.zip
This creates a new automate domain `repro`.
2. Automate->Explorer-> select `repro` domain -> edit the domain:
make sure the domain is enabled.
3. Scale up some deployment, replication controller or replicaset
by 3 replicas at once.
4. Wait until the new pods got created and *scheduled* to nodes.
(POD_SCHEDULED is the event automate script reacts to)
Can be confirmed by `oc get event | grep -i scheduled`.
5. Wait a few seconds, grep '@@@@@' log/automation.log
Expected:
[----] I, [2018-06-26T16:02:50.379418 #12186:2accd596b528] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-06-26 13:02:38 UTC, event uid: 33be41ad-7941-11e8-8706-001a4a162695, project: automate-bz, name: hello-world-1-87478, message: Successfully assigned hello-world-1-87478 to cben39-compute002.10.35.48.148.cmio.eng.lab.tlv.redhat.com
[----] I, [2018-06-26T16:02:50.483113 #12084:2accd596bbe0] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-06-26 13:02:38 UTC, event uid: 33c4909c-7941-11e8-8706-001a4a162695, project: automate-bz, name: hello-world-1-vsp6g, message: Successfully assigned hello-world-1-vsp6g to cben39-master001.10.35.48.149.cmio.eng.lab.tlv.redhat.com
[----] I, [2018-06-26T16:02:50.650881 #6188:2accd596b3d4] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-06-26 13:02:38 UTC, event uid: 33c61607-7941-11e8-8706-001a4a162695, project: automate-bz, name: hello-world-1-5xcrn, message: Successfully assigned hello-world-1-5xcrn to cben39-compute001.10.35.48.150.cmio.eng.lab.tlv.redhat.com
Each '@@@@@@@@@@@' line comes from 1 execution of the Automate script - number of lines should match by how many pods you scaled up.
Important: confirm at least 2 lines have exactly same `timestamp:` (probably all of them will). If unlucky and all had distinct timestamps, you didn't test the bug (that worked before as well), scale down and up again...
> Important: confirm at least 2 lines have exactly same `timestamp:` (probably all of them will).
To clarify, this refers to the EmsEvent's timestamps printed as `timestamp: 2018-06-26 13:02:38 UTC` in the middle of the line.
The moment of logging `[----] I, [2018-06-26T16:02:50.650881 #6188:2accd596b3d4]` at the start of the line will all differ, that's irrelevant.
Verified in: 5.10.0.22.20181030184024_26956a0 Followed verification steps in https://bugzilla.redhat.com/show_bug.cgi?id=1583832#c20 After scaling redis to 4 replicas, saw 3 pods scheduled in the oc events: to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com 9s 9s 1 redis-1-h4vlt.1563120f0d6b1da2 Pod Normal Scheduled default-scheduler Successfully assigned redis-1-h4vlt to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com 9s 9s 1 redis-1-hwbdn.1563120f0dad2ab9 Pod Normal Scheduled default-scheduler Successfully assigned redis-1-hwbdn to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com 9s 9s 1 redis-1-jpxkf.1563120f0dee71bb Pod Normal Scheduled default-scheduler Successfully assigned redis-1-jpxkf to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com [root@epacific-ocp-master-v3 ~]# Verified 3 POD_SCHEDULED events with same timestamp: [root@dhcp-8-198-226 log]# grep '@@@@@' automation.log [----] I, [2018-11-01T14:00:08.253488 #15177:6af048c] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 17:59:57 UTC, event uid: f1ac9557-ddff-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-deploy, message: Successfully assigned redis-1-deploy to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com [----] I, [2018-11-01T14:00:09.185951 #15177:d054f8] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:00:00 UTC, event uid: f354c5e2-ddff-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-7k5r7, message: Successfully assigned redis-1-7k5r7 to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com [----] I, [2018-11-01T14:02:41.886685 #15177:d099f4] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:02:33 UTC, event uid: 4e851efb-de00-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-h4vlt, message: Successfully assigned redis-1-h4vlt to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com [----] I, [2018-11-01T14:02:42.231086 #3181:d0019c] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:02:33 UTC, event uid: 4e867753-de00-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-hwbdn, message: Successfully assigned redis-1-hwbdn to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com [----] I, [2018-11-01T14:02:42.527367 #15159:d09468] INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:02:33 UTC, event uid: 4e874463-de00-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-jpxkf, message: Successfully assigned redis-1-jpxkf to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com [root@dhcp-8-198-226 log]# |