Bug 1583832

Summary:

Cloudforms Automation not executing properly when multiple pods are created or killed in a short timeframe.

Product:

Red Hat CloudForms Management Engine

Reporter:

Saif Ali <saali>

Component:

Providers

Assignee:

Beni Paskin-Cherniavsky <cben>

Status:

CLOSED CURRENTRELEASE

QA Contact:

juwatts

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

5.9.0

CC:

akarol, cben, cpelland, gblomqui, jfrey, jhardy, jprause, mkanoor, mrobson, obarenbo, oourfali, saali, simaishi, tfitzger

Target Milestone:

Keywords:

TestOnly, ZStream

Target Release:

5.10.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

5.10.0.2

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1595324 (view as bug list)

Environment:

Last Closed:

2019-02-11 14:09:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

Container Management

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1595324

Attachments:

Description	Flags
minimal repro, without confidential params	none

Description Saif Ali 2018-05-29 20:29:32 UTC

Description of problem:
When multiple pods created on the same time the automation  skip the event and the automation is not executing 

Version-Release number of selected component (if applicable):
4.6

How reproducible:


Steps to Reproduce:
1. Import the attached automate domain
2. Add OCP provider if you don't have one already added
3. On OCP side create new app
e.g
~~~
oc new-app redis 
~~~
After the app completed you should see the automate code executed.

"automation.log"

 <AEMethod create_start_date_msg> ----- Create ServiceNow CMDB Record ----- - Return Event: CONTAINER_CREATED POD:

4. No try to scale the up
~~~ 
oc scale dc --replicas=1 redis

~~~
At this point the automation also run fine, and you should see the message in the logs.

5. Scale the app up again with 4 or more replicas

~~~
oc scale dc --replicas=4 redis 
~~~

Actual results:
The automation executed once 

Expected results:
The automation should be execution equal to the replicas 

Additional info:

Comment 4 Beni Paskin-Cherniavsky 2018-06-12 16:39:07 UTC

Does it reliably execute exactly once, or is a random less-than-expected-times?

Comment 5 Saif Ali 2018-06-12 16:47:03 UTC

I think its more like less than-than-expected times.

Comment 12 Beni Paskin-Cherniavsky 2018-06-20 13:13:04 UTC

I managed to reproduce (happens reliably), debugging...

- The problem is before Automate!  Not all the events get an EmsEvent created!
- ManageIQ does receive the events!  kubernetes event_catcher_mixin does log "Queuing event ..."

This is encouraging, because it suggests a bug in kubernetes event processing code path, which will be easier for me to debug than Automate :-))

Comment 13 Beni Paskin-Cherniavsky 2018-06-21 12:58:09 UTC

Debugged with help from Elad & Mooli.
EmsEvent.create_event drops "duplicate" events.
For container events, this happens if they are of same event type, provider, and timestamp - which we only get with 1 second resolution.
Scaling up pods hits this easily, if they're all scheduled within same second.  It doesn't even have to be from scaling 1 replication controller - simultaneous events of same type on completely unrelated pods will still collide!

I'm going to set EmsEvent ems_ref field to the k8s event's uid.  Then 2 events will never collide.
It's a somewhat recently added field; I'm not yet sure whether this worked reliably at some older point, or never worked.

----

This is independent of whether we already have the object in inventory when event arrives.  Same type same second => dropped!
After this is fixed, some things will still not work when event arrives before refresh got the object (which happens a lot).
- policies won't evaluate (bz 1367114)
+ I hope automate will run - untested, will know when I complete the change...
- EmsEvent container_*_id will be nil
+ EmsEvent container_*_name and container_namespace are set
- Automate script *might* break if they look up the object in the inventory.
  + attached customer's automate scripts seem safe at a quick glance, they only access fields on the event ($evm.root['event_stream']), don't look up the object

I should add unit tests for event processing where we don't have object yet to exercise both code paths.

Comment 14 CFME Bot 2018-06-24 21:55:35 UTC

https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/264

Comment 15 CFME Bot 2018-06-25 08:49:23 UTC

New commit detected on ManageIQ/manageiq-providers-kubernetes/master:

https://github.com/ManageIQ/manageiq-providers-kubernetes/commit/6d1cb2c7e69212e8b7d655cdfc8f31e95bb0763a
commit 6d1cb2c7e69212e8b7d655cdfc8f31e95bb0763a
Author:     Beni Cherniavsky-Paskin <cben>
AuthorDate: Thu Jun 21 11:20:27 2018 -0400
Commit:     Beni Cherniavsky-Paskin <cben>
CommitDate: Thu Jun 21 11:20:27 2018 -0400

    Set EmsEvent ems_ref to event's uid, to avoid same-second collision

    EmsEvent.create_event drops "duplicate" events, determined by
    (event_type, timestamp, chain_id, ems_id, ems_ref).
    Container events had no ems_ref (and no chain_id), meaning events of same
    type in same second collided and only one EmsEvent was created.
    The lost events weren't handled by Automate / Policy.

    Using k8s event's uid, not involvedObject's uid which still wouldn't
    be unique (for repeating events on same object).

    https://bugzilla.redhat.com/show_bug.cgi?id=1583832

 app/models/manageiq/providers/kubernetes/container_manager/event_catcher_mixin.rb | 3 +-
 app/models/manageiq/providers/kubernetes/container_manager/event_parser_mixin.rb | 3 +-
 spec/models/manageiq/providers/kubernetes/container_manager/event_catcher_mixin_spec.rb | 9 +-
 3 files changed, 10 insertions(+), 5 deletions(-)

Comment 17 Beni Paskin-Cherniavsky 2018-06-26 09:54:38 UTC

Moving to POST so this BZ can be cloned for 5.9.
Testing 5.9 backport now, will update on clone BZ.

I'm going to attach minimal reproducer automate domain soon for QE.

Comment 20 Beni Paskin-Cherniavsky 2018-06-28 09:59:46 UTC

Created attachment 1455224 [details]
minimal repro, without confidential params

Minimal repro
=============
Attached container_scheduled_datastore.zip, clear of any confidential params.

1. Automate->Import/export-> import container_scheduled_datastore.zip
   This creates a new automate domain `repro`.
2. Automate->Explorer-> select `repro` domain -> edit the domain:
   make sure the domain is enabled.
3. Scale up some deployment, replication controller or replicaset
   by 3 replicas at once.
4. Wait until the new pods got created and *scheduled* to nodes.
   (POD_SCHEDULED is the event automate script reacts to)
   Can be confirmed by `oc get event | grep -i scheduled`.
5. Wait a few seconds, grep '@@@@@' log/automation.log

Expected:

[----] I, [2018-06-26T16:02:50.379418 #12186:2accd596b528]  INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-06-26 13:02:38 UTC, event uid: 33be41ad-7941-11e8-8706-001a4a162695, project: automate-bz, name: hello-world-1-87478, message: Successfully assigned hello-world-1-87478 to cben39-compute002.10.35.48.148.cmio.eng.lab.tlv.redhat.com
[----] I, [2018-06-26T16:02:50.483113 #12084:2accd596bbe0]  INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-06-26 13:02:38 UTC, event uid: 33c4909c-7941-11e8-8706-001a4a162695, project: automate-bz, name: hello-world-1-vsp6g, message: Successfully assigned hello-world-1-vsp6g to cben39-master001.10.35.48.149.cmio.eng.lab.tlv.redhat.com
[----] I, [2018-06-26T16:02:50.650881 #6188:2accd596b3d4]  INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-06-26 13:02:38 UTC, event uid: 33c61607-7941-11e8-8706-001a4a162695, project: automate-bz, name: hello-world-1-5xcrn, message: Successfully assigned hello-world-1-5xcrn to cben39-compute001.10.35.48.150.cmio.eng.lab.tlv.redhat.com

Each '@@@@@@@@@@@' line comes from 1 execution of the Automate script - number of lines should match by how many pods you scaled up.

Important: confirm at least 2 lines have exactly same `timestamp:` (probably all of them will).  If unlucky and all had distinct timestamps, you didn't test the bug (that worked before as well), scale down and up again...

Comment 21 Beni Paskin-Cherniavsky 2018-07-02 09:31:42 UTC

> Important: confirm at least 2 lines have exactly same `timestamp:` (probably all of them will). 

To clarify, this refers to the EmsEvent's timestamps printed as `timestamp: 2018-06-26 13:02:38 UTC` in the middle of the line.
The moment of logging `[----] I, [2018-06-26T16:02:50.650881 #6188:2accd596b3d4]` at the start of the line will all differ, that's irrelevant.

Comment 24 juwatts 2018-11-01 18:06:52 UTC

Verified in:
5.10.0.22.20181030184024_26956a0

Followed verification steps in 
https://bugzilla.redhat.com/show_bug.cgi?id=1583832#c20

After scaling redis to 4 replicas, saw 3 pods scheduled in the oc events:

to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com
9s          9s           1         redis-1-h4vlt.1563120f0d6b1da2     Pod                                                        Normal    Scheduled                     default-scheduler                                                Successfully assigned redis-1-h4vlt to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com
9s          9s           1         redis-1-hwbdn.1563120f0dad2ab9     Pod                                                        Normal    Scheduled                     default-scheduler                                                Successfully assigned redis-1-hwbdn to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com
9s          9s           1         redis-1-jpxkf.1563120f0dee71bb     Pod                                                        Normal    Scheduled                     default-scheduler                                                Successfully assigned redis-1-jpxkf to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com
[root@epacific-ocp-master-v3 ~]# 

Verified 3 POD_SCHEDULED events with same timestamp:

[root@dhcp-8-198-226 log]# grep '@@@@@' automation.log
[----] I, [2018-11-01T14:00:08.253488 #15177:6af048c]  INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 17:59:57 UTC, event uid: f1ac9557-ddff-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-deploy, message: Successfully assigned redis-1-deploy to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com
[----] I, [2018-11-01T14:00:09.185951 #15177:d054f8]  INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:00:00 UTC, event uid: f354c5e2-ddff-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-7k5r7, message: Successfully assigned redis-1-7k5r7 to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com
[----] I, [2018-11-01T14:02:41.886685 #15177:d099f4]  INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:02:33 UTC, event uid: 4e851efb-de00-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-h4vlt, message: Successfully assigned redis-1-h4vlt to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com
[----] I, [2018-11-01T14:02:42.231086 #3181:d0019c]  INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:02:33 UTC, event uid: 4e867753-de00-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-hwbdn, message: Successfully assigned redis-1-hwbdn to epacific-ocp-compute2-v3.cmqe.lab.eng.rdu2.redhat.com
[----] I, [2018-11-01T14:02:42.527367 #15159:d09468]  INFO -- : <AEMethod create_start_date_msg> @@@@@@@@@@@ container event: POD_SCHEDULED, timestamp: 2018-11-01 18:02:33 UTC, event uid: 4e874463-de00-11e8-bc54-001a4a520000, project: bugverify, name: redis-1-jpxkf, message: Successfully assigned redis-1-jpxkf to epacific-ocp-compute1-v3.cmqe.lab.eng.rdu2.redhat.com
[root@dhcp-8-198-226 log]#