1386463 – [platformmanagement_public_479] No new jobs scheduled as cron setting when set concurrencyPolicy of scheduledjobs as Replace

Bug 1386463 - [platformmanagement_public_479] No new jobs scheduled as cron setting when set concurrencyPolicy of scheduledjobs as Replace

Summary: [platformmanagement_public_479] No new jobs scheduled as cron setting when se...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Maciej Szulik
QA Contact:	Chuan Yu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-19 03:09 UTC by Chuan Yu
Modified:	2017-03-08 18:43 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Active jobs were mistakenly counted when doing synchronization. Consequence: Wrong active calculation lead to not scheduling new jobs when concurrencyPolicy is set to Replace. Fix: Fix how active jobs for a ScheduledJob are calculated. Result: concurrencyPolicy should work as expected, when set to Replace.
Clone Of:
Environment:
Last Closed:	2017-01-18 12:43:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0066	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.4 RPM Release Advisory	2017-01-18 17:23:26 UTC

Description Chuan Yu 2016-10-19 03:09:28 UTC

Description of problem:
After update the concurrencyPolicy of scheduledjob to Replace, the job scheduled not as the policy, only scheduled once, then will no new jobs scheduled:
[root@dhcp-140-15 ~]# oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     2         Wed, 19 Oct 2016 10:28:00 +0800
[root@dhcp-140-15 ~]# oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-3360002802   1         0            24s
sj3-467178220    1         0            1m
[root@dhcp-140-15 ~]# oc patch scheduledjobs sj3 -p '{"spec":{"concurrencyPolicy": "Replace"}}'
"sj3" patched
[root@dhcp-140-15 ~]# oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     2         Wed, 19 Oct 2016 10:28:00 +0800
[root@dhcp-140-15 ~]# oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     3         Wed, 19 Oct 2016 10:29:00 +0800
[root@dhcp-140-15 ~]# oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     3         Wed, 19 Oct 2016 10:29:00 +0800
[root@dhcp-140-15 ~]# oc get pod
NAME                   READY     STATUS    RESTARTS   AGE
sj3-4060648175-uqfm9   1/1       Running   0          40s
[root@dhcp-140-15 ~]# oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-4060648175   1         0            46s
[root@dhcp-140-15 ~]# oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-4060648175   1         0            1m
[root@dhcp-140-15 ~]# oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-4060648175   1         1            10m
[root@dhcp-140-15 ~]# oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-4060648175   1         1            28m
[root@dhcp-140-15 ~]# oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     2         Wed, 19 Oct 2016 10:29:00 +0800


And master config got some logs like:
Oct 18 22:34:44 qe-pm-chuyumaster-1 docker[23945]: I1018 22:34:44.823091       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found
Oct 18 22:34:54 qe-pm-chuyumaster-1 docker[23945]: I1018 22:34:54.834128       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found
Oct 18 22:35:04 qe-pm-chuyumaster-1 docker[23945]: I1018 22:35:04.846256       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found
Oct 18 22:35:14 qe-pm-chuyumaster-1 docker[23945]: I1018 22:35:14.858327       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found
Oct 18 22:35:24 qe-pm-chuyumaster-1 docker[23945]: I1018 22:35:24.874125       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found


Version-Release number of selected component (if applicable):
openshift v3.3.1.3

How reproducible:
Always

Steps to Reproduce:
1. Create a scheduledjob
oc run sj3 --image=busybox --restart=Never --schedule="*/1 * * * *" -- sleep 300
2. Check the master log
3. Set the concurrencyPolicy of the scheduledjob as Replace
oc patch scheduledjobs sj3 -p '{"spec":{"concurrencyPolicy": "Replace"}}'
4. Check the scheduledjobs by 'oc get jobs' and 'oc get scheduledjobs' and master log.

Actual results:
1. no new jobs scheduled as cron setting.
2. logs in master log:
Oct 18 22:35:24 qe-pm-chuyumaster-1 docker[23945]: I1018 22:35:24.874125       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found

Expected results:
1. new jobs scheduled as cron setting.
2. should not get logs like this.

Additional info:

Comment 1 Maciej Szulik 2016-10-19 09:19:41 UTC

ScheduledJobs are a techpreview feature in 3.3.1 (alpha in kubernetes in origin master). That's why this is not blocking the release in any way, but I'll fix the problem in the master only.

Comment 2 Maciej Szulik 2016-10-24 12:15:00 UTC

Upstream fix is in https://github.com/kubernetes/kubernetes/pull/35420

Downstream cherry-pick in https://github.com/openshift/origin/pull/11523

Comment 3 openshift-github-bot 2016-10-31 16:10:42 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/f5e3c6dbbb59b8a6e374485c8d11d829b4571a91
Merge pull request #11523 from soltysh/bug1386463

Merged by openshift-bot

Comment 4 Chuan Yu 2016-11-02 06:10:09 UTC

Checked with devenv-fedora_5292, the issue still not fixed.

openshift v1.4.0-alpha.0+8ecb3f5-997


[chuyu@dhcp-140-15 redhat]$ oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-1705609028   1         0            1m
[chuyu@dhcp-140-15 redhat]$ oc get pods
NAME                   READY     STATUS    RESTARTS   AGE
sj3-1705609028-o9rjg   1/1       Running   0          1m
[chuyu@dhcp-140-15 redhat]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     2         Wed, 02 Nov 2016 13:59:00 +0800

I1102 06:04:52.772985    1133 event.go:217] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"3f739531-a0c1-11e6-904b-0ef7883dfef8", APIVersion:"batch", ResourceVersion:"649", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-1781434183" not found

Comment 5 Maciej Szulik 2016-11-02 11:52:21 UTC

I've followed the steps described in #c1 and it's working as expected. I am running v1.4.0-alpha.0+537c0a5-1006 which is only a few commits ahead of what you were testing with. Can you give me the exact steps you're testing with?

Comment 6 Chuan Yu 2016-11-03 07:53:10 UTC

Here is my steps on openshift v1.4.0-alpha.0+90d8c62-1000
1.login as normal user, and create a new project "chuyu"
2.schedule job with 'oc run sj3 --image=busybox --restart=Never --schedule="*/1 * * * *" -- sleep 300'
3.wait about 5m, then edited the scheduledjobs sj3 with concurrencyPolicy 'Replace'
4.then check scheduledjobs, jobs and pods, since then still no new job scheduled. As the docs description, the concurrencyPolicy for Replace, every minutes will have a new job scheduled to replace the old one.

see the command output:
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     4         Thu, 03 Nov 2016 15:27:00 +0800
[chuyu@dhcp-140-15 ~]$ oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-1782089543   1         0            2m
sj3-1857914698   1         0            2m
sj3-1858045770   1         0            17s
sj3-1933870925   1         0            1m
[chuyu@dhcp-140-15 ~]$ oc edit scheduledjobs sj3
scheduledjob "sj3" edited
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     5         Thu, 03 Nov 2016 15:28:00 +0800
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     4         Thu, 03 Nov 2016 15:28:00 +0800
[chuyu@dhcp-140-15 ~]$ oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-1782089543   1         1            5m
sj3-1858045770   1         0            3m
[chuyu@dhcp-140-15 ~]$ oc get pod
NAME                   READY     STATUS      RESTARTS   AGE
sj3-1782089543-8cf0q   0/1       Completed   0          5m
sj3-1858045770-tel3i   1/1       Running     0          3m
[chuyu@dhcp-140-15 ~]$ oc get pod
NAME                   READY     STATUS      RESTARTS   AGE
sj3-1782089543-8cf0q   0/1       Completed   0          6m
sj3-1858045770-tel3i   1/1       Running     0          4m
[chuyu@dhcp-140-15 ~]$ oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-1782089543   1         1            7m
sj3-1858045770   1         1            5m
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     3         Thu, 03 Nov 2016 15:28:00 +0800
[chuyu@dhcp-140-15 ~]$ oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-1782089543   1         1            7m
sj3-1858045770   1         1            5m
[chuyu@dhcp-140-15 ~]$ oc get pods
NAME                   READY     STATUS      RESTARTS   AGE
sj3-1782089543-8cf0q   0/1       Completed   0          7m
sj3-1858045770-tel3i   0/1       Completed   0          5m
[chuyu@dhcp-140-15 ~]$ oc get pods
NAME                   READY     STATUS      RESTARTS   AGE
sj3-1782089543-8cf0q   0/1       Completed   0          8m
sj3-1858045770-tel3i   0/1       Completed   0          6m
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     3         Thu, 03 Nov 2016 15:28:00 +0800
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     3         Thu, 03 Nov 2016 15:28:00 +0800
[chuyu@dhcp-140-15 ~]$ date
Thu Nov  3 15:38:05 CST 2016

Comment 7 Chuan Yu 2016-11-03 09:46:41 UTC

Checked with openshift v1.4.0-alpha.0+019d471-1064, still get the same results.

Also checked the src code on the instance, the PR have been merged.

As https://github.com/openshift/openshift-docs/blob/master/dev_guide/scheduled_jobs.adoc, when the Concurrency with "Replace", "should cancels the currently running job and replaces it with a new one." that should be the issue.

Comment 8 Maciej Szulik 2016-11-03 13:59:28 UTC

Fix is in https://github.com/openshift/origin/pull/11751 and waiting upstream approval.

Comment 9 Chuan Yu 2016-11-08 07:58:21 UTC

Checked with the OCP latest verion, the issue was fixed.
openshift v3.4.0.23+24b1a58
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Comment 11 errata-xmlrpc 2017-01-18 12:43:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.