Bug 1386463 - [platformmanagement_public_479] No new jobs scheduled as cron setting when set concurrencyPolicy of scheduledjobs as Replace
Summary: [platformmanagement_public_479] No new jobs scheduled as cron setting when se...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 3.3.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Maciej Szulik
QA Contact: Chuan Yu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-19 03:09 UTC by Chuan Yu
Modified: 2017-03-08 18:43 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Active jobs were mistakenly counted when doing synchronization. Consequence: Wrong active calculation lead to not scheduling new jobs when concurrencyPolicy is set to Replace. Fix: Fix how active jobs for a ScheduledJob are calculated. Result: concurrencyPolicy should work as expected, when set to Replace.
Clone Of:
Environment:
Last Closed: 2017-01-18 12:43:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0066 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 17:23:26 UTC

Description Chuan Yu 2016-10-19 03:09:28 UTC
Description of problem:
After update the concurrencyPolicy of scheduledjob to Replace, the job scheduled not as the policy, only scheduled once, then will no new jobs scheduled:
[root@dhcp-140-15 ~]# oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     2         Wed, 19 Oct 2016 10:28:00 +0800
[root@dhcp-140-15 ~]# oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-3360002802   1         0            24s
sj3-467178220    1         0            1m
[root@dhcp-140-15 ~]# oc patch scheduledjobs sj3 -p '{"spec":{"concurrencyPolicy": "Replace"}}'
"sj3" patched
[root@dhcp-140-15 ~]# oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     2         Wed, 19 Oct 2016 10:28:00 +0800
[root@dhcp-140-15 ~]# oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     3         Wed, 19 Oct 2016 10:29:00 +0800
[root@dhcp-140-15 ~]# oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     3         Wed, 19 Oct 2016 10:29:00 +0800
[root@dhcp-140-15 ~]# oc get pod
NAME                   READY     STATUS    RESTARTS   AGE
sj3-4060648175-uqfm9   1/1       Running   0          40s
[root@dhcp-140-15 ~]# oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-4060648175   1         0            46s
[root@dhcp-140-15 ~]# oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-4060648175   1         0            1m
[root@dhcp-140-15 ~]# oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-4060648175   1         1            10m
[root@dhcp-140-15 ~]# oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-4060648175   1         1            28m
[root@dhcp-140-15 ~]# oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     2         Wed, 19 Oct 2016 10:29:00 +0800


And master config got some logs like:
Oct 18 22:34:44 qe-pm-chuyumaster-1 docker[23945]: I1018 22:34:44.823091       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found
Oct 18 22:34:54 qe-pm-chuyumaster-1 docker[23945]: I1018 22:34:54.834128       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found
Oct 18 22:35:04 qe-pm-chuyumaster-1 docker[23945]: I1018 22:35:04.846256       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found
Oct 18 22:35:14 qe-pm-chuyumaster-1 docker[23945]: I1018 22:35:14.858327       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found
Oct 18 22:35:24 qe-pm-chuyumaster-1 docker[23945]: I1018 22:35:24.874125       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found


Version-Release number of selected component (if applicable):
openshift v3.3.1.3

How reproducible:
Always

Steps to Reproduce:
1. Create a scheduledjob
oc run sj3 --image=busybox --restart=Never --schedule="*/1 * * * *" -- sleep 300
2. Check the master log
3. Set the concurrencyPolicy of the scheduledjob as Replace
oc patch scheduledjobs sj3 -p '{"spec":{"concurrencyPolicy": "Replace"}}'
4. Check the scheduledjobs by 'oc get jobs' and 'oc get scheduledjobs' and master log.

Actual results:
1. no new jobs scheduled as cron setting.
2. logs in master log:
Oct 18 22:35:24 qe-pm-chuyumaster-1 docker[23945]: I1018 22:35:24.874125       1 event.go:216] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"67363c5c-95a3-11e6-a188-42010af00014", APIVersion:"batch", ResourceVersion:"3109", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-467178220" not found

Expected results:
1. new jobs scheduled as cron setting.
2. should not get logs like this.

Additional info:

Comment 1 Maciej Szulik 2016-10-19 09:19:41 UTC
ScheduledJobs are a techpreview feature in 3.3.1 (alpha in kubernetes in origin master). That's why this is not blocking the release in any way, but I'll fix the problem in the master only.

Comment 2 Maciej Szulik 2016-10-24 12:15:00 UTC
Upstream fix is in https://github.com/kubernetes/kubernetes/pull/35420

Downstream cherry-pick in https://github.com/openshift/origin/pull/11523

Comment 3 openshift-github-bot 2016-10-31 16:10:42 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/f5e3c6dbbb59b8a6e374485c8d11d829b4571a91
Merge pull request #11523 from soltysh/bug1386463

Merged by openshift-bot

Comment 4 Chuan Yu 2016-11-02 06:10:09 UTC
Checked with devenv-fedora_5292, the issue still not fixed.

openshift v1.4.0-alpha.0+8ecb3f5-997


[chuyu@dhcp-140-15 redhat]$ oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-1705609028   1         0            1m
[chuyu@dhcp-140-15 redhat]$ oc get pods
NAME                   READY     STATUS    RESTARTS   AGE
sj3-1705609028-o9rjg   1/1       Running   0          1m
[chuyu@dhcp-140-15 redhat]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     2         Wed, 02 Nov 2016 13:59:00 +0800

I1102 06:04:52.772985    1133 event.go:217] Event(api.ObjectReference{Kind:"ScheduledJob", Namespace:"chuyu", Name:"sj3", UID:"3f739531-a0c1-11e6-904b-0ef7883dfef8", APIVersion:"batch", ResourceVersion:"649", FieldPath:""}): type: 'Warning' reason: 'FailedGet' Get job: jobs.batch "sj3-1781434183" not found

Comment 5 Maciej Szulik 2016-11-02 11:52:21 UTC
I've followed the steps described in #c1 and it's working as expected. I am running v1.4.0-alpha.0+537c0a5-1006 which is only a few commits ahead of what you were testing with. Can you give me the exact steps you're testing with?

Comment 6 Chuan Yu 2016-11-03 07:53:10 UTC
Here is my steps on openshift v1.4.0-alpha.0+90d8c62-1000
1.login as normal user, and create a new project "chuyu"
2.schedule job with 'oc run sj3 --image=busybox --restart=Never --schedule="*/1 * * * *" -- sleep 300'
3.wait about 5m, then edited the scheduledjobs sj3 with concurrencyPolicy 'Replace'
4.then check scheduledjobs, jobs and pods, since then still no new job scheduled. As the docs description, the concurrencyPolicy for Replace, every minutes will have a new job scheduled to replace the old one.

see the command output:
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     4         Thu, 03 Nov 2016 15:27:00 +0800
[chuyu@dhcp-140-15 ~]$ oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-1782089543   1         0            2m
sj3-1857914698   1         0            2m
sj3-1858045770   1         0            17s
sj3-1933870925   1         0            1m
[chuyu@dhcp-140-15 ~]$ oc edit scheduledjobs sj3
scheduledjob "sj3" edited
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     5         Thu, 03 Nov 2016 15:28:00 +0800
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     4         Thu, 03 Nov 2016 15:28:00 +0800
[chuyu@dhcp-140-15 ~]$ oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-1782089543   1         1            5m
sj3-1858045770   1         0            3m
[chuyu@dhcp-140-15 ~]$ oc get pod
NAME                   READY     STATUS      RESTARTS   AGE
sj3-1782089543-8cf0q   0/1       Completed   0          5m
sj3-1858045770-tel3i   1/1       Running     0          3m
[chuyu@dhcp-140-15 ~]$ oc get pod
NAME                   READY     STATUS      RESTARTS   AGE
sj3-1782089543-8cf0q   0/1       Completed   0          6m
sj3-1858045770-tel3i   1/1       Running     0          4m
[chuyu@dhcp-140-15 ~]$ oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-1782089543   1         1            7m
sj3-1858045770   1         1            5m
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     3         Thu, 03 Nov 2016 15:28:00 +0800
[chuyu@dhcp-140-15 ~]$ oc get jobs
NAME             DESIRED   SUCCESSFUL   AGE
sj3-1782089543   1         1            7m
sj3-1858045770   1         1            5m
[chuyu@dhcp-140-15 ~]$ oc get pods
NAME                   READY     STATUS      RESTARTS   AGE
sj3-1782089543-8cf0q   0/1       Completed   0          7m
sj3-1858045770-tel3i   0/1       Completed   0          5m
[chuyu@dhcp-140-15 ~]$ oc get pods
NAME                   READY     STATUS      RESTARTS   AGE
sj3-1782089543-8cf0q   0/1       Completed   0          8m
sj3-1858045770-tel3i   0/1       Completed   0          6m
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     3         Thu, 03 Nov 2016 15:28:00 +0800
[chuyu@dhcp-140-15 ~]$ oc get scheduledjobs
NAME      SCHEDULE      SUSPEND   ACTIVE    LAST-SCHEDULE
sj3       */1 * * * *   False     3         Thu, 03 Nov 2016 15:28:00 +0800
[chuyu@dhcp-140-15 ~]$ date
Thu Nov  3 15:38:05 CST 2016

Comment 7 Chuan Yu 2016-11-03 09:46:41 UTC
Checked with openshift v1.4.0-alpha.0+019d471-1064, still get the same results.

Also checked the src code on the instance, the PR have been merged.

As https://github.com/openshift/openshift-docs/blob/master/dev_guide/scheduled_jobs.adoc, when the Concurrency with "Replace", "should cancels the currently running job and replaces it with a new one." that should be the issue.

Comment 8 Maciej Szulik 2016-11-03 13:59:28 UTC
Fix is in https://github.com/openshift/origin/pull/11751 and waiting upstream approval.

Comment 9 Chuan Yu 2016-11-08 07:58:21 UTC
Checked with the OCP latest verion, the issue was fixed.
openshift v3.4.0.23+24b1a58
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Comment 11 errata-xmlrpc 2017-01-18 12:43:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066


Note You need to log in before you can comment on or make changes to this bug.