Bug 1500513 - The extensions/v1beta1 API is not updated on old successful Jobs
Summary: The extensions/v1beta1 API is not updated on old successful Jobs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 3.6.1
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: 3.6.z
Assignee: Maciej Szulik
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-10 19:31 UTC by Matthew Robson
Modified: 2020-12-14 10:27 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-12-07 07:12:13 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3389 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Enterprise security, bug fix, and enhancement update 2017-12-07 12:09:10 UTC

Description Matthew Robson 2017-10-10 19:31:06 UTC
Description of problem:

In 3.6, extensions/v1beta1 was removed. In 3.5, there was a play[1] to update the jobs backend to the batch API.

It appears as if that did not update completed jobs as they still reference the old API.

This leads to a massive spam of unexpected ListAndWatch error logs that slow down the API server.

Oct 10 12:35:10 atomic-openshift-master-api[40847]: E1010 12:35:10.805130   40847 cacher.go:274] unexpected ListAndWatch error: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: Failed to list *batch.Job: no kind "Job" is registered for version "extensions/v1beta1"

[root@home]# oc get all
Error from server: no kind "Job" is registered for version "extensions/v1beta1"

[1] https://github.com/openshift/openshift-ansible/blob/release-1.5/playbooks/common/openshift-cluster/upgrades/v3_5/storage_upgrade.yml
Version-Release number of selected component (if applicable):


How reproducible:

3.4 to 3.5 to 3.6.1 upgrade


Steps to Reproduce:
1. These jobs were created in 3.4 and running with
2. Upgraded to 3.5 GA when it was released
3. Upgraded to 3.6 and the API is gone

Actual results:
Cluster is degraded due to API availability. Can not delete the jobs or the namespace anymore.


Expected results:
Smooth transition.

Additional info:

Comment 3 Maciej Szulik 2017-10-16 15:04:24 UTC
This looks like old jobs that somehow managed to slip the migration, remove them with the following commands:
 
ETCDCTL_API=3 etcdctl --key=<path_to_master.etcd-client.key> --cert=<path_to_master.etcd-client.crt> --cacert=<path_to_ca.crt> --endpoints=<etcd_address> del /kubernetes.io/jobs/<namespace>/<job_name>

It's also worth checking pods created by those jobs, although they should be cleaned up by the garbage collector
once job is gone.

Comment 4 zhou ying 2017-10-17 08:30:59 UTC
Matthew Robson:

Does the delete commands works for you ?

Comment 5 Matthew Robson 2017-10-31 13:16:15 UTC
Sorry, yes. The delete command worked and it resolved the issue.

We cleaned up all of the jobs and pods and confirmed they were all gone via:

ETCDCTL_API=3 etcdctl --key=<path_to_master.etcd-client.key> --cert=<path_to_master.etcd-client.crt> --cacert=<path_to_ca.crt> --endpoints=<etcd_address> get / --prefix

To see all remaining objects.

Comment 6 zhou ying 2017-11-01 01:15:54 UTC
Matthew Robson:

   Thanks , then I'll verity this issue.

Comment 9 errata-xmlrpc 2017-12-07 07:12:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3389


Note You need to log in before you can comment on or make changes to this bug.