1500513 – The extensions/v1beta1 API is not updated on old successful Jobs

Bug 1500513 - The extensions/v1beta1 API is not updated on old successful Jobs

Summary: The extensions/v1beta1 API is not updated on old successful Jobs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.6.1
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Maciej Szulik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-10 19:31 UTC by Matthew Robson
Modified:	2020-12-14 10:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2017-12-07 07:12:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:3389	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Enterprise security, bug fix, and enhancement update	2017-12-07 12:09:10 UTC

Description Matthew Robson 2017-10-10 19:31:06 UTC

Description of problem:

In 3.6, extensions/v1beta1 was removed. In 3.5, there was a play[1] to update the jobs backend to the batch API.

It appears as if that did not update completed jobs as they still reference the old API.

This leads to a massive spam of unexpected ListAndWatch error logs that slow down the API server.

Oct 10 12:35:10 atomic-openshift-master-api[40847]: E1010 12:35:10.805130   40847 cacher.go:274] unexpected ListAndWatch error: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: Failed to list *batch.Job: no kind "Job" is registered for version "extensions/v1beta1"

[root@home]# oc get all
Error from server: no kind "Job" is registered for version "extensions/v1beta1"

[1] https://github.com/openshift/openshift-ansible/blob/release-1.5/playbooks/common/openshift-cluster/upgrades/v3_5/storage_upgrade.yml
Version-Release number of selected component (if applicable):


How reproducible:

3.4 to 3.5 to 3.6.1 upgrade


Steps to Reproduce:
1. These jobs were created in 3.4 and running with
2. Upgraded to 3.5 GA when it was released
3. Upgraded to 3.6 and the API is gone

Actual results:
Cluster is degraded due to API availability. Can not delete the jobs or the namespace anymore.


Expected results:
Smooth transition.

Additional info:

Comment 3 Maciej Szulik 2017-10-16 15:04:24 UTC

This looks like old jobs that somehow managed to slip the migration, remove them with the following commands:
 
ETCDCTL_API=3 etcdctl --key=<path_to_master.etcd-client.key> --cert=<path_to_master.etcd-client.crt> --cacert=<path_to_ca.crt> --endpoints=<etcd_address> del /kubernetes.io/jobs/<namespace>/<job_name>

It's also worth checking pods created by those jobs, although they should be cleaned up by the garbage collector
once job is gone.

Comment 4 zhou ying 2017-10-17 08:30:59 UTC

Matthew Robson:

Does the delete commands works for you ?

Comment 5 Matthew Robson 2017-10-31 13:16:15 UTC

Sorry, yes. The delete command worked and it resolved the issue.

We cleaned up all of the jobs and pods and confirmed they were all gone via:

ETCDCTL_API=3 etcdctl --key=<path_to_master.etcd-client.key> --cert=<path_to_master.etcd-client.crt> --cacert=<path_to_ca.crt> --endpoints=<etcd_address> get / --prefix

To see all remaining objects.

Comment 6 zhou ying 2017-11-01 01:15:54 UTC

Matthew Robson:

   Thanks , then I'll verity this issue.

Comment 9 errata-xmlrpc 2017-12-07 07:12:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3389

Note You need to log in before you can comment on or make changes to this bug.