1805474 – New and existing deployments/builds/pods are not scheduled to any nodes

Bug 1805474 - New and existing deployments/builds/pods are not scheduled to any nodes

Summary: New and existing deployments/builds/pods are not scheduled to any nodes

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Maciej Szulik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-20 20:47 UTC by emahoney
Modified:	2023-09-07 21:58 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-02-21 18:57:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description emahoney 2020-02-20 20:47:13 UTC

Description of problem: Customer is unable to schedule any new pods. Existing pods are up/running on workers. Deleting any pod in the cluster ends in the pod not being rescheduled at all. For example, deleting a webconsole pod results in only 2 replicas running and no third pod attempts to schedule. In 'oc get events' in any of those namespace, no events are shown. In the controller logs, we can scale up/down, but this has no effect as pods can be scaled down but not back up. All control plane pods (API/Controllers/etcd) are listed as running/healthy and in those pod logs no errors particularly stand out. 


Version-Release number of selected component (if applicable):
3.11

How reproducible:
Can reproduce at will on customer env. Have not reproduced on any other env. 

Steps to Reproduce:
1. oc scale dc <dc_name> --replicas=0; oc scale dc <dc_name> --replicas=1
2. OR oc delete pod <podname>
3. No pod is recreated, no errors in controller/apiserver

Actual results:
No pod is recreated/scheduled in the environment


Expected results:
pods are rescheduled/scheduled based on OCP objects


Additional info:

Comment 4 Maciej Szulik 2020-02-21 14:46:47 UTC

So from what I see in the logs there are several problems that need solving first:

1. logs.go:49] http: TLS handshake error from 10.10.72.173:39818: no serving certificate available for the kubelet

happens from the start of the logs until around 10:00. 

2. remote_image.go:108] PullImage "registry.redhat.io/openshift3/ose-deployer:v3.11" from image service failed: rpc error: code = Unknown desc = Get https:/│
/registry.redhat.io/v2/openshift3/ose-deployer/manifests/v3.11: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication

after 1 ends and kubelet actually starts (?) there's a ton of these errors suggesting there's an authentication error to pull images.

Whatever is happening afterwards is an avalanche of the above. I'd suggest solving these problems first
and then we can look into other problems.

Note You need to log in before you can comment on or make changes to this bug.