Bug 1805474

Summary:	New and existing deployments/builds/pods are not scheduled to any nodes
Product:	OpenShift Container Platform	Reporter:	emahoney
Component:	kube-controller-manager	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	zhou ying <yinzhou>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.11.0	CC:	adam.kaplan, aos-bugs, mfojtik, scuppett
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-02-21 18:57:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description emahoney 2020-02-20 20:47:13 UTC

Description of problem: Customer is unable to schedule any new pods. Existing pods are up/running on workers. Deleting any pod in the cluster ends in the pod not being rescheduled at all. For example, deleting a webconsole pod results in only 2 replicas running and no third pod attempts to schedule. In 'oc get events' in any of those namespace, no events are shown. In the controller logs, we can scale up/down, but this has no effect as pods can be scaled down but not back up. All control plane pods (API/Controllers/etcd) are listed as running/healthy and in those pod logs no errors particularly stand out. 


Version-Release number of selected component (if applicable):
3.11

How reproducible:
Can reproduce at will on customer env. Have not reproduced on any other env. 

Steps to Reproduce:
1. oc scale dc <dc_name> --replicas=0; oc scale dc <dc_name> --replicas=1
2. OR oc delete pod <podname>
3. No pod is recreated, no errors in controller/apiserver

Actual results:
No pod is recreated/scheduled in the environment


Expected results:
pods are rescheduled/scheduled based on OCP objects


Additional info:

Comment 4 Maciej Szulik 2020-02-21 14:46:47 UTC

So from what I see in the logs there are several problems that need solving first:

1. logs.go:49] http: TLS handshake error from 10.10.72.173:39818: no serving certificate available for the kubelet

happens from the start of the logs until around 10:00. 

2. remote_image.go:108] PullImage "registry.redhat.io/openshift3/ose-deployer:v3.11" from image service failed: rpc error: code = Unknown desc = Get https:/│
/registry.redhat.io/v2/openshift3/ose-deployer/manifests/v3.11: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication

after 1 ends and kubelet actually starts (?) there's a ton of these errors suggesting there's an authentication error to pull images.

Whatever is happening afterwards is an avalanche of the above. I'd suggest solving these problems first
and then we can look into other problems.