Description of problem: Customer is unable to schedule any new pods. Existing pods are up/running on workers. Deleting any pod in the cluster ends in the pod not being rescheduled at all. For example, deleting a webconsole pod results in only 2 replicas running and no third pod attempts to schedule. In 'oc get events' in any of those namespace, no events are shown. In the controller logs, we can scale up/down, but this has no effect as pods can be scaled down but not back up. All control plane pods (API/Controllers/etcd) are listed as running/healthy and in those pod logs no errors particularly stand out.
Version-Release number of selected component (if applicable):
Can reproduce at will on customer env. Have not reproduced on any other env.
Steps to Reproduce:
1. oc scale dc <dc_name> --replicas=0; oc scale dc <dc_name> --replicas=1
2. OR oc delete pod <podname>
3. No pod is recreated, no errors in controller/apiserver
No pod is recreated/scheduled in the environment
pods are rescheduled/scheduled based on OCP objects
So from what I see in the logs there are several problems that need solving first:
1. logs.go:49] http: TLS handshake error from 10.10.72.173:39818: no serving certificate available for the kubelet
happens from the start of the logs until around 10:00.
2. remote_image.go:108] PullImage "registry.redhat.io/openshift3/ose-deployer:v3.11" from image service failed: rpc error: code = Unknown desc = Get https:/│
/registry.redhat.io/v2/openshift3/ose-deployer/manifests/v3.11: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication
after 1 ends and kubelet actually starts (?) there's a ton of these errors suggesting there's an authentication error to pull images.
Whatever is happening afterwards is an avalanche of the above. I'd suggest solving these problems first
and then we can look into other problems.