Description of problem: Customer is unable to schedule any new pods. Existing pods are up/running on workers. Deleting any pod in the cluster ends in the pod not being rescheduled at all. For example, deleting a webconsole pod results in only 2 replicas running and no third pod attempts to schedule. In 'oc get events' in any of those namespace, no events are shown. In the controller logs, we can scale up/down, but this has no effect as pods can be scaled down but not back up. All control plane pods (API/Controllers/etcd) are listed as running/healthy and in those pod logs no errors particularly stand out. Version-Release number of selected component (if applicable): 3.11 How reproducible: Can reproduce at will on customer env. Have not reproduced on any other env. Steps to Reproduce: 1. oc scale dc <dc_name> --replicas=0; oc scale dc <dc_name> --replicas=1 2. OR oc delete pod <podname> 3. No pod is recreated, no errors in controller/apiserver Actual results: No pod is recreated/scheduled in the environment Expected results: pods are rescheduled/scheduled based on OCP objects Additional info:
So from what I see in the logs there are several problems that need solving first: 1. logs.go:49] http: TLS handshake error from 10.10.72.173:39818: no serving certificate available for the kubelet happens from the start of the logs until around 10:00. 2. remote_image.go:108] PullImage "registry.redhat.io/openshift3/ose-deployer:v3.11" from image service failed: rpc error: code = Unknown desc = Get https:/│ /registry.redhat.io/v2/openshift3/ose-deployer/manifests/v3.11: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication after 1 ends and kubelet actually starts (?) there's a ton of these errors suggesting there's an authentication error to pull images. Whatever is happening afterwards is an avalanche of the above. I'd suggest solving these problems first and then we can look into other problems.