Description of problem:
customer with a OCP 3.9.25 cluster running with the following architecture:
- 3 masters and etcd and 12 compute nodes on top of RHEV with NFS server as the storage backend for the Datacenter storage domain (so main vms qcows are running on top of NFS).
- Masters are also being used as infra nodes with elasticsearch, cassandra, routers, registry, ipfailover, etc running on them.
- Masters have 8vCPUs, 32GB vRAM and 17GB vDisks with root partition. Second volume for docker.
- All PVs are backed up by NFS as well (and they have a lot of them)
- Trying to deploy or rollout new application can get stuck with creating any deploy pod or the deployment timeouts after some time making the deployment fail. Examples of what tests done:
# oc rollout latest dc <some-dc-name>
1. Deploy-pod runs but after a couple minutes fails with timeout rolling out new rc.
2. Or it does nothing and the new rollout version simple stays "in queue" waiting to start.
# oc new-app --template=mariadb-ephemeral -p MYSQL_USER=user -p MYSQL_PASSWORD=1234 -p MYSQL_ROOT_PASSWORD=5678 -l testdb=mariadb (or using the catalog via web-console, doesn't matter)
1. Objects deploymentConfig, svc, endpoints are created but no replicationController or deploy pod rolls out. dc stays with desired=1 but nothing happens
2. Or it may creates the deploy pod but fails the same way as the above.
# oc get events
Nothing is displayed
# oc status -v
Some dc versions appear in pending state
- Then with the above stuck, going to masters and restarting the openshift-master-controllers service makes the above work immediately:
# systemctl restart atomic-openshift-master-controllers
# oc get all -n <testing-projects>
1. We see all pods starting running and/or builds starting to run as well.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.9.25
Steps to Reproduce:
will attach logs from master-controllers.