Bug 1657179 - deployments and builds stuck on waiting to start until openshift-master-controllers service is restarted
Summary: deployments and builds stuck on waiting to start until openshift-master-contr...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.9.z
Assignee: Tomas Smetana
QA Contact: Liang Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-07 11:16 UTC by Vladislav Walek
Modified: 2019-04-04 08:31 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-04 08:31:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Vladislav Walek 2018-12-07 11:16:56 UTC
Description of problem:

customer with a OCP 3.9.25 cluster running with the following architecture:

  - 3 masters and etcd and 12 compute nodes on top of RHEV with NFS server as the storage backend for the Datacenter storage domain (so main vms qcows are running on top of NFS).
  - Masters are also being used as infra nodes with elasticsearch, cassandra, routers, registry, ipfailover, etc running on them.
  - Masters have 8vCPUs, 32GB vRAM and 17GB vDisks with root partition. Second volume for docker.
  - All PVs are backed up by NFS as well (and they have a lot of them)

Issue:

  - Trying to deploy or rollout new application can get stuck with creating any deploy pod or the deployment timeouts after some time making the deployment fail. Examples of what tests done:

   # oc rollout latest dc <some-dc-name>
     1. Deploy-pod runs but after a couple minutes fails with timeout rolling out new rc.
     2. Or it does nothing and the new rollout version simple stays "in queue" waiting to start.

   # oc new-app --template=mariadb-ephemeral -p MYSQL_USER=user -p MYSQL_PASSWORD=1234 -p MYSQL_ROOT_PASSWORD=5678 -l testdb=mariadb (or using the catalog via web-console, doesn't matter)
   1. Objects deploymentConfig, svc, endpoints are created but no replicationController or deploy pod rolls out. dc stays with desired=1 but nothing happens
    2. Or it may creates the deploy pod but fails the same way as the above.

  # oc get events 
    Nothing is displayed
  
  # oc status -v 
    Some dc versions appear in pending state

  - Then with the above stuck, going to masters and restarting the openshift-master-controllers service makes the above work immediately:

   # systemctl restart atomic-openshift-master-controllers
   # oc get all -n <testing-projects> 
     1. We see all pods starting running and/or builds starting to run as well.


Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.9.25


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
will attach logs from master-controllers.


Note You need to log in before you can comment on or make changes to this bug.