Description of problem: I am running concurrent build tests, when I am running django quickstart app concurrent builds several builds are stuck in Pending status. This cluster is a system container install of openshift. NAMESPACE NAME TYPE FROM STATUS STARTED DURATION svt-proj-22 django-psql-example-1 Source Git Pending svt-proj-23 django-psql-example-1 Source Git Pending svt-proj-25 django-psql-example-1 Source Git Pending svt-proj-28 django-psql-example-1 Source Git Pending here are the events from one project LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 15m 15m 1 django-psql-example-1-build Pod Normal Scheduled default-scheduler Successfully assigned django-psql-example-1-build to ip-172-31-62-163.us-west-2.compute.internal 15m 15m 1 django-psql-example-1-build Pod Normal SuccessfulMountVolume kubelet, ip-172-31-62-163.us-west-2.compute.internal MountVolume.SetUp succeeded for volume "crio-socket" 15m 15m 1 django-psql-example-1-build Pod Normal SuccessfulMountVolume kubelet, ip-172-31-62-163.us-west-2.compute.internal MountVolume.SetUp succeeded for volume "docker-socket" 15m 15m 1 django-psql-example-1-build Pod Normal SuccessfulMountVolume kubelet, ip-172-31-62-163.us-west-2.compute.internal MountVolume.SetUp succeeded for volume "buildworkdir" 15m 15m 1 django-psql-example-1-build Pod Normal SuccessfulMountVolume kubelet, ip-172-31-62-163.us-west-2.compute.internal MountVolume.SetUp succeeded for volume "builder-token-6qcbf" 15m 15m 1 django-psql-example-1-build Pod Normal SuccessfulMountVolume kubelet, ip-172-31-62-163.us-west-2.compute.internal MountVolume.SetUp succeeded for volume "builder-dockercfg-4l3k4-push" 4m 14m 19 django-psql-example-1-build Pod Warning FailedSync kubelet, ip-172-31-62-163.us-west-2.compute.internal Error syncing pod 9m 14m 10 django-psql-example-1-build Pod Normal SandboxChanged kubelet, ip-172-31-62-163.us-west-2.compute.internal Pod sandbox changed, it will be killed and re-created. 36s 36s 1 django-psql-example-1 Build Normal BuildStarted build-controller Build svt-proj-18/django-psql-example-1 is now running 15m 15m 1 django-psql-example BuildConfig Warning BuildConfigInstantiateFailed buildconfig-controller gave up on Build for BuildConfig svt-proj-18/django-psql-example (0) due to fatal error: the LastVersion(1) on build config svt-proj-18/django-psql-example does not match the build request LastVersion(0) Version-Release number of selected component (if applicable): openshift v3.7.0-0.147.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.1 How reproducible: With django app, when doing 50 concurrent builds on 2 compute nodes. Steps to Reproduce: 1. create 50 build configs in 50 different projects 2. watch the builds Actual results: Some builds are stuck in Pending status Expected results: Builds should finish successfully. Additional info: When I run oc get events on that project where build is stuck, somehow that triggers build to start and it finishes fine. Attaching master and node logs.
This seems like the most interesting event: 9m 14m 10 django-psql-example-1-build Pod Normal SandboxChanged kubelet, ip-172-31-62-163.us-west-2.compute.internal Pod sandbox changed, it will be killed and re-created. But in general any case where a pod is stuck in pending should be looked at by the pod team.
Created attachment 1336886 [details] pod yaml
Builds finally finish after around 2 hours, one of the builds will start after few mins and once that finishes it will all be pending again for some time.
There are many non-bug explanations for why this might occur. Did all pods schedule at the same time? what were the pod resource limits? max-pods and pods-per-core in the node config? node resource capacity/allocatable values? My understanding is that the pods did eventually complete successfully. What exactly is the issue then?
Yes all the pods were scheduled at the same time. kubeletArguments: cloud-config: - /etc/origin/cloudprovider/aws.conf cloud-provider: - aws image-gc-high-threshold: - '80' image-gc-low-threshold: - '70' max-pods: - '510' maximum-dead-containers: - '20' maximum-dead-containers-per-container: - '1' minimum-container-ttl-duration: - 10s node-labels: - region=infra - zone=default pods-per-core: - '0' Issue is out of 50 builds around 40 completes and then there is no activity for an hour. Then some of the builds will to to Running state and complete again, again after that no activity for some time. My question is why pods are remaining in Pending when both the nodes have no Running pods ? Please let me know if you want me to re-produce and give you access to the cluster.
This is very different behavior from previous releases when all build pods would go to Running within a few minutes at most and run to completion. There have been bugs across releases where builds hang in various states and for various reasons, but they were eventually fixed. https://www.google.com/search?q=site%3Abugzilla.redhat.com+openshift+build+hangs
Vikas, I am going through the previous comments/history/logs. Meahwhile whenever you observe/reproduce the issue, please give me access to the cluster.
From the stuck build pod yaml initContainerStatuses: - image: registry.ops.openshift.com/openshift3/ose-sti-builder:v3.7.0-0.158.0 imageID: "" lastState: {} name: git-clone ready: false restartCount: 0 state: waiting: reason: PodInitializing And the # oc get pods --all-namespaces|grep -v Completed # NAMESPACE NAME READY STATUS RESTARTS AGE svt-proj-044 eap-app-3-build 0/1 Init:0/2 0 4h The same behaviour is observed in https://bugzilla.redhat.com/show_bug.cgi?id=1486356 just manifested in a different way. So closing it as a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1486356.
*** This bug has been marked as a duplicate of bug 1486356 ***