Bug 1500498 - Builds stuck in Pending status
Summary: Builds stuck in Pending status
Keywords:
Status: CLOSED DUPLICATE of bug 1486356
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Avesh Agarwal
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-10 18:25 UTC by Vikas Laad
Modified: 2017-10-24 23:20 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-24 23:20:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pod yaml (23.98 KB, text/plain)
2017-10-10 19:26 UTC, Vikas Laad
no flags Details

Description Vikas Laad 2017-10-10 18:25:38 UTC
Description of problem:
I am running concurrent build tests, when I am running django quickstart app concurrent builds several builds are stuck in Pending status. This cluster is a system container install of openshift.

NAMESPACE     NAME                    TYPE      FROM          STATUS     STARTED             DURATION
svt-proj-22   django-psql-example-1   Source    Git           Pending
svt-proj-23   django-psql-example-1   Source    Git           Pending
svt-proj-25   django-psql-example-1   Source    Git           Pending
svt-proj-28   django-psql-example-1   Source    Git           Pending

here are the events from one project
LASTSEEN   FIRSTSEEN   COUNT     NAME                          KIND          SUBOBJECT   TYPE      REASON                         SOURCE                                                 MESSAGE
15m        15m         1         django-psql-example-1-build   Pod                       Normal    Scheduled                      default-scheduler                                      Successfully assigned django-psql-example-1-build to ip-172-31-62-163.us-west-2.compute.internal
15m        15m         1         django-psql-example-1-build   Pod                       Normal    SuccessfulMountVolume          kubelet, ip-172-31-62-163.us-west-2.compute.internal   MountVolume.SetUp succeeded for volume "crio-socket" 
15m        15m         1         django-psql-example-1-build   Pod                       Normal    SuccessfulMountVolume          kubelet, ip-172-31-62-163.us-west-2.compute.internal   MountVolume.SetUp succeeded for volume "docker-socket" 
15m        15m         1         django-psql-example-1-build   Pod                       Normal    SuccessfulMountVolume          kubelet, ip-172-31-62-163.us-west-2.compute.internal   MountVolume.SetUp succeeded for volume "buildworkdir" 
15m        15m         1         django-psql-example-1-build   Pod                       Normal    SuccessfulMountVolume          kubelet, ip-172-31-62-163.us-west-2.compute.internal   MountVolume.SetUp succeeded for volume "builder-token-6qcbf" 
15m        15m         1         django-psql-example-1-build   Pod                       Normal    SuccessfulMountVolume          kubelet, ip-172-31-62-163.us-west-2.compute.internal   MountVolume.SetUp succeeded for volume "builder-dockercfg-4l3k4-push" 
4m         14m         19        django-psql-example-1-build   Pod                       Warning   FailedSync                     kubelet, ip-172-31-62-163.us-west-2.compute.internal   Error syncing pod
9m         14m         10        django-psql-example-1-build   Pod                       Normal    SandboxChanged                 kubelet, ip-172-31-62-163.us-west-2.compute.internal   Pod sandbox changed, it will be killed and re-created.
36s        36s         1         django-psql-example-1         Build                     Normal    BuildStarted                   build-controller                                       Build svt-proj-18/django-psql-example-1 is now running
15m        15m         1         django-psql-example           BuildConfig               Warning   BuildConfigInstantiateFailed   buildconfig-controller                                 gave up on Build for BuildConfig svt-proj-18/django-psql-example (0) due to fatal error: the LastVersion(1) on build config svt-proj-18/django-psql-example does not match the build request LastVersion(0)

Version-Release number of selected component (if applicable):
openshift v3.7.0-0.147.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.1

How reproducible:
With django app, when doing 50 concurrent builds on 2 compute nodes.

Steps to Reproduce:
1. create 50 build configs in 50 different projects
2. watch the builds

Actual results:
Some builds are stuck in Pending status

Expected results:
Builds should finish successfully.

Additional info:
When I run oc get events on that project where build is stuck, somehow that triggers build to start and it finishes fine. Attaching master and node logs.

Comment 2 Ben Parees 2017-10-10 19:12:03 UTC
This seems like the most interesting event:

9m         14m         10        django-psql-example-1-build   Pod                       Normal    SandboxChanged                 kubelet, ip-172-31-62-163.us-west-2.compute.internal   Pod sandbox changed, it will be killed and re-created.

But in general any case where a pod is stuck in pending should be looked at by the pod team.

Comment 3 Vikas Laad 2017-10-10 19:26:39 UTC
Created attachment 1336886 [details]
pod yaml

Comment 4 Vikas Laad 2017-10-11 15:45:46 UTC
Builds finally finish after around 2 hours, one of the builds will start after few mins and once that finishes it will all be pending again for some time.

Comment 5 Seth Jennings 2017-10-19 19:17:50 UTC
There are many non-bug explanations for why this might occur.

Did all pods schedule at the same time?  what were the pod resource limits? max-pods and pods-per-core in the node config?  node resource capacity/allocatable values?

My understanding is that the pods did eventually complete successfully.  What exactly is the issue then?

Comment 6 Vikas Laad 2017-10-19 19:31:54 UTC
Yes all the pods were scheduled at the same time.

kubeletArguments: 
  cloud-config:
  - /etc/origin/cloudprovider/aws.conf
  cloud-provider:
  - aws
  image-gc-high-threshold:
  - '80'
  image-gc-low-threshold:
  - '70'
  max-pods:
  - '510'
  maximum-dead-containers:
  - '20'
  maximum-dead-containers-per-container:
  - '1'
  minimum-container-ttl-duration:
  - 10s
  node-labels:
  - region=infra
  - zone=default
  pods-per-core:
  - '0'

Issue is out of 50 builds around 40 completes and then there is no activity for an hour. Then some of the builds will to to Running state and complete again, again after that no activity for some time. My question is why pods are remaining in Pending when both the nodes have no Running pods ? Please let me know if you want me to re-produce and give you access to the cluster.

Comment 7 Mike Fiedler 2017-10-19 20:02:07 UTC
This is very different behavior from previous releases when all build pods would go to Running within a few minutes at most and run to completion.  There have been bugs across releases where builds hang in various states and for various reasons, but they were eventually fixed.

https://www.google.com/search?q=site%3Abugzilla.redhat.com+openshift+build+hangs

Comment 8 Avesh Agarwal 2017-10-20 16:28:04 UTC
Vikas,

I am going through the previous comments/history/logs. Meahwhile whenever you observe/reproduce the issue, please give me access to the cluster.

Comment 10 Avesh Agarwal 2017-10-24 23:20:00 UTC
From the stuck build pod yaml

  initContainerStatuses:
  - image: registry.ops.openshift.com/openshift3/ose-sti-builder:v3.7.0-0.158.0
    imageID: ""
    lastState: {}
    name: git-clone
    ready: false
    restartCount: 0
    state:
      waiting:
        reason: PodInitializing


And the 

# oc get pods --all-namespaces|grep -v Completed
# NAMESPACE      NAME                       READY     STATUS                    RESTARTS   AGE
svt-proj-044   eap-app-3-build            0/1       Init:0/2                  0          4h

The same behaviour is observed in https://bugzilla.redhat.com/show_bug.cgi?id=1486356 just manifested in a different way. So closing it as a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1486356.

Comment 11 Avesh Agarwal 2017-10-24 23:20:28 UTC

*** This bug has been marked as a duplicate of bug 1486356 ***


Note You need to log in before you can comment on or make changes to this bug.