Bug 1486457 - Many builds fail due to FetchSourceFailed
Summary: Many builds fail due to FetchSourceFailed
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.7.0
Assignee: Ben Parees
QA Contact: Vikas Laad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-29 19:38 UTC by Vikas Laad
Modified: 2017-11-01 13:41 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-01 13:41:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
build yaml (3.28 KB, text/plain)
2017-08-29 19:41 UTC, Vikas Laad
no flags Details
pod yaml (26.75 KB, text/plain)
2017-08-29 19:41 UTC, Vikas Laad
no flags Details
describe pod (12.04 KB, text/plain)
2017-08-29 19:43 UTC, Vikas Laad
no flags Details
new pod yaml (25.80 KB, text/plain)
2017-09-27 15:15 UTC, Vikas Laad
no flags Details

Description Vikas Laad 2017-08-29 19:38:49 UTC
Description of problem:
Startin 40 eap concurrent builds results in around 15-20 failed due to FetchSourceFailed.

oc get build
NAME        TYPE      FROM          STATUS                       STARTED       DURATION
eap-app-1   Source    Git@caec202   Failed (FetchSourceFailed)   2 hours ago   2m19s

oc get pods
NAME              READY     STATUS       RESTARTS   AGE
eap-app-1-build   0/1       Init:Error   0          1h

Version-Release number of selected component (if applicable):
openshift v3.7.0-0.117.0
kubernetes v1.7.0+695f48a16f
etcd 3.2.1


How reproducible:
Always

Steps to Reproduce:
1. Start 30-40 concurrent builds on 2 compute nodes
2. 15-20 builds will fail due to this error
3. 3.6 had all builds pass

Actual results:
Build fails due to FetchSourceFailed

Expected results:
All builds should pass

Additional info:
Attaching logs and other information.

Comment 1 Vikas Laad 2017-08-29 19:41:00 UTC
Created attachment 1319762 [details]
build yaml

Comment 2 Vikas Laad 2017-08-29 19:41:17 UTC
Created attachment 1319763 [details]
pod yaml

Comment 3 Vikas Laad 2017-08-29 19:43:04 UTC
Created attachment 1319764 [details]
describe pod

Comment 4 Vikas Laad 2017-08-29 19:43:50 UTC
oc logs for the pod returns following error
Error from server (BadRequest): container "sti-build" in pod "eap-app-1-build" is waiting to start: PodInitializing

Comment 5 Vikas Laad 2017-08-29 19:54:00 UTC
root@ip-172-31-36-253: ~/FetchSourceFailed # oc logs eap-app-1-build -c git-clone
I0829 17:44:51.550054       1 builder.go:73] redacted build: {"kind":"Build","apiVersion":"v1","metadata":{"name":"eap-app-1","namespace":"proj0","selfLink":"/apis/build.openshift.io/v1/namespaces/proj0/builds/eap-app-1","uid":"6fe4f0c9-8ce1-11e7-bc47-028cf5771d52","resourceVersion":"74637","creationTimestamp":"2017-08-29T17:42:33Z","labels":{"application":"eap-app","buildconfig":"eap-app","openshift.io/build-config.name":"eap-app","openshift.io/build.start-policy":"Serial","template":"eap64-mysql-persistent-s2i","xpaas":"1.3.2"},"annotations":{"openshift.io/build-config.name":"eap-app","openshift.io/build.number":"1"},"ownerReferences":[{"apiVersion":"build.openshift.io/v1","kind":"BuildConfig","name":"eap-app","uid":"6fdf86ce-8ce1-11e7-bc47-028cf5771d52","controller":true}]},"spec":{"serviceAccount":"builder","source":{"type":"Git","git":{"uri":"https://github.com/jboss-openshift/openshift-quickstarts","ref":"1.2"},"contextDir":"todolist/todolist-jdbc"},"strategy":{"type":"Source","sourceStrategy":{"from":{"kind":"DockerImage","name":"registry.access.redhat.com/jboss-eap-6/eap64-openshift@sha256:416eec8d735883757f5e7f812a97ad0dbb38717050d229b0613c6815523654c9"},"env":[{"name":"BUILD_LOGLEVEL","value":"5"}],"forcePull":true}},"output":{"to":{"kind":"DockerImage","name":"docker-registry.default.svc:5000/proj0/eap-app:latest"},"pushSecret":{"name":"builder-dockercfg-bftwj"}},"resources":{},"postCommit":{},"nodeSelector":null,"triggeredBy":[{"message":"Image change","imageChangeBuild":{"imageID":"registry.access.redhat.com/jboss-eap-6/eap64-openshift@sha256:416eec8d735883757f5e7f812a97ad0dbb38717050d229b0613c6815523654c9","fromRef":{"kind":"ImageStreamTag","namespace":"openshift","name":"jboss-eap64-openshift:latest"}}}]},"status":{"phase":"New","outputDockerImageReference":"docker-registry.default.svc:5000/proj0/eap-app:latest","config":{"kind":"BuildConfig","namespace":"proj0","name":"eap-app"},"output":{}}}
I0829 17:44:51.550397       1 builder.go:84] Master version "v3.7.0-0.117.0", Builder version "v3.7.0-0.117.0"
Cloning "https://github.com/jboss-openshift/openshift-quickstarts" ...
I0829 17:44:51.550883       1 source.go:161] git ls-remote --heads https://github.com/jboss-openshift/openshift-quickstarts
I0829 17:44:51.550899       1 repository.go:385] Executing git ls-remote --heads https://github.com/jboss-openshift/openshift-quickstarts
I0829 17:44:52.031515       1 source.go:164] 292cc9ab9332bc4d3521f6902d905c267fabef76   refs/heads/1.1
caec20220374804b2cb3d3622a754f9091af7c57        refs/heads/1.2
819af3dcde4ef54f6b58181eec018a7efef5dfb0        refs/heads/1.3
a2ad21169f43bb47a17ffb0fb62f62dc8823f947        refs/heads/master
I0829 17:44:52.031552       1 source.go:262] Cloning source from https://github.com/jboss-openshift/openshift-quickstarts
I0829 17:44:52.031575       1 repository.go:385] Executing git clone https://github.com/jboss-openshift/openshift-quickstarts /tmp/build/inputs
I0829 17:44:52.043481       1 repository.go:445] Error executing command: exit status 128
F0829 17:44:52.104035       1 helpers.go:120] error: fatal: destination path '/tmp/build/inputs' already exists and is not an empty directory.

Comment 6 Ben Parees 2017-08-29 20:18:31 UTC
So the git-clone failed because the emptydir that we create in the pod was not...empty.

Looking at the podevents from https://bugzilla.redhat.com/attachment.cgi?id=1319764

I see two theories:

1) the git-clone init container was run twice.  I suspect this because you can see the git-clone init container ran, and the manage-dockerfile init container ran.  If the logs we have from the git-clone initcontainer are accurate, then the git-clone initcontainer failed and the manage-dockerfile init container should not have been run, yet it was run.  That would imply to be that the following sequence occurred:

  - git-clone runs successfully
  - manage-dockerfile runs successfully
  - git-clone gets run again.  This time it fails because the dir isn't empty anymore, and that's the only run we can get logs from.


2) the init containers are being run out of order/in parallel (and thus the manage-dockerfile initcontainer populated the emptydir before the git-clone initcontainer ran).  This is less likely since even if manage-dockerfile ran, it shouldn't have touched the emptydir given this particular buildconfig.

Seth, can you confirm that:

init containers run serially
init containers run in the order defined in the pod
after one init container fails, subsequent init containers will not be fun?


Theory (1) may also be related to the existing bug in which we see init container status get sent back to "initializing" after the initcontainer ran successfully.

Comment 7 Seth Jennings 2017-08-29 20:37:46 UTC
Yes, I think this could be related to the other bug where the first init container is being removed while the second init container runs.  Then when the second init container completes, the status on the first has been reset, causing the kubelet to start it again, leading to the error.

Like the other bug, I think the fix is not updating the status for init containers that have terminated and are thus subject to removal.

Comment 8 Seth Jennings 2017-08-30 18:17:41 UTC

*** This bug has been marked as a duplicate of bug 1486356 ***

Comment 9 Vikas Laad 2017-09-20 19:34:19 UTC
reopening this bz, it is happening even after https://bugzilla.redhat.com/show_bug.cgi?id=1486356 has been fixed.

To re-create this bug I am creating 50 eap builds using cluster loader (https://github.com/openshift/svt/tree/master/openshift_scalability) and using this template with it  https://raw.githubusercontent.com/openshift/svt/master/openshift_performance/ci/content/conc_builds_eap.yaml

Please let me know if you have any questions.

Comment 10 Seth Jennings 2017-09-27 13:52:57 UTC
Vikas, just to verify, did you recreate this exact issue?  Namely that the init container is re-running and erroring because the empty_dir is not empty?

Would be helpful to have yaml from the failed pod and node logs.

Comment 11 Vikas Laad 2017-09-27 15:14:36 UTC
Here is what I am getting 

 oc logs eap-app-2-build -c git-clone -n proj47
Cloning "https://github.com/jboss-openshift/openshift-quickstarts" ...
error: fatal: destination path '/tmp/build/inputs' already exists and is not an empty directory.


root@ip-172-31-24-181: ~ # openshift version
openshift v3.7.0-0.126.4
kubernetes v1.7.0+80709908fd
etcd 3.2.1

will attach yaml and node logs.

Comment 12 Vikas Laad 2017-09-27 15:15:16 UTC
Created attachment 1331467 [details]
new pod yaml

Comment 14 Seth Jennings 2017-09-27 20:57:48 UTC
The logs verify that the process was git-clone (init), manage-dockerfile (init), sti-build (main container), which all succeeds.  However the git-clone init container is re-executed due to a corruption of it's status due to GC (I think) causing the overall status of the pod to be failed and the build, keying of the pod status, indicates that the build failed fetching the source (the second time).

Comment 15 Seth Jennings 2017-09-29 21:09:29 UTC
It is unknown why the init container is being re-run; however, the kubelet is within its rights to do so for any reason according to the docs:

https://kubernetes.io/docs/concepts/workloads/pods/init-containers/

"Because Init Containers can be restarted, retried, or re-executed, Init Container code should be idempotent. In particular, code that writes to files on EmptyDirs should be prepared for the possibility that an output file already exists."

The git-clone init container is currently not idempotent, and it fails when being re-run:

journal: Cloning "https://github.com/jboss-openshift/openshift-quickstarts" ...
journal: error: fatal: destination path '/tmp/build/inputs' already exists and is not an empty directory.

Build team should fix the git-clone init container to be idempotent.

Comment 16 Ben Parees 2017-09-29 21:57:15 UTC
We can do that, but the restart policy for the pod is RestartNever.  I would view it as a kube bug if containers are being restarted.

"If an Init Container fails for a Pod, Kubernetes restarts the Pod repeatedly until the Init Container succeeds. However, if the Pod has a restartPolicy of Never, it is not restarted."

Comment 17 Clayton Coleman 2017-09-29 22:02:24 UTC
You're both right.  A restart never pod shouldn't get restarted if it fails.  git-clone should be re-entrant.

Comment 18 Seth Jennings 2017-09-29 22:05:33 UTC
Pod team is still tracking the undesired reset of init container statuses in https://bugzilla.redhat.com/show_bug.cgi?id=1486356 so we are continuing to look into that.

Comment 19 Ben Parees 2017-09-29 22:07:17 UTC
Making git-clone re-entrant here:
https://github.com/openshift/origin/pull/16624

the other initcontainers should be re-entrant already (they blindly overwrite files or use unique tempdirs), so I think it was only git-clone that would fail if there was content already there.

Comment 20 openshift-github-bot 2017-09-30 14:41:55 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/f0f7ce9063aa3e49d20cacef954a31e28daf5e28
Wipe out any existing content in the workdir before cloning

bug 1486457

https://bugzilla.redhat.com/show_bug.cgi?id=1486457

Comment 22 Vikas Laad 2017-10-31 13:17:44 UTC
Completed a full cycle of concurrent build test (50 concurrent builds on 2 compute nodes) for all the quickstart apps. Did not see this issue in following version

openshift v3.7.0-0.184.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8


Note You need to log in before you can comment on or make changes to this bug.