Bug 1486457
Summary: | Many builds fail due to FetchSourceFailed | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> | ||||||||||
Component: | Build | Assignee: | Ben Parees <bparees> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Vikas Laad <vlaad> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 3.7.0 | CC: | aos-bugs, ccoleman, jeder, jokerman, mifiedle, mmccomas, sjenning, vlaad, xtian | ||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||
Target Release: | 3.7.0 | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2017-11-01 13:41:27 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Vikas Laad
2017-08-29 19:38:49 UTC
Created attachment 1319762 [details]
build yaml
Created attachment 1319763 [details]
pod yaml
Created attachment 1319764 [details]
describe pod
oc logs for the pod returns following error Error from server (BadRequest): container "sti-build" in pod "eap-app-1-build" is waiting to start: PodInitializing root@ip-172-31-36-253: ~/FetchSourceFailed # oc logs eap-app-1-build -c git-clone I0829 17:44:51.550054 1 builder.go:73] redacted build: {"kind":"Build","apiVersion":"v1","metadata":{"name":"eap-app-1","namespace":"proj0","selfLink":"/apis/build.openshift.io/v1/namespaces/proj0/builds/eap-app-1","uid":"6fe4f0c9-8ce1-11e7-bc47-028cf5771d52","resourceVersion":"74637","creationTimestamp":"2017-08-29T17:42:33Z","labels":{"application":"eap-app","buildconfig":"eap-app","openshift.io/build-config.name":"eap-app","openshift.io/build.start-policy":"Serial","template":"eap64-mysql-persistent-s2i","xpaas":"1.3.2"},"annotations":{"openshift.io/build-config.name":"eap-app","openshift.io/build.number":"1"},"ownerReferences":[{"apiVersion":"build.openshift.io/v1","kind":"BuildConfig","name":"eap-app","uid":"6fdf86ce-8ce1-11e7-bc47-028cf5771d52","controller":true}]},"spec":{"serviceAccount":"builder","source":{"type":"Git","git":{"uri":"https://github.com/jboss-openshift/openshift-quickstarts","ref":"1.2"},"contextDir":"todolist/todolist-jdbc"},"strategy":{"type":"Source","sourceStrategy":{"from":{"kind":"DockerImage","name":"registry.access.redhat.com/jboss-eap-6/eap64-openshift@sha256:416eec8d735883757f5e7f812a97ad0dbb38717050d229b0613c6815523654c9"},"env":[{"name":"BUILD_LOGLEVEL","value":"5"}],"forcePull":true}},"output":{"to":{"kind":"DockerImage","name":"docker-registry.default.svc:5000/proj0/eap-app:latest"},"pushSecret":{"name":"builder-dockercfg-bftwj"}},"resources":{},"postCommit":{},"nodeSelector":null,"triggeredBy":[{"message":"Image change","imageChangeBuild":{"imageID":"registry.access.redhat.com/jboss-eap-6/eap64-openshift@sha256:416eec8d735883757f5e7f812a97ad0dbb38717050d229b0613c6815523654c9","fromRef":{"kind":"ImageStreamTag","namespace":"openshift","name":"jboss-eap64-openshift:latest"}}}]},"status":{"phase":"New","outputDockerImageReference":"docker-registry.default.svc:5000/proj0/eap-app:latest","config":{"kind":"BuildConfig","namespace":"proj0","name":"eap-app"},"output":{}}} I0829 17:44:51.550397 1 builder.go:84] Master version "v3.7.0-0.117.0", Builder version "v3.7.0-0.117.0" Cloning "https://github.com/jboss-openshift/openshift-quickstarts" ... I0829 17:44:51.550883 1 source.go:161] git ls-remote --heads https://github.com/jboss-openshift/openshift-quickstarts I0829 17:44:51.550899 1 repository.go:385] Executing git ls-remote --heads https://github.com/jboss-openshift/openshift-quickstarts I0829 17:44:52.031515 1 source.go:164] 292cc9ab9332bc4d3521f6902d905c267fabef76 refs/heads/1.1 caec20220374804b2cb3d3622a754f9091af7c57 refs/heads/1.2 819af3dcde4ef54f6b58181eec018a7efef5dfb0 refs/heads/1.3 a2ad21169f43bb47a17ffb0fb62f62dc8823f947 refs/heads/master I0829 17:44:52.031552 1 source.go:262] Cloning source from https://github.com/jboss-openshift/openshift-quickstarts I0829 17:44:52.031575 1 repository.go:385] Executing git clone https://github.com/jboss-openshift/openshift-quickstarts /tmp/build/inputs I0829 17:44:52.043481 1 repository.go:445] Error executing command: exit status 128 F0829 17:44:52.104035 1 helpers.go:120] error: fatal: destination path '/tmp/build/inputs' already exists and is not an empty directory. So the git-clone failed because the emptydir that we create in the pod was not...empty. Looking at the podevents from https://bugzilla.redhat.com/attachment.cgi?id=1319764 I see two theories: 1) the git-clone init container was run twice. I suspect this because you can see the git-clone init container ran, and the manage-dockerfile init container ran. If the logs we have from the git-clone initcontainer are accurate, then the git-clone initcontainer failed and the manage-dockerfile init container should not have been run, yet it was run. That would imply to be that the following sequence occurred: - git-clone runs successfully - manage-dockerfile runs successfully - git-clone gets run again. This time it fails because the dir isn't empty anymore, and that's the only run we can get logs from. 2) the init containers are being run out of order/in parallel (and thus the manage-dockerfile initcontainer populated the emptydir before the git-clone initcontainer ran). This is less likely since even if manage-dockerfile ran, it shouldn't have touched the emptydir given this particular buildconfig. Seth, can you confirm that: init containers run serially init containers run in the order defined in the pod after one init container fails, subsequent init containers will not be fun? Theory (1) may also be related to the existing bug in which we see init container status get sent back to "initializing" after the initcontainer ran successfully. Yes, I think this could be related to the other bug where the first init container is being removed while the second init container runs. Then when the second init container completes, the status on the first has been reset, causing the kubelet to start it again, leading to the error. Like the other bug, I think the fix is not updating the status for init containers that have terminated and are thus subject to removal. *** This bug has been marked as a duplicate of bug 1486356 *** reopening this bz, it is happening even after https://bugzilla.redhat.com/show_bug.cgi?id=1486356 has been fixed. To re-create this bug I am creating 50 eap builds using cluster loader (https://github.com/openshift/svt/tree/master/openshift_scalability) and using this template with it https://raw.githubusercontent.com/openshift/svt/master/openshift_performance/ci/content/conc_builds_eap.yaml Please let me know if you have any questions. Vikas, just to verify, did you recreate this exact issue? Namely that the init container is re-running and erroring because the empty_dir is not empty? Would be helpful to have yaml from the failed pod and node logs. Here is what I am getting oc logs eap-app-2-build -c git-clone -n proj47 Cloning "https://github.com/jboss-openshift/openshift-quickstarts" ... error: fatal: destination path '/tmp/build/inputs' already exists and is not an empty directory. root@ip-172-31-24-181: ~ # openshift version openshift v3.7.0-0.126.4 kubernetes v1.7.0+80709908fd etcd 3.2.1 will attach yaml and node logs. Created attachment 1331467 [details]
new pod yaml
The logs verify that the process was git-clone (init), manage-dockerfile (init), sti-build (main container), which all succeeds. However the git-clone init container is re-executed due to a corruption of it's status due to GC (I think) causing the overall status of the pod to be failed and the build, keying of the pod status, indicates that the build failed fetching the source (the second time). It is unknown why the init container is being re-run; however, the kubelet is within its rights to do so for any reason according to the docs: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/ "Because Init Containers can be restarted, retried, or re-executed, Init Container code should be idempotent. In particular, code that writes to files on EmptyDirs should be prepared for the possibility that an output file already exists." The git-clone init container is currently not idempotent, and it fails when being re-run: journal: Cloning "https://github.com/jboss-openshift/openshift-quickstarts" ... journal: error: fatal: destination path '/tmp/build/inputs' already exists and is not an empty directory. Build team should fix the git-clone init container to be idempotent. We can do that, but the restart policy for the pod is RestartNever. I would view it as a kube bug if containers are being restarted. "If an Init Container fails for a Pod, Kubernetes restarts the Pod repeatedly until the Init Container succeeds. However, if the Pod has a restartPolicy of Never, it is not restarted." You're both right. A restart never pod shouldn't get restarted if it fails. git-clone should be re-entrant. Pod team is still tracking the undesired reset of init container statuses in https://bugzilla.redhat.com/show_bug.cgi?id=1486356 so we are continuing to look into that. Making git-clone re-entrant here: https://github.com/openshift/origin/pull/16624 the other initcontainers should be re-entrant already (they blindly overwrite files or use unique tempdirs), so I think it was only git-clone that would fail if there was content already there. Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/f0f7ce9063aa3e49d20cacef954a31e28daf5e28 Wipe out any existing content in the workdir before cloning bug 1486457 https://bugzilla.redhat.com/show_bug.cgi?id=1486457 Completed a full cycle of concurrent build test (50 concurrent builds on 2 compute nodes) for all the quickstart apps. Did not see this issue in following version openshift v3.7.0-0.184.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 |