Description of problem: I started few eap builds on 3.7 cluster, some of the builds got stuck in Running status. Build logs show build was completed fine, pod status shows Init:0/2 oc get pods NAME READY STATUS RESTARTS AGE eap-app-1-build 0/1 Init:0/2 0 26m oc get builds NAME TYPE FROM STATUS STARTED DURATION eap-app-1 Source Git@caec202 Running 26 minutes ago Version-Release number of selected component (if applicable): openshift v3.7.0-0.117.0 kubernetes v1.7.0+695f48a16f etcd 3.2.1 How reproducible: Start 20-30 concurrent builds on 2 compute node cluster, some of the builds are stuck. Steps to Reproduce: 1. Create Openshift 3.7 cluster 2. Start concurrent eap builds Actual results: Build stuck in Running, pod shows Init:0/2 Expected results: Build and Pod status should show completed. Additional info: Attaching logs, build and pod yamls.
Created attachment 1319579 [details] build config
Created attachment 1319580 [details] pod yaml
Created attachment 1319581 [details] build logs
Created attachment 1319595 [details] describe pod
do you have the build yaml itself too?
Created attachment 1319627 [details] build yaml
Based on the fact that the build did run successfully (as seen in the build state and build log) which means the git clone succeeded or else the build could not have succeeded, and since git-clone is run in one of the init containers and the fact that the main pod container did run, it seems like the init containers did run at some point, but then their state got reverted back to indicating they had not run yet. Seems like a k8s pod state/management problem.
Also if you can recreate this with build loglevel=5 that would show us more about the init container execution.
I agree with Ben; for some reason the init container status is being reset to: "state":{"waiting":{"reason":"PodInitializing"}} This is why the oc printer is showing the status as "Init:0/2" even though the pod (init and normal containers) have run and completed successfully. Trying to figure out what would be resetting this.
Related upstream: https://github.com/kubernetes/kubernetes/issues/29972 https://github.com/kubernetes/kubernetes/issues/41580 Looks like there might be an issue with GC of init containers before the main pod containers exit and when the pod terminates the init container statuses are updated to PodInitializing because the init containers have been removed.
Created attachment 1319706 [details] build logs with loglevel 5
*** Bug 1486457 has been marked as a duplicate of this bug. ***
Kube PR: https://github.com/kubernetes/kubernetes/pull/51644
Origin PR: https://github.com/openshift/origin/pull/16244
I am still able to reproduce this issue with django quickstart app in following version of openshift openshift v3.7.0-0.126.4 kubernetes v1.7.0+80709908fd etcd 3.2.1 Attaching all the logs.
Created attachment 1329101 [details] pod yaml
Created attachment 1329102 [details] describe pod
Created attachment 1329103 [details] build yaml
Created attachment 1329104 [details] build log
Andrew, this is in the same vein as https://bugzilla.redhat.com/show_bug.cgi?id=1486457, which I bounced to Build team. This one is slightly different in that the pod is showing "Init:0/2" status forever rather than "Init:Error". The init container is stuck in Waiting state with reason PodInitializing initContainerStatuses: - image: registry.ops.openshift.com/openshift3/ose-sti-builder:v3.7.0-0.126.4 imageID: "" lastState: {} name: git-clone ready: false restartCount: 0 state: waiting: reason: PodInitializing The build pod is succeeded as in the other BZ, but in this case, the init container status is reset and the init container is NOT being run again.
I have been unable to reproduce this using versions: openshift v3.7.0-alpha.1+848776c-719 kubernetes v1.7.0+80709908fd @vikas - do you have more explicit instructions I can try w.r.t running the cluster-loader?
cluster-loader is located here https://github.com/openshift/svt/tree/master/openshift_scalability I am using following config with cluster loader to create build config. https://raw.githubusercontent.com/openshift/svt/master/openshift_performance/ci/content/conc_builds_eap.yaml
@vikas, I'm invoking this as: $ ./cluster-loader -f t where t == your-URL-to-conc_builds_eap.yaml Also, what user are you running this under? I was unable to reproduce using either fedora or centos as the base install.
*** Bug 1499695 has been marked as a duplicate of this bug. ***
(In reply to Andrew McDermott from comment #26) Cluster loader always need to be run as cluster admin user and also need to be run from default project. Please let me know if you want me to demo this to you.
(In reply to Vikas Laad from comment #29) > (In reply to Andrew McDermott from comment #26) > Cluster loader always need to be run as cluster admin user and also need to > be run from default project. Please let me know if you want me to demo this > to you. Yes please. Even today (Friday PM GMT+1) would be fine if you are available.
I went through the steps I was using with Vikas today and together we ran both the openshift_performance and openshift_scalability tests without seeing the issues. For good measure I ran the performance test 3 times - no Init issues were seen. One difference in my setup vis-a-vis the bug reported is that my openshift version is behind (though was relevant at the time the bug was reported and when I stood up the cluster) and I'm deploying via KVM, using CentOS for the base install. I'll switch to RHEL 7 to see if that is the difference. @Vikas - did I miss anything here from our screenshare?
I was able to re-produce this issue again with following version openshift v3.7.0-0.153.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 Again, this is on RHEL 7 cluster created on AWS.
Created attachment 1339394 [details] pod yaml on 10-16-2017
Avesh found a pick that kube had done back to 1.7 upstream to avoid GCing (init) containers from running pods. I picked it. vlaad is testing now with a custom binary to shorten round trip time for fix confirmation/failure. https://github.com/openshift/origin/pull/16896
Yes, I was able to re-create.
actually behavior is little different pod status shows svt-proj-35 eap-app-1-build 0/1 Init:0/2 0 2h Build status shows eap-app-1 Source Git@caec202 Pending
Vikas, unfortunately this is a recreate. Back to the drawing board..
I tried this using RHEL 7.4 as the base OS. I did not run into this bug but I did notice the following: $ oc get pods --all-namespaces |grep Error proj13 eap-app-1-build 0/1 CreateContainerError 0 8m proj19 eap-app-1-build 0/1 CreateContainerError 0 8m The error was: $ oc describe all |grep Error 10m 10m 1 kubelet, rhel-74-vm-4.k8s.home spec.containers{sti-build} Warning Failed Error: context deadline exceeded Reason: CreateContainerError 10m 10m 1 kubelet, rhel-74-vm-4.k8s.home spec.containers{sti-build} Warning Failed Error: context deadline exceeded 7m 7m 1 kubelet, rhel-74-vm-4.k8s.home spec.containers{sti-build} Warning Failed Error: context deadline exceeded My oc version has moved on since I first started looking into this, so possibly a new issue given the following version: $ oc version oc v3.7.0-alpha.1+a91eb4b-1195 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://rhel-74-vm-1.k8s.home:8443 openshift v3.7.0-alpha.1+a91eb4b-1195 kubernetes v1.7.6+a08f5eeb62
When I try to recreate this, I don't get any in Init:0/2 but I do get quite a few that end in Error state with "error: build error: Failed to push image: unauthorized: authentication required" in the container logs: openshift v3.7.0-0.143.7 kubernetes v1.7.0+80709908fd $ oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-c812b 1/1 Running 0 21h default router-1-23hcr 1/1 Running 0 21h proj0 eap-app-1-build 0/1 Completed 0 11m proj1 eap-app-1-build 0/1 Completed 0 11m proj10 eap-app-1-build 0/1 Error 0 11m proj11 eap-app-1-build 0/1 Error 0 11m proj12 eap-app-1-build 0/1 Completed 0 11m proj13 eap-app-1-build 0/1 Error 0 11m proj14 eap-app-1-build 0/1 Error 0 11m proj15 eap-app-1-build 0/1 Completed 0 11m proj16 eap-app-1-build 0/1 Completed 0 11m proj17 eap-app-1-build 0/1 Error 0 11m proj18 eap-app-1-build 0/1 Error 0 11m proj19 eap-app-1-build 0/1 Error 0 11m proj2 eap-app-1-build 0/1 Completed 0 11m proj3 eap-app-1-build 0/1 Completed 0 11m proj4 eap-app-1-build 0/1 Completed 0 11m proj5 eap-app-1-build 0/1 Completed 0 11m proj6 eap-app-1-build 0/1 Completed 0 11m proj7 eap-app-1-build 0/1 Completed 0 11m proj8 eap-app-1-build 0/1 Completed 0 11m proj9 eap-app-1-build 0/1 Completed 0 11m $ for i in $(oc get pod --all-namespaces | grep Error | cut -f1 -d' '); do oc logs eap-app-1-build -n $i | grep "build error"; done error: build error: Failed to push image: unauthorized: authentication required error: build error: Failed to push image: unauthorized: authentication required error: build error: Failed to push image: unauthorized: authentication required error: build error: Failed to push image: unauthorized: authentication required error: build error: Failed to push image: unauthorized: authentication required error: build error: Failed to push image: unauthorized: authentication required error: build error: Failed to push image: unauthorized: authentication required This is because the dockercfg secret is not being added to the builder pod. $ oc describe pod eap-app-1-build -n proj0 | grep SecretName SecretName: builder-dockercfg-xptk6 <-- good SecretName: builder-token-w7bbs $ oc describe pod eap-app-1-build -n proj10 | grep SecretName SecretName: builder-token-337rs Vikas, this happen to you? Don't want to hijack the bug. Might open a separate bug for this. Just wondering if you have seen it.
No, "authentication required" is not happening to me.
Sorry for not updating the bug. The previous "authentication required" was due to a race with the build container starting before the controller that creates the dockercfg secret had run. Still couldn't reproduce. I looked through the node log with the debug binary and my debug message is never hit. I even though you can recreate this, I'm not sure what we can do about it. Since https://github.com/openshift/origin/pull/16244, I have personally not been able to recreate this. https://github.com/openshift/origin/pull/16896 guards against that recreation mode even more. Not sure where we go with this from here. The value of a test doing 50 parallel builds on a single node can be debated. Definitely not a use case we expect to be encountering. We've burned a lot of engineering time trying to run this down.
In concurrent build test I am doing 25 builds on 1 node not 50. Also since we are running into this issue in free-stg, it seems only concurrent builds are not causing it. Also comparing with 3.6 release this is a new issue we did not have any builds getting stuck in Running.
The CI jobs are going to be running a large number of builds on nodes in parallel soon. Since this is possibly showing up in deployments as well, if Kubelet is resetting status in a way that violates the pod phase sequence diagram this is potentially very serious (could break stateful sets as well).
With the tight reproducer in https://github.com/openshift/origin/issues/17011, which does reproduce for me nearly every time, we should be able to make progress on this, assuming they are indeed the same issue.
*** Bug 1500498 has been marked as a duplicate of this bug. ***
Vikas, does this failure occur on every single CI run or is just some of them? I was trying to reproduce on AWS today and the first run completed OK for me but I was wondering if CI gets into this state each and every time.
No, this does not happen on every run.
I was able to reproduce this behaviour on an AWS deploy today. [ec2-user@ip-172-18-12-180 openshift_scalability]$ oc get pods --all-namespaces | grep Completed | wc -l 41 [ec2-user@ip-172-18-12-180 openshift_scalability]$ oc get pods --all-namespaces | grep -v Completed | wc -l 13 [ec2-user@ip-172-18-12-180 openshift_scalability]$ oc get pods --all-namespaces | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-bnt6p 1/1 Running 0 2h default registry-console-1-q7chn 1/1 Running 0 2h default router-2-69tkc 1/1 Running 0 25m proj18 eap-app-1-build 0/1 Error 0 9m proj19 eap-app-1-build 0/1 Init:Error 0 9m proj22 eap-app-1-build 0/1 Init:Error 0 9m proj26 eap-app-1-build 0/1 Error 0 9m proj3 eap-app-1-build 0/1 Init:Error 0 10m proj30 eap-app-1-build 0/1 Init:0/2 0 9m proj37 eap-app-1-build 0/1 Init:Error 0 9m proj4 eap-app-1-build 0/1 Init:Error 0 10m proj8 eap-app-1-build 0/1 Init:Error 0 10m The difference today was that I added the following (which comes from the CI inventory): openshift_node_kubelet_args={"pods-per-core": ["0"], "max-pods": ["510"],"minimum-container-ttl-duration": ["10s"], "maximum-dead-containers-per-container": ["1"], "maximum-dead-containers": ["20"], "image-gc-high-threshold": ["80"], "image-gc-low-threshold": ["70"]} I have tried this before on a local VM but it didn't happen so I convinced myself that it was not significant. Perhaps the combination of running in the cloud _and_ those values are enough to trigger the failure. However, with regard to behaviour, do these parameters make sense for real-world deployments? Are they a hangover which could be culled from the CI job? Can we cull them from the CI job to see of this problem disappears. And if so, resurrect them to convince us the problem reappears and these settings represent the root cause for the failure.
And as Seth pointed out looks like these parameters are deprecated too: https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/options/options.go#L308-L325
My very next run, without changing anything, was successful so just confirming what Vikas said in comment #54 (does not happen every run).
Andrew, can you pick these two to origin master and open a PR: https://github.com/kubernetes/kubernetes/pull/54593 (possible fix) https://github.com/kubernetes/kubernetes/pull/54597 (illegal transition check) Can you build the patched openshift binary, put it on the AWS cluster you got a recreate on, and try to recreate again. Also, please monitor the node logs for the "attempted illegal transition" message. Might pick those to openshift/ose master as well and build some binaries for Vikas to put on his cluster and try a recreate.
Origin PRs (already merged): https://github.com/openshift/origin/pull/17042 https://github.com/openshift/origin/pull/17051
I tested for quite some time today with a build that had the PRs mentioned in comment #59. I did not see any cases where the POD status was left at Init:0/2. This was also on the same cluster that I was using yesterday where I got my one and only repo. I reverted to the originally deployed openshift binaries (from yesterday) but was unable to repro which makes this a bit inconclusive. I did run into occasional build errors which looks like: $ oc logs -f bc/eap-app Cloning "https://github.com/jboss-openshift/openshift-quickstarts" ... error: fatal: destination path '/tmp/build/inputs' already exists and is not an empty directory. I have a vague recollection that we had seen this before -- an init container needed to be more idempotent with regards to its actions.
The build init container re-entrancy issue is https://bugzilla.redhat.com/show_bug.cgi?id=1486457
Regarding the build init container issues I was seeing yesterday I spent some time testing with a ":latest" version of the openshift-sti-builder image today. Using the ansible installer and an inventory that uses "deployment=origin" I noticed I get an sti-builder image that is 8 weeks old. Pulling a latest image and forcing the existing image tag to today's latest I have been able to run 50 concurrent builds repeatedly without issue. No "Init:0/2" errors and no other errors were reported. Each time I ran 50 concurrent builds they all went to "Completed". It's not immediately obvious to me given my fairly vanilla inventory why I get such an old openshift-sti-builder image.
Completed multiple cycles of eap concurrent builds, did not see this issue again. openshift v3.7.0-0.184.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188