Bug 1486356 - Build stuck in Running Pod status shows Init:0/2
Summary: Build stuck in Running Pod status shows Init:0/2
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.7.0
Assignee: Andrew McDermott
QA Contact: Vikas Laad
URL:
Whiteboard:
: 1499695 1500498 (view as bug list)
Depends On:
Blocks: 1524460
TreeView+ depends on / blocked
 
Reported: 2017-08-29 14:37 UTC by Vikas Laad
Modified: 2018-01-03 23:00 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Container state was transitioning from the terminal phases of Succeeded or Failed to Pending. It is due to a bug in the kubelet. Consequence: Invalid phase transition could destroy user data, if a controller assumes that pods don't regress in phase (which is what happened in the openshift case). Fix: Puts a state transition check in the kubelet status manager to detect and block illegal transitions; namely from terminated to non-terminated. Additionally, this check only happens where a POD's RestartPolicy is set to Never. Result:
Clone Of:
: 1524460 (view as bug list)
Environment:
Last Closed: 2017-11-28 22:08:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
build config (1.17 KB, text/plain)
2017-08-29 14:40 UTC, Vikas Laad
no flags Details
pod yaml (24.81 KB, text/plain)
2017-08-29 14:40 UTC, Vikas Laad
no flags Details
build logs (55.81 KB, text/plain)
2017-08-29 14:40 UTC, Vikas Laad
no flags Details
describe pod (11.82 KB, text/plain)
2017-08-29 14:48 UTC, Vikas Laad
no flags Details
build yaml (3.55 KB, text/plain)
2017-08-29 15:55 UTC, Vikas Laad
no flags Details
build logs with loglevel 5 (84.20 KB, text/plain)
2017-08-29 17:54 UTC, Vikas Laad
no flags Details
pod yaml (23.68 KB, text/plain)
2017-09-21 17:17 UTC, Vikas Laad
no flags Details
describe pod (11.32 KB, text/plain)
2017-09-21 17:18 UTC, Vikas Laad
no flags Details
build yaml (3.02 KB, text/plain)
2017-09-21 17:18 UTC, Vikas Laad
no flags Details
build log (93.87 KB, text/plain)
2017-09-21 17:18 UTC, Vikas Laad
no flags Details
pod yaml on 10-16-2017 (24.40 KB, text/plain)
2017-10-16 17:50 UTC, Vikas Laad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3307941 0 None None None 2018-01-03 23:00:19 UTC
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Vikas Laad 2017-08-29 14:37:49 UTC
Description of problem:
I started few eap builds on 3.7 cluster, some of the builds got stuck in Running status. Build logs show build was completed fine, pod status shows Init:0/2

oc get pods
NAME              READY     STATUS     RESTARTS   AGE
eap-app-1-build   0/1       Init:0/2   0          26m

oc get builds
NAME        TYPE      FROM          STATUS    STARTED          DURATION
eap-app-1   Source    Git@caec202   Running   26 minutes ago   

Version-Release number of selected component (if applicable):
openshift v3.7.0-0.117.0
kubernetes v1.7.0+695f48a16f
etcd 3.2.1

How reproducible:
Start 20-30 concurrent builds on 2 compute node cluster, some of the builds are stuck.

Steps to Reproduce:
1. Create Openshift 3.7 cluster
2. Start concurrent eap builds

Actual results:
Build stuck in Running, pod shows Init:0/2

Expected results:
Build and Pod status should show completed.

Additional info:
Attaching logs, build and pod yamls.

Comment 1 Vikas Laad 2017-08-29 14:40:16 UTC
Created attachment 1319579 [details]
build config

Comment 2 Vikas Laad 2017-08-29 14:40:32 UTC
Created attachment 1319580 [details]
pod yaml

Comment 3 Vikas Laad 2017-08-29 14:40:50 UTC
Created attachment 1319581 [details]
build logs

Comment 4 Vikas Laad 2017-08-29 14:48:48 UTC
Created attachment 1319595 [details]
describe pod

Comment 5 Ben Parees 2017-08-29 15:50:18 UTC
do you have the build yaml itself too?

Comment 6 Vikas Laad 2017-08-29 15:55:59 UTC
Created attachment 1319627 [details]
build yaml

Comment 7 Ben Parees 2017-08-29 15:56:43 UTC
Based on the fact that the build did run successfully (as seen in the build state and build log) which means the git clone succeeded or else the build could not have succeeded, and since git-clone is run in one of the init containers and the fact that the main pod container did run, it seems like the init containers did run at some point, but then their state got reverted back to indicating they had not run yet.

Seems like a k8s pod state/management problem.

Comment 8 Ben Parees 2017-08-29 15:57:58 UTC
Also if you can recreate this with build loglevel=5 that would show us more about the init container execution.

Comment 9 Seth Jennings 2017-08-29 17:44:16 UTC
I agree with Ben; for some reason the init container status is being reset to:

"state":{"waiting":{"reason":"PodInitializing"}}

This is why the oc printer is showing the status as "Init:0/2" even though the pod (init and normal containers) have run and completed successfully.

Trying to figure out what would be resetting this.

Comment 10 Seth Jennings 2017-08-29 17:53:02 UTC
Related upstream:
https://github.com/kubernetes/kubernetes/issues/29972
https://github.com/kubernetes/kubernetes/issues/41580

Looks like there might be an issue with GC of init containers before the main pod containers exit and when the pod terminates the init container statuses are updated to PodInitializing because the init containers have been removed.

Comment 11 Vikas Laad 2017-08-29 17:54:22 UTC
Created attachment 1319706 [details]
build logs with loglevel 5

Comment 12 Seth Jennings 2017-08-30 18:17:41 UTC
*** Bug 1486457 has been marked as a duplicate of this bug. ***

Comment 13 Seth Jennings 2017-08-30 18:18:12 UTC
Kube PR:
https://github.com/kubernetes/kubernetes/pull/51644

Comment 14 Seth Jennings 2017-09-08 13:48:49 UTC
Origin PR:
https://github.com/openshift/origin/pull/16244

Comment 16 Vikas Laad 2017-09-21 17:17:06 UTC
I am still able to reproduce this issue with django quickstart app in following version of openshift

openshift v3.7.0-0.126.4
kubernetes v1.7.0+80709908fd
etcd 3.2.1

Attaching all the logs.

Comment 17 Vikas Laad 2017-09-21 17:17:46 UTC
Created attachment 1329101 [details]
pod yaml

Comment 18 Vikas Laad 2017-09-21 17:18:12 UTC
Created attachment 1329102 [details]
describe pod

Comment 19 Vikas Laad 2017-09-21 17:18:37 UTC
Created attachment 1329103 [details]
build yaml

Comment 20 Vikas Laad 2017-09-21 17:18:57 UTC
Created attachment 1329104 [details]
build log

Comment 23 Seth Jennings 2017-09-29 21:15:08 UTC
Andrew, this is in the same vein as https://bugzilla.redhat.com/show_bug.cgi?id=1486457, which I bounced to Build team. This one is slightly different in that the pod is showing "Init:0/2" status forever rather than "Init:Error".

The init container is stuck in Waiting state with reason PodInitializing

initContainerStatuses:
  - image: registry.ops.openshift.com/openshift3/ose-sti-builder:v3.7.0-0.126.4
    imageID: ""
    lastState: {}
    name: git-clone
    ready: false
    restartCount: 0
    state:
      waiting:
        reason: PodInitializing

The build pod is succeeded as in the other BZ, but in this case, the init container status is reset and the init container is NOT being run again.

Comment 24 Andrew McDermott 2017-10-03 17:01:14 UTC
I have been unable to reproduce this using versions:

  openshift v3.7.0-alpha.1+848776c-719
  kubernetes v1.7.0+80709908fd

@vikas - do you have more explicit instructions I can try w.r.t running the cluster-loader?

Comment 25 Vikas Laad 2017-10-03 17:10:48 UTC
cluster-loader is located here https://github.com/openshift/svt/tree/master/openshift_scalability

I am using following config with cluster loader to create build config. https://raw.githubusercontent.com/openshift/svt/master/openshift_performance/ci/content/conc_builds_eap.yaml

Comment 26 Andrew McDermott 2017-10-06 14:50:29 UTC
@vikas, I'm invoking this as:

  $ ./cluster-loader -f t

where t == your-URL-to-conc_builds_eap.yaml

Also, what user are you running this under?

I was unable to reproduce using either fedora or centos as the base install.

Comment 27 Seth Jennings 2017-10-12 14:03:19 UTC
*** Bug 1499695 has been marked as a duplicate of this bug. ***

Comment 29 Vikas Laad 2017-10-12 15:10:16 UTC
(In reply to Andrew McDermott from comment #26)
Cluster loader always need to be run as cluster admin user and also need to be run from default project. Please let me know if you want me to demo this to you.

Comment 34 Andrew McDermott 2017-10-13 14:29:34 UTC
(In reply to Vikas Laad from comment #29)
> (In reply to Andrew McDermott from comment #26)
> Cluster loader always need to be run as cluster admin user and also need to
> be run from default project. Please let me know if you want me to demo this
> to you.

Yes please. Even today (Friday PM GMT+1) would be fine if you are available.

Comment 35 Andrew McDermott 2017-10-13 16:02:39 UTC
I went through the steps I was using with Vikas today and together we ran both  the openshift_performance and openshift_scalability tests without seeing the issues. For good measure I ran the performance test 3 times - no Init issues were seen.

One difference in my setup vis-a-vis the bug reported is that my openshift version is behind (though was relevant at the time the bug was reported and when I stood up the cluster) and I'm deploying via KVM, using CentOS for the base install. I'll switch to RHEL 7 to see if that is the difference.

@Vikas - did I miss anything here from our screenshare?

Comment 36 Vikas Laad 2017-10-16 17:34:23 UTC
I was able to re-produce this issue again with following version

openshift v3.7.0-0.153.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

Again, this is on RHEL 7 cluster created on AWS.

Comment 37 Vikas Laad 2017-10-16 17:50:52 UTC
Created attachment 1339394 [details]
pod yaml on 10-16-2017

Comment 38 Seth Jennings 2017-10-16 22:05:18 UTC
Avesh found a pick that kube had done back to 1.7 upstream to avoid GCing (init) containers from running pods.  I picked it.  vlaad is testing now with a custom binary to shorten round trip time for fix confirmation/failure.

https://github.com/openshift/origin/pull/16896

Comment 41 Vikas Laad 2017-10-17 15:25:44 UTC
Yes, I was able to re-create.

Comment 43 Vikas Laad 2017-10-17 15:29:19 UTC
actually behavior is little different pod status shows

svt-proj-35   eap-app-1-build            0/1       Init:0/2    0          2h

Build status shows 
eap-app-1   Source    Git@caec202   Pending

Comment 44 Seth Jennings 2017-10-17 16:17:24 UTC
Vikas, unfortunately this is a recreate.  Back to the drawing board..

Comment 45 Andrew McDermott 2017-10-18 20:09:36 UTC
I tried this using RHEL 7.4 as the base OS. I did not run into this bug but I did notice the following:

$ oc get pods --all-namespaces |grep Error
proj13      eap-app-1-build            0/1       CreateContainerError   0          8m
proj19      eap-app-1-build            0/1       CreateContainerError   0          8m

The error was:

$ oc describe all |grep Error
  10m           10m             1       kubelet, rhel-74-vm-4.k8s.home  spec.containers{sti-build}              Warning         Failed                  Error: context deadline exceeded
      Reason:           CreateContainerError
  10m           10m             1       kubelet, rhel-74-vm-4.k8s.home  spec.containers{sti-build}              Warning         Failed                  Error: context deadline exceeded

  7m		7m		1	kubelet, rhel-74-vm-4.k8s.home	spec.containers{sti-build}		Warning		Failed			Error: context deadline exceeded


My oc version has moved on since I first started looking into this, so possibly a new issue given the following version:

$ oc version
oc v3.7.0-alpha.1+a91eb4b-1195
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://rhel-74-vm-1.k8s.home:8443
openshift v3.7.0-alpha.1+a91eb4b-1195
kubernetes v1.7.6+a08f5eeb62

Comment 46 Seth Jennings 2017-10-19 16:43:33 UTC
When I try to recreate this, I don't get any in Init:0/2 but I do get quite a few that end in Error state with "error: build error: Failed to push image: unauthorized: authentication required" in the container logs:

openshift v3.7.0-0.143.7
kubernetes v1.7.0+80709908fd

$ oc get pods --all-namespaces
NAMESPACE   NAME                      READY     STATUS      RESTARTS   AGE
default     docker-registry-1-c812b   1/1       Running     0          21h
default     router-1-23hcr            1/1       Running     0          21h
proj0       eap-app-1-build           0/1       Completed   0          11m
proj1       eap-app-1-build           0/1       Completed   0          11m
proj10      eap-app-1-build           0/1       Error       0          11m
proj11      eap-app-1-build           0/1       Error       0          11m
proj12      eap-app-1-build           0/1       Completed   0          11m
proj13      eap-app-1-build           0/1       Error       0          11m
proj14      eap-app-1-build           0/1       Error       0          11m
proj15      eap-app-1-build           0/1       Completed   0          11m
proj16      eap-app-1-build           0/1       Completed   0          11m
proj17      eap-app-1-build           0/1       Error       0          11m
proj18      eap-app-1-build           0/1       Error       0          11m
proj19      eap-app-1-build           0/1       Error       0          11m
proj2       eap-app-1-build           0/1       Completed   0          11m
proj3       eap-app-1-build           0/1       Completed   0          11m
proj4       eap-app-1-build           0/1       Completed   0          11m
proj5       eap-app-1-build           0/1       Completed   0          11m
proj6       eap-app-1-build           0/1       Completed   0          11m
proj7       eap-app-1-build           0/1       Completed   0          11m
proj8       eap-app-1-build           0/1       Completed   0          11m
proj9       eap-app-1-build           0/1       Completed   0          11m

$ for i in $(oc get pod --all-namespaces | grep Error | cut -f1 -d' '); do oc logs eap-app-1-build -n $i | grep "build error"; done
error: build error: Failed to push image: unauthorized: authentication required
error: build error: Failed to push image: unauthorized: authentication required
error: build error: Failed to push image: unauthorized: authentication required
error: build error: Failed to push image: unauthorized: authentication required
error: build error: Failed to push image: unauthorized: authentication required
error: build error: Failed to push image: unauthorized: authentication required
error: build error: Failed to push image: unauthorized: authentication required

This is because the dockercfg secret is not being added to the builder pod.

$ oc describe pod eap-app-1-build -n proj0 | grep SecretName
    SecretName:	builder-dockercfg-xptk6  <-- good
    SecretName:	builder-token-w7bbs

$ oc describe pod eap-app-1-build -n proj10 | grep SecretName
    SecretName:	builder-token-337rs

Vikas, this happen to you?  Don't want to hijack the bug.  Might open a separate bug for this.  Just wondering if you have seen it.

Comment 47 Vikas Laad 2017-10-19 17:13:07 UTC
No, "authentication required" is not happening to me.

Comment 48 Seth Jennings 2017-10-23 18:54:56 UTC
Sorry for not updating the bug.  The previous "authentication required" was due to a race with the build container starting before the controller that creates the dockercfg secret had run.

Still couldn't reproduce.

I looked through the node log with the debug binary and my debug message is never hit.

I even though you can recreate this, I'm not sure what we can do about it.

Since https://github.com/openshift/origin/pull/16244, I have personally not been able to recreate this.

https://github.com/openshift/origin/pull/16896 guards against that recreation mode even more.

Not sure where we go with this from here.  The value of a test doing 50 parallel
builds on a single node can be debated. Definitely not a use case we expect to be encountering.  We've burned a lot of engineering time trying to run this down.

Comment 49 Vikas Laad 2017-10-23 19:05:24 UTC
In concurrent build test I am doing 25 builds on 1 node not 50. Also since we are running into this issue in free-stg, it seems only concurrent builds are not causing it.
Also comparing with 3.6 release this is a new issue we did not have any builds getting stuck in Running.

Comment 50 Clayton Coleman 2017-10-24 02:53:15 UTC
The CI jobs are going to be running a large number of builds on nodes in parallel soon.  Since this is possibly showing up in deployments as well, if Kubelet is resetting status in a way that violates the pod phase sequence diagram this is potentially very serious (could break stateful sets as well).

Comment 51 Seth Jennings 2017-10-24 03:07:20 UTC
With the tight reproducer in https://github.com/openshift/origin/issues/17011, which does reproduce for me nearly every time, we should be able to make progress on this, assuming they are indeed the same issue.

Comment 52 Avesh Agarwal 2017-10-24 23:20:28 UTC
*** Bug 1500498 has been marked as a duplicate of this bug. ***

Comment 53 Andrew McDermott 2017-10-25 15:02:36 UTC
Vikas, does this failure occur on every single CI run or is just some of them? I was trying to reproduce on AWS today and the first run completed OK for me but I was wondering if CI gets into this state each and every time.

Comment 54 Vikas Laad 2017-10-25 15:08:40 UTC
No, this does not happen on every run.

Comment 55 Andrew McDermott 2017-10-25 15:25:52 UTC
I was able to reproduce this behaviour on an AWS deploy today.

[ec2-user@ip-172-18-12-180 openshift_scalability]$ oc get pods --all-namespaces | grep Completed | wc -l
41

[ec2-user@ip-172-18-12-180 openshift_scalability]$ oc get pods --all-namespaces | grep -v Completed | wc -l
13

[ec2-user@ip-172-18-12-180 openshift_scalability]$ oc get pods --all-namespaces | grep -v Completed 
NAMESPACE   NAME                       READY     STATUS       RESTARTS   AGE
default     docker-registry-1-bnt6p    1/1       Running      0          2h
default     registry-console-1-q7chn   1/1       Running      0          2h
default     router-2-69tkc             1/1       Running      0          25m
proj18      eap-app-1-build            0/1       Error        0          9m
proj19      eap-app-1-build            0/1       Init:Error   0          9m
proj22      eap-app-1-build            0/1       Init:Error   0          9m
proj26      eap-app-1-build            0/1       Error        0          9m
proj3       eap-app-1-build            0/1       Init:Error   0          10m
proj30      eap-app-1-build            0/1       Init:0/2     0          9m
proj37      eap-app-1-build            0/1       Init:Error   0          9m
proj4       eap-app-1-build            0/1       Init:Error   0          10m
proj8       eap-app-1-build            0/1       Init:Error   0          10m

The difference today was that I added the following (which comes from the CI inventory):

openshift_node_kubelet_args={"pods-per-core": ["0"], "max-pods": ["510"],"minimum-container-ttl-duration": ["10s"], "maximum-dead-containers-per-container": ["1"], "maximum-dead-containers": ["20"], "image-gc-high-threshold": ["80"], "image-gc-low-threshold": ["70"]}

I have tried this before on a local VM but it didn't happen so I convinced myself that it was not significant. Perhaps the combination of running in the cloud _and_ those values are enough to trigger the failure.

However, with regard to behaviour, do these parameters make sense for real-world deployments? Are they a hangover which could be culled from the CI job? Can we cull them from the CI job to see of this problem disappears. And if so, resurrect them to convince us the problem reappears and these settings represent the root cause for the failure.

Comment 56 Andrew McDermott 2017-10-25 15:33:48 UTC
And as Seth pointed out looks like these parameters are deprecated too:

 https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/options/options.go#L308-L325

Comment 57 Andrew McDermott 2017-10-25 15:36:27 UTC
My very next run, without changing anything, was successful so just confirming what Vikas said in comment #54 (does not happen every run).

Comment 58 Seth Jennings 2017-10-26 04:56:15 UTC
Andrew, can you pick these two to origin master and open a PR:

https://github.com/kubernetes/kubernetes/pull/54593 (possible fix)
https://github.com/kubernetes/kubernetes/pull/54597 (illegal transition check)

Can you build the patched openshift binary, put it on the AWS cluster you got a recreate on, and try to recreate again.  Also, please monitor the node logs for the "attempted illegal transition" message.

Might pick those to openshift/ose master as well and build some binaries for Vikas to put on his cluster and try a recreate.

Comment 59 Seth Jennings 2017-10-26 16:12:06 UTC
Origin PRs (already merged):
https://github.com/openshift/origin/pull/17042
https://github.com/openshift/origin/pull/17051

Comment 60 Andrew McDermott 2017-10-26 18:45:51 UTC
I tested for quite some time today with a build that had the PRs mentioned in comment #59. I did not see any cases where the POD status was left at Init:0/2. This was also on the same cluster that I was using yesterday where I got my one and only repo.

I reverted to the originally deployed openshift binaries (from yesterday) but was unable to repro which makes this a bit inconclusive.

I did run into occasional build errors which looks like:

$ oc logs -f bc/eap-app                                                                                      Cloning "https://github.com/jboss-openshift/openshift-quickstarts" ...                                                                                        error: fatal: destination path '/tmp/build/inputs' already exists and is not an empty directory.

I have a vague recollection that we had seen this before -- an init container needed to be more idempotent with regards to its actions.

Comment 61 Mike Fiedler 2017-10-26 18:51:45 UTC
The build init container re-entrancy issue is https://bugzilla.redhat.com/show_bug.cgi?id=1486457

Comment 62 Andrew McDermott 2017-10-27 17:06:32 UTC
Regarding the build init container issues I was seeing yesterday I spent some time testing with a ":latest" version of the openshift-sti-builder image today. Using the ansible installer and an inventory that uses "deployment=origin" I noticed I get an sti-builder image that is 8 weeks old.

Pulling a latest image and forcing the existing image tag to today's latest I have been able to run 50 concurrent builds repeatedly without issue. No "Init:0/2" errors and no other errors were reported. Each time I ran 50 concurrent builds they all went to "Completed".

It's not immediately obvious to me given my fairly vanilla inventory why I get such an old openshift-sti-builder image.

Comment 64 Vikas Laad 2017-10-31 15:51:26 UTC
Completed multiple cycles of eap concurrent builds, did not see this issue again.

openshift v3.7.0-0.184.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

Comment 67 errata-xmlrpc 2017-11-28 22:08:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.