Description of problem: I am running Reliability tests for 3.3 release, most of the time when the builds are started with --follow flag the oc command times out with following error. When the tests were done with 3.2 release I never saw this error. root@ip-172-31-46-10: ~ # oc start-build cakephp-mysql-example -n cakephp --follow build "cakephp-mysql-example-3" started error getting logs: Timeout: timed out waiting for build cakephp-mysql-example-3 to start after 10s Sometime build starts after 10 secs and finishes. Version-Release number of selected component (if applicable): openshift v3.3.0.22 kubernetes v1.3.0+507d3a7 etcd 2.3.0+git How reproducible: Frequently Steps to Reproduce: 1. Create 3.3 cluster in AWS env with m4.xlarge instances 2. Run reliability tests for few hours 3. start-build starts timing out in couple of hours Actual results: Timeout when start-build is issued with --follow command Expected results: Build should start and command should start showing build logs. Additional info:
Ben, adding you on cc. This doesn't look like an issue with builds per se, but how quickly build pods are getting started. Is there a more appropriate owner for this bug?
I think we need to fix this on our side either way (though not as a blocker). I think we have an issue open for better handling the timeout waiting for build logs (retry instead). As for looking into why builds are taking time to start, we should gather some data on the state of the pod (our build describe should indicate at least some of that) in terms of time to go from New to pending to container create to running. Once we have that we can take it to the cluster team if indeed we see that the time is increasing as the test runs.
Vikas, is it possible to get a 'oc describe build [build-name]' when the 'oc start-build --follow' fails?
root@ip-172-31-11-188: ~/openshift-ansible # oc start-build dancer-mysql-example --follow build "dancer-mysql-example-3" started error getting logs: Timeout: timed out waiting for build dancer-mysql-example-3 to start after 10s root@ip-172-31-11-188: ~/openshift-ansible # oc describe build dancer-mysql-example-3 Name: dancer-mysql-example-3 Namespace: p1 Created: 21 seconds ago Labels: app=dancer-mysql-example buildconfig=dancer-mysql-example openshift.io/build-config.name=dancer-mysql-example openshift.io/build.start-policy=Serial template=dancer-mysql-example Annotations: openshift.io/build-config.name=dancer-mysql-example openshift.io/build.number=3 openshift.io/build.pod-name=dancer-mysql-example-3-build Status: Running Started: Tue, 23 Aug 2016 13:21:49 EDT Duration: running for 1s Build Config: dancer-mysql-example Build Pod: dancer-mysql-example-3-build Strategy: Source URL: https://github.com/openshift/dancer-ex.git From Image: DockerImage registry.access.redhat.com/rhscl/perl-520-rhel7:latest Output to: ImageStreamTag dancer-mysql-example:latest Push Secret: builder-dockercfg-t9e6i Build trigger cause: Manually triggered Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 21s 21s 1 {default-scheduler } Normal Scheduled Successfully assigned dancer-mysql-example-3-build to ip-172-31-49-146.us-west-2.compute.internal 14s 14s 1 {kubelet ip-172-31-49-146.us-west-2.compute.internal} spec.containers{sti-build} Normal Pulled Container image "registry.qe.openshift.com/openshift3/ose-sti-builder:v3.3.0.23" already present on machine 8s 8s 1 {kubelet ip-172-31-49-146.us-west-2.compute.internal} spec.containers{sti-build} Normal Created Created container with docker id 45c8322e67a2 1s 1s 1 {kubelet ip-172-31-49-146.us-west-2.compute.internal} spec.containers{sti-build} Normal Started Started container with docker id 45c8322e67a2
On the client side, we will add logic to the 'oc start-build --follow' command (or any other command that waits for build logs) to continue waiting until a pod is available. The timeout (waiting for the log) hasn't changed from 3.2 to 3.3. From the events above, I can see that it took about 7 secs from the time the pod was scheduled til the node started looking for the image and another 6 secs for the container to be created. Andy, is this a performance concern from the cluster's point of view?
Yes, those numbers don't look good. Vikas, what does your cluster look like (# of nodes, etc), and how active was it when you ran this test (how many other pods did you have)?
I was running reliability tests on this cluster and due to this timeout issue my tests kept creating builds, because of that It became really slow. I stopped the tests, deleted all the projects made sure nodes were ready and created 1 project and provided oc describe build data. This cluster has 1 master, 2 primary and 1 infra nodes. when I did oc describe build there was only 1 project on this cluster with few completed builds. I still have the cluster around in case you want to check anything. Please let me know if you need anything else.
Instance size is m4.xlarge for all these nodes.
Another observation : I put a wait for one min in my automation if the start-build is timing out. In many instances I still get timeout after one min when the test is trying to load the logs with same following error oc logs bc/cakephp-mysql-example --follow Error from server: Timeout: timed out waiting for build cakephp-mysql-example-19 to start after 10s[#]
Can you 'oc describe' the build and its associated pod (if it exists)?
Here is oc describe on build, build pod does not exist. root@ip-172-31-31-121: ~ # oc describe build cakephp-mysql-example-32 Name: cakephp-mysql-example-32 Namespace: cakephp-mysql-example-user-2-22-2-55 Created: 37 minutes ago Labels: app=cakephp-mysql-example buildconfig=cakephp-mysql-example openshift.io/build-config.name=cakephp-mysql-example openshift.io/build.start-policy=Serial template=cakephp-mysql-example Annotations: openshift.io/build-config.name=cakephp-mysql-example openshift.io/build.number=32 Status: New Duration: waiting for 37m6s Build Config: cakephp-mysql-example Build Pod: cakephp-mysql-example-32-build Strategy: Source URL: https://github.com/openshift/cakephp-ex.git From Image: DockerImage registry.access.redhat.com/rhscl/php-56-rhel7:latest Output to: ImageStreamTag cakephp-mysql-example:latest Push Secret: builder-dockercfg-wqv6n Build trigger cause: Manually triggered
root@ip-172-31-31-121: ~ # oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE cakephp-mysql-example-16-build 0/1 Completed 0 13h 172.20.1.4 ip-172-31-19-53.us-west-2.compute.internal cakephp-mysql-example-17-build 0/1 Completed 0 12h 172.20.1.5 ip-172-31-19-53.us-west-2.compute.internal cakephp-mysql-example-18-build 0/1 Completed 0 12h 172.20.1.4 ip-172-31-19-53.us-west-2.compute.internal cakephp-mysql-example-19-8fgab 1/1 Running 0 11h 172.20.1.32 ip-172-31-19-53.us-west-2.compute.internal cakephp-mysql-example-19-build 0/1 Completed 0 11h 172.20.1.6 ip-172-31-19-53.us-west-2.compute.internal cakephp-mysql-example-20-deploy 0/1 Error 0 9h 172.20.2.31 ip-172-31-19-54.us-west-2.compute.internal cakephp-mysql-example-23-build 0/1 Completed 0 9h 172.20.1.4 ip-172-31-19-53.us-west-2.compute.internal cakephp-mysql-example-24-build 0/1 Error 0 8h 172.20.2.28 ip-172-31-19-54.us-west-2.compute.internal cakephp-mysql-example-25-build 1/1 Running 0 8h 172.20.2.2 ip-172-31-19-54.us-west-2.compute.internal mysql-1-nswnt 1/1 Running 0 23h 172.20.2.6 ip-172-31-19-54.us-west-2.compute.internal
Created attachment 1194082 [details] /var/log/messages from master
So this bug has gotten pretty muddled, summarizing: 1) we're going to make the log timeout logic more tolerant, but that's low severity. (and we'll use this bug to fix that issue, so lowering this bug's severity) 2) the build with no pod has not been scheduled because there is an existing build already running and the buildconfig is in serial mode 3) the build that's already running hung for reasons that are being investigated in another bug. 4) there is still concern over the overall performance of starting builds as it appears to have gone up from 3.2. We probably need a new bug to chase that.
Ben, I'm looking at (1) in context of the conversation at https://github.com/openshift/origin/pull/6268 Q: what do you want 'make the log timeout logic more tolerant' to mean in practice? Increase the server-side timeout from 10s, retry on the client N times (or until the build "concludes" in some where), or do something else? FWIW (1) I think at the moment --follow --wait will already do the retry until build "concludes" case; I'm not immediately clear why --follow shouldn't actually just imply --wait? FWIW (2) However I'm also wondering if there's currently a small race window in the implementation as it stands that could cause the end of the logs to be truncated accidentally when --follow --wait is specified?
(1) yes having --follow imply --wait makes sense. (2) do you have evidence of that, or something in the code that makes you think there is a danger of that?
on (2), it was even commented: // if --wait option is set, then don't wait for logs to finish streaming // but wait for the build to reach its final state FWIW I don't think ^ is the right approach. https://github.com/openshift/origin/pull/11119 , WIP
OK, #11119 is merged to origin. Update to start-build --follow, somewhat contradicting #6268 - make --follow always retry on timeouts - make --follow imply --wait: guarantee returning an error, or ensure stream has finished and build status is in complete, failed, cancelled, error (aka not new, pending, running) - remove possible undesirable race condition (I think --follow --wait could unwittingly truncate returned logs)
This has been merged into ose and is in OSE v3.4.0.12 or newer.
verified with : openshift v3.4.0.12 kubernetes v1.4.0+776c994 etcd 3.1.0-alpha.1 steps: [root@ip-172-18-5-140 ~]# oc new-build https://github.com/openshift/ruby-ex --> Found image 79f079e (5 weeks old) in image stream "ruby" in project "openshift" under tag "2.3" for "ruby" Ruby 2.3 -------- Platform for building and running Ruby 2.3 applications Tags: builder, ruby, ruby23, rh-ruby23 * The source repository appears to match: ruby * A source build using source code from https://github.com/openshift/ruby-ex will be created * The resulting image will be pushed to image stream "ruby-ex:latest" * Use 'start-build' to trigger a new build --> Creating resources with label build=ruby-ex ... imagestream "ruby-ex" created buildconfig "ruby-ex" created --> Success Build configuration "ruby-ex" created and build triggered. Run 'new-build logs -f bc/ruby-ex' to stream the build progress. [root@ip-172-18-5-140 ~]# oc start-build ruby-ex --wait build "ruby-ex-2" started ^C [root@ip-172-18-5-140 ~]# oc get build NAME TYPE FROM STATUS STARTED DURATION ruby-ex-1 Source Git Running 13 seconds ago 13s ruby-ex-2 Source Git New [root@ip-172-18-5-140 ~]# oc start-build ruby-ex --follow build "ruby-ex-3" started timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying timed out getting logs, retrying I1019 01:38:48.894498 1 docker.go:375] Connecting to docker on unix:///var/run/docker.sock I1019 01:38:48.894783 1 docker.go:395] Start docker client with request timeout=0 I1019 01:38:48.899476 1 docker.go:375] Connecting to docker on unix:///var/run/docker.sock I1019 01:38:48.899489 1 docker.go:395] Start docker client with request timeout=0 I1019 01:38:48.912993 1 docker.go:375] Connecting to docker on unix:///var/run/docker.sock I1019 01:38:48.913008 1 docker.go:395] Start docker client with request timeout=0 I1019 01:38:48.915086 1 docker.go:375] Connecting to docker on unix:///var/run/docker.sock I1019 01:38:48.915131 1 docker.go:395] Start docker client with request timeout=0 Cloning "https://github.com/openshift/ruby-ex" ... Commit: f63d076b602441ebd65fd0749c5c58ea4bafaf90 (Merge pull request #2 from mfojtik/add-puma) Author: Michal Fojtik <mi> Date: Thu Jun 30 10:47:53 2016 +0200 ---> Installing application source ... ---> Building your Ruby application from source ... ---> Running 'bundle install --deployment' ... Warning: the running version of Bundler is older than the version that created the lockfile. We suggest you upgrade to the latest version of Bundler by running `gem install bundler`. Fetching gem metadata from https://rubygems.org/............... Fetching version metadata from https://rubygems.org/.. Warning: the running version of Bundler is older than the version that created the lockfile. We suggest you upgrade to the latest version of Bundler by running `gem install bundler`. Installing puma 3.4.0 with native extensions Installing rack 1.6.4 Using bundler 1.10.6 Bundle complete! 2 Gemfile dependencies, 3 gems now installed. Bundled gems are installed into ./bundle. ---> Cleaning up unused ruby gems ... Warning: the running version of Bundler is older than the version that created the lockfile. We suggest you upgrade to the latest version of Bundler by running `gem install bundler`. Pushing image 172.30.71.96:5000/haowang/ruby-ex:latest ... Pushed 4/5 layers, 81% complete Pushed 5/5 layers, 100% complete Push successful
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066