Bug 1368581 - start-build timeout when started with --follow
Summary: start-build timeout when started with --follow
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Jim Minter
QA Contact: Wang Haoran
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-19 20:38 UTC by Vikas Laad
Modified: 2017-03-08 18:43 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, oc start-build --follow could return a timeout error if there were delays in scheduling the build. Now, oc start-build --follow will block until the build completes.
Clone Of:
Environment:
Last Closed: 2017-01-18 12:52:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/var/log/messages from master (13.17 MB, text/plain)
2016-08-25 18:49 UTC, Vikas Laad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0066 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 17:23:26 UTC

Description Vikas Laad 2016-08-19 20:38:51 UTC
Description of problem:
I am running Reliability tests for 3.3 release, most of the time when the builds are started with --follow flag the oc command times out with following error. When the tests were done with 3.2 release I never saw this error.

root@ip-172-31-46-10: ~ # oc start-build cakephp-mysql-example -n cakephp --follow
build "cakephp-mysql-example-3" started
error getting logs: Timeout: timed out waiting for build cakephp-mysql-example-3 to start after 10s

Sometime build starts after 10 secs and finishes.

Version-Release number of selected component (if applicable):
openshift v3.3.0.22
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git

How reproducible:
Frequently

Steps to Reproduce:
1. Create 3.3 cluster in AWS env with m4.xlarge instances
2. Run reliability tests for few hours
3. start-build starts timing out in couple of hours

Actual results:
Timeout when start-build is issued with --follow command

Expected results:
Build should start and command should start showing build logs.

Additional info:

Comment 1 Cesar Wong 2016-08-20 02:04:40 UTC
Ben, adding you on cc. This doesn't look like an issue with builds per se, but how quickly build pods are getting started. Is there a more appropriate owner for this bug?

Comment 2 Ben Parees 2016-08-20 02:42:31 UTC
I think we need to fix this on our side either way (though not as a blocker). I think we have an issue open for better handling the timeout waiting for build logs (retry instead). 

As for looking into why builds are taking time to start, we should gather some data on the state of the pod (our build describe should indicate at least some of that) in terms of time to go from New to pending to container create to running. 

Once we have that we can take it to the cluster team if indeed we see that the time is increasing as the test runs.

Comment 3 Cesar Wong 2016-08-23 13:13:59 UTC
Vikas, is it possible to get a 'oc describe build [build-name]' when the 'oc start-build --follow' fails?

Comment 4 Vikas Laad 2016-08-23 17:22:50 UTC
root@ip-172-31-11-188: ~/openshift-ansible # oc start-build dancer-mysql-example --follow
build "dancer-mysql-example-3" started
error getting logs: Timeout: timed out waiting for build dancer-mysql-example-3 to start after 10s


root@ip-172-31-11-188: ~/openshift-ansible # oc describe build dancer-mysql-example-3
Name:           dancer-mysql-example-3
Namespace:      p1
Created:        21 seconds ago
Labels:         app=dancer-mysql-example
                buildconfig=dancer-mysql-example
                openshift.io/build-config.name=dancer-mysql-example
                openshift.io/build.start-policy=Serial
                template=dancer-mysql-example
Annotations:    openshift.io/build-config.name=dancer-mysql-example
                openshift.io/build.number=3
                openshift.io/build.pod-name=dancer-mysql-example-3-build

Status:         Running
Started:        Tue, 23 Aug 2016 13:21:49 EDT
Duration:       running for 1s
Build Config:   dancer-mysql-example
Build Pod:      dancer-mysql-example-3-build

Strategy:       Source
URL:            https://github.com/openshift/dancer-ex.git
From Image:     DockerImage registry.access.redhat.com/rhscl/perl-520-rhel7:latest
Output to:      ImageStreamTag dancer-mysql-example:latest
Push Secret:    builder-dockercfg-t9e6i

Build trigger cause:    Manually triggered

Events:
  FirstSeen     LastSeen        Count   From                                                    SubobjectPath                   Type            Reason          Message
  ---------     --------        -----   ----                                                    -------------                   --------        ------          -------
  21s           21s             1       {default-scheduler }                                                                    Normal          Scheduled       Successfully assigned dancer-mysql-example-3-build to ip-172-31-49-146.us-west-2.compute.internal
  14s           14s             1       {kubelet ip-172-31-49-146.us-west-2.compute.internal}   spec.containers{sti-build}      Normal          Pulled          Container image "registry.qe.openshift.com/openshift3/ose-sti-builder:v3.3.0.23" already present on machine
  8s            8s              1       {kubelet ip-172-31-49-146.us-west-2.compute.internal}   spec.containers{sti-build}      Normal          Created         Created container with docker id 45c8322e67a2
  1s            1s              1       {kubelet ip-172-31-49-146.us-west-2.compute.internal}   spec.containers{sti-build}      Normal          Started         Started container with docker id 45c8322e67a2

Comment 5 Cesar Wong 2016-08-23 18:08:24 UTC
On the client side, we will add logic to the 'oc start-build --follow' command (or any other command that waits for build logs) to continue waiting until a pod is available.

The timeout (waiting for the log) hasn't changed from 3.2 to 3.3. From the events above, I can see that it took about 7 secs from the time the pod was scheduled til the node started looking for the image and another 6 secs for the container to be created. 
Andy, is this a performance concern from the cluster's point of view?

Comment 6 Andy Goldstein 2016-08-23 18:41:34 UTC
Yes, those numbers don't look good. Vikas, what does your cluster look like (# of nodes, etc), and how active was it when you ran this test (how many other pods did you have)?

Comment 7 Vikas Laad 2016-08-23 18:55:10 UTC
I was running reliability tests on this cluster and due to this timeout issue my tests kept creating builds, because of that It became really slow. I stopped the tests, deleted all the projects made sure nodes were ready and created 1 project and provided oc describe build data.

This cluster has 1 master, 2 primary and 1 infra nodes. when I did oc describe build there was only 1 project on this cluster with few completed builds.

I still have the cluster around in case you want to check anything.

Please let me know if you need anything else.

Comment 8 Vikas Laad 2016-08-23 19:58:12 UTC
Instance size is m4.xlarge for all these nodes.

Comment 9 Vikas Laad 2016-08-25 17:06:44 UTC
Another observation : I put a wait for one min in my automation if the start-build is timing out. In many instances I still get timeout after one min when the test is trying to load the logs with same following error

oc logs bc/cakephp-mysql-example --follow

Error from server: Timeout: timed out waiting for build cakephp-mysql-example-19 to start after 10s[#]

Comment 10 Andy Goldstein 2016-08-25 17:11:49 UTC
Can you 'oc describe' the build and its associated pod (if it exists)?

Comment 11 Vikas Laad 2016-08-25 18:14:43 UTC
Here is oc describe on build, build pod does not exist.

root@ip-172-31-31-121: ~ # oc describe build cakephp-mysql-example-32
Name:           cakephp-mysql-example-32
Namespace:      cakephp-mysql-example-user-2-22-2-55
Created:        37 minutes ago
Labels:         app=cakephp-mysql-example
                buildconfig=cakephp-mysql-example
                openshift.io/build-config.name=cakephp-mysql-example
                openshift.io/build.start-policy=Serial
                template=cakephp-mysql-example
Annotations:    openshift.io/build-config.name=cakephp-mysql-example
                openshift.io/build.number=32

Status:         New
Duration:       waiting for 37m6s
Build Config:   cakephp-mysql-example
Build Pod:      cakephp-mysql-example-32-build

Strategy:       Source
URL:            https://github.com/openshift/cakephp-ex.git
From Image:     DockerImage registry.access.redhat.com/rhscl/php-56-rhel7:latest
Output to:      ImageStreamTag cakephp-mysql-example:latest
Push Secret:    builder-dockercfg-wqv6n

Build trigger cause:    Manually triggered

Comment 12 Vikas Laad 2016-08-25 18:25:43 UTC
root@ip-172-31-31-121: ~ # oc get pods -o wide
NAME                              READY     STATUS      RESTARTS   AGE       IP            NODE
cakephp-mysql-example-16-build    0/1       Completed   0          13h       172.20.1.4    ip-172-31-19-53.us-west-2.compute.internal
cakephp-mysql-example-17-build    0/1       Completed   0          12h       172.20.1.5    ip-172-31-19-53.us-west-2.compute.internal
cakephp-mysql-example-18-build    0/1       Completed   0          12h       172.20.1.4    ip-172-31-19-53.us-west-2.compute.internal
cakephp-mysql-example-19-8fgab    1/1       Running     0          11h       172.20.1.32   ip-172-31-19-53.us-west-2.compute.internal
cakephp-mysql-example-19-build    0/1       Completed   0          11h       172.20.1.6    ip-172-31-19-53.us-west-2.compute.internal
cakephp-mysql-example-20-deploy   0/1       Error       0          9h        172.20.2.31   ip-172-31-19-54.us-west-2.compute.internal
cakephp-mysql-example-23-build    0/1       Completed   0          9h        172.20.1.4    ip-172-31-19-53.us-west-2.compute.internal
cakephp-mysql-example-24-build    0/1       Error       0          8h        172.20.2.28   ip-172-31-19-54.us-west-2.compute.internal
cakephp-mysql-example-25-build    1/1       Running     0          8h        172.20.2.2    ip-172-31-19-54.us-west-2.compute.internal
mysql-1-nswnt                     1/1       Running     0          23h       172.20.2.6    ip-172-31-19-54.us-west-2.compute.internal

Comment 13 Vikas Laad 2016-08-25 18:49:48 UTC
Created attachment 1194082 [details]
/var/log/messages from master

Comment 14 Ben Parees 2016-08-25 18:56:04 UTC
So this bug has gotten pretty muddled, summarizing:

1) we're going to make the log timeout logic more tolerant, but that's low severity.  (and we'll use this bug to fix that issue, so lowering this bug's severity)

2) the build with no pod has not been scheduled because there is an existing build already running and the buildconfig is in serial mode

3) the build that's already running hung for reasons that are being investigated in another bug.

4) there is still concern over the overall performance of starting builds as it appears to have gone up from 3.2.  We probably need a new bug to chase that.

Comment 15 Jim Minter 2016-09-27 10:29:51 UTC
Ben, I'm looking at (1) in context of the conversation at https://github.com/openshift/origin/pull/6268

Q: what do you want 'make the log timeout logic more tolerant' to mean in practice?  Increase the server-side timeout from 10s, retry on the client N times (or until the build "concludes" in some where), or do something else?

FWIW (1) I think at the moment --follow --wait will already do the retry until build "concludes" case; I'm not immediately clear why --follow shouldn't actually just imply --wait?

FWIW (2) However I'm also wondering if there's currently a small race window in the implementation as it stands that could cause the end of the logs to be truncated accidentally when --follow --wait is specified?

Comment 16 Ben Parees 2016-09-27 11:54:51 UTC
(1) yes having --follow imply --wait makes sense.
(2) do you have evidence of that, or something in the code that makes you think there is a danger of that?

Comment 17 Jim Minter 2016-09-27 17:08:04 UTC
on (2), it was even commented:
  // if --wait option is set, then don't wait for logs to finish streaming
  // but wait for the build to reach its final state

FWIW I don't think ^ is the right approach.

https://github.com/openshift/origin/pull/11119 , WIP

Comment 18 Jim Minter 2016-09-29 08:34:33 UTC
OK, #11119 is merged to origin.

Update to start-build --follow, somewhat contradicting #6268
- make --follow always retry on timeouts
- make --follow imply --wait: guarantee returning an error, or ensure stream
  has finished and build status is in complete, failed, cancelled,
  error (aka not new, pending, running)
- remove possible undesirable race condition (I think --follow --wait
  could unwittingly truncate returned logs)

Comment 19 Troy Dawson 2016-10-18 15:57:24 UTC
This has been merged into ose and is in OSE v3.4.0.12 or newer.

Comment 21 Wang Haoran 2016-10-19 05:40:24 UTC
verified with :
openshift v3.4.0.12
kubernetes v1.4.0+776c994
etcd 3.1.0-alpha.1


steps:
[root@ip-172-18-5-140 ~]# oc new-build https://github.com/openshift/ruby-ex
--> Found image 79f079e (5 weeks old) in image stream "ruby" in project "openshift" under tag "2.3" for "ruby"

    Ruby 2.3 
    -------- 
    Platform for building and running Ruby 2.3 applications

    Tags: builder, ruby, ruby23, rh-ruby23

    * The source repository appears to match: ruby
    * A source build using source code from https://github.com/openshift/ruby-ex will be created
      * The resulting image will be pushed to image stream "ruby-ex:latest"
      * Use 'start-build' to trigger a new build

--> Creating resources with label build=ruby-ex ...
    imagestream "ruby-ex" created
    buildconfig "ruby-ex" created
--> Success
    Build configuration "ruby-ex" created and build triggered.
    Run 'new-build logs -f bc/ruby-ex' to stream the build progress.
[root@ip-172-18-5-140 ~]# oc start-build ruby-ex --wait
build "ruby-ex-2" started
^C
[root@ip-172-18-5-140 ~]# oc get build
NAME        TYPE      FROM      STATUS    STARTED          DURATION
ruby-ex-1   Source    Git       Running   13 seconds ago   13s
ruby-ex-2   Source    Git       New                        
[root@ip-172-18-5-140 ~]# oc start-build ruby-ex --follow
build "ruby-ex-3" started
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
timed out getting logs, retrying
I1019 01:38:48.894498       1 docker.go:375] Connecting to docker on unix:///var/run/docker.sock
I1019 01:38:48.894783       1 docker.go:395] Start docker client with request timeout=0
I1019 01:38:48.899476       1 docker.go:375] Connecting to docker on unix:///var/run/docker.sock
I1019 01:38:48.899489       1 docker.go:395] Start docker client with request timeout=0
I1019 01:38:48.912993       1 docker.go:375] Connecting to docker on unix:///var/run/docker.sock
I1019 01:38:48.913008       1 docker.go:395] Start docker client with request timeout=0
I1019 01:38:48.915086       1 docker.go:375] Connecting to docker on unix:///var/run/docker.sock
I1019 01:38:48.915131       1 docker.go:395] Start docker client with request timeout=0
Cloning "https://github.com/openshift/ruby-ex" ...
	Commit:	f63d076b602441ebd65fd0749c5c58ea4bafaf90 (Merge pull request #2 from mfojtik/add-puma)
	Author:	Michal Fojtik <mi>
	Date:	Thu Jun 30 10:47:53 2016 +0200

---> Installing application source ...
---> Building your Ruby application from source ...
---> Running 'bundle install --deployment' ...
Warning: the running version of Bundler is older than the version that created the lockfile. We suggest you upgrade to the latest version of Bundler by running `gem install bundler`.
Fetching gem metadata from https://rubygems.org/...............
Fetching version metadata from https://rubygems.org/..
Warning: the running version of Bundler is older than the version that created the lockfile. We suggest you upgrade to the latest version of Bundler by running `gem install bundler`.
Installing puma 3.4.0 with native extensions
Installing rack 1.6.4
Using bundler 1.10.6
Bundle complete! 2 Gemfile dependencies, 3 gems now installed.
Bundled gems are installed into ./bundle.
---> Cleaning up unused ruby gems ...
Warning: the running version of Bundler is older than the version that created the lockfile. We suggest you upgrade to the latest version of Bundler by running `gem install bundler`.


Pushing image 172.30.71.96:5000/haowang/ruby-ex:latest ...
Pushed 4/5 layers, 81% complete
Pushed 5/5 layers, 100% complete
Push successful

Comment 23 errata-xmlrpc 2017-01-18 12:52:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066


Note You need to log in before you can comment on or make changes to this bug.