Bug 1547551

Summary:	The pod for this build already exists and is older than the build.
Product:	OpenShift Container Platform	Reporter:	Luiz Carvalho <lucarval>
Component:	Build	Assignee:	Adam Kaplan <adam.kaplan>
Status:	CLOSED ERRATA	QA Contact:	Wenjing Zheng <wzheng>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.7.0	CC:	adam.kaplan, aos-bugs, bparees, fshaikh, hgomes, jminter, lionelve, lucarval, suchaudh, tatanaka, xiuwang
Target Milestone:	---	Keywords:	Reopened
Target Release:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	The build controller was susceptible to incorrectly failing builds when time was not synchronised accurately between multiple masters. The controller logic has been improved to not depend on time synchronisation.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-05-03 18:03:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Luiz Carvalho 2018-02-21 14:19:02 UTC

Description of problem:
After manually triggering a build by using POST /oapi/v1/namespaces/$NAMESPACE/buildconfigs/$NAME/instantiate, Build fails with "The pod for this build already exists and is older than the build."

Version-Release number of selected component (if applicable):
OpenShift Master:
v3.7.23
Kubernetes Master:
v1.7.6+a08f5eeb6

How reproducible:
This appears to happen randomly at about 10% of time.

Steps to Reproduce:
1. Create BuildConfig object
2. POST /oapi/v1/namespaces/$NAMESPACE/buildconfigs/$NAME/instantiate

Actual results:

Build is started and completed successfully.


Expected results:
Build fails with "The pod for this build already exists and is older than the build."


Additional info:

Not sure if relevant, but this cluster has two namespaces, each of which will have Pods of the same name. This shouldn't matter though because they are in different namespaces.

Comment 1 Luiz Carvalho 2018-02-21 14:43:40 UTC

There appears to be some etcd errors:

Feb 21 09:39:33 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: apply entries took too long [406.008943ms for 1 entries]
Feb 21 09:39:33 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: avoid queries with large range/delete range!
Feb 21 09:39:34 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: store.index: compact 1178072
Feb 21 09:39:34 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: finished scheduled compaction at 1178072 (took 2.254176ms)
Feb 21 09:39:35 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: apply entries took too long [406.920156ms for 1 entries]
Feb 21 09:39:35 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: avoid queries with large range/delete range!
Feb 21 09:39:39 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3133; CANCEL")
Feb 21 09:39:39 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3135; CANCEL")

But cluster health seems ok:

Cluster-Endpoints: https://10.0.10.20:2379
cURL Command: curl -X GET https://10.0.10.20:2379/v2/members
member 1f1f0203c9b14f3e is healthy: got healthy result from https://10.0.10.13:2379
member 583b291be906b37b is healthy: got healthy result from https://10.0.10.20:2379
member 7136e824e9edb4dd is healthy: got healthy result from https://10.0.10.21:2379
cluster is healthy

Comment 2 Ben Parees 2018-02-21 17:10:46 UTC

this is likely caused by the lastversion sequence in your buildconfig getting reset (perhaps you previously had a buildconfig w/ the same name and deleted it but the build pods were not deleted?)

Did you actually check if said pod does already exist?  If so, where did it come from? when was it created, etcd?

Comment 3 Luiz Carvalho 2018-02-21 18:50:31 UTC

I have not deleted any BuildConfig objects.

The pod does exist. I'm not sure how it was created. The information seem misleading:

BUILD:
    osbs-test-hamburger-doc-distscope-restricted-au-db9f4-2
    Status: The pod for this build already exists and is older than the build. 
    Started: Feb 21, 2018 6:52:03 AM 
    Duration: 0 seconds
    
POD:
    osbs-test-hamburger-doc-distscope-restricted-au-db9f4-2-build
    Status: Completed, ran for 51 seconds Build:
    Build: osbs-test-hamburger-doc-distscope-restricted-au-db9f4-2
    State: Terminated at Feb 21, 2018 6:52:52 AM (Completed)

The logs for the pod, which would help me determine what created it, are missing.

Note the build attribute on the pod object. It actually links to the correct build, but at the same time, the build states that a pod already exists.

Also note the timestamps. It looks like the Build and Pod objects were created pretty much at the same time.

Comment 4 Jim Minter 2018-02-21 21:42:50 UTC

It looks like the masters on this cluster are not synced to NTP and there is a few seconds' drift between them.  Please fix up the clocks, see if the problem recurs, then update the bug with your findings.  Thanks.

Comment 5 Luiz Carvalho 2018-02-22 13:53:19 UTC

After enabling NTP on all hosts in cluster, issue has gone away.

Thanks, Jim!

Comment 6 Ben Parees 2018-02-26 16:31:08 UTC

*** Bug 1548952 has been marked as a duplicate of this bug. ***

Comment 7 Jim Minter 2018-02-26 16:55:48 UTC

https://github.com/openshift/origin/pull/18735 should make this codepath more robust in the face of clock desync between masters.

Comment 8 Takayoshi Tanaka 2018-04-03 01:43:15 UTC

One of our customers is seeing this error message during build in jenkins. They say time of their masters are synced.
Do you have any idea why this error happens except time sync issue?

Comment 9 Ben Parees 2018-04-03 02:56:57 UTC

other possible causes are that the build pod really does exist from a prior attempt to build and the build sequence number has been reset in the buildconfig, or build objects are being created directly w/ old sequence numbers.

Comment 16 XiuJuan Wang 2018-05-18 09:00:23 UTC

Set master and node's time not synchronized.
Start-build dozens of build, can't reproduce this bug. So move this bug as verified.

oc v3.10.0-0.47.0
openshift v3.10.0-0.47.0
kubernetes v1.10.0+b81c8f8

Comment 18 errata-xmlrpc 2018-07-30 19:09:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Comment 23 Gabe Montero 2019-10-31 14:02:17 UTC

*** Bug 1719474 has been marked as a duplicate of this bug. ***