Bug 1547551
Summary: | The pod for this build already exists and is older than the build. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Luiz Carvalho <lucarval> |
Component: | Build | Assignee: | Adam Kaplan <adam.kaplan> |
Status: | CLOSED ERRATA | QA Contact: | Wenjing Zheng <wzheng> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.7.0 | CC: | adam.kaplan, aos-bugs, bparees, fshaikh, hgomes, jminter, lionelve, lucarval, suchaudh, tatanaka, xiuwang |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 3.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
The build controller was susceptible to incorrectly failing builds when time was not synchronised accurately between multiple masters. The controller logic has been improved to not depend on time synchronisation.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-05-03 18:03:02 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Luiz Carvalho
2018-02-21 14:19:02 UTC
There appears to be some etcd errors: Feb 21 09:39:33 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: apply entries took too long [406.008943ms for 1 entries] Feb 21 09:39:33 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: avoid queries with large range/delete range! Feb 21 09:39:34 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: store.index: compact 1178072 Feb 21 09:39:34 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: finished scheduled compaction at 1178072 (took 2.254176ms) Feb 21 09:39:35 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: apply entries took too long [406.920156ms for 1 entries] Feb 21 09:39:35 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: avoid queries with large range/delete range! Feb 21 09:39:39 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3133; CANCEL") Feb 21 09:39:39 osbs-stage-mstr-01.ocp.osbs.upshift.eng.rdu2.redhat.com etcd[41576]: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3135; CANCEL") But cluster health seems ok: Cluster-Endpoints: https://10.0.10.20:2379 cURL Command: curl -X GET https://10.0.10.20:2379/v2/members member 1f1f0203c9b14f3e is healthy: got healthy result from https://10.0.10.13:2379 member 583b291be906b37b is healthy: got healthy result from https://10.0.10.20:2379 member 7136e824e9edb4dd is healthy: got healthy result from https://10.0.10.21:2379 cluster is healthy this is likely caused by the lastversion sequence in your buildconfig getting reset (perhaps you previously had a buildconfig w/ the same name and deleted it but the build pods were not deleted?) Did you actually check if said pod does already exist? If so, where did it come from? when was it created, etcd? I have not deleted any BuildConfig objects. The pod does exist. I'm not sure how it was created. The information seem misleading: BUILD: osbs-test-hamburger-doc-distscope-restricted-au-db9f4-2 Status: The pod for this build already exists and is older than the build. Started: Feb 21, 2018 6:52:03 AM Duration: 0 seconds POD: osbs-test-hamburger-doc-distscope-restricted-au-db9f4-2-build Status: Completed, ran for 51 seconds Build: Build: osbs-test-hamburger-doc-distscope-restricted-au-db9f4-2 State: Terminated at Feb 21, 2018 6:52:52 AM (Completed) The logs for the pod, which would help me determine what created it, are missing. Note the build attribute on the pod object. It actually links to the correct build, but at the same time, the build states that a pod already exists. Also note the timestamps. It looks like the Build and Pod objects were created pretty much at the same time. It looks like the masters on this cluster are not synced to NTP and there is a few seconds' drift between them. Please fix up the clocks, see if the problem recurs, then update the bug with your findings. Thanks. After enabling NTP on all hosts in cluster, issue has gone away. Thanks, Jim! *** Bug 1548952 has been marked as a duplicate of this bug. *** https://github.com/openshift/origin/pull/18735 should make this codepath more robust in the face of clock desync between masters. One of our customers is seeing this error message during build in jenkins. They say time of their masters are synced. Do you have any idea why this error happens except time sync issue? other possible causes are that the build pod really does exist from a prior attempt to build and the build sequence number has been reset in the buildconfig, or build objects are being created directly w/ old sequence numbers. Set master and node's time not synchronized. Start-build dozens of build, can't reproduce this bug. So move this bug as verified. oc v3.10.0-0.47.0 openshift v3.10.0-0.47.0 kubernetes v1.10.0+b81c8f8 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 *** Bug 1719474 has been marked as a duplicate of this bug. *** |