Bug 1891362

Summary: Wrong metrics count for openshift_build_result_total
Product: OpenShift Container Platform Reporter: wewang <wewang>
Component: BuildAssignee: Adam Kaplan <adam.kaplan>
Status: CLOSED ERRATA QA Contact: wewang <wewang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.7CC: aos-bugs, wzheng
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:28:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description wewang 2020-10-26 01:23:33 UTC
Description of problem:
actually successful build is 3, but metrics of openshift_build_result_total{result="success",strategy="docker"} is 4


Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-10-22-175439

How reproducible:
always

Steps to Reproduce:
[wewang@wangwen work]$ oc -n openshift-controller-manager exec controller-manager-zc695  -- curl -k -H "Authorization: Bearer $token" 'https://10.129.0.5:8443/metrics'   |grep "openshift_build_result_total"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP openshift_build_result_total [ALPHA] Counts the total number of finished builds across all namespaces by result and strategy
# TYPE openshift_build_result_total counter
openshift_build_result_total{result="failed",strategy="docker"} 3
openshift_build_result_total{result="failed",strategy="source"} 2
openshift_build_result_total{result="success",strategy="docker"} 4
openshift_build_result_total{result="success",strategy="source"} 2
100 68539    0 68539    0     0  5148k      0 --:--:-- --:--:-- --:--:-- 5148k
 
[wewang@wangwen work]$ oc get builds
NAME               TYPE     FROM          STATUS                       STARTED          DURATION
build-src1-1       Source   Git@57073c0   Complete                     26 minutes ago   1m31s
build-src2-1       Source   Git@57073c0   Complete                     26 minutes ago   1m24s
build-src3-1       Source   Git           Failed (FetchSourceFailed)   26 minutes ago   12s
build-docker-1-1   Docker   Git@57073c0   Complete                     25 minutes ago   1m5s
build-docker-2-1   Docker   Git@57073c0   Complete                     22 minutes ago   46s
build-docker-3-1   Docker   Git@57073c0   Complete                     22 minutes ago   49s
build-docker-4-1   Docker   Git           Failed (FetchSourceFailed)   22 minutes ago   3s
build-docker-5-1   Docker   Git           Failed (FetchSourceFailed)   22 minutes ago   3sActual results:
metrics count is not same with actual counts of successful builds
Expected results:
metrics count should the same with actual counts of successful builds


Additional info:
the same issue with openshift_build_result_total{result="failed",strategy="source"} and openshift_build_result_total{result="failed",strategy="docker"}

Comment 1 Adam Kaplan 2020-10-27 18:41:23 UTC
The metric appears to work well for successful builds. However, failed builds can frequently hit the "completed build" method calls more than once in their lifecycle (at least 1/3 of the time). This wasn't an issue until this metric was introduced - most other operations in the completed build steps are idempotent.

The "Failed" phase is unique in that the build container reports this phase transition, _not_ the build controller. We should enhance builds to use detailed failure conditions, which was alluded to in BUILD-73 [1]. Then we can have the build controller take over the "Running" -> "Failed" phase transition using the conditions reported by the build pod.


[1] https://issues.redhat.com/browse/BUILD-73

Comment 3 Adam Kaplan 2020-11-12 13:27:49 UTC
This metric has been removed in 4.7. It may be reintroduced in a future release.

Comment 4 wewang 2020-11-13 06:36:18 UTC
Got it.

Comment 7 errata-xmlrpc 2021-02-24 15:28:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633