Bug 1937535

Summary: Not all image pulls within OpenShift builds retry
Product: OpenShift Container Platform Reporter: Hongkai Liu <hongkliu>
Component: BuildAssignee: Nalin Dahyabhai <nalin>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: medium Docs Contact: Rolfe Dlugy-Hegwer <rdlugyhe>
Priority: unspecified    
Version: 4.7CC: adam.kaplan, aos-bugs, gmontero, nalin, wking
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When OpenShift builds interact with image registries, such as pulling base images, intermittent communications issues can produce build failures. The current release increases the number of retries to these interactions. Now, OpenShift builds are more resilient when they encounter intermittent communication issues with image registries.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:52:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1940052    

Description Hongkai Liu 2021-03-10 21:43:06 UTC
Description of problem:

unstable build with the buildah error:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ci-tools/1772/pull-ci-openshift-ci-tools-master-images/1369684570110169088


2021/03/10 16:30:58 Build private-org-peribolos-sync failed, printing logs:
Caching blobs under "/var/cache/blobs".
Getting image source signatures
Copying blob sha256:af875c55da53e4fe440c266863aa55b31df88175d5b2eb9b3872bf796e99887c
Copying blob sha256:2d473b07cdd5f0912cd6f1a703352c82b512407db6b05b43f2553732b55df3bc
Copying blob sha256:67a658d3535469bd82dfdda6899da83dc51bfa0d11f63aa4cb014ddd280ae1ae
Copying blob sha256:73e99c44efe07b295629455f26c365cb00ff358b480af2cf1bc6bc428d94dabe
Copying blob sha256:92aabcd08403dce8bf1631319292135665aa22825f5398f2d9e36e67fa44c84c
Copying blob sha256:72f2c078316a3eab7bbaa6b6053d0e24d3c657caaabfc7b18fc19e27a18c461c
error: error creating buildah builder: Error writing blob: error storing blob to file "/var/tmp/storage478406705/1": error happened during read: read tcp 10.130.5.94:54490->172.217.204.128:443: read: connection reset by peer

Version-Release number of selected component (if applicable):
oc --context build02 get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0     True        False         5d3h    Cluster version is 4.7.0

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Nalin has pinpointed the cause of error is lacking retries connecting registry.
https://github.com/openshift/builder/pull/222

We created this bug for convenient backports.

Comment 3 XiuJuan Wang 2021-03-17 10:07:20 UTC
Gabe,
I could reproduce this bug on 4.8.0-0.nightly-2021-03-17-014745 cluster with senario:

step 1:
 Specify a private image as source image and add pull secret.

  source:
    git:
      uri: https://github.com/openshift/ruby-hello-world.git
    images:
    - from:
        kind: DockerImage
        name: 172.30.128.188:5000/busybox
      paths:
      - destinationDir: openshiftqedir
        sourcePath: /opt/app-root
      pullSecret:
        name: test
    type: Git
  strategy:
    sourceStrategy:
      env:
      - name: EXAMPLE
        value: sample-app
      from:
        kind: ImageStreamTag
        name: ruby:2.7
        namespace: openshift
    type: Source

step2:
Trigger build. the build failed for pulling private source image without retry.

$ oc logs -f build/ruby-sample-build-1
Cloning "https://github.com/openshift/ruby-hello-world.git" ...
	Commit:	f476e11e538445e76470b0c63252b49e294a51d2 (Merge pull request #121 from vrutkovs/ruby-2.7)
	Author:	Ben Parees <bparees.github.com>
	Date:	Wed Mar 10 09:52:09 2021 -0500
Caching blobs under "/var/cache/blobs".
error: error creating buildah builder: Error initializing source docker://172.30.128.188:5000/busybox:latest: error pinging docker registry 172.30.128.188:5000: Get "https://172.30.128.188:5000/v2/": http: server gave HTTP response to HTTPS client


Successfully senario:
Trigger build with pull private image, set invaild secret at first, then correct secret quickly.

  source:
    git:
      uri: http://github.com/openshift/rails-ex.git
    type: Git
  strategy:
    sourceStrategy:
      from:
        kind: ImageStreamTag
        name: mystream:latest
        namespace: rhf34
      pullSecret:
        name: test

$ oc logs -f build/rails-ex-8 
Cloning "http://github.com/openshift/rails-ex.git" ...
	Commit:	9e6fe17f934b87b9a399e2623d6c7dfcebd4b530 (Merge pull request #130 from pvalena/bundler)
	Author:	Pavel Valena <pvalena>
	Date:	Wed Sep 16 16:23:12 2020 +0200
Caching blobs under "/var/cache/blobs".
error trying to parse file /var/run/secrets/openshift.io/pull/.dockerconfigjson: illegal base64 data at input byte 28
Warning: Pull failed, retrying in 5s ...
Getting image source signatures
Copying blob sha256:0669b0daf1fba90642d105f3bc2c94365c5282155a33cc65ac946347a90d90d1
Copying config sha256:83aa35aa1c79e4b6957e018da6e322bfca92bf3b4696a211b42502543c242d6f
Writing manifest to image destination
Storing signatures
Generating dockerfile with builder image 172.30.128.188:5000/busybox@sha256:afe605d272837ce1732f390966166c2afff5391208ddd57de10942748694049d

Comment 4 Gabe Montero 2021-03-17 14:05:58 UTC
Hey XiuJuan

So I dove into 

error: error creating buildah builder: Error initializing source docker://172.30.128.188:5000/busybox:latest: error pinging docker registry 172.30.128.188:5000: Get "https://172.30.128.188:5000/v2/": http: server gave HTTP response to HTTPS client

the top level error there corresponds to 

https://github.com/openshift/builder/blob/c910b5cd6c0e0a284c544d3fd98d1ddf8167cbc7/pkg/build/builder/source.go#L451-L458

which is where Nalin added retry.

If you then work off the "Error intializing source", you get into the retry copy logic of containers image.  The thing is, that logic does not retry on just any error.  It distinguishes intermittent errors from ones that will persist.

For reference:  https://github.com/openshift/builder/blob/f9787dc13c7cff8ccbb6dd5d93a9bfddc2412ed0/vendor/github.com/containers/common/pkg/retry/retry.go#L45-L95


A server giving a "HTTP response to HTTPS client" is one of those persistent or perm fail errors.  So the lack of retry there is good / expected.

Based on that, and the retry you were able to identify, I'm marking this verified.

thanks

Comment 6 Rolfe Dlugy-Hegwer 2021-04-09 11:13:04 UTC
Supporting information for release notes:

Cause: intermittent communication issues can occur when interacting with image registries

Consequence: certain interactions between openshift builds and image registries, for example when pulling images as source, could result in build failure when those intermittent issues occurred

Fix: retry for pulling images for all permutations of interaction between openshift builds and image registries was added

Result: openshift builds are now more resilient when they encounter intermittent communication issues with image registries

Comment 8 Nalin Dahyabhai 2021-06-08 17:26:30 UTC
I suggest changing "source images" to "base images" in the doc text, since we're talking about what's usually called the base image in a Dockerfile, that's what we use it for in "Docker" strategy builds, and it's how we use the s2i builder image in "Source" strategy builds.

Comment 9 Rolfe Dlugy-Hegwer 2021-06-08 19:42:58 UTC
Thanks, Nalin. Updated.

Comment 12 errata-xmlrpc 2021-07-27 22:52:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438