Bug 1703156 - the build machine-os-content failed with reason DockerBuildFailed, Couldn't resolve host name for https://mirrors.fedoraproject.org
Summary: the build machine-os-content failed with reason DockerBuildFailed, Couldn't r...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.4.0
Assignee: Vadim Rutkovsky
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-25 16:19 UTC by bpeterse
Modified: 2020-01-30 21:20 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-30 21:20:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Instances of this error over the past 24 hours (597.06 KB, image/png)
2019-04-26 06:16 UTC, W. Trevor King
no flags Details

Description bpeterse 2019-04-25 16:19:30 UTC
Description of problem:

The error:

```
could not wait for build: the build machine-os-content failed with reason DockerBuildFailed: Docker build strategy has failed

The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: Error downloading packages:
  Curl error (6): Couldn't resolve host name for https://m...=x86_64 [Could not resolve host: mirrors.fedoraproject.org]
error: build error: running 'set -x && yum install -y ostr...erlay RPMs" --branch=origin-ci-dev' failed with exit code 1
```
has happened approximately 1% of the time.  See example:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-serial/5299

It seems unlikely that we should have issues hitting mirrors.fedoraproject.org.  Its not clear from the logs if there is a retry built-in already.

Comment 1 Micah Abbott 2019-04-25 16:52:30 UTC
This is coming from https://github.com/openshift/origin/blob/master/images/os/Dockerfile

We are working towards replacing this with the use of `coreos-assembler` to perform the `machine-os-content` builds.  See this related PR which helps enable this work - https://github.com/coreos/coreos-assembler/pull/489

Comment 3 Jeff Ligon 2019-04-25 19:46:27 UTC
not blocking the release for it, but if it gets fixed so be it.

Comment 4 W. Trevor King 2019-04-26 06:16:53 UTC
Created attachment 1558927 [details]
Instances of this error over the past 24 hours

Five jobs failed with this error message today, but all of them were from a single PR [1].  The first failure was slow (16+ minutes [2]).  The remaining failures were all under one minute [3,4,5,6].  I'm pretty convinced that you got unlucky with one mirrors.fedoraproject.org connection, and then CI keeps replaying that cached failure (bug 1695507).  You should be able to recover by removing the project to clear the cache:

  $ oc delete project ci-op-n34w1184

and kicking the test again.

[1]: https://github.com/openshift/origin/pull/22653
[2]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-serial/5286
[3]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-serial/5290
[4]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-serial/5295
[5]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-serial/5296
[6]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22653/pull-ci-openshift-origin-master-e2e-aws-serial/5299

Comment 6 Steve Milner 2019-06-24 20:58:16 UTC
(In reply to Micah Abbott from comment #1)
> This is coming from
> https://github.com/openshift/origin/blob/master/images/os/Dockerfile
> 
> We are working towards replacing this with the use of `coreos-assembler` to
> perform the `machine-os-content` builds.  See this related PR which helps
> enable this work - https://github.com/coreos/coreos-assembler/pull/489

I believe this is due to faux `machine-os-content` being generated for testing using https://github.com/openshift/imagebuilder to create an image. Is the idea to replace the use of `imagebuilder` in origin's context with `cosa` and the dev-overlay command?

Comment 7 Colin Walters 2019-06-24 21:24:01 UTC
I don't think the problem is related to imagebuilder, it's a generic DNS failure which could happen with any tool; we're currently running `yum` at build time and particularly with Fedora infrastructure that is known to be flaky.

Using cosa would avoid doing `yum install ostree`, though at a notable cost of downloading a rather larger image.

Then bigger picture the idea indeed is to use dev-overlay for this, but that's not directly related to the build flake.

Comment 8 Steve Milner 2019-06-25 13:29:26 UTC
(In reply to Colin Walters from comment #7)
> I don't think the problem is related to imagebuilder, it's a generic DNS
> failure which could happen with any tool; we're currently running `yum` at
> build time and particularly with Fedora infrastructure that is known to be
> flaky.
> 
> Using cosa would avoid doing `yum install ostree`, though at a notable cost
> of downloading a rather larger image.

That's where I was heading with my question :-)

> Then bigger picture the idea indeed is to use dev-overlay for this, but
> that's not directly related to the build flake.

For the time being would moving the CI image off of Fedora and on to RHEL make sense? With the bigger picture having origin folks take advantage of dev-overlay _or_ RHCOS folks helping origin developers utilize cosa?

Comment 13 Micah Abbott 2019-11-08 15:50:48 UTC
Pushing to 4.4 and reassigning to Vadim since he owns GRPA-392

Comment 14 Colin Walters 2020-01-30 21:20:55 UTC
Not worth tracking as a bug


Note You need to log in before you can comment on or make changes to this bug.