Bug 1979966 - OCP builds always fail when run on RHEL7 nodes
Summary: OCP builds always fail when run on RHEL7 nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 4.9
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.10.0
Assignee: Nalin Dahyabhai
QA Contact: Jitendar Singh
URL:
Whiteboard:
: 2023942 (view as bug list)
Depends On:
Blocks: 2026589 2037776
TreeView+ depends on / blocked
 
Reported: 2021-07-07 14:08 UTC by Stephen Benjamin
Modified: 2022-07-29 11:07 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2026589 (view as bug list)
Environment:
job=periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-workers-rhel7=all
Last Closed: 2022-03-12 04:35:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github containers storage pull 1049 0 None Merged overlay.get: if we're ignoring metacopy=on, ignore it when it's global 2022-01-05 04:58:34 UTC
Github openshift builder pull 275 0 None Merged Bug 1979966: Update containers/storage to address incorrect overlay options being set on rhel7 nodes 2022-01-05 04:58:29 UTC
Red Hat Knowledge Base (Solution) 6676981 0 None None None 2022-01-27 08:00:43 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:36:15 UTC

Description Stephen Benjamin 2021-07-07 14:08:05 UTC
job:
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-workers-rhel7 

is failing frequently in CI, see testgrid results:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-workers-rhel7

It looks like this permanently started failing around 6/30.

Comment 1 Russell Teague 2021-07-07 15:27:24 UTC
RHEL scaleup completes but the step fails waiting for machineconfigpools/worker to update:

2021-07-07 13:48:47+00:00 - Waiting for worker machineconfigpool to update
+ oc wait machineconfigpool/worker --for=condition=Updated=True --timeout=10m
error: timed out waiting for the condition on machineconfigpools/worker


Also, before this step we attempt to delete the RHCOS nodes which also reports failure waiting for nodes to delete.

2021-07-07 13:36:45+00:00 - Waiting for CoreOS nodes to be removed
+ oc wait node --for=delete --timeout=10m --selector node.openshift.io/os_id=rhcos,node-role.kubernetes.io/worker
node/ip-10-0-136-97.us-west-1.compute.internal condition met
node/ip-10-0-139-212.us-west-1.compute.internal condition met
error: timed out waiting for the condition on nodes/ip-10-0-195-255.us-west-1.compute.internal

Comment 2 Russell Teague 2021-08-02 18:09:41 UTC
Needs prioritized.

Comment 3 Russell Teague 2021-08-24 17:49:47 UTC
Will review again for a future sprint.

Comment 4 Russell Teague 2021-08-24 19:23:47 UTC
Looking at current failures, the problem mentioned in comment 1 is no longer happening.  The job is failing on many tests, mostly related to [sig-build].

https://search.ci.openshift.org/?search=failed%3A.*sig-builds&maxAge=336h&context=1&type=build-log&name=nightly-4.9-e2e-aws-workers-rhel7&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 5 Matthew Staebler 2021-10-28 02:18:05 UTC
It seems like there may be problems with how images are being built on the rhel workers.

From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-workers-rhel7/1453429013811826688,
~~~
Oct 27 19:48:50.266: INFO: Running 'oc --namespace=e2e-test-custom-build-mg99l --kubeconfig=/tmp/configfile033660499 logs -f build.build.openshift.io/custom-builder-image-1 --timestamps --v 10'
Oct 27 19:48:50.703: INFO: 2021-10-27T19:48:32.928390267Z Receiving source from STDIN as archive ...
2021-10-27T19:48:36.018068153Z time="2021-10-27T19:48:36Z" level=info msg="metacopy option not supported on this kernelmetacopy=on"
2021-10-27T19:48:36.028271597Z time="2021-10-27T19:48:36Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: failed to mount overlay: invalid argument"
2021-10-27T19:48:36.033946337Z I1027 19:48:36.033922       1 defaults.go:102] Defaulting to storage driver "overlay" with options [mountopt=metacopy=on].
2021-10-27T19:48:36.060829210Z Caching blobs under "/var/cache/blobs".
2021-10-27T19:48:36.062453705Z 
2021-10-27T19:48:36.062453705Z Pulling image registry.redhat.io/rhel8/buildah:latest ...
2021-10-27T19:48:37.137004711Z Getting image source signatures
2021-10-27T19:48:37.352238984Z Copying blob sha256:06038631a24a25348b51d1bfc7d0a0ee555552a8998f8328f9b657d02dd4c64c
2021-10-27T19:48:37.359880877Z Copying blob sha256:262268b65bd5f33784d6a61514964887bc18bc00c60c588bc62bfae7edca46f1
2021-10-27T19:48:40.364347355Z Copying blob sha256:b794e6c09d5c032e7e212bc66f7b125b429381e722d87502e48253ef580f54d8
2021-10-27T19:48:43.185523971Z Copying config sha256:d19c0a0e81fa7244281d1df2f85594408ceeac80101cac33fc115393dbdacc8e
2021-10-27T19:48:43.195644584Z Writing manifest to image destination
2021-10-27T19:48:43.197240176Z Storing signatures
2021-10-27T19:48:49.189933108Z Adding transient rw bind mount for /run/secrets/rhsm
2021-10-27T19:48:49.191279569Z STEP 1: FROM registry.redhat.io/rhel8/buildah:latest
2021-10-27T19:48:49.230217051Z time="2021-10-27T19:48:49Z" level=error msg="error unmounting /var/lib/containers/storage/overlay/d456bbc62bcb2af7a7f9ee6f146e8ee71d65bedff6b221fe92e23488b8b2f04e/merged: invalid argument"
2021-10-27T19:48:49.319969178Z error: build error: error mounting new container: error mounting build container "0ad7cb573de4467bd0e4980e266fdd05ceaccecc1495dd9558cb4d88aed66968": error creating overlay mount to /var/lib/containers/storage/overlay/d456bbc62bcb2af7a7f9ee6f146e8ee71d65bedff6b221fe92e23488b8b2f04e/merged, mount_data="metacopy=on,lowerdir=/var/lib/containers/storage/overlay/l/TCAZGT2TQQ6LTJXT34DEBHF2J6:/var/lib/containers/storage/overlay/l/MFETL4Y53FI7A3OLWO6L6LUXLR:/var/lib/containers/storage/overlay/l/MN4J7IGIFV3UR43ZVJ3BEOHATY,upperdir=/var/lib/containers/storage/overlay/d456bbc62bcb2af7a7f9ee6f146e8ee71d65bedff6b221fe92e23488b8b2f04e/diff,workdir=/var/lib/containers/storage/overlay/d456bbc62bcb2af7a7f9ee6f146e8ee71d65bedff6b221fe92e23488b8b2f04e/work": invalid argument
~~~

Comment 6 Matthew Staebler 2021-10-28 02:20:17 UTC
Could we enlist some assistance from the Build team in helping to diagnose what the issue may be with the failure to perform builds on rhel7 workers?

Comment 9 Adam Kaplan 2021-11-16 23:31:16 UTC
*** Bug 2023942 has been marked as a duplicate of this bug. ***

Comment 12 Nalin Dahyabhai 2021-11-26 17:07:24 UTC
Not completed during this sprint.

Comment 17 Adam Kaplan 2021-12-08 21:05:07 UTC
Root cause:

Recent upgrades to buildah and its related libraries causes buildah to set incorrect options for the overlayfs storage driver on RHEL 7 hosts.
This currently only impacts OCP 4.9 clusters with RHEL7 worker nodes - 4.8 and earlier versions are not impacted.

Work around:

Builds continue to function on RHCOS worker nodes if such nodes can be provisioned.
The RHEL7 worker nodes do not need to be torn down - developers can use the following NodeSelector on their BuildConfig objects to ensure that builds only run on RHCOS nodes [1]:

"node.openshift.io/os_id: rhcos"

This same NodeSelector can be applied to all builds cluster-wide using the buildOverride configuration option [2].

[1] https://docs.openshift.com/container-platform/4.9/cicd/builds/advanced-build-operations.html#builds-assigning-builds-to-nodes_advanced-build-operations
[2] https://docs.openshift.com/container-platform/4.9/cicd/builds/build-configuration.html

Comment 19 W. Trevor King 2021-12-08 21:29:42 UTC
Comment 17 dropped UpgradeBlocker, so I'm clearing ImpactStatementRequested.

Comment 21 Priti Kumari 2022-01-06 13:46:14 UTC
Verify ocp build with rhel, ocp 4.10.0-0.nightly-2021-12-10-033652

========================

1. Create a project testing-rhel
2. applied buildconfig with nodeSelector `node.openshift.io/os_id: rhel`
buildconfig.yaml

```
apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
  name: example
  namespace: testing-rhel
spec:
  nodeSelector:
    node.openshift.io/os_id: rhel
  source:
    git:
      ref: master
      uri: 'https://github.com/openshift/ruby-ex.git'
    type: Git
  strategy:
    type: Source
    sourceStrategy:
      from:
        kind: ImageStreamTag
        name: 'ruby:2.7'
        namespace: openshift
      env: []
  triggers:
    - type: ImageChange
      imageChange: {}
    - type: ConfigChange
```

3. Builds gets completed
$ oc get builds
NAME        TYPE     FROM          STATUS     STARTED          DURATION
example-1   Source   Git@01effef   Complete   54 seconds ago   54s

$ oc get bc
NAME      TYPE     FROM         LATEST
example   Source   Git@master   1

$ oc get pods -o wide
NAME              READY   STATUS      RESTARTS   AGE   IP            NODE                       NOMINATED NODE   READINESS GATES
example-1-build   0/1     Completed   0          69s   <ip>   jshjs-lshvl-w-a-l-rhel-1   <none>           <none>

$ oc logs pod/example-1-build
time="2022-01-06T10:09:53Z" level=info msg="metacopy [...]
"/var/cache/blobs".
Trying to pull image-registry.openshift-image-registry.svc:5000/openshift/ruby@sha256:19f0b4c21e1b5e77d5442515719a543d0dca5c6b7f57bfaeea8c5e100ae63232...
Getting image source signatures
Copying blob sha256:b46ca46c303b49d886a7585735ebd1dc8651e83d0fab5823300cf3a9fd2febc1
Copying blob sha256:ac08ca107ad9ed699cbd28339749dd6463a84c73aa1d468a4241385fc4ec3876
[...]
Writing manifest to image destination
Storing signatures
Generating dockerfile with builder image image-registry.openshift-image-registry.svc:5000/openshift/ruby@sha256:19f0b4c21e1b5e77d5442515719a543d0dca5c6b7f57bfaeea8c5e100ae63232
Adding transient rw bind mount for /run/secrets/rhsm
Adding transient rw bind mount for /run/secrets/redhat.repo
STEP 1/9: FROM image-registry.openshift-image-registry.svc:5000/openshift/ruby@sha256:19f0b4c21e1b5e77d5442515719a543d0dca5c6b7f57bfaeea8c5e100ae63232
time="2022-01-06T10:10:07Z" level=warning msg="Ignoring global metacopy option, not supported with booted kernel"
STEP 2/9: LABEL "io.openshift.build.image"="image-registry.openshift-image-registry.svc:5000/openshift/ruby@sha256:19f0b4c21e1b5e77d5442515719a543d0dca5c6b7f57bfaeea8c5e100ae63232"       "io.openshift.build.commit.author"="Honza Horak <hhorak>"       "io.openshift.build.commit.date"="Fri Aug 21 13:44:47 2020 +0200"       "io.openshift.build.commit.id"="01effef3a23935c1a83110d4b074b0738d677c44"       "io.openshift.build.commit.ref"="master"       "io.openshift.build.commit.message"="Merge pull request #35 from pvalena/bundler"       "io.openshift.build.source-location"="https://github.com/openshift/ruby-ex.git"
STEP 3/9: ENV OPENSHIFT_BUILD_NAME="example-1"     OPENSHIFT_BUILD_NAMESPACE="testing-rhel"     OPENSHIFT_BUILD_SOURCE="https://github.com/openshift/ruby-ex.git"     OPENSHIFT_BUILD_REFERENCE="master"     OPENSHIFT_BUILD_COMMIT="01effef3a23935c1a83110d4b074b0738d677c44"
STEP 4/9: USER root
STEP 5/9: COPY upload/src /tmp/src
STEP 6/9: RUN chown -R 1001:0 /tmp/src
STEP 7/9: USER 1001
STEP 8/9: RUN /usr/libexec/s2i/assemble
---> Installing application source ...
---> Building your Ruby application from source ...
---> Running 'bundle install --retry 2 --deployment --without development:test' ...
[DEPRECATED] The `--deployment` flag is deprecated because it relies on being remembered across bundler invocations, which bundler will no longer do in future versions. Instead please use `bundle config set --local deployment 'true'`, and stop using this flag
[DEPRECATED] The `--path` flag is deprecated because it relies on being remembered across bundler invocations, which bundler will no longer do in future versions. Instead please use `bundle config set --local path './bundle'`, and stop using this flag
[DEPRECATED] The `--without` flag is deprecated because it relies on being remembered across bundler invocations, which bundler will no longer do in future versions. Instead please use `bundle config set --local without 'development:test'`, and stop using this flag
Fetching gem metadata from https://rubygems.org/
Fetching gem metadata from https://rubygems.org/..
Fetching gem metadata from https://rubygems.org/..
Using bundler 2.2.24
Fetching nio4r 2.5.2
Fetching rack 2.2.3
Installing nio4r 2.5.2 with native extensions
Installing rack 2.2.3
Fetching puma 4.3.5
Installing puma 4.3.5 with native extensions
Bundle complete! 2 Gemfile dependencies, 4 gems now installed.
Gems in the groups 'development' and 'test' were not installed.
Bundled gems are installed into `./bundle`
---> Cleaning up unused ruby gems ...
Running `bundle clean --verbose` with bundler 2.2.24
Frozen, using resolution from the lockfile
STEP 9/9: CMD /usr/libexec/s2i/run
COMMIT temp.builder.openshift.io/testing-rhel/example-1:c434382c
time="2022-01-06T10:10:18Z" level=warning msg="Ignoring global metacopy option, not supported with booted kernel"
Getting image source signatures
Copying blob sha256:cc423b2000aec40199a4f4e1012f2e9b573d4ce6bc1ca416a598f8e1d45f3d13
Copying blob sha256:41d099875e8768dcadb9f7e388d68c50eb25f6160c8a3858b966d12d89e4d288
Copying blob sha256:3cd3b63408eccc3f9a1ffb740cf311d927927f94247e952af1c9b67c1ad2db4f
Copying blob sha256:3dca2e66497972abbd6a7796a701296ada6bb53013b52d4432bc0d3f1cf0e7bd
Copying blob sha256:83b76fb61d8095ec96901a654c78ecb24246378f905d96eb152348af72089f70
Copying blob sha256:f008aacb05a5e87c49ca50c5f5ac03b6c1b633f38249ed0617e984381b426c27
Copying config sha256:0c3c9936d566ecca29f4e8b92dfcceca341d632056ff056e628a1f612fe7f50b
Writing manifest to image destination
Storing signatures
--> 0c3c9936d56
Successfully tagged temp.builder.openshift.io/testing-rhel/example-1:c434382c
0c3c9936d566ecca29f4e8b92dfcceca341d632056ff056e628a1f612fe7f50b
Build complete, no image push requested

Comment 24 errata-xmlrpc 2022-03-12 04:35:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.