Bug 1908704

Summary: OpenShift Dockerfile build slowness
Product: OpenShift Container Platform Reporter: Vinu K <vkochuku>
Component: BuildAssignee: Adam Kaplan <adam.kaplan>
Status: CLOSED NOTABUG QA Contact: Michael Nguyen <mnguyen>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.6CC: abhinkum, adam.kaplan, aos-bugs, bbreard, imcleod, jligon, miabbott, mrbraga, nalin, nstielau, openshift-bugs-escalate, rbdiri, rcarrier, smilner, sople, travier
Target Milestone: ---Keywords: Reopened
Target Release: 4.8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-05 17:38:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Build logs none

Description Vinu K 2020-12-17 11:47:24 UTC
Created attachment 1739955 [details]
Build logs

Description of problem:

OpenShift Dockerfile build takes 40 seconds to complete a simple RUN instruction like 'mkdir foo'. The build log level 10 shows it stuck in the below:

---
stdio is not a terminal, defaulting to not using a terminal
---

Version-Release number of selected component (if applicable):

OpenShift 4.6

How reproducible:

Hard to reproduce in another environment

Steps to Reproduce:

1. oc project foo

2. cat << EOF | oc new-build --dockerfile=- --to=bar
   FROM registry.access.redhat.com/ubi8/ubi
   RUN mkdir /tmp/data1
   RUN mkdir /tmp/data2
   RUN mkdir /tmp/data3
   ENTRYPOINT ["sleep", "infinity"]
   EOF

3. oc start-build bar --follow=true --wait=true --build-loglevel=10 | tee build-bar.log

Actual results:

Each RUN instruction in Dockerfile completes in 40 seconds.

Expected results:

The RUN instruction completes in one second.

Additional info:

Build logs are attached.

Comment 2 Adam Kaplan 2020-12-22 15:00:01 UTC
Tested on GCP running OCP 4.9. Could not reproduce this issue - `mkdir` commands take no more than 1 second to complete.

Comment 4 Adam Kaplan 2020-12-23 15:36:46 UTC
Correction - test was running 4.6.9 on GCP.

Comment 22 Timothée Ravier 2021-01-25 17:25:01 UTC
As a comparison point, can they try building the same Dockerfile with podman directly on a node via oc debug node/... ?

Comment 51 Adam Kaplan 2021-03-05 17:38:39 UTC
Root Cause:

Some customers run the Dynatrace OneAgent operator on their clusters. OneAgent by default enables automatic "deep" monitoring of all processes, which causes the performance of OpenShift Builds to degrade significantly [1]. Any fix to address the performance degradation would need to be provided by Dynatrace (in partnership with Red Hat if necessary).

Work Around:

OpenShift admins who install Dynatrace OneAgent can configure Dynatrace to exclude deep monitoring of certain workloads [2]. Admins can add a monitoring rule which excludes all OpenShift Builds, if that is desired.

Admins can also use Tolerations and Node Selectors to isolate Builds from nodes that run Dynatrace OneAgent. This could be accomplished as follows:

1. Add node labels and taints to the build worker nodes
  a. Taint worker nodes to be used for builds with a desired key, value, and the `NoSchedule` effect:
    ```
    $ oc taint node <worker-node> build-node=true:NoSchedule-
    ```
  b. Label these worker nodes with a desired key and value. These can be the same as above:
    ```
    $ oc label node <worker-node> build-node=true
    ```
2. Alternatively, add or update the labels and taints on a MachineSet, like so [3]:
  ```
  apiVersion: machine.openshift.io/v1beta1
  kind: MachineSet
  ...
  spec:
    template: # this is the template for the Machines to be provisioned
      metadata:
        labels:
          build-node: "true"
      ...
      spec:
        metadata: # this is metadata applied to all Nodes underlying the MachineSet
          labels:
            build-node: "true"
      taints: # taints applied to all Nodes underlying the MachineSet
      - effect: NoSchedule
        key: build-node
        value: "true"
      
  ```

3. Set up a cluster-wide BuildOverride that allows builds to tolerate the "build-node" taint and forces builds onto the labeled build-nodes [4].

```
$ oc edit build.config.openshift.io/cluster

spec:
  buildOverrides:
    nodeSelector:
      build-node: "true"
    tolerations:
    - effect: NoSchedule
      key: build-node
```

4. Deploy Dynatrace OneAgent via Operator Hub. The agents will not tolerate the custom "build-node" taint by default and therefore will not run on these nodes.

[1] https://access.redhat.com/solutions/4978291
[2] https://www.dynatrace.com/support/help/shortlink/process-group-monitoring#enable-automatic-deep-monitoring
[3] https://docs.openshift.com/container-platform/4.7/machine_management/creating_machinesets/creating-machineset-aws.html
[4] https://docs.openshift.com/container-platform/4.7/cicd/builds/build-configuration.html

Comment 56 Red Hat Bugzilla 2023-09-15 00:53:12 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days