Bug 1934177 - knative-camel-operator CreateContainerError "container_linux.go:366: starting container process caused: chdir to cwd (\"/home/nonroot\") set in config.json failed: permission denied"
Summary: knative-camel-operator CreateContainerError "container_linux.go:366: startin...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Peter Hunt
QA Contact: Weinan Liu
URL:
Whiteboard:
: 1944312 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-02 16:28 UTC by Marek Schmidt
Modified: 2021-07-27 22:51 UTC (History)
8 users (show)

Fixed In Version: runc-1.0.0-84.rhaos4.6.git7116f03
Doc Type: Bug Fix
Doc Text:
Cause: A change in the order of when runc sets up the workdir of a container Consequence: Container creation errors occurred if the workdir wasn't owned by the user running runc Fix: Update runc to attempt the chdir to the workdir multiple times, in case one does not work Result: Container creations succeed regardless of whether the workdir is owned by the container user or the user running runc
Clone Of:
Environment:
Last Closed: 2021-07-27 22:49:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github opencontainers runc pull 2894 0 None open libct/init_linux: retry chdir to fix EPERM 2021-04-06 17:53:21 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:51:31 UTC

Description Marek Schmidt 2021-03-02 16:28:51 UTC
Description of problem:

knative-camel-operator can be installed on 4.6.15, but fails on 4.6.18 and 4.7.0. 

The camel-controller-manager pod gets into CreateContainerError state, with the following error:

container create failed: time="2021-03-02T12:02:28Z" level=error msg="container_linux.go:366: starting container process caused: chdir to cwd (\"/home/nonroot\") set in config.json failed: permission denied"

Version-Release number of selected component (if applicable):
4.6.18 , 4.7.0

How reproducible:
Always

Steps to Reproduce:
1. On 4.6.18, or 4.7.0, Install Serverless Operator from the Red Hat OperatorHub (currently 1.13.0)
2. Create KnativeServing in knative-serving namespace and KnativeEventing in knative-eventing namespace
3. Install "Knative Apache Camel Operator" (currently 0.18.0) from the community OperatorHub
4. Notice the operator installation fails, with the camel-controller-manager pod in CreateContainerError state.


Actual results:
camel-controller-manager pod is stuck in CreateContainerError with the following error:

container create failed: time="2021-03-02T12:02:28Z" level=error msg="container_linux.go:366: starting container process caused: chdir to cwd (\"/home/nonroot\") set in config.json failed: permission denied"

Expected results:
camel-controller-manager pod starts up normally

Additional info:

This seems to be a regression in 4.6.18 compared to 4.6.15, as the same operator with the same image worked fine there.

Possibly introduced by https://github.com/projectatomic/runc/commit/e541951c107025363752afe4fb483d3b8d71addd  ?

Comment 1 Marek Schmidt 2021-03-02 16:34:32 UTC
The image used by the knative-camel-operator is gcr.io/knative-releases/knative.dev/eventing-camel/cmd/controller@sha256:874b498fc53ee5060c4f897c3fdf193a457d7c51c6ae6acc336d57518e848882

Comment 2 Marek Schmidt 2021-03-02 22:06:57 UTC
Specifically it seems the regression is between 4.6.16 (which also works), and 4.6.17 (on which it fails with CreateContainerError )

Comment 3 Peter Hunt 2021-03-03 14:43:58 UTC
ah, It seems you've hit the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1915397

make sure the WORKDIR is accessible by the user the container runs as

Comment 5 Marek Schmidt 2021-03-03 17:56:26 UTC
The image is an upstream image based on gcr.io/distroless/static:nonroot  

This affects any image based on gcr.io/distroless/static:nonroot  that doesn't modify WORKDIR , e.g.

oc new-app quay.io/maschmid/helloworld:latest

which is just

FROM gcr.io/distroless/static:nonroot
ADD hello_world /hello_world
CMD ["/hello_world"]


This image works on 4.6.15, but doesn't on 4.7.0

Comment 6 Peter Hunt 2021-03-03 19:12:34 UTC
running
`id=$(podman pull -1 gcr.io/distroless/static:nonroot)` and then `podman inspect $id` returns:
[{
...
 "Config": {
            "User": "65532",
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt"
            ],
            "WorkingDir": "/home/nonroot"
        },
...
}]

So the WORKDIR is actually modified, just not by your Dockerfile.

I am suspcious that running `oc new-app` is running this container as a random uid, and that uid is not 65532. I would recommend running as that user, or doing something similar to what CNV did to workaround this issue:
```
RUN chgrp -R 0 /home/nonroot && \
+    chmod -R g=u /home/nonroot
```
for whatever group your container ends up running as

Comment 7 Peter Hunt 2021-03-03 19:23:57 UTC
podman pull -q gcr.io/distroless/static:nonroot
should be the first command

Comment 8 Peter Hunt 2021-03-09 21:47:50 UTC
does the work around work for you?

Comment 9 Marek Schmidt 2021-03-10 12:44:49 UTC
As we don't have direct control on the image, we're trying to workaround by "runAsUser: 65532" in the operator:

https://github.com/operator-framework/community-operators/pull/3262

Comment 10 Peter Hunt 2021-03-15 17:00:34 UTC
That PR merged, is that work around sufficient/can we close this?

Comment 11 Marek Schmidt 2021-03-19 08:19:05 UTC
Specifically for the knative-camel-operator the issue is fixed by our "workaround".

I'd leave it up to you if you want to track a general problem of making images based on gcr.io/distroless/static:nonroot "just work" on OpenShift like it did before 4.6.17.

(I'd consider this to be a serious regression, as this behavior can cause applications breaking when upgrading to new OCP micro release, but I understand that was an unfortunate tradeoff that had to be done for fixing a different regression vs OCP 3.x)

Comment 12 Peter Hunt 2021-03-19 18:27:47 UTC
yeah I deem this to be an unfortunate trade-off. Since this behavior is more correct, the regression must be allowed to happen

Comment 13 Peter Hunt 2021-04-06 17:53:24 UTC
I've had a change of heart. I believe we can fix this case because it *was* previously valid. I've attached the PR.
If it is accepted by upstream I will backport it to 4.5+

Comment 14 Peter Hunt 2021-04-06 17:54:56 UTC
*** Bug 1944312 has been marked as a duplicate of this bug. ***

Comment 15 Peter Hunt 2021-04-16 21:34:49 UTC
I have worked around the issue and submitted the patch to 4.5-4.8

Comment 24 errata-xmlrpc 2021-07-27 22:49:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.