Bug 1951812 - [master] [assisted operator] Assisted Service Postgres crashes msg: "mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied"
Summary: [master] [assisted operator] Assisted Service Postgres crashes msg: "mkdir: c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Mat Kowalski
QA Contact: bjacot
URL:
Whiteboard: AI-Team-Platform
Depends On:
Blocks: 1967945
TreeView+ depends on / blocked
 
Reported: 2021-04-20 21:29 UTC by hanzhang
Modified: 2023-09-15 01:05 UTC (History)
13 users (show)

Fixed In Version: OCP-Metal-v1.0.22.1
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1967945 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:30:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oc get pod assisted-service-db6cb96b5-6k6nm -o yaml (15.58 KB, text/plain)
2021-05-27 15:00 UTC, Michael Hrivnak
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift assisted-installer pull 301 0 None closed Bug 1951812: Remove label from AI namespace after cluster installation 2021-06-17 07:01:33 UTC
Github openshift assisted-service pull 1557 0 None closed OCPBUGSM-28017: Unify operator manifests and move them under config 2021-06-17 07:01:32 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:30:35 UTC

Internal Links: 1967945

Description hanzhang 2021-04-20 21:29:40 UTC
Description of problem:

When installing the assisted-service operator bundle with local-storage, the assisted-service pod keeps crashing, and postgres container shows missing permission.

```
$ oc get all -n assisted-installer
NAME                                             READY   STATUS             RESTARTS   AGE
pod/assisted-service-68b444b4c4-hx9z4            0/2     CrashLoopBackOff   38         75m
$ oc logs assisted-service-68b444b4c4-hx9z4 -n assisted-installer -c postgres
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied

```

Turns out the postgres container wants the folder to be using uid=gid=26.

As the assisted-service-operator won't allow us to modify the deployment of assisted-service pod, we have to manually ssh into the host machine, and modify the folder permission to be using uid=gid=26. 

This is not nice to users, and we should figure out a way to set up the permission better.

Version-Release number of selected component (if applicable):
Latest (0.0.2 community bundle + latest assisted-service images on 04/20/2021)

How reproducible:

Tried 3 times within one week, happens 100%.

Steps to Reproduce:
1. Install the community bundle (0.0.2) of assisted-service
2. provide a custom pv/storage-class(we used local-storage operator) in AgentServiceConfig CR.
3. Wait the PVC to be generated & bonded.
4. See pods CrashLoopBackOff 

Actual results:

assisted-service pod CrashLoopBackOff 

Expected results:

assisted-service pod Running 


Additional info:

Comment 1 Michael Hrivnak 2021-04-21 20:39:38 UTC
It would help to exec into the pod and see the actual ownership of the directory.

$ ls -ld /var/lib/pgsql/data

It would also help to see the complete AgentServiceConfig resource.

Comment 2 Flavio Percoco 2021-04-22 09:56:44 UTC
(In reply to Michael Hrivnak from comment #1)
> It would help to exec into the pod and see the actual ownership of the
> directory.
> 
> $ ls -ld /var/lib/pgsql/data
> 
> It would also help to see the complete AgentServiceConfig resource.

bash-4.2$ ls -ld /var/lib/pgsql/data
drwxrwsr-x. 4 root 1000650000 4096 Apr 22 08:30 /var/lib/pgsql/data


I spent some time trying to reproduce this without success. 

The default reclaim policy for PVs is `Delete` so, we should not see this if the PVC is deleted and the assisted-service is wiped out. Once the PVC is created, the postgres container always gets the same PVC. I tried deleting the deployment multiple times and I was not able to reproduce this issue. Perhaps some data is missing in the steps to reproduce?

Some questions:

* How was the local-storage configured?
* Can you confirm that the PV is being created with the `Delete` reclaim policy?

Comment 3 Flavio Percoco 2021-04-22 09:59:34 UTC
One more thing, this is how I installed local-storage and created the localvolume in my environment:

```
  oc adm new-project openshift-local-storage || true

  oc annotate project openshift-local-storage openshift.io/node-selector=''

  cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha2
kind: OperatorGroup
metadata:
  name: local-operator-group
  namespace: openshift-local-storage
spec:
  targetNamespaces:
    - openshift-local-storage
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: local-storage-operator
  namespace: openshift-local-storage
spec:
  installPlanApproval: Automatic
  name: local-storage-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

  wait_for_crd "localvolumes.local.storage.openshift.io"

  echo "Creating local volume and storage class..."
  cat <<EOCR | oc apply -f -
apiVersion: local.storage.openshift.io/v1
kind: LocalVolume
metadata:
  name: assisted-service
  namespace: openshift-local-storage
spec:
  logLevel: Normal
  managementState: Managed
  storageClassDevices:
    - devicePaths:
        - /dev/sdb
        - /dev/sdc
      storageClassName: assisted-service
      volumeMode: Filesystem
EOCR
```

Comment 4 Juan Manuel Parrilla Madrid 2021-04-22 12:49:04 UTC
Output:

bash-4.2$ ls /var/lib/pgsql/data/
lost+found
bash-4.2$ ls -l /var/lib/pgsql/data/                                                                                                                                                                                                                                                                                                                                                        
total 16
drwx------. 2 root root 16384 Apr 22 12:41 lost+found
bash-4.2$ echo "test" > /var/lib/pgsql/data/test
bash: /var/lib/pgsql/data/test: Permission denied
bash-4.2$ id
uid=26(postgres) gid=26(postgres) groups=26(postgres),0(root)

Postgres container uses postgres user vs the permission on the PV which is owned by root

Comment 5 Flavio Percoco 2021-04-22 13:01:29 UTC
Digging a bit more into @jparrill environment, I noticed that the postgresql Pod is indeed different from what I see in my environment (specifically the securityContext):


This is from my environment (deployed with dev script, specifically https://github.com/openshift-metal3/dev-scripts/blob/master/assisted_deployment.sh )

```
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      runAsUser: 1000730000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
```

This is from Juan's environment

```
    securityContext:
      capabilities:
        drop:
        - MKNOD
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
```

Still digging into it a bit more

Comment 6 Flavio Percoco 2021-04-22 14:09:00 UTC
Alright, in Juan's environment he had the following applied: 

`oc adm policy add-cluster-role-from-user cluster-admin -z default -n assisted-installer`

At the beginning, I thought that was ok but then it clicked that given the cluster-admin permissions to the `default` service account (which is the one used by the operator and AI) would allow the POD to run with its internal UID and overwrite the default SCC.

Removing solved the issue in his environment.

The question now is: Should we force `runAsUser` in the postgres container?

Comment 7 hanzhang 2021-04-23 01:50:35 UTC
Hi Flavio, 

I also has patched the cluster-admin role for assisted-installer default service account to fix a missing permission issue before, and I forgot to restore the clusterrolebinding when reinstalling assisted-installer. Guess that can be the reason that the default SCC not working properly.

I created a new cluster without the cluster-admin patch and created 2 PVs using local filesystem, and it works good this time without issues.

I'll close this issue, and thanks for your investigation.

Thanks

Comment 8 Alex Krzos 2021-04-26 20:02:21 UTC
I am opening this back up because I don't think a complete resolution has been found at least for the environment I am working with.

We are creating a bare metal cluster using an on-prem assisted installer. We use the ignition override to plant a script to split a disk into multiple lvm volumes in which those become persistent volumes from the local-storage operator.  When we deploy assisted-installer into the pre-existing assisted-installer namespace the assisted-service pod is left crashlooping because of the previously described permissions issue. If I create the resources in a new namespace (Ex assisted-installer-2) then the pod comes up fine.  It is unclear to me what is wrong or different with using the existing namespace the cluster came with out of the box.

Comment 9 melserng 2021-04-29 15:52:16 UTC
Hi,
Have the same issue happen with me. Creating a baremetal cluster using AI on-prem then install the AI operator getting the same permission error into the postgres container

$ oc get pod -n assisted-installer
NAME                                         READY   STATUS             RESTARTS   AGE
assisted-installer-controller-6hmdn          0/1     Completed          0          42h
assisted-service-b77cc9d9f-fbg4b             0/2     CrashLoopBackOff   12         6m17s
assisted-service-operator-76bc5745fc-h29q4   1/1     Running            0          80m

$ oc-n assisted-installer logs assisted-service-b77cc9d9f-fbg4b postgres
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied

Comment 10 David Zager 2021-04-30 00:30:52 UTC
My suspicion that the problem @melserng @akrzo are hitting the same problem that @fperco explained is the 
assisted-installer-controller-6hmdn          0/1     Completed          0          42h
still existing in the assisted-installer namespace. It's likely that the assisted-installer-controller needs elevated permissions to do the things it needs to and may very well have cluster admin privileges.

Moving the assisted-service-operator (and the assisted-service it deploys) away from the default service account (https://bugzilla.redhat.com/show_bug.cgi?id=1951636) should prevent this scenario in the future (assuming the elevated permissions on the default service account are still the problem).

IF it is the case that the default service account has cluster-admin, we likely need a new bug to: after cluster creation, remove the cluster-admin privileges from the default service account in assisted-installer namespace.

Is it possible to attempt removing the cluster-admin privilege from the default service account with `oc adm policy remove-cluster-role-from-user cluster-admin -z default -n assisted-installer`?

Comment 11 Flavio Percoco 2021-05-04 05:51:04 UTC
> Is it possible to attempt removing the cluster-admin privilege from the default service account with `oc adm policy remove-cluster-role-from-user cluster-admin -z default -n assisted-installer`?

@melserng are you able to try this?

We need to confirm if this is still an issue

Comment 12 Alex Krzos 2021-05-07 01:15:17 UTC
Our last workaround is to just deleting the pre-existing assisted-installer namespace and then create a new assisted-installer namespace in order to use the expected namespace.

Comment 13 Flavio Percoco 2021-05-10 12:03:54 UTC
https://github.com/openshift/assisted-service/pull/1557 landed and that should create a specific serviceaccount with the required permissions for Assisted Service. I will move this to ON_QA now for further/final testing. Hopefully, this issue won't regress.

Comment 14 Michael Hrivnak 2021-05-26 22:41:14 UTC
I'm seeing this with:
- SNO as the host cluster
- LSO
- Assisted Service Operator 0.0.3


$ oc logs assisted-service-db6cb96b5-6k6nm postgres
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied


$ oc get persistentvolumes
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                 STORAGECLASS   REASON   AGE
local-pv-7ffcd3c8   1Gi        RWO            Delete           Bound    assisted-installer/postgres           db                      62m
local-pv-8c21f4f1   10Gi       RWO            Delete           Bound    assisted-installer/assisted-service   isos                    58m


$ oc get serviceaccounts
NAME                            SECRETS   AGE
assisted-installer-controller   2         47h
assisted-service                2         155m
builder                         2         47h
default                         2         47h
deployer                        2         47h

Comment 15 Michael Hrivnak 2021-05-27 14:58:16 UTC
More details:


$ oc get persistentvolumeclaims
NAME               STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS   AGE
assisted-service   Bound    local-pv-8c21f4f1   10Gi       RWO            isos           39m
postgres           Bound    local-pv-7ffcd3c8   1Gi        RWO            db             39m


volumes both look like this:


apiVersion: local.storage.openshift.io/v1
kind: LocalVolume
metadata:
  name: db
  namespace: openshift-local-storage
spec:
  storageClassDevices:
  - devicePaths:
    - /dev/vdc
    fsType: xfs
    storageClassName: db
    volumeMode: Filesystem


Based on a suggested workaround, I did this procedure which got things working:
- delete AgentServiceConfig
- uninstall operator
- delete assisted-installer namespace
- reinstall operator
- create AgentServiceConfig

I don't know why that works. Prior to ^, I tried doing those steps WITHOUT deleting the namespace, and that did not change the behavior.

Comment 16 Michael Hrivnak 2021-05-27 15:00:42 UTC
Created attachment 1787642 [details]
oc get pod assisted-service-db6cb96b5-6k6nm -o yaml

This is the Pod representation while experiencing the permission denied error.

Comment 17 Mat Kowalski 2021-05-31 08:15:20 UTC
So far I don't fully understand why the problem would be related to installing the AI inside SNO and not in the hub cluster itself. What I see at the moment is a difference in security contexts between my env (created with dev-scripts)

```
  securityContext:
    fsGroup: 1000640000
    seLinuxOptions:
      level: s0:c25,c20
  serviceAccount: assisted-service
  serviceAccountName: assisted-service
```

and the one provided by Michael

```
  securityContext: {}
  serviceAccount: assisted-service
  serviceAccountName: assisted-service
```

Checking more to find whether it can be related...

Comment 19 Mat Kowalski 2021-05-31 13:15:24 UTC
When installing assisted-service on the SNO cluster there is a different behaviour whether we install it in the `assisted-installer` namespace (already existing when the SNO is installed) or any other (e.g. `assisted-installer-2`), namely different securityContext applied to the pods created.

```
# oc -n assisted-installer get sa
NAME                            SECRETS   AGE
assisted-installer-controller   2         4h47m
assisted-service                2         3h15m
builder                         2         4h23m
default                         2         4h47m
deployer                        2         4h23m

# oc -n assisted-installer-2 get sa
NAME               SECRETS   AGE
assisted-service   2         114s
builder            2         2m4s
default            2         2m4s
deployer           2         2m4s

# oc -n assisted-installer- get pod/assisted-service-b7dc8b8d7-hwlq2 -o yaml
[...]
  securityContext: {}
  serviceAccount: assisted-service
  serviceAccountName: assisted-service
[...]

# oc -n assisted-installer-2 get pod/assisted-service-b7dc8b8d7-kps9s -o yaml
[...]
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      runAsUser: 1000650000
[...]
  securityContext:
    fsGroup: 1000650000
    seLinuxOptions:
      level: s0:c26,c0
  serviceAccount: assisted-service
  serviceAccountName: assisted-service
[...]
```

The `assisted-installer` namespace also seems to have an additional pod that we do not have when working on a newly created namespace

```
# oc -n assisted-installer get pods
NAME                                        READY   STATUS      RESTARTS   AGE
assisted-installer-controller-j27q4         0/1     Completed   0          4h51m
assisted-service-b7dc8b8d7-hwlq2            0/2     Pending     0          108m
assisted-service-operator-d47d9c877-6pqms   1/1     Running     0          3h19m

# oc -n assisted-installer-2 get pods
NAME                                        READY   STATUS    RESTARTS   AGE
assisted-service-b7dc8b8d7-kps9s            0/2     Pending   0          6m44s
assisted-service-operator-d9f54f9b7-fxrkj   1/1     Running   0          7m18s
```

Because of all the above, the issue is now isolated only to clusters initially created by the assisted-installer. Right now it looks that the `assisted-installer` namespace has some security-related configuration which prevents the assisted-service pods to get proper securityContext (for this reason deleting the namespace and recreating it, as a workaround described above, solves the issue).

Comment 20 Mat Kowalski 2021-05-31 13:48:50 UTC
On the SNO cluster the `assisted-installer` is running in a highly privileged mode so Security Context Constraints protections are not applied there. This seems to be a reason why UID of the postgres container does not match what we expect

```
# oc describe namespace assisted-installer
Name:         assisted-installer
Labels:       kubernetes.io/metadata.name=assisted-installer
              name=assisted-installer
              olm.operatorgroup.uid/e8c94d42-63c9-4849-b410-9f42cb8b413b=
              openshift.io/run-level=0
[...]

# oc describe namespace assisted-installer-2
Name:         assisted-installer-2
Labels:       kubernetes.io/metadata.name=assisted-installer-2
              name=assisted-installer-2
              olm.operatorgroup.uid/89308c99-e2fa-42ec-9ea6-249508511613=
[...]
```

This is also a reason why recreating the namespace solves the issue - after recreation the `run-level=0` is not set.

Comment 21 Michael Hrivnak 2021-05-31 18:46:45 UTC
That makes sense. After cluster installation, we run something in the new cluster's "assisted-installer" namespace. Its elevated permissions cause an issue later for assisted-service and its operator.

We should probably use different namespaces so they don't conflict.

Comment 22 Mat Kowalski 2021-06-01 07:36:19 UTC
My personal opinion is that we should (1) recommend using another namespace but as well (2) stop using elevated permissions unless really needed.

Reasoning for (1) is that the namespace is used for assisted-service in both cases, but one thing is being a "child" of assisted-service (this is why SNO has this namespace to start with) and another is being a parent spawning next clusters. On the one hand it feels natural that it's the same service so it belongs to a single namespace, but keeping those separated may also spare us some issues and debugging time in the future - it will make it clearer whether we are debugging something related to being deployed by assisted-service or something related to deploying further clusters.

As for (2) I'm testing it right now whether the cluster is fully functional when we drop runlevel from its assisted-installer namespace. From a very brief look (and from checking what are implications of setting a runlevel) I don't see any strong reason why it should cause any problems, but it needs a bit more testing.

Comment 23 Mat Kowalski 2021-06-01 14:27:07 UTC
After internal discussions the solution should be one of the following

* delete assisted-installer namespace after controller job finishes
* remove run-level label from the assisted-installer namespace after controller job finishes

Comment 24 Michael Hrivnak 2021-06-01 18:22:23 UTC
What's the reason to not use a separate namespace? Is the above just the near-term solution for the upcoming release, and we can still change namespace in a future release? Or is there some other reason?

Comment 25 Mat Kowalski 2021-06-04 09:10:04 UTC
We could use a separate namespace, but in principle the run-level label should be handled as currently we are using it to control the order in which the pods are started (what is not, what I believe, the original purpose of this flag). For the next release we will handle it via 1966621.

The near-term solution to fix the problem that's happening here (i.e. when we use the same namespace) is just to remove the label when it's no longer needed. In principle the question whether or not to reuse the namespace is, I think, a matter of guidelines. There are reasons for and against it, but I think the current solution is a fair trade-off.

Comment 26 Michael Hrivnak 2021-06-04 15:14:57 UTC
AIUI the near-term solution still leaves us with a race condition. We may be able to mitigate it sufficiently, but it still exists. The namespace in question can't be used for installation of the assisted-service-operator until after the label is removed.

Using a different namespace would eliminate the race.

1966621 sounds like it would remove the point of conflict, making the race irrelevant.

There's no advantage to putting these different components in the same namespace, so in general isolating them in their own namespaces seems like a reasonable default.

Comment 27 Mat Kowalski 2021-06-04 20:20:24 UTC
As discussed in the PR [1], we should recommend not to use the same namespace or to remove the label manually if required. The effort should go into solving 1966621 as an underlying issue.

[1] https://github.com/openshift/assisted-installer/pull/301#issuecomment-854975449

Comment 30 Flavio Percoco 2021-06-16 15:17:44 UTC
The PR[0] implements an interim solution for the 4.8 release. In order for the cherry-pick to be merged, this bug will have to be flagged as VERIFIED. Therefore, I believe we will need a new bug that will focus on the long-term solution (not using the label, to begin with).


[0] https://github.com/openshift/assisted-installer/pull/301

Comment 31 Mat Kowalski 2021-06-16 15:19:59 UTC
Please note there is already a bug against master branch for complete removal of run-level label - https://bugzilla.redhat.com/show_bug.cgi?id=1966621

Comment 35 errata-xmlrpc 2021-10-18 17:30:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 36 Red Hat Bugzilla 2023-09-15 01:05:23 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.