Description of problem: When installing the assisted-service operator bundle with local-storage, the assisted-service pod keeps crashing, and postgres container shows missing permission. ``` $ oc get all -n assisted-installer NAME READY STATUS RESTARTS AGE pod/assisted-service-68b444b4c4-hx9z4 0/2 CrashLoopBackOff 38 75m $ oc logs assisted-service-68b444b4c4-hx9z4 -n assisted-installer -c postgres mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied ``` Turns out the postgres container wants the folder to be using uid=gid=26. As the assisted-service-operator won't allow us to modify the deployment of assisted-service pod, we have to manually ssh into the host machine, and modify the folder permission to be using uid=gid=26. This is not nice to users, and we should figure out a way to set up the permission better. Version-Release number of selected component (if applicable): Latest (0.0.2 community bundle + latest assisted-service images on 04/20/2021) How reproducible: Tried 3 times within one week, happens 100%. Steps to Reproduce: 1. Install the community bundle (0.0.2) of assisted-service 2. provide a custom pv/storage-class(we used local-storage operator) in AgentServiceConfig CR. 3. Wait the PVC to be generated & bonded. 4. See pods CrashLoopBackOff Actual results: assisted-service pod CrashLoopBackOff Expected results: assisted-service pod Running Additional info:
It would help to exec into the pod and see the actual ownership of the directory. $ ls -ld /var/lib/pgsql/data It would also help to see the complete AgentServiceConfig resource.
(In reply to Michael Hrivnak from comment #1) > It would help to exec into the pod and see the actual ownership of the > directory. > > $ ls -ld /var/lib/pgsql/data > > It would also help to see the complete AgentServiceConfig resource. bash-4.2$ ls -ld /var/lib/pgsql/data drwxrwsr-x. 4 root 1000650000 4096 Apr 22 08:30 /var/lib/pgsql/data I spent some time trying to reproduce this without success. The default reclaim policy for PVs is `Delete` so, we should not see this if the PVC is deleted and the assisted-service is wiped out. Once the PVC is created, the postgres container always gets the same PVC. I tried deleting the deployment multiple times and I was not able to reproduce this issue. Perhaps some data is missing in the steps to reproduce? Some questions: * How was the local-storage configured? * Can you confirm that the PV is being created with the `Delete` reclaim policy?
One more thing, this is how I installed local-storage and created the localvolume in my environment: ``` oc adm new-project openshift-local-storage || true oc annotate project openshift-local-storage openshift.io/node-selector='' cat <<EOF | oc apply -f - apiVersion: operators.coreos.com/v1alpha2 kind: OperatorGroup metadata: name: local-operator-group namespace: openshift-local-storage spec: targetNamespaces: - openshift-local-storage --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: local-storage-operator namespace: openshift-local-storage spec: installPlanApproval: Automatic name: local-storage-operator source: redhat-operators sourceNamespace: openshift-marketplace EOF wait_for_crd "localvolumes.local.storage.openshift.io" echo "Creating local volume and storage class..." cat <<EOCR | oc apply -f - apiVersion: local.storage.openshift.io/v1 kind: LocalVolume metadata: name: assisted-service namespace: openshift-local-storage spec: logLevel: Normal managementState: Managed storageClassDevices: - devicePaths: - /dev/sdb - /dev/sdc storageClassName: assisted-service volumeMode: Filesystem EOCR ```
Output: bash-4.2$ ls /var/lib/pgsql/data/ lost+found bash-4.2$ ls -l /var/lib/pgsql/data/ total 16 drwx------. 2 root root 16384 Apr 22 12:41 lost+found bash-4.2$ echo "test" > /var/lib/pgsql/data/test bash: /var/lib/pgsql/data/test: Permission denied bash-4.2$ id uid=26(postgres) gid=26(postgres) groups=26(postgres),0(root) Postgres container uses postgres user vs the permission on the PV which is owned by root
Digging a bit more into @jparrill environment, I noticed that the postgresql Pod is indeed different from what I see in my environment (specifically the securityContext): This is from my environment (deployed with dev script, specifically https://github.com/openshift-metal3/dev-scripts/blob/master/assisted_deployment.sh ) ``` securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000730000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File ``` This is from Juan's environment ``` securityContext: capabilities: drop: - MKNOD terminationMessagePath: /dev/termination-log terminationMessagePolicy: File ``` Still digging into it a bit more
Alright, in Juan's environment he had the following applied: `oc adm policy add-cluster-role-from-user cluster-admin -z default -n assisted-installer` At the beginning, I thought that was ok but then it clicked that given the cluster-admin permissions to the `default` service account (which is the one used by the operator and AI) would allow the POD to run with its internal UID and overwrite the default SCC. Removing solved the issue in his environment. The question now is: Should we force `runAsUser` in the postgres container?
Hi Flavio, I also has patched the cluster-admin role for assisted-installer default service account to fix a missing permission issue before, and I forgot to restore the clusterrolebinding when reinstalling assisted-installer. Guess that can be the reason that the default SCC not working properly. I created a new cluster without the cluster-admin patch and created 2 PVs using local filesystem, and it works good this time without issues. I'll close this issue, and thanks for your investigation. Thanks
I am opening this back up because I don't think a complete resolution has been found at least for the environment I am working with. We are creating a bare metal cluster using an on-prem assisted installer. We use the ignition override to plant a script to split a disk into multiple lvm volumes in which those become persistent volumes from the local-storage operator. When we deploy assisted-installer into the pre-existing assisted-installer namespace the assisted-service pod is left crashlooping because of the previously described permissions issue. If I create the resources in a new namespace (Ex assisted-installer-2) then the pod comes up fine. It is unclear to me what is wrong or different with using the existing namespace the cluster came with out of the box.
Hi, Have the same issue happen with me. Creating a baremetal cluster using AI on-prem then install the AI operator getting the same permission error into the postgres container $ oc get pod -n assisted-installer NAME READY STATUS RESTARTS AGE assisted-installer-controller-6hmdn 0/1 Completed 0 42h assisted-service-b77cc9d9f-fbg4b 0/2 CrashLoopBackOff 12 6m17s assisted-service-operator-76bc5745fc-h29q4 1/1 Running 0 80m $ oc-n assisted-installer logs assisted-service-b77cc9d9f-fbg4b postgres mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied
My suspicion that the problem @melserng @akrzo are hitting the same problem that @fperco explained is the assisted-installer-controller-6hmdn 0/1 Completed 0 42h still existing in the assisted-installer namespace. It's likely that the assisted-installer-controller needs elevated permissions to do the things it needs to and may very well have cluster admin privileges. Moving the assisted-service-operator (and the assisted-service it deploys) away from the default service account (https://bugzilla.redhat.com/show_bug.cgi?id=1951636) should prevent this scenario in the future (assuming the elevated permissions on the default service account are still the problem). IF it is the case that the default service account has cluster-admin, we likely need a new bug to: after cluster creation, remove the cluster-admin privileges from the default service account in assisted-installer namespace. Is it possible to attempt removing the cluster-admin privilege from the default service account with `oc adm policy remove-cluster-role-from-user cluster-admin -z default -n assisted-installer`?
> Is it possible to attempt removing the cluster-admin privilege from the default service account with `oc adm policy remove-cluster-role-from-user cluster-admin -z default -n assisted-installer`? @melserng are you able to try this? We need to confirm if this is still an issue
Our last workaround is to just deleting the pre-existing assisted-installer namespace and then create a new assisted-installer namespace in order to use the expected namespace.
https://github.com/openshift/assisted-service/pull/1557 landed and that should create a specific serviceaccount with the required permissions for Assisted Service. I will move this to ON_QA now for further/final testing. Hopefully, this issue won't regress.
I'm seeing this with: - SNO as the host cluster - LSO - Assisted Service Operator 0.0.3 $ oc logs assisted-service-db6cb96b5-6k6nm postgres mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied $ oc get persistentvolumes NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pv-7ffcd3c8 1Gi RWO Delete Bound assisted-installer/postgres db 62m local-pv-8c21f4f1 10Gi RWO Delete Bound assisted-installer/assisted-service isos 58m $ oc get serviceaccounts NAME SECRETS AGE assisted-installer-controller 2 47h assisted-service 2 155m builder 2 47h default 2 47h deployer 2 47h
More details: $ oc get persistentvolumeclaims NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE assisted-service Bound local-pv-8c21f4f1 10Gi RWO isos 39m postgres Bound local-pv-7ffcd3c8 1Gi RWO db 39m volumes both look like this: apiVersion: local.storage.openshift.io/v1 kind: LocalVolume metadata: name: db namespace: openshift-local-storage spec: storageClassDevices: - devicePaths: - /dev/vdc fsType: xfs storageClassName: db volumeMode: Filesystem Based on a suggested workaround, I did this procedure which got things working: - delete AgentServiceConfig - uninstall operator - delete assisted-installer namespace - reinstall operator - create AgentServiceConfig I don't know why that works. Prior to ^, I tried doing those steps WITHOUT deleting the namespace, and that did not change the behavior.
Created attachment 1787642 [details] oc get pod assisted-service-db6cb96b5-6k6nm -o yaml This is the Pod representation while experiencing the permission denied error.
So far I don't fully understand why the problem would be related to installing the AI inside SNO and not in the hub cluster itself. What I see at the moment is a difference in security contexts between my env (created with dev-scripts) ``` securityContext: fsGroup: 1000640000 seLinuxOptions: level: s0:c25,c20 serviceAccount: assisted-service serviceAccountName: assisted-service ``` and the one provided by Michael ``` securityContext: {} serviceAccount: assisted-service serviceAccountName: assisted-service ``` Checking more to find whether it can be related...
When installing assisted-service on the SNO cluster there is a different behaviour whether we install it in the `assisted-installer` namespace (already existing when the SNO is installed) or any other (e.g. `assisted-installer-2`), namely different securityContext applied to the pods created. ``` # oc -n assisted-installer get sa NAME SECRETS AGE assisted-installer-controller 2 4h47m assisted-service 2 3h15m builder 2 4h23m default 2 4h47m deployer 2 4h23m # oc -n assisted-installer-2 get sa NAME SECRETS AGE assisted-service 2 114s builder 2 2m4s default 2 2m4s deployer 2 2m4s # oc -n assisted-installer- get pod/assisted-service-b7dc8b8d7-hwlq2 -o yaml [...] securityContext: {} serviceAccount: assisted-service serviceAccountName: assisted-service [...] # oc -n assisted-installer-2 get pod/assisted-service-b7dc8b8d7-kps9s -o yaml [...] securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000650000 [...] securityContext: fsGroup: 1000650000 seLinuxOptions: level: s0:c26,c0 serviceAccount: assisted-service serviceAccountName: assisted-service [...] ``` The `assisted-installer` namespace also seems to have an additional pod that we do not have when working on a newly created namespace ``` # oc -n assisted-installer get pods NAME READY STATUS RESTARTS AGE assisted-installer-controller-j27q4 0/1 Completed 0 4h51m assisted-service-b7dc8b8d7-hwlq2 0/2 Pending 0 108m assisted-service-operator-d47d9c877-6pqms 1/1 Running 0 3h19m # oc -n assisted-installer-2 get pods NAME READY STATUS RESTARTS AGE assisted-service-b7dc8b8d7-kps9s 0/2 Pending 0 6m44s assisted-service-operator-d9f54f9b7-fxrkj 1/1 Running 0 7m18s ``` Because of all the above, the issue is now isolated only to clusters initially created by the assisted-installer. Right now it looks that the `assisted-installer` namespace has some security-related configuration which prevents the assisted-service pods to get proper securityContext (for this reason deleting the namespace and recreating it, as a workaround described above, solves the issue).
On the SNO cluster the `assisted-installer` is running in a highly privileged mode so Security Context Constraints protections are not applied there. This seems to be a reason why UID of the postgres container does not match what we expect ``` # oc describe namespace assisted-installer Name: assisted-installer Labels: kubernetes.io/metadata.name=assisted-installer name=assisted-installer olm.operatorgroup.uid/e8c94d42-63c9-4849-b410-9f42cb8b413b= openshift.io/run-level=0 [...] # oc describe namespace assisted-installer-2 Name: assisted-installer-2 Labels: kubernetes.io/metadata.name=assisted-installer-2 name=assisted-installer-2 olm.operatorgroup.uid/89308c99-e2fa-42ec-9ea6-249508511613= [...] ``` This is also a reason why recreating the namespace solves the issue - after recreation the `run-level=0` is not set.
That makes sense. After cluster installation, we run something in the new cluster's "assisted-installer" namespace. Its elevated permissions cause an issue later for assisted-service and its operator. We should probably use different namespaces so they don't conflict.
My personal opinion is that we should (1) recommend using another namespace but as well (2) stop using elevated permissions unless really needed. Reasoning for (1) is that the namespace is used for assisted-service in both cases, but one thing is being a "child" of assisted-service (this is why SNO has this namespace to start with) and another is being a parent spawning next clusters. On the one hand it feels natural that it's the same service so it belongs to a single namespace, but keeping those separated may also spare us some issues and debugging time in the future - it will make it clearer whether we are debugging something related to being deployed by assisted-service or something related to deploying further clusters. As for (2) I'm testing it right now whether the cluster is fully functional when we drop runlevel from its assisted-installer namespace. From a very brief look (and from checking what are implications of setting a runlevel) I don't see any strong reason why it should cause any problems, but it needs a bit more testing.
After internal discussions the solution should be one of the following * delete assisted-installer namespace after controller job finishes * remove run-level label from the assisted-installer namespace after controller job finishes
What's the reason to not use a separate namespace? Is the above just the near-term solution for the upcoming release, and we can still change namespace in a future release? Or is there some other reason?
We could use a separate namespace, but in principle the run-level label should be handled as currently we are using it to control the order in which the pods are started (what is not, what I believe, the original purpose of this flag). For the next release we will handle it via 1966621. The near-term solution to fix the problem that's happening here (i.e. when we use the same namespace) is just to remove the label when it's no longer needed. In principle the question whether or not to reuse the namespace is, I think, a matter of guidelines. There are reasons for and against it, but I think the current solution is a fair trade-off.
AIUI the near-term solution still leaves us with a race condition. We may be able to mitigate it sufficiently, but it still exists. The namespace in question can't be used for installation of the assisted-service-operator until after the label is removed. Using a different namespace would eliminate the race. 1966621 sounds like it would remove the point of conflict, making the race irrelevant. There's no advantage to putting these different components in the same namespace, so in general isolating them in their own namespaces seems like a reasonable default.
As discussed in the PR [1], we should recommend not to use the same namespace or to remove the label manually if required. The effort should go into solving 1966621 as an underlying issue. [1] https://github.com/openshift/assisted-installer/pull/301#issuecomment-854975449
The PR[0] implements an interim solution for the 4.8 release. In order for the cherry-pick to be merged, this bug will have to be flagged as VERIFIED. Therefore, I believe we will need a new bug that will focus on the long-term solution (not using the label, to begin with). [0] https://github.com/openshift/assisted-installer/pull/301
Please note there is already a bug against master branch for complete removal of run-level label - https://bugzilla.redhat.com/show_bug.cgi?id=1966621
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days