Bug 1850510
| Summary: | image-registry operator fail to use nfs-based pv | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Gurenko Alex <agurenko> |
| Component: | Documentation | Assignee: | Bob Furu <bfuru> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Wenjing Zheng <wzheng> |
| Severity: | medium | Docs Contact: | Vikram Goyal <vigoyal> |
| Priority: | unspecified | ||
| Version: | 4.4 | CC: | aos-bugs, bfuru, chuffman, jokerman, jsafrane, obulatov, piqin, wzheng |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-26 15:41:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Gurenko Alex
2020-06-24 12:32:53 UTC
When the registry returns 500 Internal Server Error, it should have additional information in logs. Can you attach the registry logs (from image-registry pods)? Trying to find the problem here are additional findings: with no_root_squash there is an initial /registry permission denied error which seems to be solved by changing it to all_squash, there are no additional requirements for NFS mentioned in the docs, but giving xx7 permission for the NFS share seems to work, although restart of the pod makes previous images unavailable due to change in owner which is <container_id>:root Looks like major problem at the end was the no_root_squash vs all_squash. With all_sqaush works out of the box Should we change it to a doc bug and update documentation that NFS export must be configured with all_squash option? Right now the only option mentioned in the docs is "no_wdelay" when scaling and even more there is no_root_squash parameter in the example that does not work - 3.4.4 step №2 (https://access.redhat.com/documentation/en-us/openshift_container_platform/4.4/html/registry/setting-up-and-configuring-the-registry#configuring-registry-storage-baremetal) root_squash/no_root_squash/all_squash - none of them should affect the registry. It's a bug if files created by one pod cannot be read by another pod because of gid (fsGroup) or permissions. Please attach * yamls of image-registry pods * output of `oc -n openshift-image-registry exec pod/image-registry-5b677b9f9b-dw92l -- id` for every pod Permissions shouldn't be <container_id>:root, they should be something like 1000090000:1000090000. Here is what I have:
[kni@provisionhost-0-0 ~]$ oc get pods -o name
pod/cluster-image-registry-operator-c5f545597-dszjr
pod/image-registry-58c589cb44-4nbnb
pod/node-ca-6k475
pod/node-ca-7dvvx
pod/node-ca-8sbbr
pod/node-ca-bc92l
pod/node-ca-wzd7c
[kni@provisionhost-0-0 ~]$ for i in $(oc get pods -o name); do oc exec ${i} -- id; done
Defaulting container name to cluster-image-registry-operator.
Use 'oc describe pod/cluster-image-registry-operator-c5f545597-dszjr -n openshift-image-registry' to see all of the containers in this pod.
uid=1000160000(1000160000) gid=0(root) groups=0(root),1000160000
uid=1000160000(1000160000) gid=0(root) groups=0(root),1000160000
uid=1001(1001) gid=0(root) groups=0(root)
uid=1001(1001) gid=0(root) groups=0(root)
uid=1001(1001) gid=0(root) groups=0(root)
uid=1001(1001) gid=0(root) groups=0(root)
uid=1001(1001) gid=0(root) groups=0(root)
Probably the problem is that root is a primary group and container_id group is a 2nd group? I'm trying to bring the environment to the semi-working state so you can have a look.
I tried with EmptyDir. I have similar results from `id`, but on the file system I have 1000090000:1000090000.
Question to the Storage team. The registry pods are like
spec:
containers:
- name: registry
...
volumeMounts:
- mountPath: /registry
name: registry-storage
...
securityContext:
fsGroup: 1000090000
seLinuxOptions:
level: s0:c10,c0
...
volumes:
- name: registry-storage
persistentVolumeClaim:
claimName: image-registry-storage
Is having fsGroup enough for right permissions? How to fix this problem with NFS?
When using NFS, or any form of shared storage, it is strongly recommended to define a group on the export, and then add this to the pods using supplemental groups instead of fsgroup. This is included in the NFS documentation at [1]. We should likely include these requirements in the registry documentation, or at least a link to the NFS section where it's mentioned. In addition, you may wish to use `root_squash` [2]. This is required if you want arbitrary containers to read and write to the volume. Let me know if there are any questions on the above. If this addresses your issue, then we can move forward with getting the documentation updated. [1] https://docs.openshift.com/container-platform/4.4/storage/persistent_storage/persistent-storage-nfs.html#group-ids [2] https://docs.openshift.com/container-platform/4.4/storage/persistent_storage/persistent-storage-nfs.html#export-settings Created documentation PR to address this bug: https://github.com/openshift/openshift-docs/pull/24276 With this fix, a note is added at the end of the "Configuring registry storage for bare metal" procedure recommending the use of supplementalGroups. Additionally, the NFS example in step 2 is updated to include the use of both `no_wdelay` and `root_squash`. @Christian - PTAL and let me know if this approach makes sense. Thanks. SME reviewed and feedback applied in https://github.com/openshift/openshift-docs/pull/24276 preview build links are in PR comment Moving to ON_QA, peer review also requested Hello, Bob Added some comments in the PR Thanks, Ping. Addressed your comments in the PR. Let me know if it is lgtm and I'll get this merged. Thanks for updating the doc, I have only one concern. Right now the phrasing is: "If the storage type is NFS, and you want to scale up the registry Pod by setting replica>1 you must enable the no_wdelay and root_squash mount options." which to me reads like it should work with 1 replica, which is not 100% correct. If the pod restarts for whatever reason it will be assigned a new podID and the old registry would not be accessible without root_sqash either as we saw before or I'm wrong here? I would prefer to see something like: If you're going to use NFS, please make sure following args are used for the share. You are welcome, Alex, and thank you for sharing your concern. I've revised the wording, taking into account your suggestion. > If the pod restarts for whatever reason it will be assigned a new > podID and the old registry would not be accessible without root_sqash either > as we saw before or I'm wrong here? @Christian and @Ping - PTAL at the revision I made to the PR (removed the reference for NFS with replica>1): https://github.com/openshift/openshift-docs/pull/24276/files#diff-5b1411d684f3cbd16c535854301f6b58R50-R52 Does this look correct, based on Alex's question? Verified by QE, waiting for SME review before merge. Verified by QE, waiting for SME review before merge. Approved by SME, QE, Docs. Content merged and manually cherrypicked to 4.3-4.6. Verified live on docs.openshift.com. For example: - https://docs.openshift.com/container-platform/4.4/registry/configuring_registry_storage/configuring-registry-storage-baremetal.html#registry-configuring-storage-baremetal_configuring-registry-storage-baremetal - https://docs.openshift.com/container-platform/4.6/registry/configuring_registry_storage/configuring-registry-storage-vsphere.html#registry-configuring-storage-vsphere_configuring-registry-storage-vsphere Closing BZ. I would like to revisit current solution as I finally got time to re-test the updated solution with a fresh deployment and root_squash does not work out of the box and it needs to be either all_squash or the image_registry need to be adjusted in some way with a group which is dynamic in a sense that on a container restart/recreation will be modified. With root_squash I'm getting: sh: cd: registry/: Permission denied which is logical as the registry is mounted as: drwx------. 2 65534 65534 4.0K Aug 11 12:58 registry and id returns: uid=1000110000(1000110000) gid=0(root) groups=0(root),1000110000 so we the container user id is not squashed to nobobody (65534) hence the problem with permissions. Thanks for revisiting this, Alex. I've opened https://github.com/openshift/openshift-docs/pull/24800 and have requested QE and Eng to verify that changing to `all_squash` is sufficient. Moving to ON_QA. Created https://github.com/openshift/openshift-docs/pull/25826 to address suggestions made by Christian. Awaiting SME, QE, peer review. @Christian and @Ping, PTAL - thanks! QE review feedback applied, awaiting re-review. SME review approved, merged to master (https://github.com/openshift/openshift-docs/pull/25826) but presently holding off on CP to 4.6 as it is bumping up against other recent doc changes. So 4.6 CP PR is open and under review here: https://github.com/openshift/openshift-docs/pull/26512 4.6 manual CP PR merged, 4.5 CP completed, and manual CP of PR25826 for 4.4 created in https://github.com/openshift/openshift-docs/pull/26695 and merged. All related image registry docs are now updated to remove NFS examples. Verified on docs.openshift.com: - 4.6: https://docs.openshift.com/container-platform/4.6/registry/configuring_registry_storage/configuring-registry-storage-vsphere.html - 4.5: https://docs.openshift.com/container-platform/4.5/registry/configuring_registry_storage/configuring-registry-storage-vsphere.html - 4.4: https://docs.openshift.com/container-platform/4.4/registry/configuring_registry_storage/configuring-registry-storage-vsphere.html Closing BZ. |