Bug 2023657
Summary: | inconsistent behaviours of adding ssh key on rhel node between 4.9 and 4.10 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Rio Liu <rioliu> |
Component: | Machine Config Operator | Assignee: | mkenigsb |
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aos-bugs, jerzhang, mkenigsb, mkrejci, skumari, sregidor |
Version: | 4.10 | Keywords: | Upgrades |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | All | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:28:41 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Rio Liu
2021-11-16 09:56:13 UTC
Thanks for the detailed report! I think this is a result of https://github.com/openshift/machine-config-operator/pull/2813. For some background: we never required "Core" users to exist on RHEL nodes. The old behaviour was "wrong" in the sense that it wrote a ssh key to nodes that don't even have the user, but it didn't really break anything. Now it gets explicitly rejected, which is... unfortunate since it may break upgrade on rhel nodes where previously it would make a dummy write. To not regress that, we have a few options, such as: 1. if user does not exist, write anyways (using root) much like our existing behaviour 2. do not write to rhel nodes at all (but now we risk leaking "old" sshkeys we can no longer remove, since it wouldn't be managed) @Matthew are you willing to take a look at this? Happy to discuss further our options Also @Rio I see you mention testing with RHEL7+4.10. I am pretty sure we no longer support rhel 7 workers in 4.10, and you must reprovision rhel8 workers. Should we update tests to reflect that? What's the desired behavior? Ignore the ssh key on RHEL and don't create the home directory? I am not sure. We can maybe discuss this during a team session? I am leaning towards preserving original behaviour if the core user does not exist, but that does still retain a somewhat undesireable behaviour. I'm not a huge fan of that because we're going to have to add a case for RHEL no matter what, and adding an if branch just to write ssh keys with root permissions that we don't want really doesn't make a whole lot of sense. Can you explain what you mean by leaking "old" ssh keys a bit more? Sure. What happens today is if you installed in 4.9 or before, the ssh keys are actually written to the RHEL nodes, although there is no core user. This also implies that it would "work" if you happen to have a rhel worker with a "core" user (I think, although I could be wrong on this), although this is not required by default, and what happens is the sshkey gets written regardless. There's no guarantee that there are no users leveraging this (maybe nobody is, and that would be the best case scenario). So when a user updates to 4.10, and then tries to update their sshkeys, the rhel nodes would actually perform an update that is a no-op. The old sshkey on the node would not get rotated if we choose to no longer manage it, but nothing would delete it today. Maybe some more nuanced options: 1. the MCO does not fail if you don't have a core user, and just writes it if you do 2. the MCO no longer manages rhel node sshkeys, and deletes any it may have written before the update (not sure how safe this is) (In reply to Yu Qi Zhang from comment #1) > Also @Rio I see you mention testing with RHEL7+4.10. I am pretty sure we no > longer support rhel 7 workers in 4.10, and you must reprovision rhel8 > workers. Should we update tests to reflect that? Hi Jerry, tested same scenario on ocp4.10+rhel8 cluster, the issue can be reproduced as well. oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-15-034648 True False 92m Cluster version is 4.10.0-0.nightly-2021-11-15-034648 oc get node -o wide -l node-role.kubernetes.io/worker NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-51-210.us-east-2.compute.internal Ready worker 3m v1.22.1+4111b82 10.0.51.210 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-348.2.1.el8_5.x86_64 cri-o://1.23.0-23.rhaos4.10.git407d866.el8 ip-10-0-55-140.us-east-2.compute.internal Ready worker 126m v1.22.1+f773b8b 10.0.55.140 <none> Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8 ip-10-0-60-51.us-east-2.compute.internal Ready worker 3m1s v1.22.1+4111b82 10.0.60.51 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-348.2.1.el8_5.x86_64 cri-o://1.23.0-23.rhaos4.10.git407d866.el8 ip-10-0-60-52.us-east-2.compute.internal Ready worker 130m v1.22.1+f773b8b 10.0.60.52 <none> Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8 ip-10-0-68-51.us-east-2.compute.internal Ready worker 130m v1.22.1+f773b8b 10.0.68.51 <none> Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8 oc create -f add-additional-ssh-authorized-keys.yaml machineconfig.machineconfiguration.openshift.io/99-add-ssh created oc get mc --sort-by metadata.creationTimestamp NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 99-master-ssh 3.2.0 138m 99-worker-ssh 3.2.0 138m 00-worker 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 01-master-container-runtime 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 01-master-kubelet 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 01-worker-container-runtime 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 99-worker-generated-registries 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 99-master-generated-registries 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 00-master 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 01-worker-kubelet 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m rendered-master-84f6fdac64f56ce17f188b0508c75a2b 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m rendered-worker-9ac8035c2db668db3d79a43a1f02895b 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 99-add-ssh 3.2.0 35s rendered-worker-79f61a75ef9db8bc23775504cc259d37 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 30s oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-9ac8035c2db668db3d79a43a1f02895b False True True 5 3 3 1 135m oc get mcp/worker -o yaml | yq -y '.status.conditions' - lastTransitionTime: '2021-11-18T01:44:48Z' message: '' reason: '' status: 'False' type: RenderDegraded - lastTransitionTime: '2021-11-18T03:58:24Z' message: '' reason: '' status: 'False' type: Updated - lastTransitionTime: '2021-11-18T03:58:24Z' message: All nodes are updating to rendered-worker-79f61a75ef9db8bc23775504cc259d37 reason: '' status: 'True' type: Updating - lastTransitionTime: '2021-11-18T03:58:49Z' message: 'Node ip-10-0-60-51.us-east-2.compute.internal is reporting: "failed to retrieve UserID for username: core"' reason: 1 nodes are reporting degraded status on sync status: 'True' type: NodeDegraded - lastTransitionTime: '2021-11-18T03:58:49Z' message: '' reason: '' status: 'True' type: Degraded I think this issue is not related to rhel os version. the key point here is core user does not exist on rhel node Would it work even if you had a core user? If authorized_keys has permissions 600 and is owned by root, a core user couldn't even use it, right? the precondition of this issue is: update ssh key on RHEL node, the concern is not whether the key can be used or not, it is, the original behavior does not block upgrade operation, but current logic in 4.10 will block upgrade from a cluster like ocp4.9+RHEL worker, because core user is not expected on RHEL node. Right, I was more responding to: 2. the MCO no longer manages rhel node sshkeys, and deletes any it may have written before the update (not sure how safe this is) I was wondering if it's safe to delete a key a user does not have permission to read Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |