summary: MCO behaviours of adding ssh authorized key on rhel node are inconsistent b/w ocp 4.9 and 4.10 1. ocp4.9+rhel7: user core does not exist on rhel node new mc can be applied on rhel node user core is not created, but home dir is created for ssh key 2. ocp4.10+rhel7: user core does not exist on rhel node new mc cannot applied, mcp and node are degraded 3. upgrade from ocp4.9+rhel7 to ocp4.10+rhel7 upgrade is failed with same error in #2 verify add ssh auth key behaviour on ocp4.9+rhel7.9 cluster oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.7 True False 37m Cluster version is 4.9.7 oc get node -l node-role.kubernetes.io/worker -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-56-244.us-east-2.compute.internal Ready worker 10m v1.22.2+5e38c72 10.0.56.244 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.45.1.el7.x86_64 cri-o://1.22.1-2.rhaos4.9.git63ca938.el7 ip-10-0-62-99.us-east-2.compute.internal Ready worker 10m v1.22.2+5e38c72 10.0.62.99 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.45.1.el7.x86_64 cri-o://1.22.1-2.rhaos4.9.git63ca938.el7 ip-10-0-63-175.us-east-2.compute.internal Ready worker 56m v1.22.1+d8c4430 10.0.63.175 <none> Red Hat Enterprise Linux CoreOS 49.84.202111022104-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.22.0-77.rhaos4.9.gitd745cab.el8 ip-10-0-63-77.us-east-2.compute.internal Ready worker 52m v1.22.1+d8c4430 10.0.63.77 <none> Red Hat Enterprise Linux CoreOS 49.84.202111022104-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.22.0-77.rhaos4.9.gitd745cab.el8 ip-10-0-69-115.us-east-2.compute.internal Ready worker 52m v1.22.1+d8c4430 10.0.69.115 <none> Red Hat Enterprise Linux CoreOS 49.84.202111022104-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.22.0-77.rhaos4.9.gitd745cab.el8 check whether user core exists on rhel node oc debug node/ip-10-0-56-244.us-east-2.compute.internal Starting pod/ip-10-0-56-244us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.56.244 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.2# cat /etc/passwd| grep -i core sh-4.2# id core id: core: no such user sh-4.2# ls /home ec2-user user core does not exist on rhel node create new mc for worker pool to update ssh auth key cat add-additional-ssh-authorized-keys.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-add-ssh spec: config: ignition: version: 3.2.0 passwd: users: - name: core sshAuthorizedKeys: - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPmGf/sfIYog1KaHj50H0vaDRITn4Wa8RN9bgc2jj6SejvxhAWZVc4BrRst6BdhGr34IowkZmz76ba9jfa4nGm2HNd+CGqf6KmUhwPjF9oJNjy3z5zT2i903OZii35MUnJl056YXgKYpN96WAD5LVOKop/+7Soxq4PW8TtVZeSpHiPNI28XiIdyqGLzJerhlgPLZBsNO0JcVH1DYLd/c4fh5GDLutszZH/dzAX5RmvN1P/cHie+BnkbgNx91NbrOLTrV5m3nY2End5uGDl8zhaGQ2BX2TmnMqWyxYkYuzNmQFprHMNCCpqLshFGRvCFZGpc6L/72mlpcJubzBF0t5Z mco_test oc create -f add-additional-ssh-authorized-keys.yaml machineconfig.machineconfiguration.openshift.io/99-add-ssh created oc get mc --sort-by metadata.creationTimestamp NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 99-master-ssh 3.2.0 68m 99-worker-ssh 3.2.0 67m 00-master 33286190af0d4d340af5c61e603d185780e74b39 3.2.0 64m 01-master-kubelet 33286190af0d4d340af5c61e603d185780e74b39 3.2.0 64m 01-worker-container-runtime 33286190af0d4d340af5c61e603d185780e74b39 3.2.0 64m 01-worker-kubelet 33286190af0d4d340af5c61e603d185780e74b39 3.2.0 64m 01-master-container-runtime 33286190af0d4d340af5c61e603d185780e74b39 3.2.0 64m 99-master-generated-registries 33286190af0d4d340af5c61e603d185780e74b39 3.2.0 64m 00-worker 33286190af0d4d340af5c61e603d185780e74b39 3.2.0 64m 99-worker-generated-registries 33286190af0d4d340af5c61e603d185780e74b39 3.2.0 64m rendered-master-85d30989eec2c74b3f956f0a68af5717 33286190af0d4d340af5c61e603d185780e74b39 3.2.0 64m rendered-worker-e0d974bebe6d12406c20ff95c7d78c96 33286190af0d4d340af5c61e603d185780e74b39 3.2.0 64m 99-add-ssh 3.2.0 14s rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1 33286190af0d4d340af5c61e603d185780e74b39 3.2.0 9s new ssh mc 99-add-ssh can be applied on worker pool oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1 True False False 5 5 5 0 68m confirm whether worker pool contains new mc oc get mcp/worker -o yaml | yq -y '.status.configuration' name: rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-add-ssh - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh check ssh auth key info on rhel node oc debug node/ip-10-0-56-244.us-east-2.compute.internal Starting pod/ip-10-0-56-244us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.56.244 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.2# ls /home core ec2-user sh-4.2# ls /home/core sh-4.2# ls /home/core/.ssh/authorized_keys /home/core/.ssh/authorized_keys sh-4.2# cat /home/core/.ssh/authorized_keys ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPmGf/sfIYog1KaHj50H0vaDRITn4Wa8RN9bgc2jj6SejvxhAWZVc4BrRst6BdhGr34IowkZmz76ba9jfa4nGm2HNd+CGqf6KmUhwPjF9oJNjy3z5zT2i903OZii35MUnJl056YXgKYpN96WAD5LVOKop/+7Soxq4PW8TtVZeSpHiPNI28XiIdyqGLzJerhlgPLZBsNO0JcVH1DYLd/c4fh5GDLutszZH/dzAX5RmvN1P/cHie+BnkbgNx91NbrOLTrV5m3nY2End5uGDl8zhaGQ2BX2TmnMqWyxYkYuzNmQFprHMNCCpqLshFGRvCFZGpc6L/72mlpcJubzBF0t5Z mco_test ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCWkwurd8TNAi+D7ffvyDdhGBSQtJx3/Yedlwvvha0q772vLlOAGlKCw4dajKy6qty1/GGQDgTJ17h3C9TEArI8ZqILnyydeY56DL+ELN3dtGBVof/N2qtW0+SmEnd1Mi7Qy5Tx4e/GVmB3NgX9szwNOVXhebzgBsXc9x+RtCVLPLC8J+qqSdTUZ0UfJsh2ptlQLGHmmTpF//QlJ1tngvAFeCOxJUhrLAa37P9MtFsiNk31EfKyBk3eIdZljTERmqFaoJCohsFFEdO7tVgU6p5NwniAyBGZVjZBzjELoI1aZ+/g9yReIScxl1R6PWqEzcU6lGo2hInnb6nuZFGb+90D openshift-qe sh-4.2# sh-4.2# cat /etc/passwd | grep -i core sh-4.2# id core id: core: no such user user core is not created. but /home/core is created and auth key is updated in file /home/core/.ssh/authorized_keys. trigger upgrade from 4.9 to 4.10 oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-11-15-034648 --force --allow-explicit-upgrade warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-11-15-034648 mcp/worker is degraded due to error "failed to retrieve UserID for username: core" oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1 False True True 5 0 0 1 154m oc get mcp/worker -o yaml | yq -y '.status' conditions: - lastTransitionTime: '2021-11-16T06:31:22Z' message: '' reason: '' status: 'False' type: RenderDegraded - lastTransitionTime: '2021-11-16T08:47:54Z' message: '' reason: '' status: 'False' type: Updated - lastTransitionTime: '2021-11-16T08:47:54Z' message: All nodes are updating to rendered-worker-62b315cbcf7d4ce752037af022820f9c reason: '' status: 'True' type: Updating - lastTransitionTime: '2021-11-16T08:49:50Z' message: 'Node ip-10-0-62-99.us-east-2.compute.internal is reporting: "failed to retrieve UserID for username: core"' reason: 1 nodes are reporting degraded status on sync status: 'True' type: NodeDegraded - lastTransitionTime: '2021-11-16T08:49:50Z' message: '' reason: '' status: 'True' type: Degraded configuration: name: rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-add-ssh - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh degradedMachineCount: 1 machineCount: 5 observedGeneration: 4 readyMachineCount: 0 unavailableMachineCount: 1 updatedMachineCount: 0 oc get node/ip-10-0-62-99.us-east-2.compute.internal -o yaml | yq -y '.metadata.annotations' csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-071ce7e8547223187"}' machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1 machineconfiguration.openshift.io/desiredConfig: rendered-worker-62b315cbcf7d4ce752037af022820f9c machineconfiguration.openshift.io/reason: 'failed to retrieve UserID for username: core' machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: 'true' verify add ssh auth key behaviour on ocp4.10+rhel7.9 cluster oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-14-184249 True False 7m11s Cluster version is 4.10.0-0.nightly-2021-11-14-184249 oc get node -l node-role.kubernetes.io/worker -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME rioliu111603-skfnk-w-a-l-rhel-0 Ready worker 3m47s v1.22.1+0474f31 10.0.128.6 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.45.1.el7.x86_64 cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el7 rioliu111603-skfnk-w-a-l-rhel-1 Ready worker 3m47s v1.22.1+0474f31 10.0.128.5 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.45.1.el7.x86_64 cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el7 rioliu111603-skfnk-worker-a-p6kw8.c.openshift-qe.internal Ready worker 36m v1.22.1+f773b8b 10.0.128.2 <none> Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8 rioliu111603-skfnk-worker-b-7bvxj.c.openshift-qe.internal Ready worker 36m v1.22.1+f773b8b 10.0.128.3 <none> Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8 rioliu111603-skfnk-worker-c-v254k.c.openshift-qe.internal Ready worker 36m v1.22.1+f773b8b 10.0.128.4 <none> Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8 check whether user core exists on rhel node oc debug node/rioliu111603-skfnk-w-a-l-rhel-0 W1116 17:35:22.768924 86851 warnings.go:70] would violate "latest" version of "baseline" PodSecurity profile: host namespaces (hostNetwork=true, hostPID=true), hostPath volumes (volume "host"), privileged (container "container-00" must not set securityContext.privileged=true) Starting pod/rioliu111603-skfnk-w-a-l-rhel-0-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.6 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.2# cat /etc/passwd | grep -i core sh-4.2# ls /home cloud-user sh-4.2# id core id: core: no such user create new mc for worker pool to update ssh auth key oc create -f add-additional-ssh-authorized-keys.yaml machineconfig.machineconfiguration.openshift.io/99-add-ssh created oc get mc --sort-by metadata.creationTimestamp NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 99-worker-ssh 3.2.0 47m 99-master-ssh 3.2.0 47m 00-worker 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 44m 01-master-container-runtime 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 44m 01-master-kubelet 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 44m 01-worker-container-runtime 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 44m 00-master 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 44m 99-master-generated-registries 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 44m 01-worker-kubelet 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 44m 99-worker-generated-registries 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 44m rendered-master-5610cc678fdd33547639bf3b164df970 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 44m rendered-worker-c03f11633f9d9bcb24383693788d1c13 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 44m 99-add-ssh 3.2.0 10s rendered-worker-f37c6b9c94dd265c52f040b7d5198433 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 5s mcp/worker is degraded oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-c03f11633f9d9bcb24383693788d1c13 False True True 5 0 0 1 45m oc get mcp/worker -o yaml | yq -y '.status' conditions: - lastTransitionTime: '2021-11-16T08:52:36Z' message: '' reason: '' status: 'False' type: RenderDegraded - lastTransitionTime: '2021-11-16T09:36:53Z' message: '' reason: '' status: 'False' type: Updated - lastTransitionTime: '2021-11-16T09:36:53Z' message: All nodes are updating to rendered-worker-f37c6b9c94dd265c52f040b7d5198433 reason: '' status: 'True' type: Updating - lastTransitionTime: '2021-11-16T09:36:58Z' message: 'Node rioliu111603-skfnk-w-a-l-rhel-0 is reporting: "failed to retrieve UserID for username: core"' reason: 1 nodes are reporting degraded status on sync status: 'True' type: NodeDegraded - lastTransitionTime: '2021-11-16T09:36:58Z' message: '' reason: '' status: 'True' type: Degraded configuration: name: rendered-worker-c03f11633f9d9bcb24383693788d1c13 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh degradedMachineCount: 1 machineCount: 5 observedGeneration: 3 readyMachineCount: 0 unavailableMachineCount: 0 updatedMachineCount: 0 oc get node/rioliu111603-skfnk-w-a-l-rhel-0 -o yaml | yq -y '.metadata.annotations' csi.volume.kubernetes.io/nodeid: '{"pd.csi.storage.gke.io":"projects/openshift-qe/zones/us-central1-a/instances/rioliu111603-skfnk-w-a-l-rhel-0"}' machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-c03f11633f9d9bcb24383693788d1c13 machineconfiguration.openshift.io/desiredConfig: rendered-worker-f37c6b9c94dd265c52f040b7d5198433 machineconfiguration.openshift.io/reason: 'failed to retrieve UserID for username: core' machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: 'true'
Thanks for the detailed report! I think this is a result of https://github.com/openshift/machine-config-operator/pull/2813. For some background: we never required "Core" users to exist on RHEL nodes. The old behaviour was "wrong" in the sense that it wrote a ssh key to nodes that don't even have the user, but it didn't really break anything. Now it gets explicitly rejected, which is... unfortunate since it may break upgrade on rhel nodes where previously it would make a dummy write. To not regress that, we have a few options, such as: 1. if user does not exist, write anyways (using root) much like our existing behaviour 2. do not write to rhel nodes at all (but now we risk leaking "old" sshkeys we can no longer remove, since it wouldn't be managed) @Matthew are you willing to take a look at this? Happy to discuss further our options Also @Rio I see you mention testing with RHEL7+4.10. I am pretty sure we no longer support rhel 7 workers in 4.10, and you must reprovision rhel8 workers. Should we update tests to reflect that?
What's the desired behavior? Ignore the ssh key on RHEL and don't create the home directory?
I am not sure. We can maybe discuss this during a team session? I am leaning towards preserving original behaviour if the core user does not exist, but that does still retain a somewhat undesireable behaviour.
I'm not a huge fan of that because we're going to have to add a case for RHEL no matter what, and adding an if branch just to write ssh keys with root permissions that we don't want really doesn't make a whole lot of sense. Can you explain what you mean by leaking "old" ssh keys a bit more?
Sure. What happens today is if you installed in 4.9 or before, the ssh keys are actually written to the RHEL nodes, although there is no core user. This also implies that it would "work" if you happen to have a rhel worker with a "core" user (I think, although I could be wrong on this), although this is not required by default, and what happens is the sshkey gets written regardless. There's no guarantee that there are no users leveraging this (maybe nobody is, and that would be the best case scenario). So when a user updates to 4.10, and then tries to update their sshkeys, the rhel nodes would actually perform an update that is a no-op. The old sshkey on the node would not get rotated if we choose to no longer manage it, but nothing would delete it today. Maybe some more nuanced options: 1. the MCO does not fail if you don't have a core user, and just writes it if you do 2. the MCO no longer manages rhel node sshkeys, and deletes any it may have written before the update (not sure how safe this is)
(In reply to Yu Qi Zhang from comment #1) > Also @Rio I see you mention testing with RHEL7+4.10. I am pretty sure we no > longer support rhel 7 workers in 4.10, and you must reprovision rhel8 > workers. Should we update tests to reflect that? Hi Jerry, tested same scenario on ocp4.10+rhel8 cluster, the issue can be reproduced as well. oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-15-034648 True False 92m Cluster version is 4.10.0-0.nightly-2021-11-15-034648 oc get node -o wide -l node-role.kubernetes.io/worker NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-51-210.us-east-2.compute.internal Ready worker 3m v1.22.1+4111b82 10.0.51.210 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-348.2.1.el8_5.x86_64 cri-o://1.23.0-23.rhaos4.10.git407d866.el8 ip-10-0-55-140.us-east-2.compute.internal Ready worker 126m v1.22.1+f773b8b 10.0.55.140 <none> Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8 ip-10-0-60-51.us-east-2.compute.internal Ready worker 3m1s v1.22.1+4111b82 10.0.60.51 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-348.2.1.el8_5.x86_64 cri-o://1.23.0-23.rhaos4.10.git407d866.el8 ip-10-0-60-52.us-east-2.compute.internal Ready worker 130m v1.22.1+f773b8b 10.0.60.52 <none> Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8 ip-10-0-68-51.us-east-2.compute.internal Ready worker 130m v1.22.1+f773b8b 10.0.68.51 <none> Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa) 4.18.0-305.25.1.el8_4.x86_64 cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8 oc create -f add-additional-ssh-authorized-keys.yaml machineconfig.machineconfiguration.openshift.io/99-add-ssh created oc get mc --sort-by metadata.creationTimestamp NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 99-master-ssh 3.2.0 138m 99-worker-ssh 3.2.0 138m 00-worker 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 01-master-container-runtime 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 01-master-kubelet 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 01-worker-container-runtime 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 99-worker-generated-registries 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 99-master-generated-registries 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 00-master 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 01-worker-kubelet 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m rendered-master-84f6fdac64f56ce17f188b0508c75a2b 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m rendered-worker-9ac8035c2db668db3d79a43a1f02895b 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 134m 99-add-ssh 3.2.0 35s rendered-worker-79f61a75ef9db8bc23775504cc259d37 0abf68a0c3206df0be0e13980da645d2c0ac9aa7 3.2.0 30s oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-9ac8035c2db668db3d79a43a1f02895b False True True 5 3 3 1 135m oc get mcp/worker -o yaml | yq -y '.status.conditions' - lastTransitionTime: '2021-11-18T01:44:48Z' message: '' reason: '' status: 'False' type: RenderDegraded - lastTransitionTime: '2021-11-18T03:58:24Z' message: '' reason: '' status: 'False' type: Updated - lastTransitionTime: '2021-11-18T03:58:24Z' message: All nodes are updating to rendered-worker-79f61a75ef9db8bc23775504cc259d37 reason: '' status: 'True' type: Updating - lastTransitionTime: '2021-11-18T03:58:49Z' message: 'Node ip-10-0-60-51.us-east-2.compute.internal is reporting: "failed to retrieve UserID for username: core"' reason: 1 nodes are reporting degraded status on sync status: 'True' type: NodeDegraded - lastTransitionTime: '2021-11-18T03:58:49Z' message: '' reason: '' status: 'True' type: Degraded I think this issue is not related to rhel os version. the key point here is core user does not exist on rhel node
Would it work even if you had a core user? If authorized_keys has permissions 600 and is owned by root, a core user couldn't even use it, right?
the precondition of this issue is: update ssh key on RHEL node, the concern is not whether the key can be used or not, it is, the original behavior does not block upgrade operation, but current logic in 4.10 will block upgrade from a cluster like ocp4.9+RHEL worker, because core user is not expected on RHEL node.
Right, I was more responding to: 2. the MCO no longer manages rhel node sshkeys, and deletes any it may have written before the update (not sure how safe this is) I was wondering if it's safe to delete a key a user does not have permission to read
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056