Bug 2023657

Summary:	inconsistent behaviours of adding ssh key on rhel node between 4.9 and 4.10
Product:	OpenShift Container Platform	Reporter:	Rio Liu <rioliu>
Component:	Machine Config Operator	Assignee:	mkenigsb
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, jerzhang, mkenigsb, mkrejci, skumari, sregidor
Version:	4.10	Keywords:	Upgrades
Target Milestone:	---
Target Release:	4.10.0
Hardware:	All
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:28:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rio Liu 2021-11-16 09:56:13 UTC

summary: MCO behaviours of adding ssh authorized key on rhel node are inconsistent b/w ocp 4.9 and 4.10

1. ocp4.9+rhel7: 
    user core does not exist on rhel node
    new mc can be applied on rhel node
    user core is not created, but home dir is created for ssh key
2. ocp4.10+rhel7: 
    user core does not exist on rhel node
    new mc cannot applied, mcp and node are degraded
3. upgrade from ocp4.9+rhel7 to ocp4.10+rhel7
    upgrade is failed with same error in #2

verify add ssh auth key behaviour on ocp4.9+rhel7.9 cluster

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.7     True        False         37m     Cluster version is 4.9.7

oc get node -l node-role.kubernetes.io/worker -o wide
NAME                                        STATUS   ROLES    AGE   VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-56-244.us-east-2.compute.internal   Ready    worker   10m   v1.22.2+5e38c72   10.0.56.244   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.45.1.el7.x86_64    cri-o://1.22.1-2.rhaos4.9.git63ca938.el7
ip-10-0-62-99.us-east-2.compute.internal    Ready    worker   10m   v1.22.2+5e38c72   10.0.62.99    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.45.1.el7.x86_64    cri-o://1.22.1-2.rhaos4.9.git63ca938.el7
ip-10-0-63-175.us-east-2.compute.internal   Ready    worker   56m   v1.22.1+d8c4430   10.0.63.175   <none>        Red Hat Enterprise Linux CoreOS 49.84.202111022104-0 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   cri-o://1.22.0-77.rhaos4.9.gitd745cab.el8
ip-10-0-63-77.us-east-2.compute.internal    Ready    worker   52m   v1.22.1+d8c4430   10.0.63.77    <none>        Red Hat Enterprise Linux CoreOS 49.84.202111022104-0 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   cri-o://1.22.0-77.rhaos4.9.gitd745cab.el8
ip-10-0-69-115.us-east-2.compute.internal   Ready    worker   52m   v1.22.1+d8c4430   10.0.69.115   <none>        Red Hat Enterprise Linux CoreOS 49.84.202111022104-0 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   cri-o://1.22.0-77.rhaos4.9.gitd745cab.el8

check whether user core exists on rhel node

oc debug node/ip-10-0-56-244.us-east-2.compute.internal
Starting pod/ip-10-0-56-244us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.56.244
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.2# cat /etc/passwd| grep -i core
sh-4.2# id core
id: core: no such user  
sh-4.2# ls /home
ec2-user

user core does not exist on rhel node

create new mc for worker pool to update ssh auth key

cat add-additional-ssh-authorized-keys.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-add-ssh
spec:
  config:
    ignition:
      version: 3.2.0
    passwd:
      users:
      - name: core
        sshAuthorizedKeys:
        - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPmGf/sfIYog1KaHj50H0vaDRITn4Wa8RN9bgc2jj6SejvxhAWZVc4BrRst6BdhGr34IowkZmz76ba9jfa4nGm2HNd+CGqf6KmUhwPjF9oJNjy3z5zT2i903OZii35MUnJl056YXgKYpN96WAD5LVOKop/+7Soxq4PW8TtVZeSpHiPNI28XiIdyqGLzJerhlgPLZBsNO0JcVH1DYLd/c4fh5GDLutszZH/dzAX5RmvN1P/cHie+BnkbgNx91NbrOLTrV5m3nY2End5uGDl8zhaGQ2BX2TmnMqWyxYkYuzNmQFprHMNCCpqLshFGRvCFZGpc6L/72mlpcJubzBF0t5Z
          mco_test

oc create -f add-additional-ssh-authorized-keys.yaml
machineconfig.machineconfiguration.openshift.io/99-add-ssh created

 oc get mc --sort-by metadata.creationTimestamp
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
99-master-ssh                                                                                 3.2.0             68m
99-worker-ssh                                                                                 3.2.0             67m
00-master                                          33286190af0d4d340af5c61e603d185780e74b39   3.2.0             64m
01-master-kubelet                                  33286190af0d4d340af5c61e603d185780e74b39   3.2.0             64m
01-worker-container-runtime                        33286190af0d4d340af5c61e603d185780e74b39   3.2.0             64m
01-worker-kubelet                                  33286190af0d4d340af5c61e603d185780e74b39   3.2.0             64m
01-master-container-runtime                        33286190af0d4d340af5c61e603d185780e74b39   3.2.0             64m
99-master-generated-registries                     33286190af0d4d340af5c61e603d185780e74b39   3.2.0             64m
00-worker                                          33286190af0d4d340af5c61e603d185780e74b39   3.2.0             64m
99-worker-generated-registries                     33286190af0d4d340af5c61e603d185780e74b39   3.2.0             64m
rendered-master-85d30989eec2c74b3f956f0a68af5717   33286190af0d4d340af5c61e603d185780e74b39   3.2.0             64m
rendered-worker-e0d974bebe6d12406c20ff95c7d78c96   33286190af0d4d340af5c61e603d185780e74b39   3.2.0             64m
99-add-ssh                                                                                    3.2.0             14s
rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1   33286190af0d4d340af5c61e603d185780e74b39   3.2.0             9s

new ssh mc 99-add-ssh can be applied on worker pool

oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1   True      False      False      5              5                   5                     0                      68m

confirm whether worker pool contains new mc

oc get mcp/worker -o yaml | yq -y '.status.configuration'
name: rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1
source:
  - apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    name: 00-worker
  - apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    name: 01-worker-container-runtime
  - apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    name: 01-worker-kubelet
  - apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    name: 99-add-ssh
  - apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    name: 99-worker-generated-registries
  - apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    name: 99-worker-ssh

check ssh auth key info on rhel node

oc debug node/ip-10-0-56-244.us-east-2.compute.internal
Starting pod/ip-10-0-56-244us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.56.244
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.2# ls /home
core  ec2-user
sh-4.2# ls /home/core
sh-4.2# ls /home/core/.ssh/authorized_keys
/home/core/.ssh/authorized_keys
sh-4.2# cat /home/core/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPmGf/sfIYog1KaHj50H0vaDRITn4Wa8RN9bgc2jj6SejvxhAWZVc4BrRst6BdhGr34IowkZmz76ba9jfa4nGm2HNd+CGqf6KmUhwPjF9oJNjy3z5zT2i903OZii35MUnJl056YXgKYpN96WAD5LVOKop/+7Soxq4PW8TtVZeSpHiPNI28XiIdyqGLzJerhlgPLZBsNO0JcVH1DYLd/c4fh5GDLutszZH/dzAX5RmvN1P/cHie+BnkbgNx91NbrOLTrV5m3nY2End5uGDl8zhaGQ2BX2TmnMqWyxYkYuzNmQFprHMNCCpqLshFGRvCFZGpc6L/72mlpcJubzBF0t5Z mco_test
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCWkwurd8TNAi+D7ffvyDdhGBSQtJx3/Yedlwvvha0q772vLlOAGlKCw4dajKy6qty1/GGQDgTJ17h3C9TEArI8ZqILnyydeY56DL+ELN3dtGBVof/N2qtW0+SmEnd1Mi7Qy5Tx4e/GVmB3NgX9szwNOVXhebzgBsXc9x+RtCVLPLC8J+qqSdTUZ0UfJsh2ptlQLGHmmTpF//QlJ1tngvAFeCOxJUhrLAa37P9MtFsiNk31EfKyBk3eIdZljTERmqFaoJCohsFFEdO7tVgU6p5NwniAyBGZVjZBzjELoI1aZ+/g9yReIScxl1R6PWqEzcU6lGo2hInnb6nuZFGb+90D openshift-qe
sh-4.2#
sh-4.2# cat /etc/passwd | grep -i core
sh-4.2# id core
id: core: no such user

user core is not created. but /home/core is created and auth key is updated in file /home/core/.ssh/authorized_keys.

trigger upgrade from 4.9 to 4.10

oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-11-15-034648 --force --allow-explicit-upgrade
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-11-15-034648

mcp/worker is degraded due to error "failed to retrieve UserID for username: core"

oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1   False     True       True       5              0                   0                     1                      154m

oc get mcp/worker -o yaml | yq -y '.status'
conditions:
  - lastTransitionTime: '2021-11-16T06:31:22Z'
    message: ''
    reason: ''
    status: 'False'
    type: RenderDegraded
  - lastTransitionTime: '2021-11-16T08:47:54Z'
    message: ''
    reason: ''
    status: 'False'
    type: Updated
  - lastTransitionTime: '2021-11-16T08:47:54Z'
    message: All nodes are updating to rendered-worker-62b315cbcf7d4ce752037af022820f9c
    reason: ''
    status: 'True'
    type: Updating
  - lastTransitionTime: '2021-11-16T08:49:50Z'
    message: 'Node ip-10-0-62-99.us-east-2.compute.internal is reporting: "failed
      to retrieve UserID for username: core"'
    reason: 1 nodes are reporting degraded status on sync
    status: 'True'
    type: NodeDegraded
  - lastTransitionTime: '2021-11-16T08:49:50Z'
    message: ''
    reason: ''
    status: 'True'
    type: Degraded
configuration:
  name: rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1
  source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-add-ssh
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
degradedMachineCount: 1
machineCount: 5
observedGeneration: 4
readyMachineCount: 0
unavailableMachineCount: 1
updatedMachineCount: 0

oc get node/ip-10-0-62-99.us-east-2.compute.internal -o yaml | yq -y '.metadata.annotations'
csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-071ce7e8547223187"}'
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-worker-9f03262499354f7a5b0fc3b2c14b36f1
machineconfiguration.openshift.io/desiredConfig: rendered-worker-62b315cbcf7d4ce752037af022820f9c
machineconfiguration.openshift.io/reason: 'failed to retrieve UserID for username:
  core'
machineconfiguration.openshift.io/ssh: accessed
machineconfiguration.openshift.io/state: Degraded
volumes.kubernetes.io/controller-managed-attach-detach: 'true'


verify add ssh auth key behaviour on ocp4.10+rhel7.9 cluster

oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-14-184249   True        False         7m11s   Cluster version is 4.10.0-0.nightly-2021-11-14-184249

oc get node -l node-role.kubernetes.io/worker -o wide
NAME                                                        STATUS   ROLES    AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
rioliu111603-skfnk-w-a-l-rhel-0                             Ready    worker   3m47s   v1.22.1+0474f31   10.0.128.6    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                     3.10.0-1160.45.1.el7.x86_64    cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el7
rioliu111603-skfnk-w-a-l-rhel-1                             Ready    worker   3m47s   v1.22.1+0474f31   10.0.128.5    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                     3.10.0-1160.45.1.el7.x86_64    cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el7
rioliu111603-skfnk-worker-a-p6kw8.c.openshift-qe.internal   Ready    worker   36m     v1.22.1+f773b8b   10.0.128.2    <none>        Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8
rioliu111603-skfnk-worker-b-7bvxj.c.openshift-qe.internal   Ready    worker   36m     v1.22.1+f773b8b   10.0.128.3    <none>        Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8
rioliu111603-skfnk-worker-c-v254k.c.openshift-qe.internal   Ready    worker   36m     v1.22.1+f773b8b   10.0.128.4    <none>        Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8

check whether user core exists on rhel node

oc debug node/rioliu111603-skfnk-w-a-l-rhel-0
W1116 17:35:22.768924   86851 warnings.go:70] would violate "latest" version of "baseline" PodSecurity profile: host namespaces (hostNetwork=true, hostPID=true), hostPath volumes (volume "host"), privileged (container "container-00" must not set securityContext.privileged=true)
Starting pod/rioliu111603-skfnk-w-a-l-rhel-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.2# cat /etc/passwd | grep -i core
sh-4.2# ls /home
cloud-user
sh-4.2# id core
id: core: no such user

create new mc for worker pool to update ssh auth key

oc create -f add-additional-ssh-authorized-keys.yaml
machineconfig.machineconfiguration.openshift.io/99-add-ssh created

oc get mc --sort-by metadata.creationTimestamp
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
99-worker-ssh                                                                                 3.2.0             47m
99-master-ssh                                                                                 3.2.0             47m
00-worker                                          0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             44m
01-master-container-runtime                        0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             44m
01-master-kubelet                                  0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             44m
01-worker-container-runtime                        0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             44m
00-master                                          0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             44m
99-master-generated-registries                     0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             44m
01-worker-kubelet                                  0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             44m
99-worker-generated-registries                     0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             44m
rendered-master-5610cc678fdd33547639bf3b164df970   0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             44m
rendered-worker-c03f11633f9d9bcb24383693788d1c13   0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             44m
99-add-ssh                                                                                    3.2.0             10s
rendered-worker-f37c6b9c94dd265c52f040b7d5198433   0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             5s

mcp/worker is degraded

oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-c03f11633f9d9bcb24383693788d1c13   False     True       True       5              0                   0                     1                      45m

oc get mcp/worker -o yaml | yq -y '.status'
conditions:
  - lastTransitionTime: '2021-11-16T08:52:36Z'
    message: ''
    reason: ''
    status: 'False'
    type: RenderDegraded
  - lastTransitionTime: '2021-11-16T09:36:53Z'
    message: ''
    reason: ''
    status: 'False'
    type: Updated
  - lastTransitionTime: '2021-11-16T09:36:53Z'
    message: All nodes are updating to rendered-worker-f37c6b9c94dd265c52f040b7d5198433
    reason: ''
    status: 'True'
    type: Updating
  - lastTransitionTime: '2021-11-16T09:36:58Z'
    message: 'Node rioliu111603-skfnk-w-a-l-rhel-0 is reporting: "failed to retrieve
      UserID for username: core"'
    reason: 1 nodes are reporting degraded status on sync
    status: 'True'
    type: NodeDegraded
  - lastTransitionTime: '2021-11-16T09:36:58Z'
    message: ''
    reason: ''
    status: 'True'
    type: Degraded
configuration:
  name: rendered-worker-c03f11633f9d9bcb24383693788d1c13
  source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
degradedMachineCount: 1
machineCount: 5
observedGeneration: 3
readyMachineCount: 0
unavailableMachineCount: 0
updatedMachineCount: 0

oc get node/rioliu111603-skfnk-w-a-l-rhel-0 -o yaml | yq -y '.metadata.annotations'
csi.volume.kubernetes.io/nodeid: '{"pd.csi.storage.gke.io":"projects/openshift-qe/zones/us-central1-a/instances/rioliu111603-skfnk-w-a-l-rhel-0"}'
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-worker-c03f11633f9d9bcb24383693788d1c13
machineconfiguration.openshift.io/desiredConfig: rendered-worker-f37c6b9c94dd265c52f040b7d5198433
machineconfiguration.openshift.io/reason: 'failed to retrieve UserID for username:
  core'
machineconfiguration.openshift.io/ssh: accessed
machineconfiguration.openshift.io/state: Degraded
volumes.kubernetes.io/controller-managed-attach-detach: 'true'

Comment 1 Yu Qi Zhang 2021-11-17 20:47:50 UTC

Thanks for the detailed report! I think this is a result of https://github.com/openshift/machine-config-operator/pull/2813.

For some background: we never required "Core" users to exist on RHEL nodes. The old behaviour was "wrong" in the sense that it wrote a ssh key to nodes that don't even have the user, but it didn't really break anything.

Now it gets explicitly rejected, which is... unfortunate since it may break upgrade on rhel nodes where previously it would make a dummy write. To not regress that, we have a few options, such as:
1. if user does not exist, write anyways (using root) much like our existing behaviour
2. do not write to rhel nodes at all (but now we risk leaking "old" sshkeys we can no longer remove, since it wouldn't be managed)

@Matthew are you willing to take a look at this? Happy to discuss further our options

Also @Rio I see you mention testing with RHEL7+4.10. I am pretty sure we no longer support rhel 7 workers in 4.10, and you must reprovision rhel8 workers. Should we update tests to reflect that?

Comment 2 mkenigsb 2021-11-17 21:04:10 UTC

What's the desired behavior? Ignore the ssh key on RHEL and don't create the home directory?

Comment 3 Yu Qi Zhang 2021-11-17 21:08:33 UTC

I am not sure. We can maybe discuss this during a team session?

I am leaning towards preserving original behaviour if the core user does not exist, but that does still retain a somewhat undesireable behaviour.

Comment 4 mkenigsb 2021-11-17 21:11:35 UTC

I'm not a huge fan of that because we're going to have to add a case for RHEL no matter what, and adding an if branch just to write ssh keys with root permissions that we don't want really doesn't make a whole lot of sense. Can you explain what you mean by leaking "old" ssh keys a bit more?

Comment 5 Yu Qi Zhang 2021-11-17 21:42:09 UTC

Sure. What happens today is if you installed in 4.9 or before, the ssh keys are actually written to the RHEL nodes, although there is no core user.

This also implies that it would "work" if you happen to have a rhel worker with a "core" user (I think, although I could be wrong on this), although this is not required by default, and what happens is the sshkey gets written regardless.

There's no guarantee that there are no users leveraging this (maybe nobody is, and that would be the best case scenario). So when a user updates to 4.10, and then tries to update their sshkeys, the rhel nodes would actually perform an update that is a no-op. The old sshkey on the node would not get rotated if we choose to no longer manage it, but nothing would delete it today.

Maybe some more nuanced options:

1. the MCO does not fail if you don't have a core user, and just writes it if you do
2. the MCO no longer manages rhel node sshkeys, and deletes any it may have written before the update (not sure how safe this is)

Comment 6 Rio Liu 2021-11-18 04:40:49 UTC

(In reply to Yu Qi Zhang from comment #1)
> Also @Rio I see you mention testing with RHEL7+4.10. I am pretty sure we no
> longer support rhel 7 workers in 4.10, and you must reprovision rhel8
> workers. Should we update tests to reflect that?

Hi Jerry, 

tested same scenario on ocp4.10+rhel8 cluster, the issue can be reproduced as well.

oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-15-034648   True        False         92m     Cluster version is 4.10.0-0.nightly-2021-11-15-034648

oc get node -o wide -l node-role.kubernetes.io/worker
NAME                                        STATUS   ROLES    AGE    VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-51-210.us-east-2.compute.internal   Ready    worker   3m     v1.22.1+4111b82   10.0.51.210   <none>        Red Hat Enterprise Linux 8.4 (Ootpa)                            4.18.0-348.2.1.el8_5.x86_64    cri-o://1.23.0-23.rhaos4.10.git407d866.el8
ip-10-0-55-140.us-east-2.compute.internal   Ready    worker   126m   v1.22.1+f773b8b   10.0.55.140   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8
ip-10-0-60-51.us-east-2.compute.internal    Ready    worker   3m1s   v1.22.1+4111b82   10.0.60.51    <none>        Red Hat Enterprise Linux 8.4 (Ootpa)                            4.18.0-348.2.1.el8_5.x86_64    cri-o://1.23.0-23.rhaos4.10.git407d866.el8
ip-10-0-60-52.us-east-2.compute.internal    Ready    worker   130m   v1.22.1+f773b8b   10.0.60.52    <none>        Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8
ip-10-0-68-51.us-east-2.compute.internal    Ready    worker   130m   v1.22.1+f773b8b   10.0.68.51    <none>        Red Hat Enterprise Linux CoreOS 410.84.202111112202-0 (Ootpa)   4.18.0-305.25.1.el8_4.x86_64   cri-o://1.23.0-12.rhaos4.10.git6ee64e9.el8

 oc create -f add-additional-ssh-authorized-keys.yaml
machineconfig.machineconfiguration.openshift.io/99-add-ssh created

oc get mc --sort-by metadata.creationTimestamp
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
99-master-ssh                                                                                 3.2.0             138m
99-worker-ssh                                                                                 3.2.0             138m
00-worker                                          0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             134m
01-master-container-runtime                        0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             134m
01-master-kubelet                                  0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             134m
01-worker-container-runtime                        0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             134m
99-worker-generated-registries                     0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             134m
99-master-generated-registries                     0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             134m
00-master                                          0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             134m
01-worker-kubelet                                  0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             134m
rendered-master-84f6fdac64f56ce17f188b0508c75a2b   0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             134m
rendered-worker-9ac8035c2db668db3d79a43a1f02895b   0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             134m
99-add-ssh                                                                                    3.2.0             35s
rendered-worker-79f61a75ef9db8bc23775504cc259d37   0abf68a0c3206df0be0e13980da645d2c0ac9aa7   3.2.0             30s

oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-9ac8035c2db668db3d79a43a1f02895b   False     True       True       5              3                   3                     1                      135m

 oc get mcp/worker -o yaml | yq -y '.status.conditions'
- lastTransitionTime: '2021-11-18T01:44:48Z'
  message: ''
  reason: ''
  status: 'False'
  type: RenderDegraded
- lastTransitionTime: '2021-11-18T03:58:24Z'
  message: ''
  reason: ''
  status: 'False'
  type: Updated
- lastTransitionTime: '2021-11-18T03:58:24Z'
  message: All nodes are updating to rendered-worker-79f61a75ef9db8bc23775504cc259d37
  reason: ''
  status: 'True'
  type: Updating
- lastTransitionTime: '2021-11-18T03:58:49Z'
  message: 'Node ip-10-0-60-51.us-east-2.compute.internal is reporting: "failed to
    retrieve UserID for username: core"'
  reason: 1 nodes are reporting degraded status on sync
  status: 'True'
  type: NodeDegraded
- lastTransitionTime: '2021-11-18T03:58:49Z'
  message: ''
  reason: ''
  status: 'True'
  type: Degraded

I think this issue is not related to rhel os version. the key point here is core user does not exist on rhel node

Comment 7 mkenigsb 2021-11-18 14:34:47 UTC

Would it work even if you had a core user? If authorized_keys has permissions 600 and is owned by root, a core user couldn't even use it, right?

Comment 8 Rio Liu 2021-11-18 16:08:36 UTC

the precondition of this issue is: update ssh key on RHEL node, the concern is not whether the key can be used or not, it is, the original behavior does not block upgrade operation, but current logic in 4.10 will block upgrade from a cluster like ocp4.9+RHEL worker, because core user is not expected on RHEL node.

Comment 9 mkenigsb 2021-11-18 16:29:30 UTC

Right, I was more responding to:
2. the MCO no longer manages rhel node sshkeys, and deletes any it may have written before the update (not sure how safe this is)
I was wondering if it's safe to delete a key a user does not have permission to read

Comment 22 errata-xmlrpc 2022-03-10 16:28:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056