Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1842603

Summary:	SSH nodes annotation only happening during cluster upgrade
Product:	OpenShift Container Platform	Reporter:	Pedro Amoedo <pamoedom>
Component:	Machine Config Operator	Assignee:	MCO Team <team-mco>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Jian Zhang <jiazha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	low
Priority:	low	CC:	alklein, aos-bugs, dornelas, jkyros, jswensso, kgarriso, mkrejci, pamoedom, rsandu, vkochuku
Version:	4.4	Keywords:	Triaged
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-12 21:27:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1903544

Description Pedro Amoedo 2020-06-01 16:39:00 UTC

Description of problem:

SSH annotation is no longer happening in OCP 4.4.x, is this expected?

However, I can see it's still possible to achieve the annotation if the nodes are accessed via SSH during a cluster upgrade.

NOTE: It could be an external SSH or a internal jump between nodes using "oc debug/node" + "ssh -i key -l core <node>".

Version-Release number of selected component (if applicable):

OCP 4.4.4

How reproducible:
Always

Steps to Reproduce:
1. OCP 4.4.4 AWS IPI
2. Trigger a cluster upgrade
3. Wait until MCO is Progressing and perform random SSH attempts into the cluster nodes.

Actual results:

The nodes are being only marked as "accessed" on a corner scenario during the upgrade, this is not expected, right?

~~~
machine-config                             4.4.5     True        True          False      2d12h

...

[ec2-user@ip-10-0-15-118 ~]$ ssh -i key -l core ip-10-0-174-232.eu-west-3.compute.internal
[ec2-user@ip-10-0-15-118 ~]$ ssh -i key -l core ip-10-0-175-59.eu-west-3.compute.internal

...

$ oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" - "}{.metadata.annotations.machineconfiguration\.openshift\.io/ssh}{"\n"}{end}'
ip-10-0-130-229.eu-west-3.compute.internal -
ip-10-0-135-159.eu-west-3.compute.internal - accessed
ip-10-0-146-153.eu-west-3.compute.internal -
ip-10-0-159-195.eu-west-3.compute.internal -
ip-10-0-174-232.eu-west-3.compute.internal - accessed
ip-10-0-175-59.eu-west-3.compute.internal - accessed
~~~

Expected results:

Or not annotation at all (please confirm is this is deprecated), or properly annotate the nodes in all scenarios (like previous versions), not only in a very unlikely situation like in the middle of an upgrade, right?

Additional info:

https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/daemon.go#L628
https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/daemon.go#L278

Comment 1 Kirsten Garrison 2020-06-02 22:13:47 UTC

@Pedro just to clarify 

We don't expect `oc debug node` to annotate only sshing should. Are you saying you are not seeing nodes annotated after SSHing?

Can you provide a must gather from the cluster?

Comment 2 Pedro Amoedo 2020-06-03 08:17:13 UTC

Hi Kirsten, thanks for your quick reply.

Please note that by "oc debug node" I'm referring to the following (oc debug node + internal SSH after):

~~~
$ oc debug node/ip-10-0-130-229.eu-west-3.compute.internal --image rhel7/rhel-tools
Starting pod/ip-10-0-130-229eu-west-3computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.130.229
If you don't see a command prompt, try pressing enter.
sh-4.2# vim key
sh-4.2# chmod 400 key
sh-4.2# ssh -i key -l core ip-10-0-135-159.eu-west-3.compute.internal
...
WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.4/architecture/architecture-rhcos.html

---
[core@ip-10-0-135-159 ~]$

~~~

In previous OCP versions, this "oc debug node" method didn't marked the nodes as accessed (as expected) but an external SSH marked them instantly.

Now, with newer 4.4.x versions, both methods are marking the hosts as accessed but surprinsingly only if a cluster upgrade is in progress and you SSH in the precise moment the daemon is running, how is that possible?

Regarding the must-gather, I don't have the same cluster available but I'm raising a new one to reproduce again and provide you the logs ASAP.

Best Regards.

Comment 3 Antonio Murdaca 2020-06-03 09:19:06 UTC

(In reply to Pedro Amoedo from comment #2)
> Hi Kirsten, thanks for your quick reply.
> 
> Please note that by "oc debug node" I'm referring to the following (oc debug
> node + internal SSH after):
> 
> ~~~
> $ oc debug node/ip-10-0-130-229.eu-west-3.compute.internal --image
> rhel7/rhel-tools
> Starting pod/ip-10-0-130-229eu-west-3computeinternal-debug ...
> To use host binaries, run `chroot /host`
> Pod IP: 10.0.130.229
> If you don't see a command prompt, try pressing enter.
> sh-4.2# vim key
> sh-4.2# chmod 400 key
> sh-4.2# ssh -i key -l core ip-10-0-135-159.eu-west-3.compute.internal
> ...
> WARNING: Direct SSH access to machines is not recommended; instead,
> make configuration changes via `machineconfig` objects:
>  
> https://docs.openshift.com/container-platform/4.4/architecture/architecture-
> rhcos.html
> 
> ---
> [core@ip-10-0-135-159 ~]$
> 
> ~~~
> 
> In previous OCP versions, this "oc debug node" method didn't marked the
> nodes as accessed (as expected) but an external SSH marked them instantly.
> 
> Now, with newer 4.4.x versions, both methods are marking the hosts as

oc debug node isn't adding SSH - it literally can't as the MCD watches systemd login sessions so something else is at play here (oc debug node starts a debug pod)

Also, oc debug node effectively defeated the usefulness of the ssh annotation so it's unreliable since ~4.1 and we had plan to either remove it or fix oc debug node to annotate as well. More info here https://github.com/openshift/oc/issues/265

> accessed but surprinsingly only if a cluster upgrade is in progress and you
> SSH in the precise moment the daemon is running, how is that possible?
> 
> Regarding the must-gather, I don't have the same cluster available but I'm
> raising a new one to reproduce again and provide you the logs ASAP.
> 
> Best Regards.

Comment 4 Pedro Amoedo 2020-06-03 11:08:53 UTC

Hi Antonio, let me explain again the "oc debug node" method:

1) Run "oc debug node/<hostname> --image rhel7/rhel-tools", with this you have a pod running on top of one of the cluster nodes.
2) Create a key file locally with the same installation private key
3) Run SSH from this temporary pod to another cluster node using the private key, this works and the SSH session is like an external one with the difference of the source IP (which is in the same cluster network).

Having said this, I have raised a new cluster with the following specs:

1) OCP 4.4.4 AWS IPI with default installation values, nothing custom.
2) Extra bastion instance on the same VPC and one of the public subnets.
3) SSH from this bastion (using the same installation private key) against all nodes in the host, and no SSH annotation was triggered:

~~~
$ oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" - "}{.metadata.annotations.machineconfiguration\.openshift\.io/ssh}{"\n"}{end}'
ip-10-0-135-160.eu-west-3.compute.internal - 
ip-10-0-136-200.eu-west-3.compute.internal - 
ip-10-0-152-129.eu-west-3.compute.internal - 
ip-10-0-156-233.eu-west-3.compute.internal - 
ip-10-0-167-58.eu-west-3.compute.internal - 
ip-10-0-170-197.eu-west-3.compute.internal -
~~~

NOTE: As an extra step I have also run "oc debug node/ip-10-0-135-160.eu-west-3.compute.internal" + aforementioned SSH inception method against all cluster nodes and same result, no annotation at all.

I'm attaching the must-gather log bundle into the BZ so you can take a look.

I'll also proceed with a cluster upgrade to version 4.4.5 ASAP to corroborate my theory of SSH annotation only when MCO is progressing due to an upgrade and provide the new must-gather when finished.

Best Regards.

Comment 6 Pedro Amoedo 2020-06-03 12:13:08 UTC

[UPDATE]

I have triggered the upgrade to 4.4.5 version and waited precisely until only the MCO was pending in previous version 4.4.4, in Progressing state but still with all nodes Ready:

~~~
Every 10.0s: oc get clusterversion && echo && oc get co && echo && oc get nodes -o wide                                                                          p50: Wed Jun  3 13:50:16 2020

NAME	  VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.4     True        True          22m     Working towards 4.4.5: 83% complete

NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.5     True        False         False	  176m
cloud-credential                           4.4.5     True        False         False	  3h17m
cluster-autoscaler                         4.4.5     True        False         False	  3h5m
console                                    4.4.5     True        False         False	  13m
csi-snapshot-controller                    4.4.5     True        False         False	  8m47s
dns                                        4.4.5     True        False         False	  3h8m
etcd                                       4.4.5     True        False         False	  3h7m
image-registry                             4.4.5     True        False         False	  3h1m
ingress                                    4.4.5     True        False         False	  3h1m
insights                                   4.4.5     True        False         False	  3h5m
kube-apiserver                             4.4.5     True        False         False	  3h7m
kube-controller-manager                    4.4.5     True        False         False	  3h7m
kube-scheduler                             4.4.5     True        False         False	  3h7m
kube-storage-version-migrator              4.4.5     True        False         False	  3h1m
machine-api                                4.4.5     True        False         False	  3h8m
machine-config                             4.4.4     True        True          False	  3h8m    <----
marketplace                                4.4.5     True        False         False	  14m
monitoring                                 4.4.5     True        False         False	  179m
network                                    4.4.5     True        False         False	  3h9m
node-tuning                                4.4.5     True        False         False	  14m
openshift-apiserver                        4.4.5     True        False         False	  19m
openshift-controller-manager               4.4.5     True        False         False	  3h5m
openshift-samples                          4.4.5     True        False         False	  14m
operator-lifecycle-manager                 4.4.5     True        False         False	  3h8m
operator-lifecycle-manager-catalog         4.4.5     True        False         False	  3h8m
operator-lifecycle-manager-packageserver   4.4.5     True        False         False	  13m
service-ca                                 4.4.5     True        False         False	  3h9m
service-catalog-apiserver                  4.4.5     True        False         False	  3h9m
service-catalog-controller-manager         4.4.5     True        False         False	  3h9m
storage                                    4.4.5     True        False         False	  14m

NAME                                         STATUS   ROLES    AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION
             CONTAINER-RUNTIME
ip-10-0-135-160.eu-west-3.compute.internal   Ready    master   3h14m   v1.17.1   10.0.135.160   <none>        Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa)   4.18.0-147.8.1.el
8_1.x86_64   cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
ip-10-0-136-200.eu-west-3.compute.internal   Ready    worker   3h3m    v1.17.1   10.0.136.200   <none>        Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa)   4.18.0-147.8.1.el
8_1.x86_64   cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
ip-10-0-152-129.eu-west-3.compute.internal   Ready    worker   3h2m    v1.17.1   10.0.152.129   <none>        Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa)   4.18.0-147.8.1.el
8_1.x86_64   cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
ip-10-0-156-233.eu-west-3.compute.internal   Ready    master   3h14m   v1.17.1   10.0.156.233   <none>        Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa)   4.18.0-147.8.1.el
8_1.x86_64   cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
ip-10-0-167-58.eu-west-3.compute.internal    Ready    master   3h14m   v1.17.1   10.0.167.58    <none>        Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa)   4.18.0-147.8.1.el
8_1.x86_64   cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
ip-10-0-170-197.eu-west-3.compute.internal   Ready    worker   3h3m    v1.17.1   10.0.170.197   <none>        Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa)   4.18.0-147.8.1.el
~~~

Performed various SSH attempts on a loop like this one:

~~~
[ec2-user@ip-10-0-41-229 ~]$ for i in `cat list`; do ssh -i key -l core $i "uptime"; done
 11:50:46 up  3:15,  0 users,  load average: 1.11, 1.20, 1.23
 11:50:46 up  3:04,  0 users,  load average: 1.81, 1.56, 1.20
 11:50:46 up  3:04,  0 users,  load average: 1.01, 0.66, 0.55
 11:50:47 up  3:16,  0 users,  load average: 1.78, 2.12, 1.44
 11:50:47 up  3:16,  0 users,  load average: 0.67, 0.65, 0.74
 11:50:47 up  3:04,  0 users,  load average: 0.96, 0.56, 0.53
~~~

Bingo! here you have the annotations as expected:

~~~
$ oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" - "}{.metadata.annotations.machineconfiguration\.openshift\.io/ssh}{"\n"}{end}'
ip-10-0-135-160.eu-west-3.compute.internal - accessed
ip-10-0-136-200.eu-west-3.compute.internal - accessed
ip-10-0-152-129.eu-west-3.compute.internal - accessed
ip-10-0-156-233.eu-west-3.compute.internal - accessed
ip-10-0-167-58.eu-west-3.compute.internal - accessed
ip-10-0-170-197.eu-west-3.compute.internal - accessed
~~~

NOTE: I'm still waiting for the upgrade process to finish and I'll attach a new must-gather log so you can compare both if needed.

Best Regards.

Comment 8 Antonio Murdaca 2020-06-16 12:58:00 UTC

Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint.

Comment 12 Micah Abbott 2021-02-07 20:30:33 UTC

*** Bug 1925049 has been marked as a duplicate of this bug. ***