Bug 1842603
| Summary: | SSH nodes annotation only happening during cluster upgrade | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Pedro Amoedo <pamoedom> |
| Component: | Machine Config Operator | Assignee: | MCO Team <team-mco> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Jian Zhang <jiazha> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | alklein, aos-bugs, dornelas, jkyros, jswensso, kgarriso, mkrejci, pamoedom, rsandu, vkochuku |
| Version: | 4.4 | Keywords: | Triaged |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-12 21:27:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1903544 | ||
@Pedro just to clarify We don't expect `oc debug node` to annotate only sshing should. Are you saying you are not seeing nodes annotated after SSHing? Can you provide a must gather from the cluster? Hi Kirsten, thanks for your quick reply. Please note that by "oc debug node" I'm referring to the following (oc debug node + internal SSH after): ~~~ $ oc debug node/ip-10-0-130-229.eu-west-3.compute.internal --image rhel7/rhel-tools Starting pod/ip-10-0-130-229eu-west-3computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.130.229 If you don't see a command prompt, try pressing enter. sh-4.2# vim key sh-4.2# chmod 400 key sh-4.2# ssh -i key -l core ip-10-0-135-159.eu-west-3.compute.internal ... WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.4/architecture/architecture-rhcos.html --- [core@ip-10-0-135-159 ~]$ ~~~ In previous OCP versions, this "oc debug node" method didn't marked the nodes as accessed (as expected) but an external SSH marked them instantly. Now, with newer 4.4.x versions, both methods are marking the hosts as accessed but surprinsingly only if a cluster upgrade is in progress and you SSH in the precise moment the daemon is running, how is that possible? Regarding the must-gather, I don't have the same cluster available but I'm raising a new one to reproduce again and provide you the logs ASAP. Best Regards. (In reply to Pedro Amoedo from comment #2) > Hi Kirsten, thanks for your quick reply. > > Please note that by "oc debug node" I'm referring to the following (oc debug > node + internal SSH after): > > ~~~ > $ oc debug node/ip-10-0-130-229.eu-west-3.compute.internal --image > rhel7/rhel-tools > Starting pod/ip-10-0-130-229eu-west-3computeinternal-debug ... > To use host binaries, run `chroot /host` > Pod IP: 10.0.130.229 > If you don't see a command prompt, try pressing enter. > sh-4.2# vim key > sh-4.2# chmod 400 key > sh-4.2# ssh -i key -l core ip-10-0-135-159.eu-west-3.compute.internal > ... > WARNING: Direct SSH access to machines is not recommended; instead, > make configuration changes via `machineconfig` objects: > > https://docs.openshift.com/container-platform/4.4/architecture/architecture- > rhcos.html > > --- > [core@ip-10-0-135-159 ~]$ > > ~~~ > > In previous OCP versions, this "oc debug node" method didn't marked the > nodes as accessed (as expected) but an external SSH marked them instantly. > > Now, with newer 4.4.x versions, both methods are marking the hosts as oc debug node isn't adding SSH - it literally can't as the MCD watches systemd login sessions so something else is at play here (oc debug node starts a debug pod) Also, oc debug node effectively defeated the usefulness of the ssh annotation so it's unreliable since ~4.1 and we had plan to either remove it or fix oc debug node to annotate as well. More info here https://github.com/openshift/oc/issues/265 > accessed but surprinsingly only if a cluster upgrade is in progress and you > SSH in the precise moment the daemon is running, how is that possible? > > Regarding the must-gather, I don't have the same cluster available but I'm > raising a new one to reproduce again and provide you the logs ASAP. > > Best Regards. Hi Antonio, let me explain again the "oc debug node" method:
1) Run "oc debug node/<hostname> --image rhel7/rhel-tools", with this you have a pod running on top of one of the cluster nodes.
2) Create a key file locally with the same installation private key
3) Run SSH from this temporary pod to another cluster node using the private key, this works and the SSH session is like an external one with the difference of the source IP (which is in the same cluster network).
Having said this, I have raised a new cluster with the following specs:
1) OCP 4.4.4 AWS IPI with default installation values, nothing custom.
2) Extra bastion instance on the same VPC and one of the public subnets.
3) SSH from this bastion (using the same installation private key) against all nodes in the host, and no SSH annotation was triggered:
~~~
$ oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" - "}{.metadata.annotations.machineconfiguration\.openshift\.io/ssh}{"\n"}{end}'
ip-10-0-135-160.eu-west-3.compute.internal -
ip-10-0-136-200.eu-west-3.compute.internal -
ip-10-0-152-129.eu-west-3.compute.internal -
ip-10-0-156-233.eu-west-3.compute.internal -
ip-10-0-167-58.eu-west-3.compute.internal -
ip-10-0-170-197.eu-west-3.compute.internal -
~~~
NOTE: As an extra step I have also run "oc debug node/ip-10-0-135-160.eu-west-3.compute.internal" + aforementioned SSH inception method against all cluster nodes and same result, no annotation at all.
I'm attaching the must-gather log bundle into the BZ so you can take a look.
I'll also proceed with a cluster upgrade to version 4.4.5 ASAP to corroborate my theory of SSH annotation only when MCO is progressing due to an upgrade and provide the new must-gather when finished.
Best Regards.
[UPDATE]
I have triggered the upgrade to 4.4.5 version and waited precisely until only the MCO was pending in previous version 4.4.4, in Progressing state but still with all nodes Ready:
~~~
Every 10.0s: oc get clusterversion && echo && oc get co && echo && oc get nodes -o wide p50: Wed Jun 3 13:50:16 2020
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.4.4 True True 22m Working towards 4.4.5: 83% complete
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.4.5 True False False 176m
cloud-credential 4.4.5 True False False 3h17m
cluster-autoscaler 4.4.5 True False False 3h5m
console 4.4.5 True False False 13m
csi-snapshot-controller 4.4.5 True False False 8m47s
dns 4.4.5 True False False 3h8m
etcd 4.4.5 True False False 3h7m
image-registry 4.4.5 True False False 3h1m
ingress 4.4.5 True False False 3h1m
insights 4.4.5 True False False 3h5m
kube-apiserver 4.4.5 True False False 3h7m
kube-controller-manager 4.4.5 True False False 3h7m
kube-scheduler 4.4.5 True False False 3h7m
kube-storage-version-migrator 4.4.5 True False False 3h1m
machine-api 4.4.5 True False False 3h8m
machine-config 4.4.4 True True False 3h8m <----
marketplace 4.4.5 True False False 14m
monitoring 4.4.5 True False False 179m
network 4.4.5 True False False 3h9m
node-tuning 4.4.5 True False False 14m
openshift-apiserver 4.4.5 True False False 19m
openshift-controller-manager 4.4.5 True False False 3h5m
openshift-samples 4.4.5 True False False 14m
operator-lifecycle-manager 4.4.5 True False False 3h8m
operator-lifecycle-manager-catalog 4.4.5 True False False 3h8m
operator-lifecycle-manager-packageserver 4.4.5 True False False 13m
service-ca 4.4.5 True False False 3h9m
service-catalog-apiserver 4.4.5 True False False 3h9m
service-catalog-controller-manager 4.4.5 True False False 3h9m
storage 4.4.5 True False False 14m
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION
CONTAINER-RUNTIME
ip-10-0-135-160.eu-west-3.compute.internal Ready master 3h14m v1.17.1 10.0.135.160 <none> Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa) 4.18.0-147.8.1.el
8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
ip-10-0-136-200.eu-west-3.compute.internal Ready worker 3h3m v1.17.1 10.0.136.200 <none> Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa) 4.18.0-147.8.1.el
8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
ip-10-0-152-129.eu-west-3.compute.internal Ready worker 3h2m v1.17.1 10.0.152.129 <none> Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa) 4.18.0-147.8.1.el
8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
ip-10-0-156-233.eu-west-3.compute.internal Ready master 3h14m v1.17.1 10.0.156.233 <none> Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa) 4.18.0-147.8.1.el
8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
ip-10-0-167-58.eu-west-3.compute.internal Ready master 3h14m v1.17.1 10.0.167.58 <none> Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa) 4.18.0-147.8.1.el
8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8
ip-10-0-170-197.eu-west-3.compute.internal Ready worker 3h3m v1.17.1 10.0.170.197 <none> Red Hat Enterprise Linux CoreOS 44.81.202005062110-0 (Ootpa) 4.18.0-147.8.1.el
~~~
Performed various SSH attempts on a loop like this one:
~~~
[ec2-user@ip-10-0-41-229 ~]$ for i in `cat list`; do ssh -i key -l core $i "uptime"; done
11:50:46 up 3:15, 0 users, load average: 1.11, 1.20, 1.23
11:50:46 up 3:04, 0 users, load average: 1.81, 1.56, 1.20
11:50:46 up 3:04, 0 users, load average: 1.01, 0.66, 0.55
11:50:47 up 3:16, 0 users, load average: 1.78, 2.12, 1.44
11:50:47 up 3:16, 0 users, load average: 0.67, 0.65, 0.74
11:50:47 up 3:04, 0 users, load average: 0.96, 0.56, 0.53
~~~
Bingo! here you have the annotations as expected:
~~~
$ oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" - "}{.metadata.annotations.machineconfiguration\.openshift\.io/ssh}{"\n"}{end}'
ip-10-0-135-160.eu-west-3.compute.internal - accessed
ip-10-0-136-200.eu-west-3.compute.internal - accessed
ip-10-0-152-129.eu-west-3.compute.internal - accessed
ip-10-0-156-233.eu-west-3.compute.internal - accessed
ip-10-0-167-58.eu-west-3.compute.internal - accessed
ip-10-0-170-197.eu-west-3.compute.internal - accessed
~~~
NOTE: I'm still waiting for the upgrade process to finish and I'll attach a new must-gather log so you can compare both if needed.
Best Regards.
Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint. *** Bug 1925049 has been marked as a duplicate of this bug. *** |
Description of problem: SSH annotation is no longer happening in OCP 4.4.x, is this expected? However, I can see it's still possible to achieve the annotation if the nodes are accessed via SSH during a cluster upgrade. NOTE: It could be an external SSH or a internal jump between nodes using "oc debug/node" + "ssh -i key -l core <node>". Version-Release number of selected component (if applicable): OCP 4.4.4 How reproducible: Always Steps to Reproduce: 1. OCP 4.4.4 AWS IPI 2. Trigger a cluster upgrade 3. Wait until MCO is Progressing and perform random SSH attempts into the cluster nodes. Actual results: The nodes are being only marked as "accessed" on a corner scenario during the upgrade, this is not expected, right? ~~~ machine-config 4.4.5 True True False 2d12h ... [ec2-user@ip-10-0-15-118 ~]$ ssh -i key -l core ip-10-0-174-232.eu-west-3.compute.internal [ec2-user@ip-10-0-15-118 ~]$ ssh -i key -l core ip-10-0-175-59.eu-west-3.compute.internal ... $ oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" - "}{.metadata.annotations.machineconfiguration\.openshift\.io/ssh}{"\n"}{end}' ip-10-0-130-229.eu-west-3.compute.internal - ip-10-0-135-159.eu-west-3.compute.internal - accessed ip-10-0-146-153.eu-west-3.compute.internal - ip-10-0-159-195.eu-west-3.compute.internal - ip-10-0-174-232.eu-west-3.compute.internal - accessed ip-10-0-175-59.eu-west-3.compute.internal - accessed ~~~ Expected results: Or not annotation at all (please confirm is this is deprecated), or properly annotate the nodes in all scenarios (like previous versions), not only in a very unlikely situation like in the middle of an upgrade, right? Additional info: https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/daemon.go#L628 https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/daemon.go#L278