Description of problem: I have a customer who raised this issue to me, and I can reproduce on my lab. They are trying to monitor any SSH access to nodes and as you can see from my results on my lab below this fails to work. Version-Release number of selected component (if applicable): OCP 4.7.24 How reproducible: 100% Steps to Reproduce: 1. Check for any nodes accessed [kni@prov-0 ~]$ oc get nodes -o 'custom-columns=Node Name:.metadata.name,Machine Name:.metadata.annotations.machine\.openshift\.io/machine,SSHAccessed:.metadata.annotations.machineconfiguration\.openshift\.io/ssh' Node Name Machine Name SSHAccessed master-0.ocp4-bare.andytest.lab openshift-machine-api/ocp4-bare-dbnww-master-0 <none> master-1.ocp4-bare.andytest.lab openshift-machine-api/ocp4-bare-dbnww-master-1 <none> master-2.ocp4-bare.andytest.lab openshift-machine-api/ocp4-bare-dbnww-master-2 <none> worker-1.ocp4-bare.andytest.lab openshift-machine-api/ocp4-bare-dbnww-worker-0-ktcng <none> worker-2.ocp4-bare.andytest.lab openshift-machine-api/ocp4-bare-dbnww-worker-0-fhhwn <none> 2. Access a node, and run ls just to make sure its working ok [kni@prov-0 ~]$ ssh core.andytest.lab Warning: the ECDSA host key for 'master-0.ocp4-bare.andytest.lab' differs from the key for the IP address '192.168.2.50' Offending key for IP in /home/kni/.ssh/known_hosts:6 Matching host key in /home/kni/.ssh/known_hosts:8 Are you sure you want to continue connecting (yes/no)? yes Red Hat Enterprise Linux CoreOS 47.84.202108052031-0 Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html --- Last login: Fri Jul 30 10:00:16 2021 from 192.168.2.250 [core@master-0 ~]$ ls -al total 16 drwx------. 4 core core 109 Jul 30 10:01 . drwxr-xr-x. 3 root root 18 Jul 24 10:15 .. -rw-------. 1 core core 50 Jul 30 10:01 .bash_history -rw-r--r--. 1 core core 18 Mar 25 16:45 .bash_logout -rw-r--r--. 1 core core 141 Mar 25 16:45 .bash_profile -rw-r--r--. 1 core core 376 Mar 25 16:45 .bashrc drwxr-xr-x. 3 core core 19 Jul 30 10:00 .local drwx------. 2 core core 29 Aug 22 09:39 .ssh [core@master-0 ~]$ exit logout Connection to master-0.ocp4-bare.andytest.lab closed. 3. Check node after SSH access [kni@prov-0 ~]$ oc get nodes -o 'custom-columns=Node Name:.metadata.name,Machine Name:.metadata.annotations.machine\.openshift\.io/machine,SSHAccessed:.metadata.annotations.machineconfiguration\.openshift\.io/ssh' Node Name Machine Name SSHAccessed master-0.ocp4-bare.andytest.lab openshift-machine-api/ocp4-bare-dbnww-master-0 <none> master-1.ocp4-bare.andytest.lab openshift-machine-api/ocp4-bare-dbnww-master-1 <none> master-2.ocp4-bare.andytest.lab openshift-machine-api/ocp4-bare-dbnww-master-2 <none> worker-1.ocp4-bare.andytest.lab openshift-machine-api/ocp4-bare-dbnww-worker-0-ktcng <none> worker-2.ocp4-bare.andytest.lab openshift-machine-api/ocp4-bare-dbnww-worker-0-fhhwn <none> Actual results: The SSHAccessed label is not set Expected results: I expect the SSHAccessed label to set to reflect me accessing my nodes. Additional info:
From the must-gather, in pod machine-config-daemon-8h8ck on master-0.ocp4-bare.andytest.lab, it looks like that node might be having some connectivity issues: 2021-08-22T09:45:58.388270376Z E0822 09:45:58.384638 6405 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout 2021-08-22T09:46:00.523698766Z I0822 09:46:00.523244 6405 daemon.go:381] Node master-0.ocp4-bare.andytest.lab is part of the control plane 2021-08-22T09:46:01.170642137Z I0822 09:46:01.167861 6405 daemon.go:802] Current config: rendered-master-012acc289869be3f8becc00e86aec428 2021-08-22T09:46:01.170642137Z I0822 09:46:01.167889 6405 daemon.go:803] Desired config: rendered-master-754da41cc91abf2c7a0f19bc7e8745cf 2021-08-22T09:46:01.202362117Z I0822 09:46:01.201866 6405 update.go:1943] Disk currentConfig rendered-master-754da41cc91abf2c7a0f19bc7e8745cf overrides node's currentConfig annotation rendered-master-012acc289869be3f8becc00e86aec428 2021-08-22T09:46:01.216451220Z I0822 09:46:01.215039 6405 daemon.go:1085] Validating against pending config rendered-master-754da41cc91abf2c7a0f19bc7e8745cf 2021-08-22T09:46:01.298839675Z I0822 09:46:01.298727 6405 daemon.go:1096] Validated on-disk state 2021-08-22T09:46:01.512403291Z I0822 09:46:01.511434 6405 daemon.go:1151] Completing pending config rendered-master-754da41cc91abf2c7a0f19bc7e8745cf 2021-08-22T09:46:01.621500050Z I0822 09:46:01.621402 6405 update.go:1943] completed update for config rendered-master-754da41cc91abf2c7a0f19bc7e8745cf 2021-08-22T09:46:01.644233131Z I0822 09:46:01.642890 6405 daemon.go:1167] In desired config rendered-master-754da41cc91abf2c7a0f19bc7e8745cf 2021-08-22T09:57:47.016779861Z W0822 09:57:47.016703 6405 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2021-08-22T09:57:47.016971691Z W0822 09:57:47.016720 6405 reflector.go:436] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding If the mcd doesn't have a client connection to get to the node object, that would prevent the SSHAccessed annotation from being set on the node object. Looking at the mcd pods on the other nodes, they appear to be reporting connectivity errors there too. In the the host service logs, I see some ovs stuff going on. Stop times are: Aug 22 09:42:00.108073 master-0.ocp4-bare.andytest.lab systemd[1]: Stopped Open vSwitch Forwarding Unit. (this is before the error) Aug 22 09:50:53.934653 master-1.ocp4-bare.andytest.lab systemd[1]: Stopped Open vSwitch Forwarding Unit. Aug 22 09:57:14.952266 master-2.ocp4-bare.andytest.lab systemd[1]: Stopped Open vSwitch Forwarding Unit. I also see: Aug 22 09:44:33.312315 master-0.ocp4-bare.andytest.lab systemd[1]: ovs-configuration.service: Succeeded. Aug 22 09:44:33.313290 master-0.ocp4-bare.andytest.lab systemd[1]: Started Configures OVS with proper host networking configuration. Aug 22 09:44:33.313877 master-0.ocp4-bare.andytest.lab systemd[1]: ovs-configuration.service: Consumed 262ms CPU time Was this just a clean cluster build or were other things done to it ? Were you by chance testing/doing anything that would have affected connectivity before this occurred?
*** Bug 1842603 has been marked as a duplicate of this bug. ***
https://github.com/openshift/openshift-docs/pull/54465 Starting deprecation notice.
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-8958