Hide Forgot
Description of problem: After scaling up RHEL8 worker nodes and removing RHCOS nodes from the cluster, e2e-tests fail due to TargetDown. openshift-sdn Alert Details: 16.67% of the sdn/sdn targets in openshift-sdn namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support. All SDN pods are reporting as running. All resources on the cluster appear to be functioning normally. Only the alert is any indication of an issue. Version-Release number of selected component (if applicable): 4.9 How reproducible: This is happening on all CI jobs using RHEL8 workers. PR for adding RHEL8 worker support to CI: https://github.com/openshift/release/pull/19190 Example job failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/19190/rehearse-19190-pull-ci-openshift-openshift-ansible-master-e2e-aws-workers-rhel8/1410260856570122240 Test failure summary: alert TargetDown fired for 829 seconds with labels: {job="crio", namespace="kube-system", service="kubelet", severity="warning"} alert TargetDown fired for 829 seconds with labels: {job="kubelet", namespace="kube-system", service="kubelet", severity="warning"} alert TargetDown fired for 829 seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"} alert TargetDown fired for 859 seconds with labels: {job="node-exporter", namespace="openshift-monitoring", service="node-exporter", severity="warning"} alert TargetDown fired for 859 seconds with labels: {job="sdn", namespace="openshift-sdn", service="sdn", severity="warning"} This issue can be reproduced in a development cluster.
Observed that the alerts do not start firing until after the RHCOS nodes are drained/removed from the cluster. Just having RHEL8 nodes in the cluster does not cause the condition.
Observed that when prometheus-k8s pods are running on the RHEL8 nodes, alerts are raised as noted in the 'Test failure summary' in the bug description. When the prometheus-k8s pods are moved back to RHCOS nodes, the alerts are cleared. The prometheus-k8s pods are not reporting any issues when running on the RHEL8 nodes.
Moving to the monitoring component based on the above observations.
Following analysis made by @spasquie >i've had a quick look at this one and from what I can tell, Prometheus considers the ip-10-0-159-119.ec2.internal node because it can't scrape the metrics. It's visible from the targets API (see https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_rele[…]extra/artifacts/metrics/prometheus-targets.json) which says Get \"http://10.0.159.119:9537/metrics\": context deadline exceeded for the endpoint (it smells like a network connectivity issue as Prometheus fails to receive metrics within 10s). I'd suggest having a look at the node logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_rele[…]a/artifacts/nodes/ip-10-0-159-119.ec2.internal/ I've also looked at the node journal, I see the following log at the end > Jun 30 17:57:31.418776 ip-10-0-159-119.ec2.internal hyperkube[1637]: W0630 17:57:31.418619 1637 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins/csi-hostpath-e2e-provisioning-3358/csi.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/csi-hostpath-e2e-provisioning-3358/csi.sock: connect: connection refused". Reconnecting... Jun 30 17:57:31.418776 ip-10-0-159-119.ec2.internal hyperkube[1637]: I0630 17:57:31.418653 1637 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc006e6dee0, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/csi-hostpath-e2e-provisioning-3358/csi.sock: connect: connection refused"} Jun 30 17:57:31.418776 ip-10-0-159-119.ec2.internal hyperkube[1637]: E0630 17:57:31.418735 1637 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-e2e-provisioning-3358^9779380d-d9ca-11eb-b269-0a580a8005fa podName:04333bd3-fab0-4e74-9b0f-585dbdc35236 nodeName:}" failed. No retries permitted until 2021-06-30 17:59:33.418710717 +0000 UTC m=+2240.882179472 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"test-volume\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-e2e-provisioning-3358^9779380d-d9ca-11eb-b269-0a580a8005fa\") pod \"04333bd3-fab0-4e74-9b0f-585dbdc35236\" (UID: \"04333bd3-fab0-4e74-9b0f-585dbdc35236\") : kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/lib/kubelet/plugins/csi-hostpath-e2e-provisioning-3358/csi.sock: connect: connection refused\"" It indicates a network connectivity issue with the problematic node instance. Let me assign this to network team for their analysis.
Bumping the severity because this is impacting feature delivery.
This is probably not a networking issue: 1. More than just openshift-sdn targets are failing 2. The connection failure you see is a unix domain socket, i.e. a file. That's not actually networking; it's an attempt to talk to the CSI driver. My first guess is that some alerts need to be tweaked to take losing nodes in to account. One question: when the RHCOS nodes are removed, are the Node objects also removed?
In my tests to troubleshoot this issue I have cordoned and drained the RHCOS nodes. As stated in comment 2, the alerts started firing. The RHCOS nodes were not removed from the cluster. When the RHCOS nodes were uncordoned and the RHEL nodes cordoned and drained, the alerts were cleared. I do not think this is related to losing nodes. My guess was that either the wrong version of a necessary package was being installed on RHEL8 or that a file or resource is not in the right place on RHEL8 as expected on either RHCOS or RHEL7. From a networking perspective, I wanted to make sure the correct packages were being installed for RHEL8. Who owns these alerts?
The monitoring team owns the TargetDown alert but the reason why the alert fires might be outside of our competency. Having said that, we can look into https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/19190/rehearse-19190-periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-workers-rhel8/1420025520803811328 which is an example of the failure with gathered data. What I can already say from this run is that the node triggering the alert is ip-10-0-212-204.us-east-2.compute.internal. The node is reported as Ready but for some reason, Prometheus is failing to connect with "Get \"https://10.0.212.204:10250/metrics\": context deadline exceeded". If you have access to a cluster in the same configuration/situation, it's be worth checking that you can curl the /metrics endpoint from within the prometheus-k8s-0 pod.
I can build a cluster that reproduces this behavior. I have one built now and will tear it down soon but will build another tomorrow morning. In trying to diagnose the issue as you mention in comment 8 I have attempted to curl the /metrics endpoint but I may be doing it wrong. 1. Find the node where the prometheus-k8s-0 pod is running. $ oc get pod prometheus-k8s-0 --namespace openshift-monitoring -o yaml | grep nodeName 2. Connect to that node and find the prometheus container. $ oc debug node/ip-10-0-172-36.us-east-2.compute.internal sh-4.4# chroot /host sh-4.4# crictl ps | grep -E 'prometheus\s' | awk '{print $1}' 3. Exec into that container and run curl to another worker node. # crictl exec -it 363ec8ed31264 /bin/bash bash-4.4$ curl https://10.0.197.254:10250/metrics I'm getting certificate issues: bash-4.4$ curl https://10.0.197.254:10250/metrics curl: (60) SSL certificate problem: unable to get local issuer certificate More details here: https://curl.haxx.se/docs/sslcerts.html curl failed to verify the legitimacy of the server and therefore could not establish a secure connection to it. To learn more about this situation and how to fix it, please visit the web page mentioned above. bash-4.4$ curl -k https://10.0.197.254:10250/metrics Unauthorized bash-4.4$ Let me know if I'm doing something wrong. Also, how do I tell which node is causing the alert to fire?
Thanks Russell. Can you try the following commands? oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -t -- curl -k https://x.x.x.x:10250/metrics oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -t -- curl -k https://x.x.x.x:10250/metrics To find which IP address to connect to, you can open the Prometheus UI, go to the Targets page and find out which targets are down.
I built a cluster and tested out the commands: $ oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -t -- curl -k https://10.0.129.155:10250/metrics % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:02:11 --:--:-- 0curl: (7) Failed to connect to 10.0.129.155 port 10250: Connection timed out command terminated with exit code 7 $ oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -t -- curl -k https://10.0.129.155:10250/metrics % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 12 100 12 0 0 923 0 --:--:-- --:--:-- --:--:-- 923Unauthorized I can provide access to the cluster for troubleshooting today, but it will be gone tomorrow.
So we need to understand why prometheus-k8s-0 can't connect to 10.0.129.155. Is the node being seen as ready?
Yes, node is Ready. $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-129-155.us-east-2.compute.internal Ready worker 98m v1.21.1+051ac4f 10.0.129.155 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.10.2.el8_4.x86_64 cri-o://1.21.2-7.rhaos4.8.gitdd89bfb.el8 ip-10-0-131-227.us-east-2.compute.internal Ready master 136m v1.21.1+051ac4f 10.0.131.227 <none> Red Hat Enterprise Linux CoreOS 48.84.202107282126-0 (Ootpa) 4.18.0-305.10.2.el8_4.x86_64 cri-o://1.21.2-7.rhaos4.8.gitdd89bfb.el8 ip-10-0-173-217.us-east-2.compute.internal Ready master 137m v1.21.1+051ac4f 10.0.173.217 <none> Red Hat Enterprise Linux CoreOS 48.84.202107282126-0 (Ootpa) 4.18.0-305.10.2.el8_4.x86_64 cri-o://1.21.2-7.rhaos4.8.gitdd89bfb.el8 ip-10-0-187-192.us-east-2.compute.internal Ready worker 98m v1.21.1+051ac4f 10.0.187.192 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.10.2.el8_4.x86_64 cri-o://1.21.2-7.rhaos4.8.gitdd89bfb.el8 ip-10-0-194-140.us-east-2.compute.internal Ready worker 98m v1.21.1+051ac4f 10.0.194.140 <none> Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.10.2.el8_4.x86_64 cri-o://1.21.2-7.rhaos4.8.gitdd89bfb.el8 ip-10-0-221-183.us-east-2.compute.internal Ready master 137m v1.21.1+051ac4f 10.0.221.183 <none> Red Hat Enterprise Linux CoreOS 48.84.202107282126-0 (Ootpa) 4.18.0-305.10.2.el8_4.x86_64 cri-o://1.21.2-7.rhaos4.8.gitdd89bfb.el8 And as you mentioned in slack, the pod is running on the same node it is unable to reach: $ oc get pod prometheus-k8s-0 --namespace openshift-monitoring -o yaml | grep nodeName nodeName: ip-10-0-129-155.us-east-2.compute.internal
After further discussion with the networking team it was determined to be a networking issue. https://coreos.slack.com/archives/CK1AE4ZCK/p1627673448016900
Any update on this bug?
I am looking into it. No progress yet. I am trying to add RHEL8 node to my cluster and its taking time to set it up. I am following this article: https://docs.openshift.com/container-platform/4.8/machine_management/adding-rhel-compute.html Is that the best way for adding rhel8 instance on an aws cloud?
packets being dropped in PREROUTING iptable. No clear reason why. Here is a normal packet trace on RHCOS from pod -> local node port: trace id a690161c ip raw PREROUTING verdict continue trace id a690161c ip raw PREROUTING policy accept trace id a690161c ip mangle PREROUTING verdict continue trace id a690161c ip mangle PREROUTING policy accept trace id a690161c ip nat PREROUTING packet: iif "tun0" ether saddr 0a:58:0a:83:00:17 ether daddr 6a:bf:49:0d:fb:ef ip saddr 10.131.0.23 ip daddr 10.0.216.33 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 59314 ip length 60 tcp sport 59999 tcp trace id a690161c ip nat PREROUTING rule counter packets 24318 bytes 2648085 jump KUBE-SERVICES (verdict jump KUBE-SERVICES) trace id a690161c ip nat KUBE-SERVICES rule fib daddr type local counter packets 2458 bytes 264399 jump KUBE-NODEPORTS (verdict jump KUBE-NODEPORTS) trace id a690161c ip nat KUBE-NODEPORTS verdict continue trace id a690161c ip nat KUBE-SERVICES verdict continue trace id a690161c ip nat PREROUTING rule counter packets 18586 bytes 2020937 jump KUBE-PORTALS-CONTAINER (verdict jump KUBE-PORTALS-CONTAINER) trace id a690161c ip nat KUBE-PORTALS-CONTAINER verdict continue trace id a690161c ip nat PREROUTING rule fib daddr type local counter packets 17798 bytes 1960581 jump KUBE-NODEPORT-CONTAINER (verdict jump KUBE-NODEPORT-CONTAINER) trace id a690161c ip nat KUBE-NODEPORT-CONTAINER verdict continue trace id a690161c ip nat PREROUTING verdict continue trace id a690161c ip nat PREROUTING policy accept trace id a690161c ip mangle INPUT verdict continue trace id a690161c ip mangle INPUT policy accept ... ... Here is a packet trace on RHEL 8 from pod -> local node port: trace id 4078a633 ip raw PREROUTING verdict continue trace id 4078a633 ip raw PREROUTING policy accept trace id 4078a633 ip mangle PREROUTING verdict continue trace id 4078a633 ip mangle PREROUTING policy accept trace id 4078a633 ip nat PREROUTING packet: iif "tun0" ether saddr 0a:58:0a:82:02:06 ether daddr ee:cd:bc:c2:83:b7 ip saddr 10.130.2.6 ip daddr 10.0.244.172 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 30075 ip length 60 tcp sport 59999 tcp dport 10250 tcp flags == syn tcp window 26733 trace id 4078a633 ip nat PREROUTING rule counter packets 976 bytes 95117 jump KUBE-SERVICES (verdict jump KUBE-SERVICES) trace id 4078a633 ip nat KUBE-SERVICES rule fib daddr type local counter packets 486 bytes 46467 jump KUBE-NODEPORTS (verdict jump KUBE-NODEPORTS) trace id 4078a633 ip nat KUBE-NODEPORTS verdict continue trace id 4078a633 ip nat KUBE-SERVICES verdict continue trace id 4078a633 ip nat PREROUTING rule counter packets 926 bytes 91163 jump KUBE-PORTALS-CONTAINER (verdict jump KUBE-PORTALS-CONTAINER) trace id 4078a633 ip nat KUBE-PORTALS-CONTAINER verdict continue trace id 4078a633 ip nat PREROUTING rule fib daddr type local counter packets 922 bytes 90659 jump KUBE-NODEPORT-CONTAINER (verdict jump KUBE-NODEPORT-CONTAINER) trace id 4078a633 ip nat KUBE-NODEPORT-CONTAINER verdict continue trace id 4078a633 ip nat PREROUTING verdict continue trace id 4078a633 ip nat PREROUTING policy accept Packet just gets dropped. Doesn't move onto INPUT chain from PREROUTING.
This does not look like an Openshift SDN bug because there is no rule why the packet is dropped post PREROUTING. PREROUTING chain default is accept. This works on RHCOS but not on RHEL8. Both distro's have the same kernel and iptables (nf tables) version. I checked the tunable kernel parameters between RHCOS and RHEL8 and there doesn't seem to be any differences that could cause this issue. I checked if rf_filter is enabled in strict mode on the interface and it is not. I am stumped. I have asked for help and I am waiting for a response.
I confirm it's working for Red Hat Enterprise Linux 8.3 and therefore it's a regression for Red Hat Enterprise Linux 8.4.
Tested 8.3 in CI [1] and was able to get several passing jobs [2]. [1] https://github.com/openshift/release/pull/19190 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/19190/rehearse-19190-pull-ci-openshift-openshift-ansible-master-e2e-aws-workers-rhel8/1429628496895807488
I cannot see any differences for Openshift SDN in RHEL 8.3 and 8.4. OVS Flow paths are the same. Iptables rules hit are the same (for PREROUTING chain). For reference, I am testing against 8.3 and 8.4 AWS images, however, both instances have the same installed package versions and same kernel version. I did not expect this and according to the release notes for 8.4, the kernel was updated to .305 but I see this version for 8.3 also.. [1] Russell, can we move this to RHEL team? It doesn't look like SDN bug. [1] Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 (AMI image number: ami-0b0af3577fe5e3532) Red Hat Enterprise Linux 8.3 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 (AMI image number: ami-01d12f05657cd01d3)
I'm fine with moving this to the RHEL team but I don't know what component to move it to. Reassign as you see fit.
(In reply to Martin Kennelly from comment #19) [..] > Here is a packet trace on RHEL 8 from pod -> local node port: > trace id 4078a633 ip raw PREROUTING verdict continue > trace id 4078a633 ip raw PREROUTING policy accept > trace id 4078a633 ip mangle PREROUTING verdict continue > trace id 4078a633 ip mangle PREROUTING policy accept > trace id 4078a633 ip nat PREROUTING packet: iif "tun0" ether saddr > 0a:58:0a:82:02:06 ether daddr ee:cd:bc:c2:83:b7 ip saddr 10.130.2.6 ip daddr > 10.0.244.172 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 30075 ip length 60 > tcp sport 59999 tcp dport 10250 tcp flags == syn tcp window 26733 > trace id 4078a633 ip nat PREROUTING rule counter packets 976 bytes 95117 > jump KUBE-SERVICES (verdict jump KUBE-SERVICES) > trace id 4078a633 ip nat KUBE-SERVICES rule fib daddr type local counter > packets 486 bytes 46467 jump KUBE-NODEPORTS (verdict jump KUBE-NODEPORTS) > trace id 4078a633 ip nat KUBE-NODEPORTS verdict continue > trace id 4078a633 ip nat KUBE-SERVICES verdict continue > trace id 4078a633 ip nat PREROUTING rule counter packets 926 bytes 91163 > jump KUBE-PORTALS-CONTAINER (verdict jump KUBE-PORTALS-CONTAINER) > trace id 4078a633 ip nat KUBE-PORTALS-CONTAINER verdict continue > trace id 4078a633 ip nat PREROUTING rule fib daddr type local counter > packets 922 bytes 90659 jump KUBE-NODEPORT-CONTAINER (verdict jump > KUBE-NODEPORT-CONTAINER) > trace id 4078a633 ip nat KUBE-NODEPORT-CONTAINER verdict continue > trace id 4078a633 ip nat PREROUTING verdict continue > trace id 4078a633 ip nat PREROUTING policy accept > > Packet just gets dropped. Doesn't move onto INPUT chain from PREROUTING. Was the packet meant for the host? Packets only go to the INPUT chain if they're meant for the host, e.g. IP matches the host's.
> Was the packet meant for the host? Packets only go to the INPUT chain if they're meant for the host, e.g. IP matches the host's. yes. Target IP matched the hosts IP.
(In reply to Martin Kennelly from comment #21) > I confirm it's working for Red Hat Enterprise Linux 8.3 and therefore it's a > regression for Red Hat Enterprise Linux 8.4. This statement contradicts comment 23. In comment 23 you said the package/kernel versions were the same. That shouldn't be the case. They should definitely have different kernel versions. RHEL-8.3 is kernel-4.18.0-240.el8. Is this issue always reproducible? Or only sometimes?
Can someone gather some statistics from a failing node? 1. cat /proc/net/stat/nf_conntrack 2. cat /proc/net/dev I think the above may be available in a must-gather. It should be in a sosreport. This would be better as it will get loads of other useful data.
> This statement contradicts comment 23. In comment 23 you said the package/kernel versions were the same. That shouldn't be the case. They should definitely have different kernel versions. RHEL-8.3 is kernel-4.18.0-240.el8. I didn't understand when I wrote that comment that when I added a RHEL 8.3 node to a OCP 4.9 cluster, OCP updates numerous components/packages, including the kernel version to match RHEL 8.4, so it works when OCP updates 8.3 -> 8.4 but doesn't work on RHEL 8.4. Therefore, a change in 8.4 that is not managed by OCP is breaking it.
> Is this issue always reproducible? Or only sometimes? Always reproducible on RHEL 8.4
(In reply to Martin Kennelly from comment #30) > > This statement contradicts comment 23. In comment 23 you said the package/kernel versions were the same. That shouldn't be the case. They should definitely have different kernel versions. RHEL-8.3 is kernel-4.18.0-240.el8. > > I didn't understand when I wrote that comment that when I added a RHEL 8.3 > node to a OCP 4.9 cluster, OCP updates numerous components/packages, > including the kernel version to match RHEL 8.4, so it works when OCP updates > 8.3 -> 8.4 but doesn't work on RHEL 8.4. Therefore, a change in 8.4 that is > not managed by OCP is breaking it. Is there any chance you can isolate the component? i.e. only upgrade the kernel? Or selectively downgrade the kernel afterwards.
I'm building a test cluster with RHEL 8.4 and will pull the sosreport from the effected node.
These are the packages installed/upgraded when adding RHEL nodes to a cluster: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/defaults/main.yml#L12-L98 I can provide the install logs which show the packages and exact versions installed/upgraded.
(In reply to Russell Teague from comment #34) > These are the packages installed/upgraded when adding RHEL nodes to a > cluster: > https://github.com/openshift/openshift-ansible/blob/master/roles/ > openshift_node/defaults/main.yml#L12-L98 Notably missing from that list is "iptables" userspace. It may get pulled in by the other updates though, e.g. "iptables-services". Can you verify? > I can provide the install logs which show the packages and exact versions installed/upgraded. Yes please.
sosreport and scaleup log are attached. To find the packages/versions installed/upgraded search the scaleup log for "TASK [openshift_node : Install openshift packages]". Each host will be listed in succession. iptables-services is installed: Installed Packages iptables-libs.x86_64 1.8.4-17.el8 @anaconda iptables-services.x86_64 1.8.4-17.el8 @rhel-8-for-x86_64-baseos-rpms
Fix component. This is openshift-sdn which means iptables. Not nftables.
(In reply to Eric Garver from comment #32) > (In reply to Martin Kennelly from comment #30) > > > This statement contradicts comment 23. In comment 23 you said the package/kernel versions were the same. That shouldn't be the case. They should definitely have different kernel versions. RHEL-8.3 is kernel-4.18.0-240.el8. > > > > I didn't understand when I wrote that comment that when I added a RHEL 8.3 > > node to a OCP 4.9 cluster, OCP updates numerous components/packages, > > including the kernel version to match RHEL 8.4, so it works when OCP updates > > 8.3 -> 8.4 but doesn't work on RHEL 8.4. Therefore, a change in 8.4 that is > > not managed by OCP is breaking it. > > Is there any chance you can isolate the component? i.e. only upgrade the > kernel? Or selectively downgrade the kernel afterwards. Martin, can you try this?
I built two identical clusters, one using RHEL 8.3 worker nodes, one using RHEL 8.4 worker nodes. I pulled the installed package list from one node of each and attached them to the bug. A diff checker can be used to compare the two files and determine package version differences. If there are specific packages that should be downgraded to specific versions I can attempt to make those changes.
> Martin, can you try this? I don't have the time currently until next week. Russell, can you do this to speed this up?
(In reply to Martin Kennelly from comment #44) > > Martin, can you try this? > I don't have the time currently until next week. > > Russell, can you do this to speed this up? Yes, I can attempt any package changes. With the kernel specifically, both nodes are running the same kernel version so it doesn't make sense to downgrade the kernel on the 8.4 host to the original version on the 8.3 host. I provided the installed package lists so that I could get better direction on which packages to change instead of working through every package difference.
(In reply to Russell Teague from comment #45) > (In reply to Martin Kennelly from comment #44) > > > Martin, can you try this? > > I don't have the time currently until next week. > > > > Russell, can you do this to speed this up? > > Yes, I can attempt any package changes. With the kernel specifically, both > nodes are running the same kernel version so it doesn't make sense to > downgrade the kernel on the 8.4 host to the original version on the 8.3 > host. I provided the installed package lists so that I could get better > direction on which packages to change instead of working through every > package difference. kernel and iptables are a good start.
What version of kernel and iptables should be installed on the 8.4 host? As can be seen in the attached package lists, the kernel and iptables are the same version between both hosts. 8.3 Linux ip-10-0-139-135.us-east-2.compute.internal 4.18.0-305.12.1.el8_4.x86_64 #1 SMP Mon Jul 26 08:06:24 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux Installed Packages iptables.x86_64 1.8.4-17.el8 @rhel-8-for-x86_64-baseos-rpms iptables-libs.x86_64 1.8.4-17.el8 @rhel-8-for-x86_64-baseos-rpms iptables-services.x86_64 1.8.4-17.el8 @rhel-8-for-x86_64-baseos-rpms 8.4 Linux ip-10-0-154-44.us-east-2.compute.internal 4.18.0-305.12.1.el8_4.x86_64 #1 SMP Mon Jul 26 08:06:24 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux Installed Packages iptables.x86_64 1.8.4-17.el8 @rhel-8-for-x86_64-baseos-rpms iptables-libs.x86_64 1.8.4-17.el8 @anaconda iptables-services.x86_64 1.8.4-17.el8 @rhel-8-for-x86_64-baseos-rpms The following is a concise list of package differences between the hosts, 8.3 ... 8.4: $ diff -y --suppress-common-lines ../aws-4b/packagelist-8.3-norepo.txt ../aws-4a/packagelist-8.4-norepo.txt > NetworkManager-cloud-setup.x86_64 1:1.30.0-10.el8_4 bash.x86_64 4.4.19-12.el8 | bash.x86_64 4.4.20-1.el8_4 bind-export-libs.x86_64 32:9.11.20-5.el8 | bind-export-libs.x86_64 32:9.11.26-4.el8_4 brotli.x86_64 1.0.6-2.el8 | brotli.x86_64 1.0.6-3.el8 cloud-init.noarch 19.4-11.el8_3.2 | cloud-init.noarch 20.3-10.el8_4.2 cpio.x86_64 2.12-8.el8 | cpio.x86_64 2.12-10.el8 crontabs.noarch 1.11-16.20150630git.el8 | crontabs.noarch 1.11-17.20190603git.el8 crypto-policies.noarch 20200713-1.git51d1222.el8 | crypto-policies.noarch 20210209-1.gitbfb6bed.el8_3 crypto-policies-scripts.noarch 20200713-1.git51d1222.el8 | crypto-policies-scripts.noarch 20210209-1.gitbfb6bed.el8_3 curl.x86_64 7.61.1-14.el8_3.1 | curl.x86_64 7.61.1-18.el8 dbus.x86_64 1:1.12.8-12.el8_3 | dbus.x86_64 1:1.12.8-12.el8_4.2 dbus-common.noarch 1:1.12.8-12.el8_3 | dbus-common.noarch 1:1.12.8-12.el8_4.2 dbus-daemon.x86_64 1:1.12.8-12.el8_3 | dbus-daemon.x86_64 1:1.12.8-12.el8_4.2 dbus-libs.x86_64 1:1.12.8-12.el8_3 | dbus-libs.x86_64 1:1.12.8-12.el8_4.2 dbus-tools.x86_64 1:1.12.8-12.el8_3 | dbus-tools.x86_64 1:1.12.8-12.el8_4.2 dhcp-client.x86_64 12:4.3.6-41.el8 | dhcp-client.x86_64 12:4.3.6-44.el8 dhcp-common.noarch 12:4.3.6-41.el8 | dhcp-common.noarch 12:4.3.6-44.el8 dhcp-libs.x86_64 12:4.3.6-41.el8 | dhcp-libs.x86_64 12:4.3.6-44.el8 dmidecode.x86_64 1:3.2-6.el8 | dmidecode.x86_64 1:3.2-8.el8 dnf.noarch 4.2.23-4.el8 | dnf.noarch 4.4.2-11.el8 dnf-data.noarch 4.2.23-4.el8 | dnf-data.noarch 4.4.2-11.el8 dnf-plugin-subscription-manager.x86_64 1.27.18-1.el8_3 | dnf-plugin-subscription-manager.x86_64 1.28.13-2.el8 dnf-plugins-core.noarch 4.0.17-5.el8 | dnf-plugins-core.noarch 4.0.18-4.el8 elfutils-debuginfod-client.x86_64 0.180-1.el8 | elfutils-debuginfod-client.x86_64 0.182-3.el8 elfutils-default-yama-scope.noarch 0.180-1.el8 | elfutils-default-yama-scope.noarch 0.182-3.el8 elfutils-libelf.x86_64 0.180-1.el8 | elfutils-libelf.x86_64 0.182-3.el8 elfutils-libs.x86_64 0.180-1.el8 | elfutils-libs.x86_64 0.182-3.el8 ethtool.x86_64 2:5.0-2.el8 | ethtool.x86_64 2:5.8-5.el8 file.x86_64 5.33-16.el8 | file.x86_64 5.33-16.el8_3.1 file-libs.x86_64 5.33-16.el8 | file-libs.x86_64 5.33-16.el8_3.1 gawk.x86_64 4.2.1-1.el8 | gawk.x86_64 4.2.1-2.el8 glib2.x86_64 2.56.4-8.el8 | glib2.x86_64 2.56.4-9.el8 glibc.x86_64 2.28-127.el8_3.2 | glibc.x86_64 2.28-151.el8 glibc-common.x86_64 2.28-127.el8_3.2 | glibc-common.x86_64 2.28-151.el8 glibc-langpack-en.x86_64 2.28-127.el8_3.2 | glibc-langpack-en.x86_64 2.28-151.el8 gnutls.x86_64 3.6.14-7.el8_3 | gnutls.x86_64 3.6.14-8.el8_3 gpgme.x86_64 1.13.1-3.el8 | gpgme.x86_64 1.13.1-7.el8 grub2-common.noarch 1:2.02-90.el8 | grub2-common.noarch 1:2.02-99.el8 grub2-pc.x86_64 1:2.02-90.el8 | grub2-pc.x86_64 1:2.02-99.el8 grub2-pc-modules.noarch 1:2.02-90.el8 | grub2-pc-modules.noarch 1:2.02-99.el8 grub2-tools.x86_64 1:2.02-90.el8 | grub2-tools.x86_64 1:2.02-99.el8 grub2-tools-extra.x86_64 1:2.02-90.el8 | grub2-tools-extra.x86_64 1:2.02-99.el8 grub2-tools-minimal.x86_64 1:2.02-90.el8 | grub2-tools-minimal.x86_64 1:2.02-99.el8 hdparm.x86_64 9.54-2.el8 | hdparm.x86_64 9.54-3.el8 hwdata.noarch 0.314-8.6.el8 | hwdata.noarch 0.314-8.8.el8 ima-evm-utils.x86_64 1.1-5.el8 | ima-evm-utils.x86_64 1.3.2-12.el8 initscripts.x86_64 10.00.9-1.el8 | initscripts.x86_64 10.00.15-1.el8 insights-client.noarch 3.1.1-1.el8_3 | insights-client.noarch 3.1.3-2.el8_4 iproute.x86_64 5.3.0-5.el8 | iproute.x86_64 5.9.0-4.el8 iputils.x86_64 20180629-2.el8 | iputils.x86_64 20180629-7.el8 json-c.x86_64 0.13.1-0.2.el8 | json-c.x86_64 0.13.1-0.4.el8 kexec-tools.x86_64 2.0.20-34.el8_3.2 | kexec-tools.x86_64 2.0.20-46.el8 kmod.x86_64 25-16.el8_3.1 | kmod.x86_64 25-17.el8 kmod-libs.x86_64 25-16.el8_3.1 | kmod-libs.x86_64 25-17.el8 krb5-libs.x86_64 1.18.2-5.el8 | krb5-libs.x86_64 1.18.2-8.el8 libarchive.x86_64 3.3.2-9.el8 | libarchive.x86_64 3.3.3-1.el8 libblkid.x86_64 2.32.1-24.el8 | libblkid.x86_64 2.32.1-27.el8 libcomps.x86_64 0.1.11-4.el8 | libcomps.x86_64 0.1.11-5.el8 libcurl.x86_64 7.61.1-14.el8_3.1 | libcurl.x86_64 7.61.1-18.el8 libdb.x86_64 5.3.28-39.el8 | libdb.x86_64 5.3.28-40.el8 libdb-utils.x86_64 5.3.28-39.el8 | libdb-utils.x86_64 5.3.28-40.el8 libdnf.x86_64 0.48.0-5.el8 | libdnf.x86_64 0.55.0-7.el8 libfdisk.x86_64 2.32.1-24.el8 | libfdisk.x86_64 2.32.1-27.el8 libgcc.x86_64 8.3.1-5.1.el8 | libgcc.x86_64 8.4.1-1.el8 libgomp.x86_64 8.3.1-5.1.el8 | libgomp.x86_64 8.4.1-1.el8 libldb.x86_64 2.1.3-2.el8 | libldb.x86_64 2.2.0-2.el8 libmount.x86_64 2.32.1-24.el8 | libmount.x86_64 2.32.1-27.el8 libnfsidmap.x86_64 1:2.3.3-35.el8 | libnfsidmap.x86_64 1:2.3.3-41.el8 libpcap.x86_64 14:1.9.1-4.el8 | libpcap.x86_64 14:1.9.1-5.el8 libpwquality.x86_64 1.4.0-9.el8 | libpwquality.x86_64 1.4.4-3.el8 librepo.x86_64 1.12.0-2.el8 | librepo.x86_64 1.12.0-3.el8 librhsm.x86_64 0.0.3-3.el8 | librhsm.x86_64 0.0.3-4.el8 libseccomp.x86_64 2.4.3-1.el8 | libseccomp.x86_64 2.5.1-1.el8 libselinux.x86_64 2.9-4.el8_3 | libselinux.x86_64 2.9-5.el8 libselinux-utils.x86_64 2.9-4.el8_3 | libselinux-utils.x86_64 2.9-5.el8 libsemanage.x86_64 2.9-3.el8 | libsemanage.x86_64 2.9-6.el8 libsepol.x86_64 2.9-1.el8 | libsepol.x86_64 2.9-2.el8 libsmartcols.x86_64 2.32.1-24.el8 | libsmartcols.x86_64 2.32.1-27.el8 libsolv.x86_64 0.7.11-1.el8 | libsolv.x86_64 0.7.16-2.el8 libsss_autofs.x86_64 2.3.0-9.el8 | libsss_autofs.x86_64 2.4.0-9.el8 libsss_sudo.x86_64 2.3.0-9.el8 | libsss_sudo.x86_64 2.4.0-9.el8 libstdc++.x86_64 8.3.1-5.1.el8 | libstdc++.x86_64 8.4.1-1.el8 libuuid.x86_64 2.32.1-24.el8 | libuuid.x86_64 2.32.1-27.el8 libxml2.x86_64 2.9.7-8.el8 | libxml2.x86_64 2.9.7-9.el8 > lmdb-libs.x86_64 0.9.24-1.el8 lshw.x86_64 B.02.19.2-2.el8 | lshw.x86_64 B.02.19.2-5.el8 lsscsi.x86_64 0.30-1.el8 | lsscsi.x86_64 0.32-2.el8 nettle.x86_64 3.4.1-2.el8 | nettle.x86_64 3.4.1-4.el8_3 oddjob.x86_64 0.34.5-3.el8 | oddjob.x86_64 0.34.7-1.el8 oddjob-mkhomedir.x86_64 0.34.5-3.el8 | oddjob-mkhomedir.x86_64 0.34.7-1.el8 openldap.x86_64 2.4.46-15.el8 | openldap.x86_64 2.4.46-16.el8 openssl.x86_64 1:1.1.1g-12.el8_3 | openssl.x86_64 1:1.1.1g-15.el8_3 openssl-libs.x86_64 1:1.1.1g-12.el8_3 | openssl-libs.x86_64 1:1.1.1g-15.el8_3 pam.x86_64 1.3.1-11.el8 | pam.x86_64 1.3.1-14.el8 pciutils.x86_64 3.6.4-2.el8 | pciutils.x86_64 3.7.0-1.el8 pciutils-libs.x86_64 3.6.4-2.el8 | pciutils-libs.x86_64 3.7.0-1.el8 platform-python.x86_64 3.6.8-31.el8 | platform-python.x86_64 3.6.8-37.el8 platform-python-pip.noarch 9.0.3-18.el8 | platform-python-pip.noarch 9.0.3-19.el8 popt.x86_64 1.16-14.el8 | popt.x86_64 1.18-1.el8 procps-ng.x86_64 3.3.15-3.el8 | procps-ng.x86_64 3.3.15-6.el8 python3-asn1crypto.noarch 0.24.0-3.el8 < python3-cryptography.x86_64 2.3-3.el8 | python3-cryptography.x86_64 3.2.1-4.el8 python3-dnf.noarch 4.2.23-4.el8 | python3-dnf.noarch 4.4.2-11.el8 python3-dnf-plugins-core.noarch 4.0.17-5.el8 | python3-dnf-plugins-core.noarch 4.0.18-4.el8 python3-gpg.x86_64 1.13.1-3.el8 | python3-gpg.x86_64 1.13.1-7.el8 python3-hawkey.x86_64 0.48.0-5.el8 | python3-hawkey.x86_64 0.55.0-7.el8 python3-libcomps.x86_64 0.1.11-4.el8 | python3-libcomps.x86_64 0.1.11-5.el8 python3-libdnf.x86_64 0.48.0-5.el8 | python3-libdnf.x86_64 0.55.0-7.el8 python3-librepo.x86_64 1.12.0-2.el8 | python3-librepo.x86_64 1.12.0-3.el8 python3-libs.x86_64 3.6.8-31.el8 | python3-libs.x86_64 3.6.8-37.el8 python3-libselinux.x86_64 2.9-4.el8_3 | python3-libselinux.x86_64 2.9-5.el8 python3-libsemanage.x86_64 2.9-3.el8 | python3-libsemanage.x86_64 2.9-6.el8 python3-libxml2.x86_64 2.9.7-8.el8 | python3-libxml2.x86_64 2.9.7-9.el8 python3-linux-procfs.noarch 0.6.2-2.el8 | python3-linux-procfs.noarch 0.6.3-1.el8 python3-magic.noarch 5.33-16.el8 | python3-magic.noarch 5.33-16.el8_3.1 python3-perf.x86_64 4.18.0-240.15.1.el8_3 | python3-perf.x86_64 4.18.0-305.el8 python3-pip-wheel.noarch 9.0.3-18.el8 | python3-pip-wheel.noarch 9.0.3-19.el8 python3-ply.noarch 3.9-8.el8 | python3-ply.noarch 3.9-9.el8 python3-rpm.x86_64 4.14.3-4.el8 | python3-rpm.x86_64 4.14.3-13.el8 python3-subscription-manager-rhsm.x86_64 1.27.18-1.el8_3 | python3-subscription-manager-rhsm.x86_64 1.28.13-2.el8 python3-syspurpose.x86_64 1.27.18-1.el8_3 | python3-syspurpose.x86_64 1.28.13-2.el8 python3-unbound.x86_64 1.7.3-14.el8 | python3-unbound.x86_64 1.7.3-15.el8 python3-urllib3.noarch 1.24.2-4.el8 | python3-urllib3.noarch 1.24.2-5.el8 qemu-guest-agent.x86_64 15:4.2.0-34.module+el8.3.0+9828+7aab3 | qemu-guest-agent.x86_64 15:4.2.0-48.module+el8.4.0+10368+630e redhat-release.x86_64 8.3-1.0.el8 | redhat-release.x86_64 8.4-0.6.el8 redhat-release-eula.x86_64 8.3-1.0.el8 | redhat-release-eula.x86_64 8.4-0.6.el8 rh-amazon-rhui-client.noarch 3.0.39-1.el8 | rh-amazon-rhui-client.noarch 3.0.40-1.el8 rng-tools.x86_64 6.8-3.el8 | rhc.x86_64 1:0.1.4-1.el8_4 rpm.x86_64 4.14.3-4.el8 | rpm.x86_64 4.14.3-13.el8 rpm-build-libs.x86_64 4.14.3-4.el8 | rpm-build-libs.x86_64 4.14.3-13.el8 rpm-libs.x86_64 4.14.3-4.el8 | rpm-libs.x86_64 4.14.3-13.el8 rpm-plugin-selinux.x86_64 4.14.3-4.el8 | rpm-plugin-selinux.x86_64 4.14.3-13.el8 rpm-plugin-systemd-inhibit.x86_64 4.14.3-4.el8 | rpm-plugin-systemd-inhibit.x86_64 4.14.3-13.el8 rsyslog.x86_64 8.1911.0-6.el8 | rsyslog.x86_64 8.1911.0-7.el8 sqlite-libs.x86_64 3.26.0-11.el8 | sqlite-libs.x86_64 3.26.0-13.el8 squashfs-tools.x86_64 4.3-19.el8 | squashfs-tools.x86_64 4.3-20.el8 sssd-nfs-idmap.x86_64 2.3.0-9.el8 | sssd-nfs-idmap.x86_64 2.4.0-9.el8 subscription-manager.x86_64 1.27.18-1.el8_3 | subscription-manager.x86_64 1.28.13-2.el8 subscription-manager-rhsm-certificates.x86_64 1.27.18-1.el8_3 | subscription-manager-rhsm-certificates.x86_64 1.28.13-2.el8 > tpm2-tss.x86_64 2.3.2-3.el8 trousers.x86_64 0.3.14-4.el8 | trousers.x86_64 0.3.15-1.el8 trousers-lib.x86_64 0.3.14-4.el8 | trousers-lib.x86_64 0.3.15-1.el8 tuned.noarch 2.14.0-3.el8_3.2 | tuned.noarch 2.15.0-2.el8 unbound-libs.x86_64 1.7.3-14.el8 | unbound-libs.x86_64 1.7.3-15.el8 util-linux.x86_64 2.32.1-24.el8 | util-linux.x86_64 2.32.1-27.el8 yum.noarch 4.2.23-4.el8 | yum.noarch 4.4.2-11.el8 yum-utils.noarch 4.0.17-5.el8 | yum-utils.noarch 4.0.18-4.el8 zlib.x86_64 1.2.11-16.el8_2 | zlib.x86_64 1.2.11-17.el8
(In reply to Russell Teague from comment #47) > What version of kernel and iptables should be installed on the 8.4 host? 8.3 GA: kernel-4.18.0-240.el8, 1.8.4-15.el8 8.4 GA: kernel-4.18.0-305.el8, 1.8.4-17.el8 Looks like all the servers are using the 8.4 packages. Can you downgrade to the 8.3 kernel and iptables?
Downgraded kernel/iptables to kernel-4.18.0-240.el8, 1.8.4-15.el8 and the target down alerts are still firing.
(In reply to Russell Teague from comment #49) > Downgraded kernel/iptables to kernel-4.18.0-240.el8, 1.8.4-15.el8 and the > target down alerts are still firing. That's unexpected since comment 21 and comment 22 say this doesn't work on RHEL-8.3. Did you remember to reboot after the kernel downgrade?
Yes, I remembered to reboot. [ec2-user@ip-10-0-153-162 ~]$ uname -a Linux ip-10-0-153-162.us-east-2.compute.internal 4.18.0-240.el8.x86_64 #1 SMP Wed Sep 23 05:13:10 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux [ec2-user@ip-10-0-153-162 ~]$ cat /etc/redhat-release Red Hat Enterprise Linux release 8.4 (Ootpa) [ec2-user@ip-10-0-153-162 ~]$ comment 21 and comment 22 say this _does_ work on RHEL-8.3
(In reply to Martin Kennelly from comment #30) > > This statement contradicts comment 23. In comment 23 you said the package/kernel versions were the same. That shouldn't be the case. They should definitely have different kernel versions. RHEL-8.3 is kernel-4.18.0-240.el8. > > I didn't understand when I wrote that comment that when I added a RHEL 8.3 > node to a OCP 4.9 cluster, OCP updates numerous components/packages, > including the kernel version to match RHEL 8.4, so it works when OCP updates > 8.3 -> 8.4 but doesn't work on RHEL 8.4. Therefore, a change in 8.4 that is > not managed by OCP is breaking it. I missed this detail. To make it clear: RHEL-8.3: PASS RHEL-8.4 upgraded from RHEL-8.3: PASS RHEL-8.4: FAIL Russell, the list you gave in comment 47 - is "8.3" in that comment really 8.4 upgraded from 8.3? If so, then that's the list of packages you can try downgrading. Some that jump out to me: - curl - iproute - openssl Alternatively, try doing a full `dnf update` on the nodes that are upgrade from RHEL-8.3 to RHEL-8.4. I find it odd that OCP isn't doing a full update and is only upgrading select packages.
RHEL-8.4 upgraded from RHEL-8.3: PASS <-- This has not been tested to my knowledge. (see below) The statement "when OCP updates 8.3 -> 8.4" is not fully correct because only select packages are updated. In comment 34 I provided a link to the list(s) of packages that are updated when installing OCP. It is not within the scope of RHEL compute scaleup in OCP to update all the packages on the node. In comment 47, 8.3 refers to an 8.3 host that had OCP installed and select packages updated. I will test again with RHEL-8.3 hosts fully upgraded. I will collect the package differences between these hosts.
(In reply to Russell Teague from comment #53) > RHEL-8.4 upgraded from RHEL-8.3: PASS <-- This has not been tested to my > knowledge. (see below) > > The statement "when OCP updates 8.3 -> 8.4" is not fully correct because > only select packages are updated. That was my point. It's a partial upgrade. It's neither RHEL-8.3 nor RHEL-8.4. I said "RHEL-8.4 upgraded from RHEL-8.3" but I meant "RHEL-8.3 partially upgraded to RHEL-8.4". Sorry for the confusion. > In comment 34 I provided a link to the > list(s) of packages that are updated when installing OCP. That was useful to see. > It is not within > the scope of RHEL compute scaleup in OCP to update all the packages on the > node. I fail to see how a partial upgrade is in scope, but a full upgrade is out of scope.
I installed two openshift clusters, one with RHEL 8.3 workers and one with RHEL 8.4 workers. Both 8.3 and 8.4 hosts were fully upgraded based on current released versions for all packages. After installing openshift, these are the package differences between the two hosts: Installed Packages (8.3-upgraded-to-8.4) | Installed Packages (8.4) > NetworkManager-cloud-setup.x86_64 1:1.30.0-10.el8_4 grub2-tools-efi.x86_64 1:2.02-99.el8 < python3-asn1crypto.noarch 0.24.0-3.el8 < rh-amazon-rhui-client.noarch 3.0.39-1.el8 | rh-amazon-rhui-client.noarch 3.0.40-1.el8 > rhc.x86_64 1:0.1.4-1.el8_4 rng-tools.x86_64 6.8-3.el8 < I confirmed the 8.3-upgraded-to-8.4 host was functioning properly with no openshift alerts firing. The 8.4 host was presenting the same issue as described in this bug. I will step through the packages above to make them the same as the upgraded 8.3 host to see if that has any effect on the issue, although I don't know why/how any of these would be related.
Found the culprit. After uninstalling NetworkManager-cloud-setup, the problem went away. I confirmed this by installing fresh RHEL 8.4 hosts, uninstalled NetworkManager-cloud-setup, rebooted, then installed openshift (worker scaleup) as normal. After draining the RHCOS nodes and running all pods on the RHEL nodes there were no TargetDown alerts. I need some help in tracking down what this package does, and why it is being included by default in RHEL 8.4 (at least in the public AWS AMI). Here are a couple of links I came across while doing some research: https://networkmanager.pages.freedesktop.org/NetworkManager/NetworkManager/nm-cloud-setup.html https://github.com/coreos/fedora-coreos-tracker/issues/320
https://brew.engineering.redhat.com/brew/rpminfo?rpmID=9963538 Installs a nm-cloud-setup tool that can automatically configure NetworkManager in cloud setups. Currently only EC2 is supported. This tool is still experimental. Package was built with this build: NetworkManager-1.30.0-10.el8_4 https://brew.engineering.redhat.com/brew/buildinfo?buildID=1660864
Pinging the NM team. thaller, bgalvani, please see comment 56 and above if you need context. tl;dr The presence of NetworkManager-cloud-setup causes unexpected TargetDown alarms because some monitoring service can not be reached.
reassigning to NetworkManager. RHEL-8.4 images for AWS enables nm-cloud-setup by default. The idea is to automatically configure networking. Obviously, if that causes problems with containers / Openshift, that's a severe issue. It is clear what nm-cloud-setup does. It does what is implemented. But it's less clear how that is wrong and what it should do instead. Russel, as these are "just VMs", would it be easily possible to share access to such a VM that exhibits the problem? Alternatively, could you please attach: ip -4 addr ip -6 addr ip -4 rule ip -6 rule ip -4 route show table all ip -6 route show table all
I'm building a cluster and will provide access as well as attach the requested command output.
should be fixed upstream with https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/974
here is a scratch build with the fix: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39868720 Any chance to test it with RHCOS?
This important issue is affecting OpenShift clusters on AWS where RHEL 8.4 nodes are deployed. RHEL 8.4 worker node support will be GA in OpenShift 4.9. I have tested the scratch build by installing the following packages on RHEL 8.4 nodes. I observed local node prometheus targets reporting correctly when the prometheus pod is running on the local node. The original issue reported appears resolved. - NetworkManager-1.32.10-3.el8.x86_64.rpm - NetworkManager-cloud-setup-1.32.10-3.el8.x86_64.rpm - NetworkManager-libnm-1.32.10-3.el8.x86_64.rpm - NetworkManager-team-1.32.10-3.el8.x86_64.rpm - NetworkManager-tui-1.32.10-3.el8.x86_64.rpm - NetworkManager-ovs-1.32.10-3.el8.x86_64.rpm
*** Bug 1995503 has been marked as a duplicate of this bug. ***
from Frank's email: # mkdir -p /tmp/test # echo 'testhah123' > /tmp/test/1 # cd /tmp/test # podman run -dit --name my-apache-app -p 8080:80 -v "$PWD":/usr/local/apache2/htdocs/ httpd:2.4 # curl http://10.116.2.65:8080/1 (with NetworkManager-cloud-setup installed, curl failed, without nm-cloud, curl ok) testhah123 With nm-cloud-setup enabled(NetworkManager-cloud-setup-1.30.0-10.el8_4.x86_64), below is route output: [root@ip-10-116-2-65 test]# ip -4 route show table all|sort 10.116.2.0/24 dev eth0 proto kernel scope link src 10.116.2.65 metric 100 10.116.2.1 dev eth0 table 30400 proto static scope link metric 10 10.88.0.0/16 dev cni-podman0 proto kernel scope link src 10.88.0.1 broadcast 10.116.2.0 dev eth0 table local proto kernel scope link src 10.116.2.65 broadcast 10.116.2.255 dev eth0 table local proto kernel scope link src 10.116.2.65 broadcast 10.88.0.0 dev cni-podman0 table local proto kernel scope link src 10.88.0.1 broadcast 10.88.255.255 dev cni-podman0 table local proto kernel scope link src 10.88.0.1 broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 default via 10.116.2.1 dev eth0 proto dhcp metric 100 default via 10.116.2.1 dev eth0 table 30400 proto static metric 10 local 10.116.2.65 dev eth0 table local proto kernel scope host src 10.116.2.65 local 10.88.0.1 dev cni-podman0 table local proto kernel scope host src 10.88.0.1 local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 [root@ip-10-116-2-65 test]# curl http://10.116.2.65:8080/1 curl: (7) Failed to connect to 10.116.2.65 port 8080: Connection timed out With nm-cloud-setup disabled: # ip -4 route show table all|sort 10.116.2.0/24 dev eth0 proto kernel scope link src 10.116.2.65 metric 100 10.88.0.0/16 dev cni-podman0 proto kernel scope link src 10.88.0.1 broadcast 10.116.2.0 dev eth0 table local proto kernel scope link src 10.116.2.65 broadcast 10.116.2.255 dev eth0 table local proto kernel scope link src 10.116.2.65 broadcast 10.88.0.0 dev cni-podman0 table local proto kernel scope link src 10.88.0.1 broadcast 10.88.255.255 dev cni-podman0 table local proto kernel scope link src 10.88.0.1 broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 default via 10.116.2.1 dev eth0 proto dhcp metric 100 local 10.116.2.65 dev eth0 table local proto kernel scope host src 10.116.2.65 local 10.88.0.1 dev cni-podman0 table local proto kernel scope host src 10.88.0.1 local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 [root@ip-10-116-2-65 test]# curl http://10.116.2.65:8080/1 testhah123 The behavior is the same as RHEL-8.4 in RHEL-8.5, but with nm-cloud-setup enabled in fixed version(NetworkManager-cloud-setup-1.32.10-4.el8.x86_64), below is route output: [root@ip-10-116-2-122 test]# rpm -q NetworkManager-cloud-setup NetworkManager-cloud-setup-1.32.10-4.el8.x86_64 [root@ip-10-116-2-122 test]# curl http://10.116.2.122:8080/1 testhaha [root@ip-10-116-2-122 test]# ip -4 route show table all|sort 10.116.2.0/24 dev eth0 proto kernel scope link src 10.116.2.122 metric 100 10.88.0.0/16 dev cni-podman0 proto kernel scope link src 10.88.0.1 broadcast 10.116.2.0 dev eth0 table local proto kernel scope link src 10.116.2.122 broadcast 10.116.2.255 dev eth0 table local proto kernel scope link src 10.116.2.122 broadcast 10.88.0.0 dev cni-podman0 table local proto kernel scope link src 10.88.0.1 broadcast 10.88.255.255 dev cni-podman0 table local proto kernel scope link src 10.88.0.1 broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 default via 10.116.2.1 dev eth0 proto dhcp metric 100 local 10.116.2.122 dev eth0 table local proto kernel scope host src 10.116.2.122 local 10.88.0.1 dev cni-podman0 table local proto kernel scope host src 10.88.0.1 local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: NetworkManager security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4361