Description of problem: setup one cluster with OVN plugin and two rhel7.9 worker. Found on rhel7.9 worker. https://172.30.0.1:443 cannot be accessed. However the kubernetes backend ip can be accessed. see $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 6h12m openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 5h53m service-secure ClusterIP 172.30.186.168 <none> 27443/TCP 73m service-unsecure ClusterIP 172.30.15.244 <none> 27017/TCP 73m $ oc get ep NAME ENDPOINTS AGE kubernetes 172.31.249.123:6443,172.31.249.212:6443,172.31.249.224:6443 6h12m service-secure 10.129.2.12:8443,10.130.2.14:8443 74m service-unsecure 10.129.2.12:8080,10.130.2.14:8080 74m $ oc debug node/wewang-623-rwwrt-rhel-0 Creating debug namespace/openshift-debug-node-lnzdw ... Starting pod/wewang-623-rwwrt-rhel-0-debug ... To use host binaries, run `chroot /host` Pod IP: 172.31.249.173 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.2# curl https://172.30.0.1:443 ^C sh-4.2# curl https://172.30.0.1:443 --connect-timeout 4 ---> service IP cannot be accessed curl: (28) Connection timed out after 4000 milliseconds # curl https://172.31.249.123:6443 --connect-timeout 4 -k ----> backend works { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": { }, "code": 403 seems only kubernetes service ip 172.30.0.1 cannot be accessed on rhel7.9 worker. I create one test pod and service. it works well $ oc debug node/wewang-623-rwwrt-rhel-0 Creating debug namespace/openshift-debug-node-kg9m4 ... Starting pod/wewang-623-rwwrt-rhel-0-debug ... To use host binaries, run `chroot /host` Pod IP: 172.31.249.173 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.2# curl https://172.30.186.168:27443 -k Hello-OpenShift web-server-rc-hbf7h https-8443 default Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-06-22-145219 rhel7.9 worker openvswitch version and kernel version: rpm -qa | grep openv openvswitch-selinux-extra-policy-1.0-17.el7fdp.noarch openvswitch2.13-2.13.0-95.el7fdp.x86_64 sh-4.2# uname -a Linux wewang-623-rwwrt-rhel-0 3.10.0-1160.31.1.el7.x86_64 #1 SMP Wed May 26 20:18:08 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Steps to Reproduce: 1. setup OVN cluster with rhel7.9 worker. 2. 3. Actual results: Expected results: Additional info:
The problem looks to be the return syn/ack packet is getting dropped during an upcall to vswitchd. I setup a server on the master node, and then curl'ed from the pod on the rhel node. In OVS logs we can see the packet gets upcalled: Jun 24 10:16:37 wewang-623-rwwrt-rhel-0 ovs-vswitchd[1526]: ovs|00196|dpif(handler16)|DBG|system@ovs-system: action upcall: Jun 24 10:16:37 wewang-623-rwwrt-rhel-0 ovs-vswitchd[1526]: recirc_id(0x22),dp_hash(0),skb_priority(0),in_port(1),skb_mark(0),ct_state(0x2a),ct_zone(0xfa00),ct_mark(0),ct_label(0),ct_tuple4(src=172.31.249.173,dst=172.31.249.212,proto=6,tp_src=59310,tp_dst=1337),eth(src=0:50:56:ac:65:d9,dst=00:50:56:ac:e5:24),eth_type(0x0800),ipv4(src=172.31.249.212,dst=172.31.249.173,proto=6,tos=0,ttl=64,frag=no),tcp(src=1337,dst=59310),tcp_flags(syn|ack) Then if we look at the dpctl flows for recird id x22: recirc_id(0x22),in_port(1),ct_state(-new+est-rel+rpl-inv+trk),ct_label(0/0x3),eth(src=00:50:56:ac:65:d9,dst=00:50:56:ac:e5:24),eth_type(0x0800),ipv4(src=172.31.249.192/255.255.255.224,dst=172.31.249.173,proto=6,ttl=64,frag=no), packets:3789, bytes:280386, used:0.236s, flags:S., actions:userspace(pid=4294963116,slow_path(action)) recirc_id(0x22),in_port(1),ct_state(-new+est-rel+rpl-inv+trk),ct_label(0/0x3),eth(src=00:50:56:ac:67:86,dst=00:50:56:ac:e5:24),eth_type(0x0800),ipv4(src=172.31.249.0/255.255.255.128,dst=172.31.249.173,proto=6,ttl=64,frag=no), packets:3479, bytes:257446, used:0.734s, flags:S., actions:userspace(pid=4294963116,slow_path(action)) recirc_id(0x22),in_port(1),ct_state(-new+est-rel+rpl-inv+trk),ct_label(0/0x3),eth(src=00:50:56:ac:49:85,dst=00:50:56:ac:e5:24),eth_type(0x0800),ipv4(src=172.31.249.224/255.255.255.240,dst=172.31.249.173,proto=6,ttl=64,frag=no), packets:374, bytes:27676, used:1.488s, flags:S., actions:userspace(pid=4294963116,slow_path(action)) I'm wondering if this is due to check packet length action, and related to https://bugzilla.redhat.com/show_bug.cgi?id=1961506 We lost the test cluster so I was unable to try a workaround. Could you please retry with https://github.com/openshift/ovn-kubernetes/pull/584 ? Thanks.
Checked on cluster 4.8.0-0.nightly-2021-06-24-222938 with 584 PR merged. it works well. $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-06-24-222938 True False False 55m baremetal 4.8.0-0.nightly-2021-06-24-222938 True False False 85m cloud-credential 4.8.0-0.nightly-2021-06-24-222938 True False False 94m cluster-autoscaler 4.8.0-0.nightly-2021-06-24-222938 True False False 89m config-operator 4.8.0-0.nightly-2021-06-24-222938 True False False 90m console 4.8.0-0.nightly-2021-06-24-222938 True False False 36m csi-snapshot-controller 4.8.0-0.nightly-2021-06-24-222938 True False False 89m dns 4.8.0-0.nightly-2021-06-24-222938 True False False 85m etcd 4.8.0-0.nightly-2021-06-24-222938 True False False 89m image-registry 4.8.0-0.nightly-2021-06-24-222938 True False False 83m ingress 4.8.0-0.nightly-2021-06-24-222938 True False False 80m insights 4.8.0-0.nightly-2021-06-24-222938 True False False 84m kube-apiserver 4.8.0-0.nightly-2021-06-24-222938 True False False 86m kube-controller-manager 4.8.0-0.nightly-2021-06-24-222938 True False False 87m kube-scheduler 4.8.0-0.nightly-2021-06-24-222938 True False False 87m kube-storage-version-migrator 4.8.0-0.nightly-2021-06-24-222938 True False False 90m machine-api 4.8.0-0.nightly-2021-06-24-222938 True False False 86m machine-approver 4.8.0-0.nightly-2021-06-24-222938 True False False 89m machine-config 4.8.0-0.nightly-2021-06-24-222938 True False False 89m marketplace 4.8.0-0.nightly-2021-06-24-222938 True False False 89m monitoring 4.8.0-0.nightly-2021-06-24-222938 True False False 80m network 4.8.0-0.nightly-2021-06-24-222938 True False False 90m node-tuning 4.8.0-0.nightly-2021-06-24-222938 True False False 90m openshift-apiserver 4.8.0-0.nightly-2021-06-24-222938 True False False 80m openshift-controller-manager 4.8.0-0.nightly-2021-06-24-222938 True False False 89m openshift-samples 4.8.0-0.nightly-2021-06-24-222938 True False False 85m operator-lifecycle-manager 4.8.0-0.nightly-2021-06-24-222938 True False False 89m operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-06-24-222938 True False False 89m operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-06-24-222938 True False False 86m service-ca 4.8.0-0.nightly-2021-06-24-222938 True False False 90m storage 4.8.0-0.nightly-2021-06-24-222938 True False False 90m $ oc rsh hello-8blgk / # curl https://172.30.0.1:443 curl: (60) SSL certificate problem: self signed certificate in certificate chain More details here: https://curl.haxx.se/docs/sslcerts.html curl failed to verify the legitimacy of the server and therefore could not establish a secure connection to it. To learn more about this situation and how to fix it, please visit the web page mentioned above. / # curl https://172.30.0.1:443 -k { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": { }, "code": 403 Move this bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438