1975155 – Kubernetes service IP cannot be accessed for rhel worker

Bug 1975155 - Kubernetes service IP cannot be accessed for rhel worker

Summary: Kubernetes service IP cannot be accessed for rhel worker

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Andrew Stoycos
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-23 08:15 UTC by zhaozhanqi
Modified:	2021-07-27 23:13 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:13:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:13:53 UTC

Description zhaozhanqi 2021-06-23 08:15:13 UTC

Description of problem:

setup one cluster with OVN plugin and two rhel7.9 worker. Found on rhel7.9 worker. https://172.30.0.1:443 cannot be accessed. However the kubernetes backend ip can be accessed. see

$ oc get svc
NAME               TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)     AGE
kubernetes         ClusterIP      172.30.0.1       <none>                                 443/TCP     6h12m
openshift          ExternalName   <none>           kubernetes.default.svc.cluster.local   <none>      5h53m
service-secure     ClusterIP      172.30.186.168   <none>                                 27443/TCP   73m
service-unsecure   ClusterIP      172.30.15.244    <none>                                 27017/TCP   73m
$ oc get ep 
NAME               ENDPOINTS                                                     AGE
kubernetes         172.31.249.123:6443,172.31.249.212:6443,172.31.249.224:6443   6h12m
service-secure     10.129.2.12:8443,10.130.2.14:8443                             74m
service-unsecure   10.129.2.12:8080,10.130.2.14:8080                             74m


$ oc debug node/wewang-623-rwwrt-rhel-0
Creating debug namespace/openshift-debug-node-lnzdw ...
Starting pod/wewang-623-rwwrt-rhel-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 172.31.249.173
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.2# curl https://172.30.0.1:443                             
^C
sh-4.2# curl https://172.30.0.1:443 --connect-timeout 4    ---> service IP cannot be accessed 
curl: (28) Connection timed out after 4000 milliseconds


# curl https://172.31.249.123:6443 --connect-timeout 4 -k  ----> backend works
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
    
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {
    
  },
  "code": 403


seems only kubernetes service ip 172.30.0.1 cannot be accessed on rhel7.9 worker. I create one test pod and service. it works well

$ oc debug node/wewang-623-rwwrt-rhel-0
Creating debug namespace/openshift-debug-node-kg9m4 ...
Starting pod/wewang-623-rwwrt-rhel-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 172.31.249.173
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.2# curl https://172.30.186.168:27443 -k
Hello-OpenShift web-server-rc-hbf7h https-8443 default

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-06-22-145219

rhel7.9 worker openvswitch version and kernel version:
rpm -qa | grep openv
openvswitch-selinux-extra-policy-1.0-17.el7fdp.noarch
openvswitch2.13-2.13.0-95.el7fdp.x86_64

sh-4.2# uname -a
Linux wewang-623-rwwrt-rhel-0 3.10.0-1160.31.1.el7.x86_64 #1 SMP Wed May 26 20:18:08 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:


Steps to Reproduce:
1. setup OVN cluster with rhel7.9 worker.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 8 Tim Rozet 2021-06-24 15:27:53 UTC

The problem looks to be the return syn/ack packet is getting dropped during an upcall to vswitchd. I setup a server on the master node, and then curl'ed from the pod on the rhel node. In OVS logs we can see the packet gets upcalled:


Jun 24 10:16:37 wewang-623-rwwrt-rhel-0 ovs-vswitchd[1526]: ovs|00196|dpif(handler16)|DBG|system@ovs-system: action upcall:
Jun 24 10:16:37 wewang-623-rwwrt-rhel-0 ovs-vswitchd[1526]: recirc_id(0x22),dp_hash(0),skb_priority(0),in_port(1),skb_mark(0),ct_state(0x2a),ct_zone(0xfa00),ct_mark(0),ct_label(0),ct_tuple4(src=172.31.249.173,dst=172.31.249.212,proto=6,tp_src=59310,tp_dst=1337),eth(src=0:50:56:ac:65:d9,dst=00:50:56:ac:e5:24),eth_type(0x0800),ipv4(src=172.31.249.212,dst=172.31.249.173,proto=6,tos=0,ttl=64,frag=no),tcp(src=1337,dst=59310),tcp_flags(syn|ack)


Then if we look at the dpctl flows for recird id x22:

recirc_id(0x22),in_port(1),ct_state(-new+est-rel+rpl-inv+trk),ct_label(0/0x3),eth(src=00:50:56:ac:65:d9,dst=00:50:56:ac:e5:24),eth_type(0x0800),ipv4(src=172.31.249.192/255.255.255.224,dst=172.31.249.173,proto=6,ttl=64,frag=no), packets:3789, bytes:280386, used:0.236s, flags:S., actions:userspace(pid=4294963116,slow_path(action))
recirc_id(0x22),in_port(1),ct_state(-new+est-rel+rpl-inv+trk),ct_label(0/0x3),eth(src=00:50:56:ac:67:86,dst=00:50:56:ac:e5:24),eth_type(0x0800),ipv4(src=172.31.249.0/255.255.255.128,dst=172.31.249.173,proto=6,ttl=64,frag=no), packets:3479, bytes:257446, used:0.734s, flags:S., actions:userspace(pid=4294963116,slow_path(action))
recirc_id(0x22),in_port(1),ct_state(-new+est-rel+rpl-inv+trk),ct_label(0/0x3),eth(src=00:50:56:ac:49:85,dst=00:50:56:ac:e5:24),eth_type(0x0800),ipv4(src=172.31.249.224/255.255.255.240,dst=172.31.249.173,proto=6,ttl=64,frag=no), packets:374, bytes:27676, used:1.488s, flags:S., actions:userspace(pid=4294963116,slow_path(action))

I'm wondering if this is due to check packet length action, and related to https://bugzilla.redhat.com/show_bug.cgi?id=1961506

We lost the test cluster so I was unable to try a workaround. Could you please retry with https://github.com/openshift/ovn-kubernetes/pull/584 ?

Thanks.

Comment 9 zhaozhanqi 2021-06-25 03:33:32 UTC

Checked on cluster 4.8.0-0.nightly-2021-06-24-222938 with 584 PR merged. it works well. 


$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-0.nightly-2021-06-24-222938   True        False         False      55m
baremetal                                  4.8.0-0.nightly-2021-06-24-222938   True        False         False      85m
cloud-credential                           4.8.0-0.nightly-2021-06-24-222938   True        False         False      94m
cluster-autoscaler                         4.8.0-0.nightly-2021-06-24-222938   True        False         False      89m
config-operator                            4.8.0-0.nightly-2021-06-24-222938   True        False         False      90m
console                                    4.8.0-0.nightly-2021-06-24-222938   True        False         False      36m
csi-snapshot-controller                    4.8.0-0.nightly-2021-06-24-222938   True        False         False      89m
dns                                        4.8.0-0.nightly-2021-06-24-222938   True        False         False      85m
etcd                                       4.8.0-0.nightly-2021-06-24-222938   True        False         False      89m
image-registry                             4.8.0-0.nightly-2021-06-24-222938   True        False         False      83m
ingress                                    4.8.0-0.nightly-2021-06-24-222938   True        False         False      80m
insights                                   4.8.0-0.nightly-2021-06-24-222938   True        False         False      84m
kube-apiserver                             4.8.0-0.nightly-2021-06-24-222938   True        False         False      86m
kube-controller-manager                    4.8.0-0.nightly-2021-06-24-222938   True        False         False      87m
kube-scheduler                             4.8.0-0.nightly-2021-06-24-222938   True        False         False      87m
kube-storage-version-migrator              4.8.0-0.nightly-2021-06-24-222938   True        False         False      90m
machine-api                                4.8.0-0.nightly-2021-06-24-222938   True        False         False      86m
machine-approver                           4.8.0-0.nightly-2021-06-24-222938   True        False         False      89m
machine-config                             4.8.0-0.nightly-2021-06-24-222938   True        False         False      89m
marketplace                                4.8.0-0.nightly-2021-06-24-222938   True        False         False      89m
monitoring                                 4.8.0-0.nightly-2021-06-24-222938   True        False         False      80m
network                                    4.8.0-0.nightly-2021-06-24-222938   True        False         False      90m
node-tuning                                4.8.0-0.nightly-2021-06-24-222938   True        False         False      90m
openshift-apiserver                        4.8.0-0.nightly-2021-06-24-222938   True        False         False      80m
openshift-controller-manager               4.8.0-0.nightly-2021-06-24-222938   True        False         False      89m
openshift-samples                          4.8.0-0.nightly-2021-06-24-222938   True        False         False      85m
operator-lifecycle-manager                 4.8.0-0.nightly-2021-06-24-222938   True        False         False      89m
operator-lifecycle-manager-catalog         4.8.0-0.nightly-2021-06-24-222938   True        False         False      89m
operator-lifecycle-manager-packageserver   4.8.0-0.nightly-2021-06-24-222938   True        False         False      86m
service-ca                                 4.8.0-0.nightly-2021-06-24-222938   True        False         False      90m
storage                                    4.8.0-0.nightly-2021-06-24-222938   True        False         False      90m


$ oc rsh hello-8blgk
/ # curl https://172.30.0.1:443
curl: (60) SSL certificate problem: self signed certificate in certificate chain
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
/ # curl https://172.30.0.1:443 -k
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
    
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {
    
  },
  "code": 403

Move this bug to verified.

Comment 12 errata-xmlrpc 2021-07-27 23:13:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.