Bug 1281509 - Flapping connection to kubes port 8443
Flapping connection to kubes port 8443
Status: NEW
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.1.0
x86_64 Mac OS
medium Severity low
: ---
: ---
Assigned To: Eric Paris
Jianwei Hou
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-12 11:19 EST by Jaroslav Henner
Modified: 2016-01-11 13:19 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log (116.75 KB, text/x-vhdl)
2015-11-12 11:19 EST, Jaroslav Henner
no flags Details
strace for the transition between OK and Connection refused state (103.58 KB, application/x-xz)
2015-11-13 07:06 EST, Jaroslav Henner
no flags Details
strace sampling with higer frequency (197.87 KB, application/x-xz)
2015-11-13 07:54 EST, Jaroslav Henner
no flags Details

  None (edit)
Description Jaroslav Henner 2015-11-12 11:19:27 EST
Created attachment 1093387 [details]
log

Description of problem:
I am getting periodic connection resets when trying to connect to port 8443 -- the kubes/openshift APIserver, no matter if done with curl or the oc.


watch -n1 curl -k https://atomic-experiment-1.novalocal:4001
curl: (7) Failed
connect to atomic-experiment-1.novalocal:4001; Connection refused

oc status
The connection to the server atomic-experiment-1.novalocal:8443 was refused - did you specify the right host or port?


 ss -ptnl
State      Recv-Q Send-Q   Local Address:Port                  Peer Address:Port              
LISTEN     0      128                  *:53                               *:*                  
LISTEN     0      128                  *:22                               *:*                  
LISTEN     0      100          127.0.0.1:25                               *:*                  
LISTEN     0      128                  *:8443                             *:*                  
LISTEN     0      128                 :::22                              :::*                  
LISTEN     0      128                 :::7001                            :::*                  
LISTEN     0      100                ::1:25                              :::*                  
LISTEN     0      128                 :::4001                            :::*   


There seems to be no interesting information in the logs. I don't know how to proceed with debugging.


Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 7.2 (Maipo)
atomic-openshift.x86_64              3.1.0.4-1.git.2.c5fa845.el7aos  @ose-devel 
atomic-openshift-clients.x86_64      3.1.0.4-1.git.2.c5fa845.el7aos  @ose-devel 
atomic-openshift-master.x86_64       3.1.0.4-1.git.2.c5fa845.el7aos  @ose-devel 



How reproducible:
always


Steps to Reproduce:
1. deploy with openshift-ansible on RHEL 7.2
2. 
3.

Actual results:
deploy fails on some point, depending on timing

Expected results:
deploy suceeds


Additional info:
Comment 1 Jaroslav Henner 2015-11-12 11:20:31 EST
Note that I have manually set selinux to enforcing mode on that machine
Comment 2 Paul Weil 2015-11-12 11:26:31 EST
When curling are you having problem on the master itself or is this a connection coming from outside the master?  That might help narrow down the possibilities.
Comment 3 Jaroslav Henner 2015-11-12 18:30:50 EST
(In reply to Paul Weil from comment #2)
> When curling are you having problem on the master itself or is this a
> connection coming from outside the master?  That might help narrow down the
> possibilities.

It is the mater itself. 

[cloud-user@atomic-experiment-1 ~]$ curl -k 'https://atomic-experiment-1.novalocal:4001/v2/keys/kubernetes.io/minions?quorum=false&recursive=true&sorted=true'
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
[cloud-user@atomic-experiment-1 ~]$ sudo iptables -S
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N OS_FIREWALL_ALLOW
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -j OS_FIREWALL_ALLOW
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 4001 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 8443 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 53 -j ACCEPT
-A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 53 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 24224 -j ACCEPT
-A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 24224 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2224 -j ACCEPT
-A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 5404 -j ACCEPT
-A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 5405 -j ACCEPT

[cloud-user@atomic-experiment-1 ~]$ ping atomic-experiment-1.novalocal
PING atomic-experiment-1.novalocal (172.16.80.5) 56(84) bytes of data.
64 bytes from atomic-experiment-1.novalocal (172.16.80.5): icmp_seq=1 ttl=64 time=0.040 ms

[cloud-user@atomic-experiment-1 ~]$ ip a
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:47:59:04 brd ff:ff:ff:ff:ff:ff
    inet 172.16.80.5/24 brd 172.16.80.255 scope global dynamic eth0
       valid_lft 95sec preferred_lft 95sec

[cloud-user@atomic-experiment-1 ~]$ cat /etc/resolv.conf 
# Generated by NetworkManager
search novalocal
nameserver 172.16.80.1
Comment 4 Jaroslav Henner 2015-11-13 07:06 EST
Created attachment 1093596 [details]
strace for the transition between OK and Connection refused state

[cloud-user@atomic-experiment-1 ~]$ while true; do curl -k 'https://atomic-experiment-1.novalocal:4001/v2/keys/kubernetes.io/minions?quorum=false&recursive=true&sorted=true'; date --rfc-3339=ns; sleep 1; done
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:03:48.282369199-05:00
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:03:49.295402188-05:00

...

curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:04:27.043723554-05:00
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:04:28.070231983-05:00
curl: (58) NSS: client certificate not found (nickname not specified)
2015-11-13 07:04:29.558434241-05:00
curl: (58) NSS: client certificate not found (nickname not specified)
2015-11-13 07:04:30.850206177-05:00

...

curl: (58) NSS: client certificate not found (nickname not specified)
2015-11-13 07:04:39.755396698-05:00
curl: (58) NSS: client certificate not found (nickname not specified)
2015-11-13 07:04:41.085865080-05:00
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:04:42.103618837-05:00
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:04:43.128931207-05:00
...
Comment 5 Jaroslav Henner 2015-11-13 07:54 EST
Created attachment 1093632 [details]
strace sampling with higer frequency

Another strace log, with some calls filtered out and with much higher sampling frequency of curling.

strace -e'!clock_gettime' -e '!futex,epoll_wait,clock_gettime,select'  -ttfFp 23143 2>&1 | tee strace

while true; do curl -k 'https://atomic-experiment-1.novalocal:4001/v2/keys/kubernetes.io/minions?quorum=false&recursive=true&sorted=true'; date --rfc-3339=ns; done
Comment 6 Jaroslav Henner 2015-11-16 12:50:04 EST
I have two deployments on QEOS. One with the problem and one without the problem. One difference between them is that after the nova boot I have changed the instances names. That was on the problematic one.

Anyway i will try to compare those two deployments in order to get more info
Comment 7 Eric Paris 2015-12-17 12:33:55 EST
I apologize for the long delay in response. Are you still having problems? Were you able to find any more differences between the two deployments?

Note You need to log in before you can comment on or make changes to this bug.