Bug 1281509

Summary: Flapping connection to kubes port 8443
Product: OpenShift Container Platform Reporter: Jaroslav Henner <jhenner>
Component: NodeAssignee: Eric Paris <eparis>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Jianwei Hou <jhou>
Severity: low Docs Contact:
Priority: medium    
Version: 3.1.0CC: aos-bugs, jokerman, mmccomas
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Mac OS   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-18 15:19:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log
none
strace for the transition between OK and Connection refused state
none
strace sampling with higer frequency none

Description Jaroslav Henner 2015-11-12 16:19:27 UTC
Created attachment 1093387 [details]
log

Description of problem:
I am getting periodic connection resets when trying to connect to port 8443 -- the kubes/openshift APIserver, no matter if done with curl or the oc.


watch -n1 curl -k https://atomic-experiment-1.novalocal:4001
curl: (7) Failed
connect to atomic-experiment-1.novalocal:4001; Connection refused

oc status
The connection to the server atomic-experiment-1.novalocal:8443 was refused - did you specify the right host or port?


 ss -ptnl
State      Recv-Q Send-Q   Local Address:Port                  Peer Address:Port              
LISTEN     0      128                  *:53                               *:*                  
LISTEN     0      128                  *:22                               *:*                  
LISTEN     0      100          127.0.0.1:25                               *:*                  
LISTEN     0      128                  *:8443                             *:*                  
LISTEN     0      128                 :::22                              :::*                  
LISTEN     0      128                 :::7001                            :::*                  
LISTEN     0      100                ::1:25                              :::*                  
LISTEN     0      128                 :::4001                            :::*   


There seems to be no interesting information in the logs. I don't know how to proceed with debugging.


Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 7.2 (Maipo)
atomic-openshift.x86_64              3.1.0.4-1.git.2.c5fa845.el7aos  @ose-devel 
atomic-openshift-clients.x86_64      3.1.0.4-1.git.2.c5fa845.el7aos  @ose-devel 
atomic-openshift-master.x86_64       3.1.0.4-1.git.2.c5fa845.el7aos  @ose-devel 



How reproducible:
always


Steps to Reproduce:
1. deploy with openshift-ansible on RHEL 7.2
2. 
3.

Actual results:
deploy fails on some point, depending on timing

Expected results:
deploy suceeds


Additional info:

Comment 1 Jaroslav Henner 2015-11-12 16:20:31 UTC
Note that I have manually set selinux to enforcing mode on that machine

Comment 2 Paul Weil 2015-11-12 16:26:31 UTC
When curling are you having problem on the master itself or is this a connection coming from outside the master?  That might help narrow down the possibilities.

Comment 3 Jaroslav Henner 2015-11-12 23:30:50 UTC
(In reply to Paul Weil from comment #2)
> When curling are you having problem on the master itself or is this a
> connection coming from outside the master?  That might help narrow down the
> possibilities.

It is the mater itself. 

[cloud-user@atomic-experiment-1 ~]$ curl -k 'https://atomic-experiment-1.novalocal:4001/v2/keys/kubernetes.io/minions?quorum=false&recursive=true&sorted=true'
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
[cloud-user@atomic-experiment-1 ~]$ sudo iptables -S
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N OS_FIREWALL_ALLOW
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -j OS_FIREWALL_ALLOW
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 4001 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 8443 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 53 -j ACCEPT
-A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 53 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 24224 -j ACCEPT
-A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 24224 -j ACCEPT
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2224 -j ACCEPT
-A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 5404 -j ACCEPT
-A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 5405 -j ACCEPT

[cloud-user@atomic-experiment-1 ~]$ ping atomic-experiment-1.novalocal
PING atomic-experiment-1.novalocal (172.16.80.5) 56(84) bytes of data.
64 bytes from atomic-experiment-1.novalocal (172.16.80.5): icmp_seq=1 ttl=64 time=0.040 ms

[cloud-user@atomic-experiment-1 ~]$ ip a
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:47:59:04 brd ff:ff:ff:ff:ff:ff
    inet 172.16.80.5/24 brd 172.16.80.255 scope global dynamic eth0
       valid_lft 95sec preferred_lft 95sec

[cloud-user@atomic-experiment-1 ~]$ cat /etc/resolv.conf 
# Generated by NetworkManager
search novalocal
nameserver 172.16.80.1

Comment 4 Jaroslav Henner 2015-11-13 12:06:29 UTC
Created attachment 1093596 [details]
strace for the transition between OK and Connection refused state

[cloud-user@atomic-experiment-1 ~]$ while true; do curl -k 'https://atomic-experiment-1.novalocal:4001/v2/keys/kubernetes.io/minions?quorum=false&recursive=true&sorted=true'; date --rfc-3339=ns; sleep 1; done
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:03:48.282369199-05:00
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:03:49.295402188-05:00

...

curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:04:27.043723554-05:00
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:04:28.070231983-05:00
curl: (58) NSS: client certificate not found (nickname not specified)
2015-11-13 07:04:29.558434241-05:00
curl: (58) NSS: client certificate not found (nickname not specified)
2015-11-13 07:04:30.850206177-05:00

...

curl: (58) NSS: client certificate not found (nickname not specified)
2015-11-13 07:04:39.755396698-05:00
curl: (58) NSS: client certificate not found (nickname not specified)
2015-11-13 07:04:41.085865080-05:00
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:04:42.103618837-05:00
curl: (7) Failed connect to atomic-experiment-1.novalocal:4001; Connection refused
2015-11-13 07:04:43.128931207-05:00
...

Comment 5 Jaroslav Henner 2015-11-13 12:54:15 UTC
Created attachment 1093632 [details]
strace sampling with higer frequency

Another strace log, with some calls filtered out and with much higher sampling frequency of curling.

strace -e'!clock_gettime' -e '!futex,epoll_wait,clock_gettime,select'  -ttfFp 23143 2>&1 | tee strace

while true; do curl -k 'https://atomic-experiment-1.novalocal:4001/v2/keys/kubernetes.io/minions?quorum=false&recursive=true&sorted=true'; date --rfc-3339=ns; done

Comment 6 Jaroslav Henner 2015-11-16 17:50:04 UTC
I have two deployments on QEOS. One with the problem and one without the problem. One difference between them is that after the nova boot I have changed the instances names. That was on the problematic one.

Anyway i will try to compare those two deployments in order to get more info

Comment 7 Eric Paris 2015-12-17 17:33:55 UTC
I apologize for the long delay in response. Are you still having problems? Were you able to find any more differences between the two deployments?

Comment 8 Eric Paris 2017-12-18 15:19:50 UTC
I am closing this BZ due to inactivity. If you are still having problems please reach out to support!