Bug 1573122

Summary: [3.7] OCP on Azure - if the kubelet can't reach the Azure API - marked as NotReady
Product: OpenShift Container Platform Reporter: Paul Dwyer <pdwyer>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED ERRATA QA Contact: DeShuai Ma <dma>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.7.0CC: aos-bugs, avagarwa, dkinkead, dma, erich, haowang, jchaloup, jenander, jokerman, jsafrane, mfojtik, mifiedle, mmccomas, mprashad, openshift-bugs-escalate, pdwyer, sjenning, stwalter, vilibert, vwalek, wmeng, xtian
Target Milestone: ---   
Target Release: 3.7.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Fixes and issue where a Node can stop reporting status if the connection to the Azure API is terminated uncleanly, resulting a long timeout before the connection is re-established and blocking the status update loop.
Story Points: ---
Clone Of: 1554748 Environment:
Last Closed: 2018-05-18 03:54:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1554748    
Bug Blocks:    

Comment 19 DeShuai Ma 2018-05-10 08:48:54 UTC
Verify on v3.7.46

[root@dma37-master-etcd-nfs-1 ~]# oc version
oc v3.7.46
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dma37-master-etcd-nfs-1:8443
openshift v3.7.46
kubernetes v1.7.6+a08f5eeb62

1. Watch node status in master
[root@dma37-master-etcd-nfs-1 ~]# oc get no dma37-node-registry-router-1 -w
NAME                           STATUS    AGE       VERSION
dma37-node-registry-router-1   Ready     19m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     19m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     19m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     20m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     20m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     20m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     21m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     21m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     21m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     22m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     22m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     22m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     23m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     23m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     23m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     24m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     24m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     24m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     25m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     25m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     25m       v1.7.6+a08f5eeb62
dma37-node-registry-router-1   Ready     26m       v1.7.6+a08f5eeb62


2. No node block the connection with azure api then watch node log
[root@dma37-node-registry-router-1 ~]#  iptables -A OUTPUT -d management.azure.com -j DROP   
[root@dma37-node-registry-router-1 ~]# journalctl -f -u atomic-openshift-node.service |grep "Timeout after"
May 10 08:32:54 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:32:54.204527   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:33:14 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:33:14.225115   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:33:34 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:33:34.249522   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:33:54 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:33:54.274028   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:34:14 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:34:14.293589   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:34:34 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:34:34.318474   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:34:54 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:34:54.341656   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:35:26 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:35:26.402908   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:36:06 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:36:06.424409   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:36:46 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:36:46.454712   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:37:26 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:37:26.486660   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:38:06 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:38:06.552450   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s
May 10 08:38:46 dma37-node-registry-router-1 atomic-openshift-node[29230]: W0510 08:38:46.572796   29230 kubelet_node_status.go:1007] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s

Comment 22 errata-xmlrpc 2018-05-18 03:54:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1576