Bug 1583129

Summary:	[3.6] Error creating subnet for node - just error message in logs - OCP on Azure
Product:	OpenShift Container Platform	Reporter:	Vladislav Walek <vwalek>
Component:	Cloud Compute	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED ERRATA	QA Contact:	DeShuai Ma <dma>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.6.0	CC:	aos-bugs, bjarolim, bleanhar, borisb, clasohm, dcbw, dkinkead, dma, erich, grodrigu, harold.wong, jawilson, jchaloup, jenander, jokerman, mfojtik, misalunk, mmccomas, mprashad, nnosenzo, openshift-bugs-escalate, pdwyer, rhowe, rvokal, sauchter, sjenning, smunilla, tatanaka, vwalek, wmeng, xtian
Target Milestone:	---
Target Release:	3.6.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	atomic-openshift-3.6.173.0.117-1.git.1.2e5387f.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1595599 1595605 1595608 1603662 (view as bug list)		Environment:
Last Closed:	2018-08-09 22:10:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1603662

Description Vladislav Walek 2018-05-28 09:10:01 UTC

Description of problem:

after the errata https://access.redhat.com/errata/RHBA-2018:1576 with bugfix 1573122 on version 3.7.46, the controller logs are showing error message for each node:

subnets.go:192] Error creating subnet for node <node-name>, ip <node-name>: failed to parse node IP <node-name>

The error is showing every 30s (which is node-status-update-frequency exactly).
In the node logs, no error is shown.
The node are running properly without any downtime. 
No azure related errors are in the logs.


Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.7.46

How reproducible:
not reproducible

Comment 1 Vladislav Walek 2018-05-28 09:15:16 UTC

to correct:

the correct version is not openshift 3.7.46, but openshift 3.6.173.0.117.

Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.6.173.0.117

Comment 2 Vladislav Walek 2018-05-28 09:16:06 UTC

The errata is https://access.redhat.com/errata/RHBA-2018:1579  and bug is 1554748

Comment 3 Vladislav Walek 2018-05-29 08:57:47 UTC

the issue is affecting the metrics:

E0529 05:30:05.000608       1 summary.go:376] Node node1 has no valid hostname and/or IP address: node1
E0529 05:30:05.000620       1 summary.go:376] Node node2 has no valid hostname and/or IP address: node2
I0529 05:30:23.231898       1 handlers.go:264] No metrics for container container1 in pod project/container1
I0529 05:30:23.231936       1 handlers.go:215] No metrics for pod project/container1

Comment 24 Greg Rodriguez II 2018-06-04 22:07:50 UTC

Dan/Seth,
Do you have any recommendations regarding any action that Support can make to move this issue forward?  The problem being seen is urgent on the customer side.  
Also, customer is experiencing issue reported https://github.com/openshift/origin/issues/19044, would this be related or useful information for this BZ?
Thank you!

Comment 55 Jan Chaloupka 2018-07-09 13:55:50 UTC

```
# oc get nodes -o yaml | grep address
addresses:
 - address: applejack
 - address: 10.0.3.6
...
```

```
subnets.go:192] Error creating subnet for node hoofdini, ip hoofdini: failed to parse node IP hoofdini
subnets.go:192] Error creating subnet for node seabreeze, ip seabreeze: failed to parse node IP seabreeze
subnets.go:192] Error creating subnet for node moondancer, ip moondancer: failed to parse node IP moondancer
```

Checking the code responsible for the error:
```go
func getNodeIP(node *kapi.Node) (string, error) {
	if len(node.Status.Addresses) > 0 && node.Status.Addresses[0].Address != "" {
		return node.Status.Addresses[0].Address, nil
	} else {
		return netutils.GetNodeIP(node.Name)
	}
}

```

The thing is the getNodeIP expects the IP to be first in the addresses list. However, the IP is sorted as the second item and the SDN plugin in 3.6 can't deal with the fact.
Instead, the code should do:

```go
func getNodeIP(node *kapi.Node) (string, error) {
	for _, address := range node.Status.Addresses {
		if (address.Type == kapi.NodeInternalIP || address.Type == kapi.NodeExternalIP) && address.Address != "" {
			return node.Status.Addresses[0].Address, nil
		}
	}
	return netutils.GetNodeIP(node.Name)
}

```

Comment 56 Nicolas Nosenzo 2018-07-09 14:03:22 UTC

@Jan, not sure if you got customer update alreaady. They're complaining the hotfix didnt fixed (or it did partially) the issue:

~~~
so currently the issue is only partly solved. As you can see from attached logfiles, i do still have the error messages in the controller logs.
But i can see, that every node has an ipaddress:
# oc get nodes -o yaml | grep address
    addresses:
    - address: applejack
    - address: 10.0.3.6
    addresses:
    - address: cadance
    - address: 10.0.1.8
    addresses:
    - address: cloudchaser
    - address: 10.0.1.5
    addresses:
    - address: debugnode
    - address: 10.0.1.12
    addresses:
    - address: flurryheart
    - address: 10.0.1.11
    addresses:
    - address: fluttershy
    - address: 10.0.3.7
    addresses:
    - address: hoofbeard
    - address: 10.0.2.7
    addresses:
    - address: hoofdini
    - address: 10.0.2.9
    addresses:
    - address: moondancer
    - address: 10.0.1.4
    addresses:
    - address: rarity
    - address: 10.0.3.5
    addresses:
    - address: seabreeze
    - address: 10.0.1.7
    addresses:
    - address: soarin
    - address: 10.0.2.5
    addresses:
    - address: spitfire
    - address: 10.0.2.6
    addresses:
    - address: sweetiebelle
    - address: 10.0.1.10
    addresses:
    - address: thunderlane
    - address: 10.0.1.6
    addresses:
    - address: trenderhoof
    - address: 10.0.2.8
    addresses:
    - address: zecora
    - address: 10.0.2.4
    addresses:
    - address: zephyrbreeze
    - address: 10.0.1.9

But active controller (of course i've restarted all services after installing hotfix last week) is still producing errors in logs:

# sudo systemctl status atomic-openshift-master-controllers.service
● atomic-openshift-master-controllers.service - Atomic OpenShift Master Controllers
   Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-master-controllers.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2018-07-05 07:33:53 CEST; 4 days ago
     Docs: https://github.com/openshift/origin
 Main PID: 38478 (openshift)
   Memory: 441.7M
   CGroup: /system.slice/atomic-openshift-master-controllers.service
           └─38478 /usr/bin/openshift start master controllers --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:8444

Jul 09 12:32:49 applejack atomic-openshift-master-controllers[38478]: E0709 12:32:49.292508   38478 subnets.go:192] Error creating subnet for node z...zecora
Jul 09 12:32:49 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:49.330708   38478 horizontal.go:538] Successfully updated status f...ar-sso
Jul 09 12:32:54 applejack atomic-openshift-master-controllers[38478]: E0709 12:32:54.434440   38478 subnets.go:192] Error creating subnet for node m...dancer
Jul 09 12:32:57 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:57.157175   38478 stateful_set.go:420] Syncing StatefulSet ocpmon/...3 pods
Jul 09 12:32:57 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:57.157264   38478 stateful_set.go:425] Succesfully synced Stateful...essful
Jul 09 12:32:57 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:57.157344   38478 stateful_set.go:420] Syncing StatefulSet ocpmon/...3 pods
Jul 09 12:32:57 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:57.157378   38478 stateful_set.go:425] Succesfully synced Stateful...essful
Jul 09 12:32:57 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:57.370437   38478 horizontal.go:538] Successfully updated status f...ateway
Jul 09 12:33:06 applejack atomic-openshift-master-controllers[38478]: E0709 12:33:06.001183   38478 subnets.go:192] Error creating subnet for node s...ebelle
Jul 09 12:33:10 applejack atomic-openshift-master-controllers[38478]: E0709 12:33:10.155525   38478 subnets.go:192] Error creating subnet for node s...soarin
Hint: Some lines were ellipsized, use -l to show in full.

and of course hostsubnets are still in place:
oc get hostsubnet
NAME           HOST           HOST IP     SUBNET
applejack      applejack      10.0.3.6    10.128.2.0/23
cadance        cadance        10.0.1.8    10.130.4.0/23
cloudchaser    cloudchaser    10.0.1.5    10.128.0.0/23
debugnode      debugnode      10.0.1.12   10.128.10.0/23
flurryheart    flurryheart    10.0.1.11   10.129.6.0/23
fluttershy     fluttershy     10.0.3.7    10.131.2.0/23
hoofbeard      hoofbeard      10.0.2.7    10.128.8.0/23
hoofdini       hoofdini       10.0.2.9    10.130.6.0/23
moondancer     moondancer     10.0.1.4    10.129.0.0/23
rarity         rarity         10.0.3.5    10.129.2.0/23
seabreeze      seabreeze      10.0.1.7    10.130.0.0/23
soarin         soarin         10.0.2.5    10.130.2.0/23
spitfire       spitfire       10.0.2.6    10.129.4.0/23
sweetiebelle   sweetiebelle   10.0.1.10   10.128.6.0/23
thunderlane    thunderlane    10.0.1.6    10.131.0.0/23
trenderhoof    trenderhoof    10.0.2.8    10.131.6.0/23
zecora         zecora         10.0.2.4    10.128.4.0/23
zephyrbreeze   zephyrbreeze   10.0.1.9    10.131.4.0/23

But a simple wc shows, that the logs are still flooded:
#zgrep "Error creating subnet" messages-2018070*.gz | wc -l
270001
~~~~

Comment 59 Jan Chaloupka 2018-07-10 11:50:20 UTC

Revisiting the logs provided by the customer after applying the scratch-build with the hotfix:

02106255-debugnode_messages-20180623:Jun 23 01:30:44 debugnode atomic-openshift-node: W0623 01:30:26.110354 45092 kubelet_node_status.go:968] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s 02106255-debugnode_messages-20180623:Jun 23 01:31:02 debugnode atomic-openshift-node: I0623 01:31:02.621248 45092 kubelet_node_status.go:463] Node addresses []v1.NodeAddress{v1.NodeAddress{Type:"InternalIP", Address:"10.0.1.12"}, v1.NodeAddress{Type:"Hostname", Address:"debugnode"}} retrieved from cloudprovider in less than 10s, err=<nil> 
02106255-debugnode_messages-20180623:Jun 23 01:31:02 debugnode atomic-openshift-node: I0623 01:31:02.621299 45092 kubelet_node_status.go:488] Ranging over node address([{"InternalIP" "10.0.1.12"} {"Hostname" "debugnode"}]): 10000000000 
02106255-debugnode_messages-20180623:Jun 23 01:31:02 debugnode atomic-openshift-node: I0623 01:31:02.621319 45092 kubelet_node_status.go:488] Ranging over node address([{"InternalIP" "10.0.1.12"} {"Hostname" "debugnode"}]): 10000000000 
02106255-debugnode_messages-20180623:Jun 23 01:31:02 debugnode atomic-openshift-node: I0623 01:31:02.621333 45092 kubelet_node_status.go:498] Using Node Hostname from cloudprovider: "debugnode"

The internal IP is positioned as the first item in the node addresses list, the hostname as the second. So it is strange the node on the apiserver side contains the items in the reversed order.

In theory, the PATCH operation (assuming the node contained only the hostname address) could append the internal IP after the hostname instead of replacing the entire list of node addresses. Though, I am not so familiar with the code to prove or disprove the claim.

Comment 62 DeShuai Ma 2018-07-17 09:16:06 UTC

The fix has cause to regression bug
https://bugzilla.redhat.com/show_bug.cgi?id=1601749

Comment 67 DeShuai Ma 2018-08-06 09:54:45 UTC

Verify on v3.6.173.0.128

check master logs , no error log Error creating subnet for node
[root@ocp36-master ~]# oc version
oc v3.6.173.0.128
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ocp36-master:8443
openshift v3.6.173.0.128
kubernetes v1.6.1+5115d708d7
[root@ocp36-master ~]# oc get node
NAME           STATUS    AGE       VERSION
ocp36-master   Ready     5d        v1.6.1+5115d708d7
ocp36-node1    Ready     5d        v1.6.1+5115d708d7
[root@ocp36-master ~]# journalctl -u atomic-openshift-master.service > master.log |grep -i "Error creating subnet"
[root@ocp36-master ~]#

Comment 69 errata-xmlrpc 2018-08-09 22:10:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2339