Bug 1583129
Summary: | [3.6] Error creating subnet for node - just error message in logs - OCP on Azure | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vladislav Walek <vwalek> | |
Component: | Cloud Compute | Assignee: | Jan Chaloupka <jchaloup> | |
Status: | CLOSED ERRATA | QA Contact: | DeShuai Ma <dma> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 3.6.0 | CC: | aos-bugs, bjarolim, bleanhar, borisb, clasohm, dcbw, dkinkead, dma, erich, grodrigu, harold.wong, jawilson, jchaloup, jenander, jokerman, mfojtik, misalunk, mmccomas, mprashad, nnosenzo, openshift-bugs-escalate, pdwyer, rhowe, rvokal, sauchter, sjenning, smunilla, tatanaka, vwalek, wmeng, xtian | |
Target Milestone: | --- | |||
Target Release: | 3.6.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | atomic-openshift-3.6.173.0.117-1.git.1.2e5387f.el7 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1595599 1595605 1595608 1603662 (view as bug list) | Environment: | ||
Last Closed: | 2018-08-09 22:10:03 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1603662 |
Description
Vladislav Walek
2018-05-28 09:10:01 UTC
to correct: the correct version is not openshift 3.7.46, but openshift 3.6.173.0.117. Version-Release number of selected component (if applicable): OpenShift Container Platform 3.6.173.0.117 The errata is https://access.redhat.com/errata/RHBA-2018:1579 and bug is 1554748 the issue is affecting the metrics: E0529 05:30:05.000608 1 summary.go:376] Node node1 has no valid hostname and/or IP address: node1 E0529 05:30:05.000620 1 summary.go:376] Node node2 has no valid hostname and/or IP address: node2 I0529 05:30:23.231898 1 handlers.go:264] No metrics for container container1 in pod project/container1 I0529 05:30:23.231936 1 handlers.go:215] No metrics for pod project/container1 Dan/Seth, Do you have any recommendations regarding any action that Support can make to move this issue forward? The problem being seen is urgent on the customer side. Also, customer is experiencing issue reported https://github.com/openshift/origin/issues/19044, would this be related or useful information for this BZ? Thank you! ``` # oc get nodes -o yaml | grep address addresses: - address: applejack - address: 10.0.3.6 ... ``` ``` subnets.go:192] Error creating subnet for node hoofdini, ip hoofdini: failed to parse node IP hoofdini subnets.go:192] Error creating subnet for node seabreeze, ip seabreeze: failed to parse node IP seabreeze subnets.go:192] Error creating subnet for node moondancer, ip moondancer: failed to parse node IP moondancer ``` Checking the code responsible for the error: ```go func getNodeIP(node *kapi.Node) (string, error) { if len(node.Status.Addresses) > 0 && node.Status.Addresses[0].Address != "" { return node.Status.Addresses[0].Address, nil } else { return netutils.GetNodeIP(node.Name) } } ``` The thing is the getNodeIP expects the IP to be first in the addresses list. However, the IP is sorted as the second item and the SDN plugin in 3.6 can't deal with the fact. Instead, the code should do: ```go func getNodeIP(node *kapi.Node) (string, error) { for _, address := range node.Status.Addresses { if (address.Type == kapi.NodeInternalIP || address.Type == kapi.NodeExternalIP) && address.Address != "" { return node.Status.Addresses[0].Address, nil } } return netutils.GetNodeIP(node.Name) } ``` @Jan, not sure if you got customer update alreaady. They're complaining the hotfix didnt fixed (or it did partially) the issue: ~~~ so currently the issue is only partly solved. As you can see from attached logfiles, i do still have the error messages in the controller logs. But i can see, that every node has an ipaddress: # oc get nodes -o yaml | grep address addresses: - address: applejack - address: 10.0.3.6 addresses: - address: cadance - address: 10.0.1.8 addresses: - address: cloudchaser - address: 10.0.1.5 addresses: - address: debugnode - address: 10.0.1.12 addresses: - address: flurryheart - address: 10.0.1.11 addresses: - address: fluttershy - address: 10.0.3.7 addresses: - address: hoofbeard - address: 10.0.2.7 addresses: - address: hoofdini - address: 10.0.2.9 addresses: - address: moondancer - address: 10.0.1.4 addresses: - address: rarity - address: 10.0.3.5 addresses: - address: seabreeze - address: 10.0.1.7 addresses: - address: soarin - address: 10.0.2.5 addresses: - address: spitfire - address: 10.0.2.6 addresses: - address: sweetiebelle - address: 10.0.1.10 addresses: - address: thunderlane - address: 10.0.1.6 addresses: - address: trenderhoof - address: 10.0.2.8 addresses: - address: zecora - address: 10.0.2.4 addresses: - address: zephyrbreeze - address: 10.0.1.9 But active controller (of course i've restarted all services after installing hotfix last week) is still producing errors in logs: # sudo systemctl status atomic-openshift-master-controllers.service ● atomic-openshift-master-controllers.service - Atomic OpenShift Master Controllers Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-master-controllers.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2018-07-05 07:33:53 CEST; 4 days ago Docs: https://github.com/openshift/origin Main PID: 38478 (openshift) Memory: 441.7M CGroup: /system.slice/atomic-openshift-master-controllers.service └─38478 /usr/bin/openshift start master controllers --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:8444 Jul 09 12:32:49 applejack atomic-openshift-master-controllers[38478]: E0709 12:32:49.292508 38478 subnets.go:192] Error creating subnet for node z...zecora Jul 09 12:32:49 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:49.330708 38478 horizontal.go:538] Successfully updated status f...ar-sso Jul 09 12:32:54 applejack atomic-openshift-master-controllers[38478]: E0709 12:32:54.434440 38478 subnets.go:192] Error creating subnet for node m...dancer Jul 09 12:32:57 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:57.157175 38478 stateful_set.go:420] Syncing StatefulSet ocpmon/...3 pods Jul 09 12:32:57 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:57.157264 38478 stateful_set.go:425] Succesfully synced Stateful...essful Jul 09 12:32:57 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:57.157344 38478 stateful_set.go:420] Syncing StatefulSet ocpmon/...3 pods Jul 09 12:32:57 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:57.157378 38478 stateful_set.go:425] Succesfully synced Stateful...essful Jul 09 12:32:57 applejack atomic-openshift-master-controllers[38478]: I0709 12:32:57.370437 38478 horizontal.go:538] Successfully updated status f...ateway Jul 09 12:33:06 applejack atomic-openshift-master-controllers[38478]: E0709 12:33:06.001183 38478 subnets.go:192] Error creating subnet for node s...ebelle Jul 09 12:33:10 applejack atomic-openshift-master-controllers[38478]: E0709 12:33:10.155525 38478 subnets.go:192] Error creating subnet for node s...soarin Hint: Some lines were ellipsized, use -l to show in full. and of course hostsubnets are still in place: oc get hostsubnet NAME HOST HOST IP SUBNET applejack applejack 10.0.3.6 10.128.2.0/23 cadance cadance 10.0.1.8 10.130.4.0/23 cloudchaser cloudchaser 10.0.1.5 10.128.0.0/23 debugnode debugnode 10.0.1.12 10.128.10.0/23 flurryheart flurryheart 10.0.1.11 10.129.6.0/23 fluttershy fluttershy 10.0.3.7 10.131.2.0/23 hoofbeard hoofbeard 10.0.2.7 10.128.8.0/23 hoofdini hoofdini 10.0.2.9 10.130.6.0/23 moondancer moondancer 10.0.1.4 10.129.0.0/23 rarity rarity 10.0.3.5 10.129.2.0/23 seabreeze seabreeze 10.0.1.7 10.130.0.0/23 soarin soarin 10.0.2.5 10.130.2.0/23 spitfire spitfire 10.0.2.6 10.129.4.0/23 sweetiebelle sweetiebelle 10.0.1.10 10.128.6.0/23 thunderlane thunderlane 10.0.1.6 10.131.0.0/23 trenderhoof trenderhoof 10.0.2.8 10.131.6.0/23 zecora zecora 10.0.2.4 10.128.4.0/23 zephyrbreeze zephyrbreeze 10.0.1.9 10.131.4.0/23 But a simple wc shows, that the logs are still flooded: #zgrep "Error creating subnet" messages-2018070*.gz | wc -l 270001 ~~~~ Revisiting the logs provided by the customer after applying the scratch-build with the hotfix: 02106255-debugnode_messages-20180623:Jun 23 01:30:44 debugnode atomic-openshift-node: W0623 01:30:26.110354 45092 kubelet_node_status.go:968] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s 02106255-debugnode_messages-20180623:Jun 23 01:31:02 debugnode atomic-openshift-node: I0623 01:31:02.621248 45092 kubelet_node_status.go:463] Node addresses []v1.NodeAddress{v1.NodeAddress{Type:"InternalIP", Address:"10.0.1.12"}, v1.NodeAddress{Type:"Hostname", Address:"debugnode"}} retrieved from cloudprovider in less than 10s, err=<nil> 02106255-debugnode_messages-20180623:Jun 23 01:31:02 debugnode atomic-openshift-node: I0623 01:31:02.621299 45092 kubelet_node_status.go:488] Ranging over node address([{"InternalIP" "10.0.1.12"} {"Hostname" "debugnode"}]): 10000000000 02106255-debugnode_messages-20180623:Jun 23 01:31:02 debugnode atomic-openshift-node: I0623 01:31:02.621319 45092 kubelet_node_status.go:488] Ranging over node address([{"InternalIP" "10.0.1.12"} {"Hostname" "debugnode"}]): 10000000000 02106255-debugnode_messages-20180623:Jun 23 01:31:02 debugnode atomic-openshift-node: I0623 01:31:02.621333 45092 kubelet_node_status.go:498] Using Node Hostname from cloudprovider: "debugnode" The internal IP is positioned as the first item in the node addresses list, the hostname as the second. So it is strange the node on the apiserver side contains the items in the reversed order. In theory, the PATCH operation (assuming the node contained only the hostname address) could append the internal IP after the hostname instead of replacing the entire list of node addresses. Though, I am not so familiar with the code to prove or disprove the claim. The fix has cause to regression bug https://bugzilla.redhat.com/show_bug.cgi?id=1601749 Verify on v3.6.173.0.128 check master logs , no error log Error creating subnet for node [root@ocp36-master ~]# oc version oc v3.6.173.0.128 kubernetes v1.6.1+5115d708d7 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ocp36-master:8443 openshift v3.6.173.0.128 kubernetes v1.6.1+5115d708d7 [root@ocp36-master ~]# oc get node NAME STATUS AGE VERSION ocp36-master Ready 5d v1.6.1+5115d708d7 ocp36-node1 Ready 5d v1.6.1+5115d708d7 [root@ocp36-master ~]# journalctl -u atomic-openshift-master.service > master.log |grep -i "Error creating subnet" [root@ocp36-master ~]# Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2339 |