Description of problem: Using OCP 3.4 on top of OSP 10 the metrics deployment went smooth except the heapster pod tries to get metrics from the nodes on a wrong interface when the node has more than one network interface: --8<-- ... E0803 09:20:35.043086 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.16:10250/stats/container/": Post https://172.18.20.16:10250/stats/container/: x509: certificate is valid for 172.18.10.18, not 172.18.20.16 E0803 09:20:35.066144 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.9:10250/stats/container/": Post https://172.18.20.9:10250/stats/container/: x509: certificate is valid for 10.19.114.187, 172.18.10.12, not 172.18.20.9 ... -->8-- Version-Release number of selected component (if applicable): openshift v3.4.1.44 kubernetes v1.4.0+776c994 How reproducible: Deploy OCP 3.4.1 on hosts with more than 1 interface. Steps to Reproduce: 1. Deploy OCP 3.4.1 on hosts with more than 1 interface 2. Deploy OCP metrics (using deployer pod as it is 3.4) 3. Check heapster logs Actual results: --8<-- ... E0803 09:20:35.043086 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.16:10250/stats/container/": Post https://172.18.20.16:10250/stats/container/: x509: certificate is valid for 172.18.10.18, not 172.18.20.16 E0803 09:20:35.066144 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.9:10250/stats/container/": Post https://172.18.20.9:10250/stats/container/: x509: certificate is valid for 10.19.114.187, 172.18.10.12, not 172.18.20.9 ... -->8-- Expected results: Heapster should query the proper kubelet interface Additional info: The nodes have two different interfaces, and the yaml that defines the node contains: --8<-- status: addresses: - address: 172.18.10.18 type: InternalIP - address: 172.18.20.16 type: InternalIP -->8-- Where the "good one" is the 172.18.10.X. I've tried to delete the 172.18.20.X definition in the node but it doesn't work, the IP came back again. I've seen the only options for specifying IPs in kubernetes is ExternalIP or InternalIP[1] so, dead end. It seems a random behavior because there are some graphics with a lot of missing metrics, so I assume the kubelet returns both interfaces randomly and heapster tries the first one. I've been looking for a heapster parameter to modify this behavior at runtime but I couldn't find anything. As a workaround, I've tried to specify insecure=true parameter in the heapster rc for the heapster command line parameter "source" as: --8<-- - --source=kubernetes:https://kubernetes.default.svc:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250&insecure=true -->8-- After restarting the pod, the metrics shown smoothly and there are not errors in the heapster pod
This is due to your nodes using incorrect certificates for the ip address that the master api returns for the nodes. If your nodes have multiple network interfaces, then your certificates for those nodes have to be valid for each of the network interfaces that a client would use for accessing the interface.
Eduardo, Do you have a nodeIP set in /etc/origin/node/node-config.yaml and if so is it 172.18.10.18 or 172.18.20.16? If not, can you stop the node, add "nodeIP: 172.18.20.16", `oc delete node`, then start the node so that it re-registers? And, if that works, then adding 'openshift_ip' variable to each of your nodes at install time will ensure that the value is set for each node.
(In reply to Scott Dodson from comment #2) > Eduardo, > > Do you have a nodeIP set in /etc/origin/node/node-config.yaml and if so is > it 172.18.10.18 or 172.18.20.16? > > If not, can you stop the node, add "nodeIP: 172.18.20.16", `oc delete node`, > then start the node so that it re-registers? > > > > And, if that works, then adding 'openshift_ip' variable to each of your > nodes at install time will ensure that the value is set for each node. nodeIPs variables are set to the 172.18.10.X network automatically at installation time, and this is the "good one" interface: [cloud-user@bastion ansible_files]$ ansible nodes -b -a "grep -i nodeip /etc/origin/node/node-config.yaml" app-node-0.control.edu.example.com | SUCCESS | rc=0 >> nodeIP: 172.18.10.18 master-1.control.edu.example.com | SUCCESS | rc=0 >> nodeIP: 172.18.10.7 master-2.control.edu.example.com | SUCCESS | rc=0 >> nodeIP: 172.18.10.10 master-0.control.edu.example.com | SUCCESS | rc=0 >> nodeIP: 172.18.10.5 app-node-1.control.edu.example.com | SUCCESS | rc=0 >> nodeIP: 172.18.10.17 app-node-2.control.edu.example.com | SUCCESS | rc=0 >> nodeIP: 172.18.10.20 infra-node-0.control.edu.example.com | SUCCESS | rc=0 >> nodeIP: 172.18.10.8 infra-node-2.control.edu.example.com | SUCCESS | rc=0 >> nodeIP: 172.18.10.3 infra-node-1.control.edu.example.com | SUCCESS | rc=0 >> nodeIP: 172.18.10.12 The 172.18.20.x IPs are meant to be used just for the flannel traffic[1]. [1] https://access.redhat.com/documentation/en-us/reference_architectures/2017/html-single/deploying_red_hat_openshift_container_platform_3.4_on_red_hat_openstack_platform_10/#reference_architecture_overview
Do other tasks that require access to the kubelet from the master fail? `oc exec`, `oc rsh`, `oc proxy` all route requests from the API server to the kubelet so try one of those.
They work: [cloud-user@bastion ~]$ oc exec docker-registry-2-1326g uptime 13:25:35 up 1 day, 22:28, 0 users, load average: 0.14, 0.20, 0.26 [cloud-user@bastion ~]$ oc rsh docker-registry-2-1326g sh-4.2$ uptime 13:25:40 up 1 day, 22:28, 0 users, load average: 0.12, 0.19, 0.26 sh-4.2$ exit
(In reply to Scott Dodson from comment #4) > Do other tasks that require access to the kubelet from the master fail? `oc > exec`, `oc rsh`, `oc proxy` all route requests from the API server to the > kubelet so try one of those. I believe they don't use the same mechanism and don't require more strict certificates in this case, so I would not count on this being a good test.
If you use 'oc get nodes -o yaml' what ip address/hostname are being shown in status.addresses? This is what heapster is using.
(In reply to Matt Wringe from comment #7) > If you use 'oc get nodes -o yaml' what ip address/hostname are being shown > in status.addresses? > > This is what heapster is using. That's the point, it shows 3 ips for each node: --8<-- addresses: - address: 172.18.10.10 <- proper one type: InternalIP - address: 172.18.20.5 <- flannel traffic interface type: InternalIP -->8-- And for the nodes that has a floating ip, it shows 3: --8<-- addresses: - address: 172.18.10.10 <- proper one type: InternalIP - address: 172.18.20.5 <- flannel traffic interface type: InternalIP - address: 10.19.114.198 <- floating ip (OSP based environment) type: ExternalIP -->8--
Created attachment 1309068 [details] oc get nodes -o yaml
If the proper IP address is always the first IP address then I suggest heapster always uses the first address. The networking team did something similar for one of their bugs and I believe they added code to ensure that nodeIP is always the first address.
Is the ip address returned always in that order? or can the order its returned be changed? Either way, it shouldn't really matter. You have your cluster configured to be available across multiple ip addresses / hostnames. If your certificates are not valid for all of these addresses, then it means your cluster is not configured properly and heapster is properly rejecting the certificates since they are invalid.
(In reply to Matt Wringe from comment #11) > Is the ip address returned always in that order? or can the order its > returned be changed? > > Either way, it shouldn't really matter. You have your cluster configured to > be available across multiple ip addresses / hostnames. If your certificates > are not valid for all of these addresses, then it means your cluster is not > configured properly and heapster is properly rejecting the certificates > since they are invalid. But those other IPs are configured automatically as the nodeip parameter in the node configuration file is properly set (some openstack autodiscovery? it seems to autodiscover the floating ips on the nodes that have one...)... is there a way to force to just listen on a specific interface? that's what the "nodeip" parameter should do, right?[1] The thing is "I think" heapster should just query the specific nodeip interface (IDK if there is such API call or something)... maybe a heapster parameter to specify a network or if the node has "autodiscovered" IPs, loop them until you have a proper answer,... Thanks [1] https://kubernetes.io/docs/admin/kubelet/
Rajat, I think you worked on cleaning up issues with the router on nodes with multiple IP addresses. Can you provide any feedback on this bug?
From the Heapster side, Heapster will use the *last* of the valid InternalIP addresses, as a quirk of the logic which handles trying to figure out which of the IP addresses to use. We could patch this in our Heapster. I'm not sure there's good reasoning to suggest it to upstream. The relevant code is here: https://github.com/kubernetes/heapster/blob/master/metrics/sources/kubelet/kubelet.go#L313
Since we have already two different use cases I guess it would be nice to fix it in either in downstream or in both.
Hello, Are we targetting for any workaround for this one, as according to below comment, IHAC using multiple ips' for the node. https://bugzilla.redhat.com/show_bug.cgi?id=1477989#c21 Let me know your thoughts on it. Thanks.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days