Bug 1477989 - Heapster collecting metrics from the wrong networking interface of the node
Summary: Heapster collecting metrics from the wrong networking interface of the node
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.11.z
Assignee: Ruben Vargas Palma
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-03 12:13 UTC by Eduardo Minguez
Modified: 2023-09-15 00:03 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-07 16:26:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oc get nodes -o yaml (31.91 KB, text/plain)
2017-08-04 13:49 UTC, Eduardo Minguez
no flags Details

Description Eduardo Minguez 2017-08-03 12:13:27 UTC
Description of problem:
Using OCP 3.4 on top of OSP 10 the metrics deployment went smooth except the heapster pod tries to get metrics from the nodes on a wrong interface when the node has more than one network interface:

--8<--
...
E0803 09:20:35.043086       1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.16:10250/stats/container/": Post https://172.18.20.16:10250/stats/container/: x509: certificate is valid for 172.18.10.18, not 172.18.20.16
E0803 09:20:35.066144       1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.9:10250/stats/container/": Post https://172.18.20.9:10250/stats/container/: x509: certificate is valid for 10.19.114.187, 172.18.10.12, not 172.18.20.9
...
-->8--

Version-Release number of selected component (if applicable):
openshift v3.4.1.44
kubernetes v1.4.0+776c994


How reproducible:
Deploy OCP 3.4.1 on hosts with more than 1 interface.


Steps to Reproduce:
1. Deploy OCP 3.4.1 on hosts with more than 1 interface
2. Deploy OCP metrics (using deployer pod as it is 3.4)
3. Check heapster logs

Actual results:
--8<--
...
E0803 09:20:35.043086       1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.16:10250/stats/container/": Post https://172.18.20.16:10250/stats/container/: x509: certificate is valid for 172.18.10.18, not 172.18.20.16
E0803 09:20:35.066144       1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.9:10250/stats/container/": Post https://172.18.20.9:10250/stats/container/: x509: certificate is valid for 10.19.114.187, 172.18.10.12, not 172.18.20.9
...
-->8--

Expected results:
Heapster should query the proper kubelet interface


Additional info:

The nodes have two different interfaces, and the yaml that defines the node contains:

--8<--
status:
  addresses:
  - address: 172.18.10.18
    type: InternalIP
  - address: 172.18.20.16
    type: InternalIP
-->8--

Where the "good one" is the 172.18.10.X. I've tried to delete the 172.18.20.X definition in the node but it doesn't work, the IP came back again. I've seen the only options for specifying IPs in kubernetes is ExternalIP or InternalIP[1] so, dead end.

It seems a random behavior because there are some graphics with a lot of missing metrics, so I assume the kubelet returns both interfaces randomly and heapster tries the first one.

I've been looking for a heapster parameter to modify this behavior at runtime but I couldn't find anything.

As a workaround, I've tried to specify insecure=true parameter in the heapster rc for the heapster command line parameter "source" as:

--8<--
        - --source=kubernetes:https://kubernetes.default.svc:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250&insecure=true
-->8--

After restarting the pod, the metrics shown smoothly and there are not errors in the heapster pod

Comment 1 Matt Wringe 2017-08-03 15:55:15 UTC
This is due to your nodes using incorrect certificates for the ip address that the master api returns for the nodes.

If your nodes have multiple network interfaces, then your certificates for those nodes have to be valid for each of the network interfaces that a client would use for accessing the interface.

Comment 2 Scott Dodson 2017-08-03 19:57:39 UTC
Eduardo,

Do you have a nodeIP set in /etc/origin/node/node-config.yaml and if so is it 172.18.10.18 or 172.18.20.16?

If not, can you stop the node, add "nodeIP: 172.18.20.16", `oc delete node`, then start the node so that it re-registers?



And, if that works, then adding 'openshift_ip' variable to each of your nodes at install time will ensure that the value is set for each node.

Comment 3 Eduardo Minguez 2017-08-04 08:14:08 UTC
(In reply to Scott Dodson from comment #2)
> Eduardo,
> 
> Do you have a nodeIP set in /etc/origin/node/node-config.yaml and if so is
> it 172.18.10.18 or 172.18.20.16?
> 
> If not, can you stop the node, add "nodeIP: 172.18.20.16", `oc delete node`,
> then start the node so that it re-registers?
> 
> 
> 
> And, if that works, then adding 'openshift_ip' variable to each of your
> nodes at install time will ensure that the value is set for each node.

nodeIPs variables are set to the 172.18.10.X network automatically at installation time, and this is the "good one" interface:

[cloud-user@bastion ansible_files]$ ansible nodes -b -a "grep -i nodeip /etc/origin/node/node-config.yaml"
app-node-0.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.18

master-1.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.7

master-2.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.10

master-0.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.5

app-node-1.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.17

app-node-2.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.20

infra-node-0.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.8

infra-node-2.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.3

infra-node-1.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.12

The 172.18.20.x IPs are meant to be used just for the flannel traffic[1].

[1] https://access.redhat.com/documentation/en-us/reference_architectures/2017/html-single/deploying_red_hat_openshift_container_platform_3.4_on_red_hat_openstack_platform_10/#reference_architecture_overview

Comment 4 Scott Dodson 2017-08-04 13:22:37 UTC
Do other tasks that require access to the kubelet from the master fail? `oc exec`, `oc rsh`, `oc proxy` all route requests from the API server to the kubelet so try one of those.

Comment 5 Eduardo Minguez 2017-08-04 13:26:15 UTC
They work:

[cloud-user@bastion ~]$ oc exec docker-registry-2-1326g uptime
 13:25:35 up 1 day, 22:28,  0 users,  load average: 0.14, 0.20, 0.26
[cloud-user@bastion ~]$ oc rsh docker-registry-2-1326g
sh-4.2$ uptime
 13:25:40 up 1 day, 22:28,  0 users,  load average: 0.12, 0.19, 0.26
sh-4.2$ exit

Comment 6 Matt Wringe 2017-08-04 13:36:07 UTC
(In reply to Scott Dodson from comment #4)
> Do other tasks that require access to the kubelet from the master fail? `oc
> exec`, `oc rsh`, `oc proxy` all route requests from the API server to the
> kubelet so try one of those.

I believe they don't use the same mechanism and don't require more strict certificates in this case, so I would not count on this being a good test.

Comment 7 Matt Wringe 2017-08-04 13:37:18 UTC
If you use 'oc get nodes -o yaml' what ip address/hostname are being shown in status.addresses?

This is what heapster is using.

Comment 8 Eduardo Minguez 2017-08-04 13:49:39 UTC
(In reply to Matt Wringe from comment #7)
> If you use 'oc get nodes -o yaml' what ip address/hostname are being shown
> in status.addresses?
> 
> This is what heapster is using.

That's the point, it shows 3 ips for each node:

--8<--
    addresses:
    - address: 172.18.10.10    <- proper one
      type: InternalIP
    - address: 172.18.20.5     <- flannel traffic interface
      type: InternalIP
-->8--

And for the nodes that has a floating ip, it shows 3:

--8<--
    addresses:
    - address: 172.18.10.10    <- proper one
      type: InternalIP
    - address: 172.18.20.5     <- flannel traffic interface
      type: InternalIP
    - address: 10.19.114.198   <- floating ip (OSP based environment)
      type: ExternalIP
-->8--

Comment 9 Eduardo Minguez 2017-08-04 13:49:59 UTC
Created attachment 1309068 [details]
oc get nodes -o yaml

Comment 10 Scott Dodson 2017-08-04 13:52:11 UTC
If the proper IP address is always the first IP address then I suggest heapster always uses the first address. The networking team did something similar for one of their bugs and I believe they added code to ensure that nodeIP is always the first address.

Comment 11 Matt Wringe 2017-08-04 15:03:12 UTC
Is the ip address returned always in that order? or can the order its returned be changed?

Either way, it shouldn't really matter. You have your cluster configured to be available across multiple ip addresses / hostnames. If your certificates are not valid for all of these addresses, then it means your cluster is not configured properly and heapster is properly rejecting the certificates since they are invalid.

Comment 12 Eduardo Minguez 2017-08-04 17:44:36 UTC
(In reply to Matt Wringe from comment #11)
> Is the ip address returned always in that order? or can the order its
> returned be changed?
> 
> Either way, it shouldn't really matter. You have your cluster configured to
> be available across multiple ip addresses / hostnames. If your certificates
> are not valid for all of these addresses, then it means your cluster is not
> configured properly and heapster is properly rejecting the certificates
> since they are invalid.

But those other IPs are configured automatically as the nodeip parameter in the node configuration file is properly set (some openstack autodiscovery? it seems to autodiscover the floating ips on the nodes that have one...)... is there a way to force to just listen on a specific interface? that's what the "nodeip" parameter should do, right?[1]
The thing is "I think" heapster should just query the specific nodeip interface (IDK if there is such API call or something)... maybe a heapster parameter to specify a network or if the node has "autodiscovered" IPs, loop them until you have a proper answer,...
Thanks

[1] https://kubernetes.io/docs/admin/kubelet/

Comment 13 Scott Dodson 2017-08-04 18:00:09 UTC
Rajat,

I think you worked on cleaning up issues with the router on nodes with multiple IP addresses. Can you provide any feedback on this bug?

Comment 21 Solly Ross 2018-02-05 21:08:12 UTC
From the Heapster side, Heapster will use the *last* of the valid InternalIP addresses, as a quirk of the logic which handles trying to figure out which of the IP addresses to use.

We could patch this in our Heapster.  I'm not sure there's good reasoning to suggest it to upstream.

The relevant code is here: https://github.com/kubernetes/heapster/blob/master/metrics/sources/kubelet/kubelet.go#L313

Comment 22 Dmitry Zhukovski 2018-02-06 08:28:33 UTC
Since we have already two different use cases I guess it would be nice to fix it in either in downstream or in both.

Comment 37 Jatan Malde 2019-02-15 05:00:31 UTC
Hello,

Are we targetting for any workaround for this one, as according to below comment, IHAC using multiple ips' for the node.

   https://bugzilla.redhat.com/show_bug.cgi?id=1477989#c21

Let me know your thoughts on it. 

Thanks.

Comment 49 Red Hat Bugzilla 2023-09-15 00:03:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.