Bug 1477989 - Heapster collecting metrics from the wrong networking interface of the node [NEEDINFO]
Heapster collecting metrics from the wrong networking interface of the node
Status: NEW
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer (Show other bugs)
3.4.1
Unspecified Unspecified
unspecified Severity medium
: ---
: 3.8.0
Assigned To: Scott Dodson
Johnny Liu
: UpcomingRelease
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-03 08:13 EDT by Eduardo Minguez
Modified: 2017-10-17 15:32 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
sdodson: needinfo? (rchopra)


Attachments (Terms of Use)
oc get nodes -o yaml (31.91 KB, text/plain)
2017-08-04 09:49 EDT, Eduardo Minguez
no flags Details

  None (edit)
Description Eduardo Minguez 2017-08-03 08:13:27 EDT
Description of problem:
Using OCP 3.4 on top of OSP 10 the metrics deployment went smooth except the heapster pod tries to get metrics from the nodes on a wrong interface when the node has more than one network interface:

--8<--
...
E0803 09:20:35.043086       1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.16:10250/stats/container/": Post https://172.18.20.16:10250/stats/container/: x509: certificate is valid for 172.18.10.18, not 172.18.20.16
E0803 09:20:35.066144       1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.9:10250/stats/container/": Post https://172.18.20.9:10250/stats/container/: x509: certificate is valid for 10.19.114.187, 172.18.10.12, not 172.18.20.9
...
-->8--

Version-Release number of selected component (if applicable):
openshift v3.4.1.44
kubernetes v1.4.0+776c994


How reproducible:
Deploy OCP 3.4.1 on hosts with more than 1 interface.


Steps to Reproduce:
1. Deploy OCP 3.4.1 on hosts with more than 1 interface
2. Deploy OCP metrics (using deployer pod as it is 3.4)
3. Check heapster logs

Actual results:
--8<--
...
E0803 09:20:35.043086       1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.16:10250/stats/container/": Post https://172.18.20.16:10250/stats/container/: x509: certificate is valid for 172.18.10.18, not 172.18.20.16
E0803 09:20:35.066144       1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.18.20.9:10250/stats/container/": Post https://172.18.20.9:10250/stats/container/: x509: certificate is valid for 10.19.114.187, 172.18.10.12, not 172.18.20.9
...
-->8--

Expected results:
Heapster should query the proper kubelet interface


Additional info:

The nodes have two different interfaces, and the yaml that defines the node contains:

--8<--
status:
  addresses:
  - address: 172.18.10.18
    type: InternalIP
  - address: 172.18.20.16
    type: InternalIP
-->8--

Where the "good one" is the 172.18.10.X. I've tried to delete the 172.18.20.X definition in the node but it doesn't work, the IP came back again. I've seen the only options for specifying IPs in kubernetes is ExternalIP or InternalIP[1] so, dead end.

It seems a random behavior because there are some graphics with a lot of missing metrics, so I assume the kubelet returns both interfaces randomly and heapster tries the first one.

I've been looking for a heapster parameter to modify this behavior at runtime but I couldn't find anything.

As a workaround, I've tried to specify insecure=true parameter in the heapster rc for the heapster command line parameter "source" as:

--8<--
        - --source=kubernetes:https://kubernetes.default.svc:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250&insecure=true
-->8--

After restarting the pod, the metrics shown smoothly and there are not errors in the heapster pod
Comment 1 Matt Wringe 2017-08-03 11:55:15 EDT
This is due to your nodes using incorrect certificates for the ip address that the master api returns for the nodes.

If your nodes have multiple network interfaces, then your certificates for those nodes have to be valid for each of the network interfaces that a client would use for accessing the interface.
Comment 2 Scott Dodson 2017-08-03 15:57:39 EDT
Eduardo,

Do you have a nodeIP set in /etc/origin/node/node-config.yaml and if so is it 172.18.10.18 or 172.18.20.16?

If not, can you stop the node, add "nodeIP: 172.18.20.16", `oc delete node`, then start the node so that it re-registers?



And, if that works, then adding 'openshift_ip' variable to each of your nodes at install time will ensure that the value is set for each node.
Comment 3 Eduardo Minguez 2017-08-04 04:14:08 EDT
(In reply to Scott Dodson from comment #2)
> Eduardo,
> 
> Do you have a nodeIP set in /etc/origin/node/node-config.yaml and if so is
> it 172.18.10.18 or 172.18.20.16?
> 
> If not, can you stop the node, add "nodeIP: 172.18.20.16", `oc delete node`,
> then start the node so that it re-registers?
> 
> 
> 
> And, if that works, then adding 'openshift_ip' variable to each of your
> nodes at install time will ensure that the value is set for each node.

nodeIPs variables are set to the 172.18.10.X network automatically at installation time, and this is the "good one" interface:

[cloud-user@bastion ansible_files]$ ansible nodes -b -a "grep -i nodeip /etc/origin/node/node-config.yaml"
app-node-0.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.18

master-1.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.7

master-2.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.10

master-0.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.5

app-node-1.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.17

app-node-2.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.20

infra-node-0.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.8

infra-node-2.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.3

infra-node-1.control.edu.example.com | SUCCESS | rc=0 >>
nodeIP: 172.18.10.12

The 172.18.20.x IPs are meant to be used just for the flannel traffic[1].

[1] https://access.redhat.com/documentation/en-us/reference_architectures/2017/html-single/deploying_red_hat_openshift_container_platform_3.4_on_red_hat_openstack_platform_10/#reference_architecture_overview
Comment 4 Scott Dodson 2017-08-04 09:22:37 EDT
Do other tasks that require access to the kubelet from the master fail? `oc exec`, `oc rsh`, `oc proxy` all route requests from the API server to the kubelet so try one of those.
Comment 5 Eduardo Minguez 2017-08-04 09:26:15 EDT
They work:

[cloud-user@bastion ~]$ oc exec docker-registry-2-1326g uptime
 13:25:35 up 1 day, 22:28,  0 users,  load average: 0.14, 0.20, 0.26
[cloud-user@bastion ~]$ oc rsh docker-registry-2-1326g
sh-4.2$ uptime
 13:25:40 up 1 day, 22:28,  0 users,  load average: 0.12, 0.19, 0.26
sh-4.2$ exit
Comment 6 Matt Wringe 2017-08-04 09:36:07 EDT
(In reply to Scott Dodson from comment #4)
> Do other tasks that require access to the kubelet from the master fail? `oc
> exec`, `oc rsh`, `oc proxy` all route requests from the API server to the
> kubelet so try one of those.

I believe they don't use the same mechanism and don't require more strict certificates in this case, so I would not count on this being a good test.
Comment 7 Matt Wringe 2017-08-04 09:37:18 EDT
If you use 'oc get nodes -o yaml' what ip address/hostname are being shown in status.addresses?

This is what heapster is using.
Comment 8 Eduardo Minguez 2017-08-04 09:49:39 EDT
(In reply to Matt Wringe from comment #7)
> If you use 'oc get nodes -o yaml' what ip address/hostname are being shown
> in status.addresses?
> 
> This is what heapster is using.

That's the point, it shows 3 ips for each node:

--8<--
    addresses:
    - address: 172.18.10.10    <- proper one
      type: InternalIP
    - address: 172.18.20.5     <- flannel traffic interface
      type: InternalIP
-->8--

And for the nodes that has a floating ip, it shows 3:

--8<--
    addresses:
    - address: 172.18.10.10    <- proper one
      type: InternalIP
    - address: 172.18.20.5     <- flannel traffic interface
      type: InternalIP
    - address: 10.19.114.198   <- floating ip (OSP based environment)
      type: ExternalIP
-->8--
Comment 9 Eduardo Minguez 2017-08-04 09:49 EDT
Created attachment 1309068 [details]
oc get nodes -o yaml
Comment 10 Scott Dodson 2017-08-04 09:52:11 EDT
If the proper IP address is always the first IP address then I suggest heapster always uses the first address. The networking team did something similar for one of their bugs and I believe they added code to ensure that nodeIP is always the first address.
Comment 11 Matt Wringe 2017-08-04 11:03:12 EDT
Is the ip address returned always in that order? or can the order its returned be changed?

Either way, it shouldn't really matter. You have your cluster configured to be available across multiple ip addresses / hostnames. If your certificates are not valid for all of these addresses, then it means your cluster is not configured properly and heapster is properly rejecting the certificates since they are invalid.
Comment 12 Eduardo Minguez 2017-08-04 13:44:36 EDT
(In reply to Matt Wringe from comment #11)
> Is the ip address returned always in that order? or can the order its
> returned be changed?
> 
> Either way, it shouldn't really matter. You have your cluster configured to
> be available across multiple ip addresses / hostnames. If your certificates
> are not valid for all of these addresses, then it means your cluster is not
> configured properly and heapster is properly rejecting the certificates
> since they are invalid.

But those other IPs are configured automatically as the nodeip parameter in the node configuration file is properly set (some openstack autodiscovery? it seems to autodiscover the floating ips on the nodes that have one...)... is there a way to force to just listen on a specific interface? that's what the "nodeip" parameter should do, right?[1]
The thing is "I think" heapster should just query the specific nodeip interface (IDK if there is such API call or something)... maybe a heapster parameter to specify a network or if the node has "autodiscovered" IPs, loop them until you have a proper answer,...
Thanks

[1] https://kubernetes.io/docs/admin/kubelet/
Comment 13 Scott Dodson 2017-08-04 14:00:09 EDT
Rajat,

I think you worked on cleaning up issues with the router on nodes with multiple IP addresses. Can you provide any feedback on this bug?

Note You need to log in before you can comment on or make changes to this bug.