Bug 1758160 - [3.11] AWS: secondary IP address order still wrong
Summary: [3.11] AWS: secondary IP address order still wrong
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.11.z
Assignee: Dan Winship
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On: 1734319
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-03 12:54 UTC by Dan Winship
Modified: 2019-11-18 22:09 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1734319
Environment:
Last Closed: 2019-10-03 13:03:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dan Winship 2019-10-03 12:54:53 UTC
+++ This bug was initially created as a clone of Bug #1734319 +++

Description of problem:

We tested 4.1.8 that should include the fix for 1696628 but there are still nodes with wrong order of addresses.

Version-Release number of selected component (if applicable):
4.1.8

How reproducible:
Install cluster with machineCIDR: 192.168.32.0/19

Steps to Reproduce:
1. Add additional subnets for all 3 AZ 192.168.56.0/23, 192.168.58.0/23 and 192.168.60.0/23
2. Add to each node a additional interface
3. reboot the whole cluster

Actual results:

$ oc get node
NAME                                              STATUS     ROLES            AGE   VERSION
ip-192-168-48-169.eu-central-1.compute.internal   Ready      infra,worker     20h   v1.13.4+ab8449285
ip-192-168-48-9.eu-central-1.compute.internal     Ready      master           20h   v1.13.4+ab8449285
ip-192-168-49-118.eu-central-1.compute.internal   NotReady   primary,worker   20h   v1.13.4+ab8449285
ip-192-168-50-9.eu-central-1.compute.internal     Ready      master           20h   v1.13.4+ab8449285
ip-192-168-51-0.eu-central-1.compute.internal     NotReady   infra,worker     20h   v1.13.4+ab8449285
ip-192-168-51-1.eu-central-1.compute.internal     Ready      primary,worker   20h   v1.13.4+ab8449285
ip-192-168-52-9.eu-central-1.compute.internal     Ready      master           20h   v1.13.4+ab8449285
ip-192-168-53-77.eu-central-1.compute.internal    NotReady   primary,worker   20h   v1.13.4+ab8449285
ip-192-168-53-97.eu-central-1.compute.internal    NotReady   infra,worker     20h   v1.13.4+ab8449285


$ oc get node ip-192-168-53-97.eu-central-1.compute.internal -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    machine.openshift.io/machine: openshift-machine-api/t102-7g2zc-infra-m4large-eu-central-1c-7r6rk
    machineconfiguration.openshift.io/currentConfig: rendered-worker-d9f0d4c1d4ad58f19abd3eb7a491fa47
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-d9f0d4c1d4ad58f19abd3eb7a491fa47
    machineconfiguration.openshift.io/state: Done
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2019-07-29T12:19:21Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: m4.large
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eu-central-1
    failure-domain.beta.kubernetes.io/zone: eu-central-1c
    kubernetes.io/hostname: ip-192-168-53-97
    node-role.kubernetes.io/infra: ""
    node-role.kubernetes.io/worker: ""
    node.openshift.io/os_id: rhcos
    node.openshift.io/os_version: "4.1"
  name: ip-192-168-53-97.eu-central-1.compute.internal
  resourceVersion: "475138"
  selfLink: /api/v1/nodes/ip-192-168-53-97.eu-central-1.compute.internal
  uid: 18131f11-b1fb-11e9-8c9e-0627ac171b24
spec:
  providerID: aws:///eu-central-1c/i-0070510acd0211610
  taints:
  - effect: NoSchedule
    key: node.kubernetes.io/not-ready
    timeAdded: "2019-07-29T19:16:20Z"
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    timeAdded: "2019-07-30T07:16:07Z"
status:
  addresses:
  - address: 192.168.61.113
    type: InternalIP
  - address: 192.168.53.97
    type: InternalIP
  - address: ip-192-168-53-97.eu-central-1.compute.internal
    type: InternalDNS
  - address: ip-192-168-53-97.eu-central-1.compute.internal
    type: Hostname
  allocatable:
    attachable-volumes-aws-ebs: "39"
    cpu: 1500m
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 7548500Ki
    pods: "250"
  capacity:
    attachable-volumes-aws-ebs: "39"
    cpu: "2"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 8162900Ki
    pods: "250"
  conditions:
  - lastHeartbeatTime: "2019-07-30T08:52:06Z"
    lastTransitionTime: "2019-07-29T12:19:21Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2019-07-30T08:52:06Z"
    lastTransitionTime: "2019-07-29T12:19:21Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2019-07-30T08:52:06Z"
    lastTransitionTime: "2019-07-29T12:19:21Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2019-07-30T08:52:06Z"
    lastTransitionTime: "2019-07-29T19:16:01Z"
    message: 'runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
      message:Network plugin returns error: cni config uninitialized'
    reason: KubeletNotReady
    status: "False"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:922b6a35a8fccd340acfadc06ed4d61b3bb52ccbbde266308d0d4c9f33cbd5bf
    - <none>:<none>
    sizeBytes: 759456116
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:29f6d90c6b0130aeb4adfa554630feb56220cf2b253f63225e01c214739bd11f
    - <none>:<none>
    sizeBytes: 393935893
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:11503fe3d8ac3a25a7b58575f8c0a0d9923780c56765f8fb805274f6e6eeb3fc
    - <none>:<none>
    sizeBytes: 326787011
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d80264c4d43421fa342c43d6fdc7f953e7d4086d82b1314c4c757d02ef57cc7
    - <none>:<none>
    sizeBytes: 309636152
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b06aa13d7d69c5e40f83b3c5ab448c23981debd037357befb94cd8b0a76539cd
    - <none>:<none>
    sizeBytes: 308816069
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9fb0aee3654a5a3085cf68e1733d954bce0fcd29259c472b90d96f0ab3b6cae5
    - <none>:<none>
    sizeBytes: 308244944
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:315edc515624d3ea914ca8ff29bb696ff867e2201567aed3b512663c7d180235
    - <none>:<none>
    sizeBytes: 307705316
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e615166bd1582c2df4d60953988bb46f569e5c8ccf3297c3adeaa0ca2a655589
    - <none>:<none>
    sizeBytes: 305536176
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba9de202096a20fb2fd5d90c461b25d0f716b10c39f7cc97efdd6e8ecd238288
    - <none>:<none>
    sizeBytes: 290850353
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2bb650de48f0591ee1a96ad49ec1ae7fee6c63ced2841f2bc10a935344339524
    - <none>:<none>
    sizeBytes: 289813489
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9338f63ccf727b4f541148086a26d6cceb73112b1b5335c9deadfdb383a5ba44
    - <none>:<none>
    sizeBytes: 288704566
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:22ed5f9b584b127735cf3885cf68402a99e1eebc9be9ca5232136e9ceb0b0c6e
    - <none>:<none>
    sizeBytes: 288032413
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:00b5664ace1f77f884b2f3c74b5672b0024ada16799faf053ff342385bc1af7d
    - <none>:<none>
    sizeBytes: 272086341
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1df6aa92e871743fdf3d7531207db99242dcdfe2d7f2aee7ca4e444b98e73789
    - <none>:<none>
    sizeBytes: 269911675
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:be6439d5942d4749fe19c933613e785e087fc89d3588b0fb0a6d438e74602488
    - <none>:<none>
    sizeBytes: 269204107
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3d9537613c6069794d3522dc7c296f1460f3b92bf6227f24d278ab46a2be5110
    - <none>:<none>
    sizeBytes: 259800180
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4272ef33e1dd99f1579d50645cd92c30a06e3e7ec141626b82c7ede1b62299db
    - <none>:<none>
    sizeBytes: 254212177
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bba5dc11e50b1bbb5785e5585aad2d49db2e9418bdfc47e569b904e816880c9e
    - <none>:<none>
    sizeBytes: 251498500
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2c7d5f372ef8831f843e18098b08364ef4eecd7e3a20feabc9d8bdcc04f94a68
    - <none>:<none>
    sizeBytes: 240643246
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:38a2a4472fb0f56d61cd7f8a269b28c065a87eaf33f13d5df18c88533ded65ce
    - <none>:<none>
    sizeBytes: 240182053
  - names:
    - mtr.external.otc.telekomcloud.com/ocp4/mcs-cluster-health@sha256:74540595cd6b7c8a94ffc04e7ef95f084b149ac287046b06c6e25c67c0bb99f7
    - mtr.external.otc.telekomcloud.com/ocp4/mcs-cluster-health:v1.0.24
    sizeBytes: 146449258
  - names:
    - mtr.external.otc.telekomcloud.com/ocp4/mcs-letsencrypt@sha256:19d8e3a6f4b01240e647561485e81ab1ec37a62c72ec3a7dbd1bbe675f18e79b
    - mtr.external.otc.telekomcloud.com/ocp4/mcs-letsencrypt:v1.0.16
    sizeBytes: 139951428
  nodeInfo:
    architecture: amd64
    bootID: 82d3eb6c-2cee-43f8-8045-fe0c109b71c9
    containerRuntimeVersion: cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
    kernelVersion: 4.18.0-80.4.2.el8_0.x86_64
    kubeProxyVersion: v1.13.4+ab8449285
    kubeletVersion: v1.13.4+ab8449285
    machineID: 951d900f24a84243806a44286a7fc01d
    operatingSystem: linux
    osImage: Red Hat Enterprise Linux CoreOS 410.8.20190724.0 (Ootpa)
    systemUUID: ec2dfcda-dd34-ba52-02ec-1e788553beb9
Expected results:

Additional info:

--- Additional comment from Florin Peter on 2019-07-30 04:58:05 EDT ---



--- Additional comment from Dan Winship on 2019-07-30 06:35:48 EDT ---

> 2. Add to each node a additional interface

The specific thing that bug 1696628 was fixing was that if you had a *single* interface with multiple IPs, the cloud provider would report the IPs in the correct order, but they would end up listed in the node in a different order.

It's possible that there is a different bug for multiple interfaces, where the cloud provider is just not reporting the IPs in the correct order to begin with.

Can you get the output (from the node) of:

  ip link

  curl -s http://169.254.169.254/latest/meta-data/network/interfaces/macs/

  for MAC in $(curl -s http://169.254.169.254/latest/meta-data/network/interfaces/macs/); do
    curl -s http://169.254.169.254/latest/meta-data/network/interfaces/macs/$MAC/device-number
    echo ": $MAC"
  done

--- Additional comment from Florin Peter on 2019-07-30 06:43:20 EDT ---

Hey Dan,

thanks that you pick this up I really appreciate it ;)

[root@ip-192-168-53-97 core]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 0a:e5:72:94:4e:cc brd ff:ff:ff:ff:ff:ff
3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 0a:21:c9:18:5b:38 brd ff:ff:ff:ff:ff:ff
4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 86:2d:75:71:3a:8b brd ff:ff:ff:ff:ff:ff
5: tun0: <BROADCAST,MULTICAST> mtu 8951 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 1e:fa:a6:8e:03:8c brd ff:ff:ff:ff:ff:ff
6: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether d6:c5:31:0b:11:d2 brd ff:ff:ff:ff:ff:ff
7: br0: <BROADCAST,MULTICAST> mtu 8951 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether a6:3c:4e:61:2f:40 brd ff:ff:ff:ff:ff:ff

[root@ip-192-168-53-97 core]# curl -s http://169.254.169.254/latest/meta-data/network/interfaces/macs/
0a:21:c9:18:5b:38/
0a:e5:72:94:4e:cc/

[root@ip-192-168-53-97 core]# for MAC in $(curl -s http://169.254.169.254/latest/meta-data/network/interfaces/macs/); do
>     curl -s http://169.254.169.254/latest/meta-data/network/interfaces/macs/$MAC/device-number
>     echo ": $MAC"
>   done
1: 0a:21:c9:18:5b:38/
0: 0a:e5:72:94:4e:cc/

--- Additional comment from Dan Winship on 2019-07-30 07:34:22 EDT ---

> 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
>     link/ether 0a:e5:72:94:4e:cc brd ff:ff:ff:ff:ff:ff
> 3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
>     link/ether 0a:21:c9:18:5b:38 brd ff:ff:ff:ff:ff:ff

> 1: 0a:21:c9:18:5b:38/
> 0: 0a:e5:72:94:4e:cc/

OK, so the AWS cloudprovider code returns IP addresses sorted by interface in the order that http://169.254.169.254/latest/meta-data/network/interfaces/macs/ returns them, which appears to be sorted lexicographically by MAC rather than sorted by device number. So you end up with the ens4 IPs before the ens3 ones.

This can be fixed but will require another upstream patch and backport. As a workaround, you could delete and recreate your extra interfaces until you end up with MAC addresses that sort correctly.

--- Additional comment from Florin Peter on 2019-07-30 07:44:28 EDT ---

Thx for the explanation.

Well the workaround is not really helpful for us as we add the additional interfaces automatically.
We need the extra interface as this is a security requirement from our security team this means that we are currently blocked by this issue.

What do you think is it possible to get this fixed and backported for 4.1?

--- Additional comment from Seth Jennings on 2019-07-30 11:02:27 EDT ---

aligning component with assignee

--- Additional comment from Dan Winship on 2019-07-30 11:37:26 EDT ---

not really POST yet; I linked to the upstream bug, not an Origin PR

--- Additional comment from Dan Winship on 2019-08-23 11:33:39 EDT ---

This needs an upstream commit that hasn't merged yet which will then need to be backported. It's not going to make it in time for 4.2.0. We can get it into a 4.2.z release as soon as it's fixed in master.

Comment 1 Dan Winship 2019-10-03 13:03:25 UTC
oops, no, we're only backporting this to 4.1


Note You need to log in before you can comment on or make changes to this bug.