Bug 1758158

Summary: [4.2] AWS: secondary IP address order still wrong
Product: OpenShift Container Platform Reporter: Dan Winship <danw>
Component: NetworkingAssignee: Dan Winship <danw>
Networking sub component: openshift-sdn QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: ajohn
Version: 4.1.z   
Target Milestone: ---   
Target Release: 4.2.z   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1734319 Environment:
Last Closed: 2019-11-26 18:46:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1734319    
Bug Blocks:    

Description Dan Winship 2019-10-03 12:54:37 UTC
+++ This bug was initially created as a clone of Bug #1734319 +++

Description of problem:

We tested 4.1.8 that should include the fix for 1696628 but there are still nodes with wrong order of addresses.

Version-Release number of selected component (if applicable):

How reproducible:
Install cluster with machineCIDR:

Steps to Reproduce:
1. Add additional subnets for all 3 AZ, and
2. Add to each node a additional interface
3. reboot the whole cluster

Actual results:

$ oc get node
NAME                                              STATUS     ROLES            AGE   VERSION
ip-192-168-48-169.eu-central-1.compute.internal   Ready      infra,worker     20h   v1.13.4+ab8449285
ip-192-168-48-9.eu-central-1.compute.internal     Ready      master           20h   v1.13.4+ab8449285
ip-192-168-49-118.eu-central-1.compute.internal   NotReady   primary,worker   20h   v1.13.4+ab8449285
ip-192-168-50-9.eu-central-1.compute.internal     Ready      master           20h   v1.13.4+ab8449285
ip-192-168-51-0.eu-central-1.compute.internal     NotReady   infra,worker     20h   v1.13.4+ab8449285
ip-192-168-51-1.eu-central-1.compute.internal     Ready      primary,worker   20h   v1.13.4+ab8449285
ip-192-168-52-9.eu-central-1.compute.internal     Ready      master           20h   v1.13.4+ab8449285
ip-192-168-53-77.eu-central-1.compute.internal    NotReady   primary,worker   20h   v1.13.4+ab8449285
ip-192-168-53-97.eu-central-1.compute.internal    NotReady   infra,worker     20h   v1.13.4+ab8449285

$ oc get node ip-192-168-53-97.eu-central-1.compute.internal -o yaml
apiVersion: v1
kind: Node
    machine.openshift.io/machine: openshift-machine-api/t102-7g2zc-infra-m4large-eu-central-1c-7r6rk
    machineconfiguration.openshift.io/currentConfig: rendered-worker-d9f0d4c1d4ad58f19abd3eb7a491fa47
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-d9f0d4c1d4ad58f19abd3eb7a491fa47
    machineconfiguration.openshift.io/state: Done
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2019-07-29T12:19:21Z"
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: m4.large
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eu-central-1
    failure-domain.beta.kubernetes.io/zone: eu-central-1c
    kubernetes.io/hostname: ip-192-168-53-97
    node-role.kubernetes.io/infra: ""
    node-role.kubernetes.io/worker: ""
    node.openshift.io/os_id: rhcos
    node.openshift.io/os_version: "4.1"
  name: ip-192-168-53-97.eu-central-1.compute.internal
  resourceVersion: "475138"
  selfLink: /api/v1/nodes/ip-192-168-53-97.eu-central-1.compute.internal
  uid: 18131f11-b1fb-11e9-8c9e-0627ac171b24
  providerID: aws:///eu-central-1c/i-0070510acd0211610
  - effect: NoSchedule
    key: node.kubernetes.io/not-ready
    timeAdded: "2019-07-29T19:16:20Z"
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    timeAdded: "2019-07-30T07:16:07Z"
  - address:
    type: InternalIP
  - address:
    type: InternalIP
  - address: ip-192-168-53-97.eu-central-1.compute.internal
    type: InternalDNS
  - address: ip-192-168-53-97.eu-central-1.compute.internal
    type: Hostname
    attachable-volumes-aws-ebs: "39"
    cpu: 1500m
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 7548500Ki
    pods: "250"
    attachable-volumes-aws-ebs: "39"
    cpu: "2"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 8162900Ki
    pods: "250"
  - lastHeartbeatTime: "2019-07-30T08:52:06Z"
    lastTransitionTime: "2019-07-29T12:19:21Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2019-07-30T08:52:06Z"
    lastTransitionTime: "2019-07-29T12:19:21Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2019-07-30T08:52:06Z"
    lastTransitionTime: "2019-07-29T12:19:21Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2019-07-30T08:52:06Z"
    lastTransitionTime: "2019-07-29T19:16:01Z"
    message: 'runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
      message:Network plugin returns error: cni config uninitialized'
    reason: KubeletNotReady
    status: "False"
    type: Ready
      Port: 10250
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:922b6a35a8fccd340acfadc06ed4d61b3bb52ccbbde266308d0d4c9f33cbd5bf
    - <none>:<none>
    sizeBytes: 759456116
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:29f6d90c6b0130aeb4adfa554630feb56220cf2b253f63225e01c214739bd11f
    - <none>:<none>
    sizeBytes: 393935893
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:11503fe3d8ac3a25a7b58575f8c0a0d9923780c56765f8fb805274f6e6eeb3fc
    - <none>:<none>
    sizeBytes: 326787011
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d80264c4d43421fa342c43d6fdc7f953e7d4086d82b1314c4c757d02ef57cc7
    - <none>:<none>
    sizeBytes: 309636152
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b06aa13d7d69c5e40f83b3c5ab448c23981debd037357befb94cd8b0a76539cd
    - <none>:<none>
    sizeBytes: 308816069
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9fb0aee3654a5a3085cf68e1733d954bce0fcd29259c472b90d96f0ab3b6cae5
    - <none>:<none>
    sizeBytes: 308244944
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:315edc515624d3ea914ca8ff29bb696ff867e2201567aed3b512663c7d180235
    - <none>:<none>
    sizeBytes: 307705316
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e615166bd1582c2df4d60953988bb46f569e5c8ccf3297c3adeaa0ca2a655589
    - <none>:<none>
    sizeBytes: 305536176
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba9de202096a20fb2fd5d90c461b25d0f716b10c39f7cc97efdd6e8ecd238288
    - <none>:<none>
    sizeBytes: 290850353
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2bb650de48f0591ee1a96ad49ec1ae7fee6c63ced2841f2bc10a935344339524
    - <none>:<none>
    sizeBytes: 289813489
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9338f63ccf727b4f541148086a26d6cceb73112b1b5335c9deadfdb383a5ba44
    - <none>:<none>
    sizeBytes: 288704566
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:22ed5f9b584b127735cf3885cf68402a99e1eebc9be9ca5232136e9ceb0b0c6e
    - <none>:<none>
    sizeBytes: 288032413
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:00b5664ace1f77f884b2f3c74b5672b0024ada16799faf053ff342385bc1af7d
    - <none>:<none>
    sizeBytes: 272086341
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1df6aa92e871743fdf3d7531207db99242dcdfe2d7f2aee7ca4e444b98e73789
    - <none>:<none>
    sizeBytes: 269911675
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:be6439d5942d4749fe19c933613e785e087fc89d3588b0fb0a6d438e74602488
    - <none>:<none>
    sizeBytes: 269204107
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3d9537613c6069794d3522dc7c296f1460f3b92bf6227f24d278ab46a2be5110
    - <none>:<none>
    sizeBytes: 259800180
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4272ef33e1dd99f1579d50645cd92c30a06e3e7ec141626b82c7ede1b62299db
    - <none>:<none>
    sizeBytes: 254212177
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bba5dc11e50b1bbb5785e5585aad2d49db2e9418bdfc47e569b904e816880c9e
    - <none>:<none>
    sizeBytes: 251498500
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2c7d5f372ef8831f843e18098b08364ef4eecd7e3a20feabc9d8bdcc04f94a68
    - <none>:<none>
    sizeBytes: 240643246
  - names:
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:38a2a4472fb0f56d61cd7f8a269b28c065a87eaf33f13d5df18c88533ded65ce
    - <none>:<none>
    sizeBytes: 240182053
  - names:
    - mtr.external.otc.telekomcloud.com/ocp4/mcs-cluster-health@sha256:74540595cd6b7c8a94ffc04e7ef95f084b149ac287046b06c6e25c67c0bb99f7
    - mtr.external.otc.telekomcloud.com/ocp4/mcs-cluster-health:v1.0.24
    sizeBytes: 146449258
  - names:
    - mtr.external.otc.telekomcloud.com/ocp4/mcs-letsencrypt@sha256:19d8e3a6f4b01240e647561485e81ab1ec37a62c72ec3a7dbd1bbe675f18e79b
    - mtr.external.otc.telekomcloud.com/ocp4/mcs-letsencrypt:v1.0.16
    sizeBytes: 139951428
    architecture: amd64
    bootID: 82d3eb6c-2cee-43f8-8045-fe0c109b71c9
    containerRuntimeVersion: cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
    kernelVersion: 4.18.0-80.4.2.el8_0.x86_64
    kubeProxyVersion: v1.13.4+ab8449285
    kubeletVersion: v1.13.4+ab8449285
    machineID: 951d900f24a84243806a44286a7fc01d
    operatingSystem: linux
    osImage: Red Hat Enterprise Linux CoreOS 410.8.20190724.0 (Ootpa)
    systemUUID: ec2dfcda-dd34-ba52-02ec-1e788553beb9
Expected results:

Additional info:

--- Additional comment from Florin Peter on 2019-07-30 04:58:05 EDT ---

--- Additional comment from Dan Winship on 2019-07-30 06:35:48 EDT ---

> 2. Add to each node a additional interface

The specific thing that bug 1696628 was fixing was that if you had a *single* interface with multiple IPs, the cloud provider would report the IPs in the correct order, but they would end up listed in the node in a different order.

It's possible that there is a different bug for multiple interfaces, where the cloud provider is just not reporting the IPs in the correct order to begin with.

Can you get the output (from the node) of:

  ip link

  curl -s

  for MAC in $(curl -s; do
    curl -s$MAC/device-number
    echo ": $MAC"

--- Additional comment from Florin Peter on 2019-07-30 06:43:20 EDT ---

Hey Dan,

thanks that you pick this up I really appreciate it ;)

[root@ip-192-168-53-97 core]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 0a:e5:72:94:4e:cc brd ff:ff:ff:ff:ff:ff
3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 0a:21:c9:18:5b:38 brd ff:ff:ff:ff:ff:ff
4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 86:2d:75:71:3a:8b brd ff:ff:ff:ff:ff:ff
5: tun0: <BROADCAST,MULTICAST> mtu 8951 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 1e:fa:a6:8e:03:8c brd ff:ff:ff:ff:ff:ff
6: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether d6:c5:31:0b:11:d2 brd ff:ff:ff:ff:ff:ff
7: br0: <BROADCAST,MULTICAST> mtu 8951 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether a6:3c:4e:61:2f:40 brd ff:ff:ff:ff:ff:ff

[root@ip-192-168-53-97 core]# curl -s

[root@ip-192-168-53-97 core]# for MAC in $(curl -s; do
>     curl -s$MAC/device-number
>     echo ": $MAC"
>   done
1: 0a:21:c9:18:5b:38/
0: 0a:e5:72:94:4e:cc/

--- Additional comment from Dan Winship on 2019-07-30 07:34:22 EDT ---

> 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
>     link/ether 0a:e5:72:94:4e:cc brd ff:ff:ff:ff:ff:ff
> 3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
>     link/ether 0a:21:c9:18:5b:38 brd ff:ff:ff:ff:ff:ff

> 1: 0a:21:c9:18:5b:38/
> 0: 0a:e5:72:94:4e:cc/

OK, so the AWS cloudprovider code returns IP addresses sorted by interface in the order that returns them, which appears to be sorted lexicographically by MAC rather than sorted by device number. So you end up with the ens4 IPs before the ens3 ones.

This can be fixed but will require another upstream patch and backport. As a workaround, you could delete and recreate your extra interfaces until you end up with MAC addresses that sort correctly.

--- Additional comment from Florin Peter on 2019-07-30 07:44:28 EDT ---

Thx for the explanation.

Well the workaround is not really helpful for us as we add the additional interfaces automatically.
We need the extra interface as this is a security requirement from our security team this means that we are currently blocked by this issue.

What do you think is it possible to get this fixed and backported for 4.1?

--- Additional comment from Seth Jennings on 2019-07-30 11:02:27 EDT ---

aligning component with assignee

--- Additional comment from Dan Winship on 2019-07-30 11:37:26 EDT ---

not really POST yet; I linked to the upstream bug, not an Origin PR

--- Additional comment from Dan Winship on 2019-08-23 11:33:39 EDT ---

This needs an upstream commit that hasn't merged yet which will then need to be backported. It's not going to make it in time for 4.2.0. We can get it into a 4.2.z release as soon as it's fixed in master.

Comment 4 errata-xmlrpc 2019-11-26 18:46:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.