Bug 2104784
Summary: | Some EgressIP was not correctly assigned to the egress node under some condition | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | huirwang |
Component: | Networking | Assignee: | Andreas Karis <akaris> |
Networking sub component: | ovn-kubernetes | QA Contact: | huirwang |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | medium | CC: | akaris, rravaiol |
Version: | 4.11 | ||
Target Milestone: | --- | ||
Target Release: | 4.12.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-17 19:51:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
huirwang
2022-07-07 06:40:32 UTC
From the logs: 2022-07-07T04:16:43.231256399Z I0707 04:16:43.231215 1 controller.go:160] Dropping key '10.0.58.151' from the cloud-private-ip-config workqueue 2022-07-07T04:16:52.761114635Z I0707 04:16:52.761065 1 controller.go:182] Assigning key: 10.0.56.170 to cloud-private-ip-config workqueue 2022-07-07T04:16:52.764943395Z I0707 04:16:52.764914 1 cloudprivateipconfig_controller.go:271] CloudPrivateIPConfig: "10.0.56.170" will be added to node: "ip-10-0-51-130.us-east-2.compute.internal" 2022-07-07T04:16:52.772128989Z I0707 04:16:52.772096 1 cloudprivateipconfig_controller.go:295] Adding finalizer to CloudPrivateIPConfig: "10.0.56.170" 2022-07-07T04:16:52.772128989Z I0707 04:16:52.772117 1 controller.go:182] Assigning key: 10.0.56.170 to cloud-private-ip-config workqueue 2022-07-07T04:16:53.223669341Z E0707 04:16:53.223624 1 controller.go:165] error syncing '10.0.56.170': error assigning CloudPrivateIPConfig: "10.0.56.170" to node: "ip-10-0-51-130.us-east-2.compute.internal", err: PrivateIpAddressLimitExceeded: Number of private addresses will exceed limit. 2022-07-07T04:16:53.223669341Z status code: 400, request id: 0110a9ad-57f4-485a-a467-24866c3bdc59, requeuing in cloud-private-ip-config workqueue 2022-07-07T04:16:53.226454229Z I0707 04:16:53.226427 1 cloudprivateipconfig_controller.go:271] CloudPrivateIPConfig: "10.0.56.170" will be added to node: "ip-10-0-51-130.us-east-2.compute.internal" 2022-07-07T04:16:53.639710866Z E0707 04:16:53.639672 1 controller.go:165] error syncing '10.0.56.170': error assigning CloudPrivateIPConfig: "10.0.56.170" to node: "ip-10-0-51-130.us-east-2.compute.internal", err: PrivateIpAddressLimitExceeded: Number of private addresses will exceed limit. (...) 2022-07-07T04:18:19.657285557Z status code: 400, request id: 372b6951-6f2c-468b-8ec7-03c069a818e0, requeuing in cloud-private-ip-config workqueue 2022-07-07T04:19:41.583437183Z I0707 04:19:41.583392 1 cloudprivateipconfig_controller.go:271] CloudPrivateIPConfig: "10.0.56.170" will be added to node: "ip-10-0-51-130.us-east-2.compute.internal" 2022-07-07T04:19:42.039827916Z E0707 04:19:42.039785 1 controller.go:165] error syncing '10.0.56.170': error assigning CloudPrivateIPConfig: "10.0.56.170" to node: "ip-10-0-51-130.us-east-2.compute.internal", err: PrivateIpAddressLimitExceeded: Number of private addresses will exceed limit. 2022-07-07T04:19:42.039827916Z status code: 400, request id: 9c765988-b831-4b81-9e0f-e06f425ad784, requeuing in cloud-private-ip-config workqueue 2022-07-07T04:22:25.885270338Z I0707 04:22:25.885205 1 cloudprivateipconfig_controller.go:271] CloudPrivateIPConfig: "10.0.56.170" will be added to node: "ip-10-0-51-130.us-east-2.compute.internal" 2022-07-07T04:22:26.394088658Z I0707 04:22:26.394056 1 controller.go:160] Dropping key '10.0.56.170' from the cloud-private-ip-config workqueue [akaris@linux must-gather.local.4851834340043511279]$ Ok, according to the CloudProviderIntf: https://github.com/openshift/cloud-network-config-controller/blob/7a3c3c9ea200d11db5128663deade53f20105ac9/pkg/cloudprovider/cloudprovider.go#L51 ~~~ 51 // GetNodeEgressIPConfiguration retrieves the egress IP configuration for 52 // the node, following the convention the cloud uses. This means 53 // specifically that: the IP capacity can be either hard-coded and global 54 // for all instance types and IP families (GCP, Azure) or variable per 55 // instance and IP family (AWS), also: the interface is either keyed by name 56 // (GCP) or ID (Azure, AWS). Note: this function should only be called when 57 // no egress IPs have been added to the node, it will return an incorrect 58 // "egress IP capacity" otherwise ~~~ Having a look at the AWS doc: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#AvailableIpPerENI ~~~ The following table lists the maximum number of network interfaces per instance type, and the maximum number of private IPv4 addresses and IPv6 addresses per network interface. The limit for IPv6 addresses is separate from the limit for private IPv4 addresses per network interface. Not all instance types support IPv6 addressing. ~~~ So, for xlarge instances, the total is 15, minus the number of assigned IP addresses (1 in this case). The code to determine this is here: https://github.com/openshift/cloud-network-config-controller/blob/7a3c3c9ea200d11db5128663deade53f20105ac9/pkg/cloudprovider/aws.go#L278 The node reports the correct capacity here with 14 for IPv4 ~~~ [{"interface":"eni-0dd3af220dd298c34","ifaddr":{"ipv4":"10.0.48.0/20"},"capacity":{"ipv4":14,"ipv6":15}}] ~~~ The question now is why did this already fail after 13 successful assignments ... I can't reproduce this. Can you build with the following PR: https://github.com/openshift/cloud-network-config-controller/pull/50 And reproduce it? The PR will print a list of all IPs that are assigned to the interface, hopefully this will shed some more light into this issue. * The PR will print a list of all IPs that are assigned to the interface. It will show in the logs of the cloudnetworkconfigcontroller, when the assignment fails, something like: E0711 17:12:20.527460 1 controller.go:165] error syncing '10.0.129.24': error assigning CloudPrivateIPConfig: "10.0.129.24" to node: "ip-10-0-138-80.us-west-1.compute.internal", err: PrivateIpAddressLimitExceeded: Number of private addresses will exceed limit. status code: 400, request id: 534845ac-7069-4b8a-8e5e-422ff279c1fa Tried to assign the following 16 IP addresses to the interface ( 10.0.138.80 10.0.129.12 10.0.129.13 10.0.129.30 10.0.129.11 10.0.129.14 10.0.129.15 10.0.129.16 10.0.129.17 10.0.129.18 10.0.129.19 10.0.129.20 10.0.129.21 10.0.129.22 10.0.129.23 10.0.129.24), requeuing in cloud-private-ip-config workqueue We definitely have issues with the removal of CloudPrivateIPs: ~~~ [akaris@linux 2104784]$ omg logs -n openshift-cloud-network-config-controller cloud-network-config-controller-868f8459dd-4pxwd | grep 10.0.134.71 2022-07-12T02:58:05.694389392Z I0712 02:58:05.694338 1 controller.go:182] Assigning key: 10.0.134.71 to cloud-private-ip-config workqueue 2022-07-12T02:58:05.699152680Z I0712 02:58:05.699098 1 cloudprivateipconfig_controller.go:271] CloudPrivateIPConfig: "10.0.134.71" will be added to node: "ip-10-0-131-179.us-west-1.compute.internal" 2022-07-12T02:58:05.705947222Z I0712 02:58:05.705924 1 cloudprivateipconfig_controller.go:295] Adding finalizer to CloudPrivateIPConfig: "10.0.134.71" 2022-07-12T02:58:05.706031699Z I0712 02:58:05.705929 1 controller.go:182] Assigning key: 10.0.134.71 to cloud-private-ip-config workqueue 2022-07-12T02:58:06.242678494Z I0712 02:58:06.242642 1 cloudprivateipconfig_controller.go:353] Added IP address to node: "ip-10-0-131-179.us-west-1.compute.internal" for CloudPrivateIPConfig: "10.0.134.71" 2022-07-12T02:58:06.250567807Z I0712 02:58:06.250516 1 controller.go:160] Dropping key '10.0.134.71' from the cloud-private-ip-config workqueue 2022-07-12T02:58:06.255387829Z I0712 02:58:06.255363 1 controller.go:160] Dropping key '10.0.134.71' from the cloud-private-ip-config workqueue 2022-07-12T03:03:04.746097253Z I0712 03:03:04.746059 1 controller.go:182] Assigning key: 10.0.134.71 to cloud-private-ip-config workqueue 2022-07-12T03:03:04.751467833Z I0712 03:03:04.751437 1 cloudprivateipconfig_controller.go:187] CloudPrivateIPConfig: "10.0.134.71" will be deleted from node: "ip-10-0-131-179.us-west-1.compute.internal" 2022-07-12T03:03:04.759119459Z I0712 03:03:04.759089 1 controller.go:182] Assigning key: 10.0.134.71 to cloud-private-ip-config workqueue 2022-07-12T03:03:05.346106239Z I0712 03:03:05.346085 1 cloudprivateipconfig_controller.go:242] CloudPrivateIPConfig: 10.0.134.71 object has been marked for complete deletion 2022-07-12T03:03:05.346106239Z I0712 03:03:05.346099 1 cloudprivateipconfig_controller.go:249] Cleaning up IP address and finalizer for CloudPrivateIPConfig: "10.0.134.71", deleting it completely 2022-07-12T03:03:05.356521574Z I0712 03:03:05.356500 1 controller.go:160] Dropping key '10.0.134.71' from the cloud-private-ip-config workqueue 2022-07-12T03:03:05.356705007Z I0712 03:03:05.356677 1 controller.go:182] Assigning key: 10.0.134.71 to cloud-private-ip-config workqueue 2022-07-12T03:03:05.414990428Z I0712 03:03:05.414954 1 cloudprivateipconfig_controller.go:421] CloudPrivateIPConfig: "10.0.134.71" in work queue no longer exists 2022-07-12T03:03:05.414990428Z I0712 03:03:05.414980 1 controller.go:160] Dropping key '10.0.134.71' from the cloud-private-ip-config workqueue 2022-07-12T03:03:05.614967217Z I0712 03:03:05.614929 1 cloudprivateipconfig_controller.go:421] CloudPrivateIPConfig: "10.0.134.71" in work queue no longer exists 2022-07-12T03:03:05.614967217Z I0712 03:03:05.614949 1 controller.go:160] Dropping key '10.0.134.71' from the cloud-private-ip-config workqueue 2022-07-12T03:10:41.982974676Z status code: 400, request id: 518e3246-1789-4c3a-9b87-02e6898d4775 Tried to assign the following 16 IP addresses to the interface ( 10.0.131.179 10.0.134.71 10.0.147.253 10.0.178.90 10.0.173.236 10.0.163.109 10.0.188.72 10.0.177.40 10.0.146.203 10.0.185.124 10.0.128.168 10.0.162.26 10.0.132.65 10.0.190.63 10.0.156.49 10.0.161.24), requeuing in cloud-private-ip-config workqueue 2022-07-12T03:10:42.451270498Z status code: 400, request id: 061931b4-3618-4143-afa3-afa6cf11c55e Tried to assign the following 16 IP addresses to the interface ( 10.0.131.179 10.0.134.71 10.0.147.253 10.0.178.90 10.0.173.236 10.0.163.109 10.0.188.72 10.0.177.40 10.0.146.203 10.0.185.124 10.0.128.168 10.0.162.26 10.0.132.65 10.0.190.63 10.0.156.49 10.0.161.24), requeuing in cloud-private-ip-config workqueue (...) 2022-07-12T03:40:31.860759774Z status code: 400, request id: 2dd4d823-1616-4aa2-a090-ef13c56629ca Tried to assign the following 16 IP addresses to the interface ( 10.0.131.179 10.0.134.71 10.0.132.65 10.0.131.110 10.0.131.11 10.0.131.111 10.0.131.112 10.0.131.113 10.0.131.115 10.0.131.17 10.0.131.114 10.0.131.15 10.0.131.18 10.0.131.16 10.0.131.12 10.0.131.13), requeuing in cloud-private-ip-config workqueue ~~~ There's a clear race between deletion and assignment. Whereas deletion releases a single IP address from the interface, the current assignment code takes all IPs on the interface and tries to reassign them. This lead to a race for example here, meaning that the release operation for 10.0.134.71 was reported as success, but the concurrent add operation of IP 10.0.186.7 reverted that delete operation: ~~~ 2022-07-12T03:03:04.746097253Z I0712 03:03:04.746059 1 controller.go:182] Assigning key: 10.0.134.71 to cloud-private-ip-config workqueue 2022-07-12T03:03:04.751467833Z I0712 03:03:04.751437 1 cloudprivateipconfig_controller.go:187] CloudPrivateIPConfig: "10.0.134.71" will be deleted from node: "ip-10-0-131-179.us-west-1.compute.internal" 2022-07-12T03:03:04.759119459Z I0712 03:03:04.759089 1 controller.go:182] Assigning key: 10.0.134.71 to cloud-private-ip-config workqueue 2022-07-12T03:03:04.917985954Z I0712 03:03:04.917933 1 controller.go:182] Assigning key: 10.0.173.53 to cloud-private-ip-config workqueue 2022-07-12T03:03:04.924395572Z I0712 03:03:04.924345 1 cloudprivateipconfig_controller.go:187] CloudPrivateIPConfig: "10.0.173.53" will be deleted from node: "ip-10-0-131-179.us-west-1.compute.internal" 2022-07-12T03:03:04.926813866Z I0712 03:03:04.926786 1 cloudprivateipconfig_controller.go:242] CloudPrivateIPConfig: 10.0.172.68 object has been marked for complete deletion 2022-07-12T03:03:04.926813866Z I0712 03:03:04.926802 1 cloudprivateipconfig_controller.go:249] Cleaning up IP address and finalizer for CloudPrivateIPConfig: "10.0.172.68", deleting it completely 2022-07-12T03:03:04.931928216Z I0712 03:03:04.931905 1 controller.go:182] Assigning key: 10.0.173.53 to cloud-private-ip-config workqueue 2022-07-12T03:03:04.936263208Z I0712 03:03:04.936245 1 controller.go:182] Assigning key: 10.0.172.68 to cloud-private-ip-config workqueue 2022-07-12T03:03:04.936382862Z I0712 03:03:04.936367 1 controller.go:160] Dropping key '10.0.172.68' from the cloud-private-ip-config workqueue 2022-07-12T03:03:04.941801797Z I0712 03:03:04.941775 1 cloudprivateipconfig_controller.go:421] CloudPrivateIPConfig: "10.0.172.68" in work queue no longer exists 2022-07-12T03:03:04.941801797Z I0712 03:03:04.941794 1 controller.go:160] Dropping key '10.0.172.68' from the cloud-private-ip-config workqueue 2022-07-12T03:03:04.952650783Z I0712 03:03:04.952021 1 controller.go:182] Assigning key: 10.0.186.7 to cloud-private-ip-config workqueue 2022-07-12T03:03:04.956496344Z I0712 03:03:04.956475 1 cloudprivateipconfig_controller.go:271] CloudPrivateIPConfig: "10.0.186.7" will be added to node: "ip-10-0-131-179.us-west-1.compute.internal" 2022-07-12T03:03:04.962999877Z I0712 03:03:04.962978 1 cloudprivateipconfig_controller.go:295] Adding finalizer to CloudPrivateIPConfig: "10.0.186.7" 2022-07-12T03:03:04.963024237Z I0712 03:03:04.963009 1 controller.go:182] Assigning key: 10.0.186.7 to cloud-private-ip-config workqueue 2022-07-12T03:03:05.344672951Z I0712 03:03:05.344635 1 controller.go:182] Assigning key: 10.0.144.60 to cloud-private-ip-config workqueue 2022-07-12T03:03:05.346106239Z I0712 03:03:05.346085 1 cloudprivateipconfig_controller.go:242] CloudPrivateIPConfig: 10.0.134.71 object has been marked for complete deletion 2022-07-12T03:03:05.346106239Z I0712 03:03:05.346099 1 cloudprivateipconfig_controller.go:249] Cleaning up IP address and finalizer for CloudPrivateIPConfig: "10.0.134.71", deleting it completely 2022-07-12T03:03:05.349547648Z I0712 03:03:05.349506 1 cloudprivateipconfig_controller.go:187] CloudPrivateIPConfig: "10.0.144.60" will be deleted from node: "ip-10-0-131-179.us-west-1.compute.internal" 2022-07-12T03:03:05.356521574Z I0712 03:03:05.356500 1 controller.go:160] Dropping key '10.0.134.71' from the cloud-private-ip-config workqueue 2022-07-12T03:03:05.356705007Z I0712 03:03:05.356677 1 controller.go:182] Assigning key: 10.0.134.71 to cloud-private-ip-config workqueue 2022-07-12T03:03:05.356860074Z I0712 03:03:05.356840 1 controller.go:182] Assigning key: 10.0.144.60 to cloud-private-ip-config workqueue 2022-07-12T03:03:05.414990428Z I0712 03:03:05.414954 1 cloudprivateipconfig_controller.go:421] CloudPrivateIPConfig: "10.0.134.71" in work queue no longer exists 2022-07-12T03:03:05.414990428Z I0712 03:03:05.414980 1 controller.go:160] Dropping key '10.0.134.71' from the cloud-private-ip-config workqueue 2022-07-12T03:03:05.500962957Z I0712 03:03:05.500923 1 cloudprivateipconfig_controller.go:353] Added IP address to node: "ip-10-0-131-179.us-west-1.compute.internal" for CloudPrivateIPConfig: "10.0.186.7" 2022-07-12T03:03:05.614967217Z I0712 03:03:05.614929 1 cloudprivateipconfig_controller.go:421] CloudPrivateIPConfig: "10.0.134.71" in work queue no longer exists 2022-07-12T03:03:05.614967217Z I0712 03:03:05.614949 1 controller.go:160] Dropping key '10.0.134.71' from the cloud-private-ip-config workqueue 2022-07-12T03:03:05.631178307Z I0712 03:03:05.631133 1 cloudprivateipconfig_controller.go:242] CloudPrivateIPConfig: 10.0.173.53 object has been marked for complete deletion 2022-07-12T03:03:05.631178307Z I0712 03:03:05.631161 1 cloudprivateipconfig_controller.go:249] Cleaning up IP address and finalizer for CloudPrivateIPConfig: "10.0.173.53", deleting it completely 2022-07-12T03:03:05.682396046Z I0712 03:03:05.682349 1 controller.go:182] Assigning key: 10.0.152.209 to cloud-private-ip-config workqueue 2022-07-12T03:03:05.818649332Z I0712 03:03:05.818610 1 controller.go:160] Dropping key '10.0.186.7' from the cloud-private-ip-config workqueue ~~~ I added some debug logging [0] Then, I ran: ~~~ [akaris@linux 2104784]$ while true; echo ===================; echo setup; do oc apply -f egressip11.yaml ; oc apply -f egressip12.yaml ; sleep 10 ; echo test ; oc apply -f egressip13.yaml & sleep 0.3; oc delete -f egressip12.yaml & sleep 10; echo teardown; oc delete -f egressip11.yaml ; oc delete -f egressip13.yaml; sleep 10; done =================== setup egressip.k8s.ovn.org/egressip11 created egressip.k8s.ovn.org/egressip12 created test [1] 210367 [2] 210380 egressip.k8s.ovn.org "egressip12" deleted egressip.k8s.ovn.org/egressip13 created [1]- Done oc apply -f egressip13.yaml [2]+ Done oc delete -f egressip12.yaml teardown egressip.k8s.ovn.org "egressip11" deleted egressip.k8s.ovn.org "egressip13" deleted ^C ~~~ And I can reproduce the problem (output after the last iteration): ~~~ W0712 12:38:47.594980 1 aws.go:243] akaris --------- Polling interface for node ip-10-0-132-121.us-west-1.compute.internal, assigned IPs: [10.0.132.121 10.0.129.12] ~~~ ====================== [0] ~~~ diff --git a/pkg/cloudprovider/aws.go b/pkg/cloudprovider/aws.go index 2f031ac..0e5bc1a 100644 --- a/pkg/cloudprovider/aws.go +++ b/pkg/cloudprovider/aws.go @@ -59,6 +59,7 @@ func (a *AWS) initCredentials() error { // AWS API is separated per family). If the IP is already existing: it returns an // AlreadyExistingIPError. func (a *AWS) AssignPrivateIP(ip net.IP, node *corev1.Node) error { + klog.Warningf("akaris --------- Addding IP %s to node %s", ip, node.Name) instance, err := a.getInstance(node) if err != nil { return err @@ -106,6 +107,7 @@ func (a *AWS) AssignPrivateIP(ip net.IP, node *corev1.Node) error { } else { for _, assignedIPv4 := range networkInterface.PrivateIpAddresses { if assignedIP := net.ParseIP(*assignedIPv4.PrivateIpAddress); assignedIP != nil && assignedIP.Equal(ip) { + klog.Warningf("akaris --------- Already existing IP %s on node %s", ip, node.Name) return AlreadyExistingIPError } keepIPs = append(keepIPs, assignedIPv4.PrivateIpAddress) @@ -116,6 +118,7 @@ func (a *AWS) AssignPrivateIP(ip net.IP, node *corev1.Node) error { PrivateIpAddresses: newIPs, } _, err = a.client.AssignPrivateIpAddresses(&inputV4) + klog.Warningf("akaris --------- Unassigning IP %s on node %s, inputV4: %v", ip, node.Name, inputV4) if err != nil { var newIPsStr string var keepIPsStr string @@ -140,6 +143,7 @@ func (a *AWS) AssignPrivateIP(ip net.IP, node *corev1.Node) error { // per-IP-family basis (since the AWS API is separated per family). If the IP // is non-existant: it returns an NonExistingIPError. func (a *AWS) ReleasePrivateIP(ip net.IP, node *corev1.Node) error { + klog.Warningf("akaris --------- Removing IP %s from node %s", ip, node.Name) instance, err := a.getInstance(node) if err != nil { return err @@ -177,6 +181,7 @@ func (a *AWS) ReleasePrivateIP(ip net.IP, node *corev1.Node) error { } } if len(deleteIPs) == 0 { + klog.Warningf("akaris --------- Non existing IP %s on node %s", ip, node.Name) return NonExistingIPError } inputV4 := ec2.UnassignPrivateIpAddressesInput{ @@ -185,8 +190,10 @@ func (a *AWS) ReleasePrivateIP(ip net.IP, node *corev1.Node) error { } _, err = a.client.UnassignPrivateIpAddresses(&inputV4) if err != nil { + klog.Warningf("akaris --------- Error in unassing IP %s on node %s", ip, node.Name) return err } + klog.Warningf("akaris --------- Unassigning IP %s on node %s, inputV4: %v", ip, node.Name, inputV4) return a.waitForCompletion(node, awsapi.StringValueSlice(deleteIPs), true) } } @@ -255,6 +262,7 @@ func (a *AWS) waitForCompletion(node *corev1.Node, ips []string, deleteOp bool) } } } + klog.Warningf("akaris --------- Polling interface for node %s, assigned IPs: %v", node.Name, assignedIPs) if deleteOp { return !sets.NewString(assignedIPs...).HasAny(ips...), nil } else { ~~~ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |