Description of problem: Modifying ExternalIP policy in CNO from rejectedCIDRs to allowedCIDRs will take about 250 seconds to take effect. Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2020-01-14-043441 How reproducible: Always Steps to Reproduce: 1. First set ExternalIP policy rejectedCIDRs/22.2.2.0/25 in CNO 2. Then change ExternalIP policy to allowedCIDRs/22.2.2.0/25 in CNO 3. Depoly a svc with externaIP 22.2.2.10 [root@dhcp-41-193 verification-tests]# oc login -u kubeadmin -p uw3Vq-x69vi-2K6vo-r3Quz [root@dhcp-41-193 verification-tests]# oc get networks.config.openshift.io/cluster -o yaml apiVersion: config.openshift.io/v1 kind: Network metadata: creationTimestamp: "2020-01-14T13:38:01Z" generation: 27 name: cluster resourceVersion: "114200" selfLink: /apis/config.openshift.io/v1/networks/cluster uid: dfae356f-e3e2-4ffb-a83f-bf09515e0adc spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 externalIP: policy: allowedCIDRs: - 22.2.2.0/24 rejectedCIDRs: - 22.2.2.0/25 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 status: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 clusterNetworkMTU: 8951 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 [root@dhcp-41-193 verification-tests]# [root@dhcp-41-193 verification-tests]# oc login -u testuser-0 -p VooLn7KehL7I Login successful. You have one project on this server: "test" Using project "test". [root@dhcp-41-193 verification-tests]# curl -s https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/externalip_service1.json | sed s/10.5.0.1/22.2.2.10/g | oc create -f - Error from server (Forbidden): error when creating "STDIN": services "service-unsecure" is forbidden: #Clean-up required to erase above net-attach-def after testing done spec.externalIPs[0]: Forbidden: externalIP is not allowed [root@dhcp-41-193 verification-tests]# oc get svc No resources found. #### Delete [root@dhcp-41-193 verification-tests]# oc login -u kubeadmin -p uw3Vq-x69vi-2K6vo-r3Quz Login successful. You have access to 55 projects, the list has been suppressed. You can list all projects with 'oc projects' Using project "test". [root@dhcp-41-193 verification-tests]# oc edit networks.config.openshift.io/cluster network.config.openshift.io/cluster edited [root@dhcp-41-193 verification-tests]# oc get networks.config.openshift.io/cluster -o yaml apiVersion: config.openshift.io/v1 kind: Network metadata: creationTimestamp: "2020-01-14T13:38:01Z" generation: 28 name: cluster resourceVersion: "119069" selfLink: /apis/config.openshift.io/v1/networks/cluster uid: dfae356f-e3e2-4ffb-a83f-bf09515e0adc spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 externalIP: policy: allowedCIDRs: - 22.2.2.0/24 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 status: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 clusterNetworkMTU: 8951 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 [root@dhcp-41-193 verification-tests]# oc login -u testuser-0 -p VooLn7KehL7I Login successful. You have one project on this server: "test" Using project "test". [root@dhcp-41-193 verification-tests]# for i in {1..1000}; do date; curl -s https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/externalip_service1.json | sed s/10.5.0.1/22.2.2.10/g | oc create -f - ;sleep 30; done Tue Jan 14 14:44:07 EST 2020 Error from server (Forbidden): error when creating "STDIN": services "service-unsecure" is forbidden: spec.externalIPs[0]: Forbidden: externalIP is not allowed Tue Jan 14 14:44:38 EST 2020 Error from server (Forbidden): error when creating "STDIN": services "service-unsecure" is forbidden: spec.externalIPs[0]: Forbidden: externalIP is not allowed Tue Jan 14 14:45:09 EST 2020 Error from server (Forbidden): error when creating "STDIN": services "service-unsecure" is forbidden: spec.externalIPs[0]: Forbidden: externalIP is not allowed Tue Jan 14 14:45:39 EST 2020 Error from server (Forbidden): error when creating "STDIN": services "service-unsecure" is forbidden: spec.externalIPs[0]: Forbidden: externalIP is not allowed Tue Jan 14 14:46:10 EST 2020 Error from server (Forbidden): error when creating "STDIN": services "service-unsecure" is forbidden: spec.externalIPs[0]: Forbidden: externalIP is not allowed Tue Jan 14 14:46:41 EST 2020 Error from server (Forbidden): error when creating "STDIN": services "service-unsecure" is forbidden: spec.externalIPs[0]: Forbidden: externalIP is not allowed Tue Jan 14 14:47:11 EST 2020 Error from server (Forbidden): error when creating "STDIN": services "service-unsecure" is forbidden: spec.externalIPs[0]: Forbidden: externalIP is not allowed Tue Jan 14 14:47:42 EST 2020 service/service-unsecure created Actual results: Take about 220 seconds to let service/service-unsecure created Expected results: service/service-unsecure can be created very quick Additional info:
Is this on 4.4 or 4.3? The bug has been filed against 4.4 but I see from the version field 4.3 nightly? Anyway, can you please give me an environment to reproduce this? Thanks
Unable to reproduce this on 4.6: [ricky@localhost openshift-installer]$ for i in {1..1000}; do date; curl -s https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/externalip_service1.json | sed s/10.5.0.1/22.2.2.10/g | oc create -f - ;sleep 30; done Tue 14 Jul 2020 11:05:10 AM CEST service/service-unsecure created The service gets created almost immediately. Can you verify this on 4.6? If it doesn't happen will check with Ben if this is worth pursuing for 4.3, since 220 seconds may not be terrible and it's 3 versions back.
The issue still happened in 4.6.0-0.nightly-2020-07-14-092216 when QE run the script. From below error you will find hte new services "service-unsecure" is forbidden for 148 times which last around 220 seconds. And I wait up to 300 seconds for the steps to pass: # features/step_definitions/meta_steps.rb:33 [14:29:36] INFO> {"kind":"Service","apiVersion":"v1","metadata":{"name":"service-unsecure","labels":{"name":"service-unsecure"}},"spec":{"ports":[{"name":"http","protocol":"TCP","port":27017,"targetPort":8080}],"externalIPs":["10.0.129.76"],"selector":{"name":"caddy-docker"}}} [14:29:36] INFO> Shell Commands: oc create -f - --kubeconfig=/home/weliang/workdir/weliang-weliang/ocp4_testuser-0.kubeconfig STDERR: Error from server (Forbidden): error when creating "STDIN": services "service-unsecure" is forbidden: spec.externalIPs: Forbidden: externalIPs have been disabled [14:29:36] INFO> Exit Status: 1 [14:33:03] INFO> last 4 messages repeated 148 times [14:33:04] INFO> {"kind":"Service","apiVersion":"v1","metadata":{"name":"service-unsecure","labels":{"name":"service-unsecure"}},"spec":{"ports":[{"name":"http","protocol":"TCP","port":27017,"targetPort":8080}],"externalIPs":["10.0.129.76"],"selector":{"name":"caddy-docker"}}} [14:33:04] INFO> Shell Commands: oc create -f - --kubeconfig=/home/weliang/workdir/weliang-weliang/ocp4_testuser-0.kubeconfig service/service-unsecure created In order to reproduce this issue, you need: 1. First set ExternalIP policy rejectedCIDRs/22.2.2.0/25 in CNO 2. Deploy a svc with externaIP 22.2.2.10 3. curl svc get rejected 4. remove above svc 5. Then change ExternalIP policy to allowedCIDRs/22.2.2.0/25 in CNO 6. Deploy a svc with externaIP 22.2.2.10 7. svc can not be deployed until around 220 seconds passed
The error is creating the service, not making the network plumbing to make the plumbing work. This is forbidden because the kube-apiserver doesn't allow it. I think 4 minutes is quite reasonable, IMO this is a reasonable limitation rather than a bug... Anyway this is done by the kube-apiserver operator[1] so I'll reassign them to fix it if they consider it's a bug: 1- https://github.com/openshift/cluster-kube-apiserver-operator/blob/ac2f94c3216ee2eede35a7357782e3a1f2617fbd/pkg/operator/configobservation/network/observe_network.go#L122
> The error is creating the service, not making the network plumbing to make > the plumbing work. s/not making the network plumbing to make the plumbing work./not making the network plumbing to make the networking work./
Deploying new kube-apiserver configs takes O(250s) because the operator has to roll out through static pods and pass through 70s+ graceful shutdown procedure per instance. This works at intended.
I am reopening this bug as I had performed few more tests based on this issue and seems like sometimes for project admin user "policy: null" works and sometimes not (see below) after waiting for 300 secs. While this is not the case with cluser:admin user and everything works fine for it. >> Working : ~~~~~~~~~~~~~~~~~~~~~~~~~~~ $ oc login -u system:admin Logged into "https://api.sharedocp4upi44.lab.upshift.rdu2.redhat.com:6443" as "system:admin" using existing credentials. [...] $ oc patch network cluster --type merge -p '{ "spec": { "externalIP": { "policy": null }}}' network.config.openshift.io/cluster patched $ oc edit network.config apiVersion: config.openshift.io/v1 kind: Network metadata: creationTimestamp: "2020-10-25T06:40:46Z" generation: 14 name: cluster resourceVersion: "3300609" selfLink: /apis/config.openshift.io/v1/networks/cluster uid: d1d36699-5fb8-4f2f-94ac-f301d057246e spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 externalIP: {} networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 status: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 clusterNetworkMTU: 1450 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 $ oc login -u newuser -p password Login successful. [...] $ oc get po NAME READY STATUS RESTARTS AGE httpd-example-1-build 0/1 Completed 0 12m httpd-example-1-deploy 0/1 Completed 0 11m httpd-example-1-kdlbz 1/1 Running 0 11m $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE httpd-example ClusterIP 172.30.60.171 <none> 8080/TCP 12m $ oc edit svc httpd-example <-------- changed TYPE to LoadBalancer service/httpd-example edited $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE httpd-example LoadBalancer 172.30.60.171 <pending> 8080:30472/TCP 13m $ oc whoami newuser $ oc patch svc httpd-example -p '{"spec":{"externalIPs":["192.174.120.11"]}}' // after waiting for 5 mins service/httpd-example patched $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE httpd-example LoadBalancer 172.30.60.171 192.174.120.11 8080:30472/TCP 16m $ oc login -u system:admin Logged into "https://api.sharedocp4upi44.lab.upshift.rdu2.redhat.com:6443" as "system:admin" using existing credentials. You have access to 70 projects, the list has been suppressed. You can list all projects with 'oc projects' Using project "test". $ oc get networks.config cluster -o go-template='{{.spec.externalIP}}{{"\n"}}' map[] ~~~~~~~~~~~~~~~~~~~~~~~~~ >> Not Working : Here I directly assigned IP to svc and setting "policy: null" and tried a lot with project admin user but it didn't worked even I after waiting for soo long. ~~~~~~~~~~~~~~~~~~~~~ $ oc new-project newwww $ oc new-app httpd-example --> Deploying template "openshift/httpd-example" to project newwww [...] $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE httpd-example LoadBalancer 172.30.237.62 <pending> 8080:30328/TCP 4m53s $ $ oc patch svc httpd-example -p '{"spec":{"externalIPs":["192.174.120.15"]}}' Error from server (Forbidden): services "httpd-example" is forbidden: spec.externalIPs: Forbidden: externalIPs have been disabled $ oc patch svc httpd-example -p '{"spec":{"externalIPs":["192.174.120.15"]}}' Error from server (Forbidden): services "httpd-example" is forbidden: spec.externalIPs: Forbidden: externalIPs have been disabled ~~~~~~~~~~~~~~~~~~~~~ So the only concern is the abnormal behaviour while setting extIP for svc manually having "policy: null" for project admin user as sometimes svc assigned ip and sometimes not or takes too long.
Hi Shubhag, For my testing, setting policy=null from default policy={} not take effect at all. My testing not passed even I changed the TYPE to LoadBalancer for existing svc A new bug was submitted for the failure in my testing: Bug 1896880 - [ExternalIP] Setting policy=null from defualt policy={} not take effect
Hi Weibin, Thanks for the information, I will keep the track of this bugs status and will update the cu too.
The load balancer service resource is owned by the cloud-provider LB implementation. Move to routing team. The kube-apiserver is not filling the external IP field.
Hi team, when we are going to fix this bug ? May I know the current progress status of this bug ?
Sounds like we are waiting on https://bugzilla.redhat.com/show_bug.cgi?id=1896880 to be verified to see if 1896880 also resolves this bug.
We'll raise this for discussion on the next network architecture call.
Let me clarify three externalIP bugs in more details: This Bug 1793099 - Modifying ExternalIP policy in CNO take about 220 seconds to take effect which is closed one time and reopened due to customers encounter the same issue. During the discussion for bug-1793099, QE opened two more externalIP bugs: 1. Bug 1907505 - [ExternalIP] Only a user with cluster-admin privileges can create a policy object which is doc bug and is CLOSED CURRENTRELEASE 2. Bug 1896880 - [ExternalIP] Setting policy=null from default policy={} not take effect, DEV think the feature is worked as design, then we need update the doc to correct the statement.
This bug is about a 220 rollout time. Unfortunately, there's nothing we can do about this, since changing ExternalIPPolicy means restarting all of the apiervers. I know this is kind of annoying, but these fields aren't changed very often by end-users. Either we should close this, or add a documentation line that it may take up to 5 minutes for the changes to take effect.
Closing based on comment 35.