1381025 – oc cluster up is hard coded to check docker config for 172.30/16 on oc cluster up breaking many people

Bug 1381025 - oc cluster up is hard coded to check docker config for 172.30/16 on oc cluster up breaking many people

Summary: oc cluster up is hard coded to check docker config for 172.30/16 on oc cluste...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OKD
Classification:	Red Hat
Component:	oc
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Cesar Wong
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-02 13:18 UTC by Grant Shipley
Modified:	2016-10-18 14:58 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-10-18 14:58:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Grant Shipley 2016-10-02 13:18:48 UTC

Description of problem:
Sometimes docker hands out a subnet of 172.17 instead of 172.30 when starting the docker daemon.  When running oc cluster this fails because the 172.30/16 check is hard coded on line 25 of:
https://github.com/openshift/origin/blob/master/pkg/bootstrap/docker/dockerhelper/helper.go

const openShiftInsecureCIDR = "172.30.0.0/16"

Also, in a ton of their doc, they reference 172.17 so this is handed out as well.

Version-Release number of selected component (if applicable):
current

How reproducible:
everytime 

Steps to Reproduce:
1. use a docker daemon with 172.17
2.add --insecure-registry 172.17/16 to your config
3. start docker
4. oc cluster up
5. check fails



Actual results:
You can't use oc cluster up when getting a subnet from docker on anything other that 172.30 because of the hard coded check

Expected results:
We should check what network docker daemon handed out and not not hard code this value.  According the official doc, docker can use a range from 172.17 to 172.30 and its not guaranteed to be in order.  We should check for 172.0/8 or actually not hard code anything but figure it out.  I am honestly shocked we haven't seen this before. 

Additional info:

Comment 1 Fabiano Franz 2016-10-03 15:09:50 UTC

This will most likely require a backport to 3.3.1/1.3.1.

Comment 2 Cesar Wong 2016-10-03 15:31:37 UTC

So we do pass a default CIDR for services to the network plugin here:
https://github.com/openshift/origin/blob/master/pkg/cmd/server/start/network_args.go#L35

However, from what I can tell, only our own SDN pays attention to that. Grant, is this something you can reproduce at will, or it only happens sometimes? What platform?

I'm not sure that it's Docker that decides what CIDR to use for services, but rather the kube proxy. It's also not clear from the code how it decides that. Adding Ben from the networking team to help out.

Comment 3 Grant Shipley 2016-10-03 15:37:15 UTC

Yes, I can reproduce this every single time.  I don't know why I always get 172.17 as my docker0 and all of the docs say this as well.  However, it seems most people I see doing demos have 172.30.

I have tried this on 10 vm's all using fedora 24.  VMs I have created are on windows and linux.  I have also tried before and after a yum update with same results.

Comment 4 Cesar Wong 2016-10-03 15:40:41 UTC

My docker0 is 172.17.0.1, however, all services are created on 172.30.* This is on a RHEL machine.

Comment 5 Cesar Wong 2016-10-03 16:11:06 UTC

Ok, did a little bit more digging. I don't see how the services subnet can be other than 172.30.0.0/16 with a config created by cluster up:

1) As mentioned earlier, the default network args is initialized here:
https://github.com/openshift/origin/blob/master/pkg/cmd/server/start/network_args.go#L35

2) That same value is used to initialize the ServiceSubnet in the Kubernetes master config:
https://github.com/openshift/origin/blob/master/pkg/cmd/server/start/master_args.go#L451

3) Which is then used as the Kubernetes master ServiceClusterIPRange:
https://github.com/openshift/origin/blob/master/pkg/cmd/server/kubernetes/master_config.go#L79

4) And used to create the Kube master IP allocator (which is what assigns services IPs):
https://github.com/openshift/origin/blob/ffdeb1bb546339f62722f507e1a12bdb9701c4c2/vendor/k8s.io/kubernetes/pkg/master/master.go#L368-L386

Grant, are you actually seeing services that are not in the 172.30.0.0/16 range? or just the docker0 interface? The reason we need the --insecure-registry parameter is that the registry service (just like any other service) will be created with a cluster ip in the 172.30.0.0/16 range.

Comment 6 Grant Shipley 2016-10-03 16:17:09 UTC

Yes, if I docker exec -ti REGISTRY_POD bash and check network, it has a 172.17 address.  I can repeat this every time on fedora 24 but it works on centos.  I can do a bluejeans later today / this week if you want.

Comment 7 Cesar Wong 2016-10-03 16:21:24 UTC

The pod itself will not have the same IP as the service. Looking at a cluster I just brought up locally, the  docker-registry pod itself has an IP of 172.17.0.5. However, the service has an IP of 172.30.179.130. The service IP is what matters when pushing/pulling images.

Comment 8 Juan Vallejo 2016-10-03 17:18:41 UTC

I can confirm cewong's comment (https://bugzilla.redhat.com/show_bug.cgi?id=1381025#c7): I tried printing all available interfaces using the net package (net.Interfaces()), and looping through I see an IP of `172.17.0.1` for the "docker0" interface. When doing `oc` describe on my docker registry pod, it shows an IP address of `172.17.0.5`. My docker-registry service does also have an IP in the "172.130" range. With all of this information in mind, I am still able to do `oc cluster up` successfully after starting the docker daemon with the following options:

--exec-opt native.cgroupdriver=systemd \
    --insecure-registry=172.30.0.0/16 \
    --insecure-registry=ci.dev.openshift.redhat.com:5000 \
    --selinux-enabled &> /tmp/docker &

Comment 9 Cesar Wong 2016-10-04 11:47:45 UTC

Grant, can you confirm that you are getting service IP's outside of the 172.30.0.0/16 range? If not, I think both OpenShift and cluster up are working as designed.

Comment 10 Grant Shipley 2016-10-04 18:27:56 UTC

You are correct:

hostname -i on the docker container for the registry has a 172.17

but oc get svc shows the registry having 172.30.130.253

So it must be something else going on with fedora that I keep getting ErrImgPulls in that it can't connect the registry for pulling but can push just fine.

It works 100% of the time for me on centos and fails 100% of the time for me on F24 installs.  So I think we can close this as not a bug as it must be something else.

Here is my flow:
install

sudo yum install wget docker git
uncomment in /etc/sysconfig/docker

INSECURE_REGISTRY='--insecure-registry 172.30.0.0/16'

systemctl stop firewalld
systemctl start docker
oc cluster up


when pulling an image:

------------
Oct 04 14:22:26 localhost.localdomain NetworkManager[745]: <info>  [1475605346.8776] device (veth7526ee7): link connected
Oct 04 14:22:26 localhost.localdomain docker[7141]: --> Waiting up to 10m0s for pods in deployment test-1 to become ready
Oct 04 14:22:26 localhost.localdomain audit: SELINUX_ERR op=security_compute_av reason=bounds scontext=system_u:system_r:svirt_lxc_net_t:s0:c5,c6 tcontext=system_u:system_r:docker_t:s0 tclass=process perms=getattr
Oct 04 14:22:27 localhost.localdomain docker[7141]: E1004 18:22:26.966331    8297 docker_manager.go:1537] Failed to create symbolic link to the log file of pod "test-1-56xs4_myproject(7fff201b-8a5f-11e6-8747-5254008aa548)" container "POD": symlink  /var/log/containers/test-1-56xs4_myproject_POD-8340d12d7464fc5c14b3586d6d69d13c51ec49ca0b8655312c360670e75f2b02.log: no such file or directory
Oct 04 14:22:27 localhost.localdomain docker[7141]: W1004 18:22:27.025940    8297 docker_manager.go:1999] Hairpin setup failed for pod "test-1-56xs4_myproject(7fff201b-8a5f-11e6-8747-5254008aa548)": open /sys/devices/virtual/net/veth7526ee7/brport/hairpin_mode: read-only file system
Oct 04 14:22:27 localhost.localdomain docker[7141]: time="2016-10-04T14:22:27.051019894-04:00" level=info msg="{Action=create, LoginUID=4294967295, PID=8297}"
Oct 04 14:22:27 localhost.localdomain audit[7141]: VIRT_CONTROL pid=7141 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:docker_t:s0 msg='vm-pid=? user=? auid=4294967295 exe=? hostname=? reason=api op=create vm=?  exe="/usr/bin/docker" hostname=? addr=? terminal=? res=success'
Oct 04 14:22:27 localhost.localdomain docker[7141]: time="2016-10-04T14:22:27.099069609-04:00" level=warning msg="Error getting v2 registry: Get https://172.30.112.56:5000/v2/: http: server gave HTTP response to HTTPS client"
Oct 04 14:22:27 localhost.localdomain docker[7141]: E1004 18:22:27.146416    8297 handler.go:278] unable to get fs usage from thin pool for device 212: no cached value for usage of device 212
Oct 04 14:22:27 localhost.localdomain docker[7141]: time="2016-10-04T18:22:27.191169505Z" level=debug msg="authorizing request" go.version=go1.6.3 http.request.host="172.30.112.56:5000" http.request.id=cab2661a-ae10-4b01-a195-b04f0652729a http.request.method=GET http.request.remoteaddr="192.168.0.203:41940" http.request.uri="/v2/" http.request.useragent="docker/1.10.3 go/go1.6.3 kernel/4.5.5-300.fc24.x86_64 os/linux arch/amd64" instance.id=b5052801-6f88-40af-a943-5a54984a57a2 
Oct 04 14:22:27 localhost.localdomain docker[7141]: time="2016-10-04T18:22:27.193733919Z" level=error msg="error authorizing context: authorization header required" go.version=go1.6.3 http.request.host="172.30.112.56:5000" http.request.id=cab2661a-ae10-4b01-a195-b04f0652729a http.request.method=GET http.request.remoteaddr="192.168.0.203:41940" http.request.uri="/v2/" http.request.useragent="docker/1.10.3 go/go1.6.3 kernel/4.5.5-300.fc24.x86_64 os/linux arch/amd64" instance.id=b5052801-6f88-40af-a943-5a54984a57a2

Comment 11 Cesar Wong 2016-10-18 14:58:16 UTC

Grant, did you try running 'iptables -F' on your Fedora box? It just occurred to me you may have been running into this.

For now closing this bug though, as we already have an issue for that: 
https://github.com/openshift/origin/issues/10139

Note You need to log in before you can comment on or make changes to this bug.