Bug 1294061

Summary: Cannot fetch "https://rubygems.org/" when docker/custom build with ruby-hello-world for ruby-22 image
Product: OpenShift Container Platform Reporter: wewang <wewang>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED CURRENTRELEASE QA Contact: Meng Bo <bmeng>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.1.0CC: aos-bugs, bleanhar, bparees, dcbw, eparis, haowang, jhonce, jokerman, mmccomas, rchopra, tdawson
Target Milestone: ---Keywords: Regression, Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1390478 (view as bug list) Environment:
Last Closed: 2016-01-29 20:57:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description wewang 2015-12-24 10:54:30 UTC
Version-Release number of selected component (if applicable):
rhscl/ruby-22-rhel7     9416bc460d0c
openshift v3.1.1.0
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

How reproducible:
sometimes

1. Create project 
2. Create ruby apps using template 
  $oc new-app -f   https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/build/ruby22rhel7-template-docker.json
3. Check the build

  $ oc get builds
NAME                    TYPE      FROM          STATUS    STARTED          DURATION
ruby22-sample-build-1   Docker    Git@5868780   Failed    35 minutes ago   1m53s
4. Check the build log
$ oc build-logs ruby22-sample-build-1
 Step 0 : FROM rcm-img-docker01.build.eng.bos.redhat.com:5001/rhscl/ruby-22-rhel7
 ---> 9416bc460d0c
Step 1 : ENV "EXAMPLE" "sample-app"
 ---> Using cache
 ---> de75cf231580
Step 2 : USER default
 ---> Using cache
 ---> 20ee0d59a048
Step 3 : EXPOSE 8080
 ---> Using cache
 ---> 87fa42482f2c
Step 4 : ENV RACK_ENV production
 ---> Using cache
 ---> 484d27182af0
Step 5 : ENV RAILS_ENV production
 ---> Using cache
 ---> a669e61fbc01
Step 6 : COPY . /opt/app-root/src/
 ---> Using cache
 ---> 0f31b7bfc36c
Step 7 : RUN scl enable rh-ruby22 "bundle install"
 ---> Running in b2aff425852a
Fetching source index from https://rubygems.org/
Retrying source fetch due to error (2/3): Bundler::HTTPError Could not fetch specs from https://rubygems.org/
Retrying source fetch due to error (3/3): Bundler::HTTPError Could not fetch specs from https://rubygems.org/
Could not fetch specs from https://rubygems.org/
F1224 05:03:00.935644       1 builder.go:185] Error: build error: The command '/bin/sh -c scl enable rh-ruby22 "bundle install"' returned a non-zero code: 17


Actual results:
build failed 

Expected results:
build success

Additional info:
Add the proxy ,also cannot fetch :https://rubygems.org/

Comment 1 wewang 2015-12-24 11:05:52 UTC
1. It works when source build ,can fetch https://rubygems.org/

2. It doesn't work when docker build, cannot fetch: https://rubygems.org/

Comment 2 Ben Parees 2015-12-24 15:41:31 UTC
Ruby hello world cannot be docker built with the Ruby 2.0 image anymore, we switched it to use the Ruby 2.2 Scl package, hence the error you are getting. 

This was an intentional change.

Comment 3 Wang Haoran 2015-12-25 00:59:18 UTC
@wewang , this is not a bug, closed.

Comment 4 Wang Haoran 2015-12-31 05:18:08 UTC
reopen this bug :
1.As the bug tittle , we are test against ruby2.2 image ,from the build log we can get this. 
2. current env info: 3.1.1.0 container install , sti build works fine with the ruby22 builder iamge, docker build and costom build will fail.
3. the same builder images docker build works fine with 3.1.0.4 rpm install env.

So , Ben , could you please help verify this , and find the problem ?

Comment 5 Ben Parees 2016-01-01 22:07:31 UTC
My apologies, I'm not sure why I thought you were using the 2.0 image.

I'm able to run this successfully on origin (but using the same rcm ruby image you used), and the error you're getting certainly seems like a networking one (unable to reach rubygems.org) though I understand you were successful with the source-type build.

I'd like to have you start by trying again just to rule out network issues.  I'd also like to know if you have this problem with origin installations (but using the RCM ruby-22-rhel7 image to build)

here's my log output from the build:
Step 0 : FROM rcm-img-docker01.build.eng.bos.redhat.com:5001/rhscl/ruby-22-rhel7
 ---> 9416bc460d0c
Step 1 : ENV "EXAMPLE" "sample-app"
 ---> Running in 946a0401b355
 ---> 19870ac8688f
Removing intermediate container 946a0401b355
Step 2 : USER default
 ---> Running in f1057f027cab
 ---> 65e0ae8f9fac
Removing intermediate container f1057f027cab
Step 3 : EXPOSE 8080
 ---> Running in dffab589eaec
 ---> 4dc91b8af03d
Removing intermediate container dffab589eaec
Step 4 : ENV RACK_ENV production
 ---> Running in fba4c440983b
 ---> 6956746fa073
Removing intermediate container fba4c440983b
Step 5 : ENV RAILS_ENV production
 ---> Running in 026f47dbf1c3
 ---> a097a875a602
Removing intermediate container 026f47dbf1c3
Step 6 : COPY . /opt/app-root/src/
 ---> c3ae348617b8
Removing intermediate container 9fe9b26d0581
Step 7 : RUN scl enable rh-ruby22 "bundle install"
 ---> Running in 62c3b35222df
Fetching gem metadata from https://rubygems.org/..........
Installing rake 10.3.2
Installing i18n 0.6.11
Installing json 1.8.3
Installing minitest 5.4.2
Installing thread_safe 0.3.4
Installing tzinfo 1.2.2
Installing activesupport 4.1.7
Installing builder 3.2.2
Installing activemodel 4.1.7
Installing arel 5.0.1.20140414130214
Installing activerecord 4.1.7
Installing mysql2 0.3.16
Installing rack 1.5.2
Installing rack-protection 1.5.3
Installing tilt 1.4.1
Installing sinatra 1.4.5
Installing sinatra-activerecord 2.0.3
Using bundler 1.7.8
Your bundle is complete!
Use `bundle show [gemname]` to see where a bundled gem is installed.
 ---> 242d39c8204b
Removing intermediate container 62c3b35222df
Step 8 : CMD scl enable rh-ruby22 ./run.sh
 ---> Running in 3151ed30cdb1
 ---> ac98d0e4a8b7
Removing intermediate container 3151ed30cdb1
Step 9 : USER root
 ---> Running in 9e2bd4e5d47f
 ---> 88d6943b2bdb
Removing intermediate container 9e2bd4e5d47f
Step 10 : RUN chmod og+rw /opt/app-root/src/db
 ---> Running in 2170eda54edd
 ---> 63c357bbde11

<truncated>

Comment 6 Wang Haoran 2016-01-03 15:19:33 UTC
Ben:
AFAIK, when we run docker and sti build, will start a container not scheduled by kubenetes, the container network seems different from the pod network,(network topology here: https://github.com/openshift/openshift-sdn/blob/master/isolation-node-interfaces-diagram.pdf), the container will bridged to the lbr0 interface, so I start a container on one node :
$docker run -ti bmeng/hello-openshift /bin/bash
1. then curl the rubygem.org inside the container 
$curl -k https://rubygems.org/
curl: (6) Could not resolve host: rubygems.org; Unknown error
2. cannot connect to the nameserver inside the container:
bash-4.3# cat /etc/resolv.conf 
# Generated by NetworkManager
search openstacklocal nay.redhat.com
nameserver 10.11.5.19
ping 10.11.5.19
PING 10.11.5.19 (10.11.5.19): 56 data bytes

^C
--- 10.11.5.19 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss

The env is container install env on atomic host, when I do same operation on the rpm install env , the curl options works. so seems there is maybe a network problem , but I am not sure what's the differentce of source and docker build ,and why source build works and have no network problem.
 I will paste the env info in next private comment ,could you please help to check this ?

Comment 8 Ben Parees 2016-01-03 16:15:42 UTC
Yes this sounds like a networking issue.  s2i builds have some special logic to pick up the cluster networking config when launching a container:
https://github.com/openshift/origin/pull/5372

it seems we may need to do something similar for Docker and Custom builds so those containers also have the right networking configuration.

Rajat can you confirm that is the right solution here (since you were tagged in the s2i PR)?

Comment 9 Ben Parees 2016-01-08 18:51:42 UTC
For custom builds, there's nothing we can do since the custom image itself is going to control the network configuration of any docker operations it performs. (the custom container itself has the correct network settings because it's just a pod, but the container it creates as a result of running docker build does not/will not).

for docker builds, i don't see a way for us to configure the network like you can for docker run, so i'm not convinced there is anything we can do to fix that either from our side.

Would like the network and container teams to weigh in on this though.

Comment 10 Wang Haoran 2016-01-11 09:41:20 UTC
the rpm install env also have the problem as the comment 6 describe .
openshift v3.1.1.1
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

update the priority to high for this network problem.

Comment 11 Dan Williams 2016-01-12 18:20:21 UTC
The issue appears to be that net.bridge.bridge-nf-call-iptables = 1 when it should be 0.  That is likely due to https://github.com/openshift/openshift-sdn/pull/244/files.

Comment 12 Dan Williams 2016-01-12 18:29:12 UTC
That openshift-sdn PR has been merged; we must now wait for an openshift-sdn update to be merged to origin repos.

Comment 13 Meng Bo 2016-01-13 10:43:56 UTC
Verified works in origin env after modify the net.bridge.bridge-nf-call-iptables to 0 on the node.

Move the bug back, and please move it to ON_QA once the change merged into latest OSE rpm build.

Comment 15 Meng Bo 2016-01-14 06:21:28 UTC
The fix was not included in build 2016-01-13.2 with version:
atomic-openshift-node-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64
atomic-openshift-sdn-ovs-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64


# sysctl -a | grep ^net.*iptables
net.bridge.bridge-nf-call-iptables = 1

Comment 16 Meng Bo 2016-01-14 08:56:18 UTC
Reduce the severity since the workaround can be applied by user.

Comment 17 Brenton Leanhardt 2016-01-14 13:37:00 UTC
Looks like we missed the latest SDN update with yesterday's build: https://github.com/openshift/origin/pull/6650

There will be another build today so I'll move this back ON_QA once it's ready.

Comment 18 Meng Bo 2016-01-15 06:37:38 UTC
The fix has been merged into OSE build, 2016-01-14.1 with rpm:
atomic-openshift-node-3.1.1.3-1.git.0.59b3b7b.el7aos.x86_64
atomic-openshift-sdn-ovs-3.1.1.3-1.git.0.59b3b7b.el7aos.x86_64

But I found that the value of net.bridge.bridge-nf-call-iptables will be modified to 1 not 0 after node service restart.

# sysctl -a | grep bridge.*iptables
net.bridge.bridge-nf-call-iptables = 1

# sysctl -w net.bridge.bridge-nf-call-iptables=0
net.bridge.bridge-nf-call-iptables = 0

# sysctl -a | grep bridge.*iptables
net.bridge.bridge-nf-call-iptables = 0

# systemctl restart atomic-openshift-node

# sysctl -a | grep bridge.*iptables
net.bridge.bridge-nf-call-iptables = 1



Please help confirm what's the problem.

Comment 19 Dan Williams 2016-01-15 17:05:23 UTC
Ok, I think I know the issue.

Docker's libnetwork bridge driver (which is used because we give tell docker to use lbr0) always sets bridge-nf-call-iptables=1 when it starts up.

openshift sets bridge-nf-call-iptables=0 when it starts up, but *only* if it thinks the SDN is not yet configured.  openshift also restarts docker when it starts up, but then terminates setup early when it thinks the SDN is configured, and that's long before it sets bridge-nf-call-iptables=0.

Comment 20 Dan Williams 2016-01-15 17:17:34 UTC
Although, openshift only restarts docker if the options changed, so I'm still investigating what's going on here.

Comment 21 Eric Paris 2016-01-15 17:19:09 UTC
It is possible, and even likely, that he restarted docker by hand at some oint. I have heard of an operations group that sometimes does so...

Comment 22 Dan Williams 2016-01-15 22:56:52 UTC
Final analysis:

1) docker is not the cause; it will bridge-nf-call-iptables=1 but only if inter-container communication (ICC) is disabled, and that defaults to enabled.

2) the actual issue is the upstream Kubernetes proxy code, which also sets bridge-nf-call-iptables=1 and runs after the SDN code runs, and thus always overwrites the value the SDN code set.

Comment 23 Eric Paris 2016-01-16 00:24:39 UTC
https://github.com/openshift/origin/pull/6686

Comment 24 Eric Paris 2016-01-16 02:14:05 UTC
https://github.com/openshift/origin/pull/6688

replaces the above PR because of difficulties in ordering related to 6686.

Comment 25 Eric Paris 2016-01-16 05:34:46 UTC
moving MODIFIED as this is in HEAD and should make the next OSE 3.1.1 build.

Comment 26 Meng Bo 2016-01-18 06:34:03 UTC
Checked with latest origin build, the net.bridge.bridge-nf-call-iptables will be updated to 0 after node started.

# openshift version
openshift v1.1-806-gd95ec08
kubernetes v1.1.0-origin-1107-g4c8e6f4


# sysctl -a | grep ^net.bridge.bridge.*iptables
net.bridge.bridge-nf-call-iptables = 0
# sysctl -w net/bridge/bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-iptables = 1
# systemctl restart openshift-node 
# sysctl -a | grep ^net.bridge.bridge.*iptables
net.bridge.bridge-nf-call-iptables = 0

@dcbw I'm curious why this was not a problem before setup.sh refactor? At least, the docker build works fine when the 3.1.0 released. Does it mean all the functions in the setup.sh are running earlier during the node starts after re-written in golang?

Comment 27 Meng Bo 2016-01-18 09:38:39 UTC
Just confirmed that the fix has been included in latest OSE build 2016-01-16.1

Comment 28 Dan Williams 2016-01-18 15:31:46 UTC
(In reply to Meng Bo from comment #26)
> @dcbw I'm curious why this was not a problem before setup.sh refactor? At
> least, the docker build works fine when the 3.1.0 released. Does it mean all
> the functions in the setup.sh are running earlier during the node starts
> after re-written in golang?

It may have been due to 8971251ba2095bb6daace2e6396ff1a1f6882b27 (committed upstream Nov 10th, after 3.1 was released) which changed node initialization from being done in a goroutine to being synchronous.  Before that change, init from a goroutine would have usually allowed RunProxy() to happen before node initialization completed, though it was not guaranteed.

Comment 29 Meng Bo 2016-01-19 06:00:13 UTC
Verified on openshift v3.1.1.5.