Bug 1279925

Summary: After installation, openshift-sdn didn't make /etc/openshift-sdn/config.env, and can't access to the pod
Product: OpenShift Container Platform Reporter: Kenjiro Nakayama <knakayam>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED ERRATA QA Contact: Meng Bo <bmeng>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.0.0CC: aos-bugs, bleanhar, danw, dcbw, eparis, jialiu, knakayam, pruan
Target Milestone: ---Keywords: UpcomingRelease
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-01-26 19:17:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
journalctl -l -u openshift-node 2>&1 | tee $(hostname -s)-openshift-node.log
none
journalctl -l -u docker 2>&1 | tee $(hostname -s)-docker.log none

Description Kenjiro Nakayama 2015-11-10 13:45:48 UTC
version
---
~~~
  openshift-node-3.0.2.0-0.git.20.656dc3e.el7ose.x86_64
  openshift-3.0.2.0-0.git.20.656dc3e.el7ose.x86_64
  openshift-sdn-ovs-3.0.2.0-0.git.20.656dc3e.el7ose.x86_64
~~~

After ansible advanced installation, we can't access to the pod.
---
- In openshift-node log, these messages are outputed.
~~~
  Nov 09 15:34:30 foo.exampl.com openshift-node[69142]: /usr/libexec/kubernetes/kubelet-plugins/net/exec/redhat~openshift-ovs-subnet/openshift-ovs-subnet: line 34: /etc/openshift-sdn/config.env: No such file or directory
  Nov 09 15:34:30 foo.example.com openshift-node[69142]: , exit status 1
  ...
  Nov 09 15:34:30 foo.example.com openshift-node[69142]: E1109 15:34:30.974595   69142 pod_workers.go:111] Error syncing pod ee679cf1-86ab-11e5-8737-0050569a575a, skipping: exit status 1
~~~

- /etc/openshift-sdn/config.env was not made by openshift-node.

~~~
  # cat /etc/openshift-sdn/config.env
  cat: /etc/openshift-sdn/config.env: No such file or directory
~~~

- When restarting openshift-node, it outputs:

~~~
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: + setup_required
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: +++ ip a s lbr0
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: +++ awk '/inet / {print $2}'
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: ++ echo 10.1.1.1/24
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: + ip=10.1.1.1/24
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: + '[' 10.1.1.1/24 '!=' 10.1.1.1/24 ']'
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: + grep -q lbr0 /run/openshift-sdn/docker-network
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: + return 1
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: + echo 'SDN setup not required.'
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: SDN setup not required.
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: + exit 140
  Nov 05 23:49:36 ose3-node1.example.com openshift-node[15178]: I1105 23:49:36.254128   15178 iptables.go:299] running iptables -A [KUBE-PORTALS-CONTAINER -t nat -m comment --comment default/kubernetes: -p tcp -m tcp --dport 443 -d 192.168.0.1/32 -j REDIRECT --to-ports 39931]
~~~

- From openshift-node log history: "Job for docker.service canceled." exited with "1".
- It means that even though lbr0 was made with (10.1.1.1/24), /etc/openshift-sdn wasn't.

~~~
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: ++ grep 'state RELATED,ESTABLISHED'
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: ++ awk '{print $1}'
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: ++ iptables -nvL INPUT --line-numbers
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + lineno=1
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + iptables -I INPUT 1 -p udp -m multiport --dports 4789 -m comment --comment '001 vxlan incoming' -j ACCEPT
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + iptables -I INPUT 2 -i tun0 -m comment --comment 'traffic from docker for internet' -j ACCEPT
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: ++ iptables -nvL FORWARD --line-numbers
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: ++ grep 'reject-with icmp-host-prohibited'
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: ++ tail -n 1
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: ++ awk '{print $1}'
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + fwd_lineno=1
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: openshift-node.service: main process exited, code=exited, status=255/n/a
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Unit openshift-node.service entered failed state.
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + iptables -I FORWARD 1 -d 10.1.0.0/16 -j ACCEPT
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + iptables -I FORWARD 1 -s 10.1.0.0/16 -j ACCEPT
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + [[ -z '' ]]
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + DOCKER_NETWORK_OPTIONS='-b=lbr0 --mtu=1450'
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + '[' -f /.dockerinit ']'
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + mkdir -p /run/openshift-sdn
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + cat
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + systemctl daemon-reload
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + systemctl restart docker.service
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: Job for docker.service canceled.
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: Error: exit status 1
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: F1105 23:45:29.589264   14414 flatsdn.go:47] SDN Node failed: exit status 1
  Nov 05 23:45:30 ose3-node1.example.com systemd[1]: Stopped OpenShift Node.
  Nov 05 23:45:30 ose3-node1.example.com systemd[1]: Starting OpenShift Node...
~~~

- Docker log in same time is:

~~~
  Nov 05 23:45:28 ose3-node1.example.com docker[14654]: time="2015-11-05T23:45:28.950938352-05:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)"
  Nov 05 23:45:29 ose3-node1.example.com docker[14654]: time="2015-11-05T23:45:29.218764000-05:00" level=error msg="WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use"
  Nov 05 23:45:29 ose3-node1.example.com docker[14654]: time="2015-11-05T23:45:29.354624311-05:00" level=info msg="[graphdriver] using prior storage driver \"devicemapper\""
  Nov 05 23:45:29 ose3-node1.example.com docker[14654]: time="2015-11-05T23:45:29.355411432-05:00" level=info msg="Option DefaultDriver: bridge"
  Nov 05 23:45:29 ose3-node1.example.com docker[14654]: time="2015-11-05T23:45:29.355435633-05:00" level=info msg="Option DefaultNetwork: bridge"
  Nov 05 23:45:29 ose3-node1.example.com docker[14654]: time="2015-11-05T23:45:29.364341575-05:00" level=warning msg="Running modprobe bridge nf_nat br_netfilter failed with message: modprobe: WARNING: Module br_netfilter not found.\n, error: exit status 1"
  Nov 05 23:45:29 ose3-node1.example.com docker[14654]: time="2015-11-05T23:45:29.368415055-05:00" level=info msg="Firewalld running: false"
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Stopping Docker Application Container Engine...
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Stopped Docker Application Container Engine.
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Unit docker.service entered failed state.
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Starting Docker Application Container Engine...
  Nov 05 23:45:29 ose3-node1.example.com docker[14770]: time="2015-11-05T23:45:29.872330986-05:00" level=error msg="WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use"
  Nov 05 23:45:29 ose3-node1.example.com docker[14770]: time="2015-11-05T23:45:29.875682832-05:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)"
  Nov 05 23:45:29 ose3-node1.example.com docker[14770]: time="2015-11-05T23:45:29.926598698-05:00" level=info msg="[graphdriver] using prior storage driver \"devicemapper\""
  Nov 05 23:45:29 ose3-node1.example.com docker[14770]: time="2015-11-05T23:45:29.927363584-05:00" level=info msg="Option DefaultDriver: bridge"
  Nov 05 23:45:29 ose3-node1.example.com docker[14770]: time="2015-11-05T23:45:29.927378881-05:00" level=info msg="Option DefaultNetwork: bridge"
  Nov 05 23:45:29 ose3-node1.example.com docker[14770]: time="2015-11-05T23:45:29.930974678-05:00" level=warning msg="Running modprobe bridge nf_nat br_netfilter failed with message: modprobe: WARNING: Module br_netfilter not found.\n, error: exit status 1"
  Nov 05 23:45:29 ose3-node1.example.com docker[14770]: time="2015-11-05T23:45:29.934066815-05:00" level=info msg="Firewalld running: false"
  Nov 05 23:45:30 ose3-node1.example.com docker[14770]: time="2015-11-05T23:45:30.236771307-05:00" level=info msg="Loading containers: start."
~~~

Workaround
---
- Destroy lbr0 and let openshift-node create again, it will work. (So, system reboot will work as well.)

  # systemctl stop openshift-node
  # ip link set lbr0 down
  # brctl delbr lbr0
  # systemctl start openshift-node

Additional info
---
- When we add "openshift_master_portal_net=192.168.128.0/18" to ansible installer, it happend. (But I am not sure if it caused this issue.)

Comment 1 Kenjiro Nakayama 2015-11-10 13:50:20 UTC
Created attachment 1092276 [details]
journalctl -l -u openshift-node 2>&1 | tee $(hostname -s)-openshift-node.log

Comment 2 Kenjiro Nakayama 2015-11-10 13:50:59 UTC
Created attachment 1092278 [details]
journalctl -l -u docker 2>&1 | tee $(hostname -s)-docker.log

Comment 3 Kenjiro Nakayama 2015-11-10 14:02:57 UTC
I filed as networking issue, but this might be installation issue.

One user in upstream got same issue, https://github.com/openshift/openshift-ansible/issues/780#issuecomment-152872814

Comment 5 Dan Williams 2015-11-10 18:42:28 UTC
Are you able to retrieve the initial node logs, before the node was restarted?

Comment 6 Dan Winship 2015-11-10 18:53:17 UTC
The crucial bits:

  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + systemctl restart docker.service
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: Job for docker.service canceled.
  Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: Error: exit status 1

  Nov 05 23:45:28 ose3-node1.example.com docker[14654]: time="2015-11-05T23:45:28.950938352-05:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)"
  ...
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Stopping Docker Application Container Engine...
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Stopped Docker Application Container Engine.
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Unit docker.service entered failed state.
  Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Starting Docker Application Container Engine...

So systemd is failing to cleanly stop docker, which is causing "systemctl restart docker.service" to fail, which causes openshift-sdn setup to bail out (which SHOULD cause openshift node setup to bail out but apparently we're not propagating that error...)

Maybe docker doesn't like it if you tell it to restart while it's not fully started up? (Or systemd doesn't like it?)

Given that docker *is* getting restarted successfully despite the errors, a simple workaround would be to change the openshift-sdn setup script to do "systemctl restart docker.service || true"...

Comment 7 Dan Williams 2015-11-10 19:37:18 UTC
(In reply to Dan Winship from comment #6)
> So systemd is failing to cleanly stop docker, which is causing "systemctl
> restart docker.service" to fail, which causes openshift-sdn setup to bail
> out (which SHOULD cause openshift node setup to bail out but apparently
> we're not propagating that error...)

Errors in the Setup() part should be fatal:

	err = kc.StartNode(mtu)
	if err != nil {
		glog.Fatalf("SDN Node failed: %v", err)
	}

And indeed they are:

Nov 05 23:45:29 ose3-node1.example.com systemd[1]: openshift-node.service: main process exited, code=exited, status=255/n/a

But of course, by the time the script failed it had set up br0 and lbr0 and written the docker config, but not the .env file.  So when openshift-node got restarted, it thought everything was set up, but still no .env file.

Comment 8 Kenjiro Nakayama 2015-11-11 06:11:13 UTC
(In reply to Dan Williams from comment #5)
> Are you able to retrieve the initial node logs, before the node was
> restarted?

I think ose3-node1-openshift-node.log (which I attached to this bz) contains it. I'm sorry if I misunderstood your demand.

Comment 9 Dan Williams 2015-11-11 17:17:44 UTC
(In reply to Kenjiro Nakayama from comment #8)
> (In reply to Dan Williams from comment #5)
> > Are you able to retrieve the initial node logs, before the node was
> > restarted?
> 
> I think ose3-node1-openshift-node.log (which I attached to this bz) contains
> it. I'm sorry if I misunderstood your demand.

Yeah it does; I should have updated to say that.  I believe we've root-caused the bug already (see comment #6 and comment #7).  Updating to OSE 3.1 should help this greatly though the bug may not be entirely fixed yet.

Comment 15 Dan Williams 2015-12-02 16:25:17 UTC
The fixes have been merged to openshift origin and should show up in a later release than 3.1.

Comment 16 Johnny Liu 2015-12-30 02:17:41 UTC
Verified this bug with AtomicOpenShift/3.1/2015-12-19.3, and PASS.

The PRs are merged, and currently QE did not encounter such issue now.

Comment 18 errata-xmlrpc 2016-01-26 19:17:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:0070