Bug 1279925
Summary: | After installation, openshift-sdn didn't make /etc/openshift-sdn/config.env, and can't access to the pod | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Kenjiro Nakayama <knakayam> | ||||||
Component: | Networking | Assignee: | Dan Williams <dcbw> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.0.0 | CC: | aos-bugs, bleanhar, danw, dcbw, eparis, jialiu, knakayam, pruan | ||||||
Target Milestone: | --- | Keywords: | UpcomingRelease | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-01-26 19:17:14 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Kenjiro Nakayama
2015-11-10 13:45:48 UTC
Created attachment 1092276 [details]
journalctl -l -u openshift-node 2>&1 | tee $(hostname -s)-openshift-node.log
Created attachment 1092278 [details]
journalctl -l -u docker 2>&1 | tee $(hostname -s)-docker.log
I filed as networking issue, but this might be installation issue. One user in upstream got same issue, https://github.com/openshift/openshift-ansible/issues/780#issuecomment-152872814 Are you able to retrieve the initial node logs, before the node was restarted? The crucial bits: Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: + systemctl restart docker.service Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: Job for docker.service canceled. Nov 05 23:45:29 ose3-node1.example.com openshift-node[14414]: Error: exit status 1 Nov 05 23:45:28 ose3-node1.example.com docker[14654]: time="2015-11-05T23:45:28.950938352-05:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)" ... Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Stopping Docker Application Container Engine... Nov 05 23:45:29 ose3-node1.example.com systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Stopped Docker Application Container Engine. Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Unit docker.service entered failed state. Nov 05 23:45:29 ose3-node1.example.com systemd[1]: Starting Docker Application Container Engine... So systemd is failing to cleanly stop docker, which is causing "systemctl restart docker.service" to fail, which causes openshift-sdn setup to bail out (which SHOULD cause openshift node setup to bail out but apparently we're not propagating that error...) Maybe docker doesn't like it if you tell it to restart while it's not fully started up? (Or systemd doesn't like it?) Given that docker *is* getting restarted successfully despite the errors, a simple workaround would be to change the openshift-sdn setup script to do "systemctl restart docker.service || true"... (In reply to Dan Winship from comment #6) > So systemd is failing to cleanly stop docker, which is causing "systemctl > restart docker.service" to fail, which causes openshift-sdn setup to bail > out (which SHOULD cause openshift node setup to bail out but apparently > we're not propagating that error...) Errors in the Setup() part should be fatal: err = kc.StartNode(mtu) if err != nil { glog.Fatalf("SDN Node failed: %v", err) } And indeed they are: Nov 05 23:45:29 ose3-node1.example.com systemd[1]: openshift-node.service: main process exited, code=exited, status=255/n/a But of course, by the time the script failed it had set up br0 and lbr0 and written the docker config, but not the .env file. So when openshift-node got restarted, it thought everything was set up, but still no .env file. (In reply to Dan Williams from comment #5) > Are you able to retrieve the initial node logs, before the node was > restarted? I think ose3-node1-openshift-node.log (which I attached to this bz) contains it. I'm sorry if I misunderstood your demand. (In reply to Kenjiro Nakayama from comment #8) > (In reply to Dan Williams from comment #5) > > Are you able to retrieve the initial node logs, before the node was > > restarted? > > I think ose3-node1-openshift-node.log (which I attached to this bz) contains > it. I'm sorry if I misunderstood your demand. Yeah it does; I should have updated to say that. I believe we've root-caused the bug already (see comment #6 and comment #7). Updating to OSE 3.1 should help this greatly though the bug may not be entirely fixed yet. The fixes have been merged to openshift origin and should show up in a later release than 3.1. Verified this bug with AtomicOpenShift/3.1/2015-12-19.3, and PASS. The PRs are merged, and currently QE did not encounter such issue now. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:0070 |