Bug 1608505 - oc cluster up fails with error Error: failed to start the web console server
Summary: oc cluster up fails with error Error: failed to start the web console server
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: origin
Version: 29
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Jakub Čajka
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords:
: 1629431 (view as bug list)
Depends On:
Blocks: 1598406
TreeView+ depends on / blocked
 
Reported: 2018-07-25 16:27 UTC by Lukas Slebodnik
Modified: 2018-10-09 00:05 UTC (History)
22 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2018-10-09 00:05:21 UTC


Attachments (Terms of Use)
Output of journalctl -u docker (456.71 KB, text/x-vhdl)
2018-07-25 16:28 UTC, Lukas Slebodnik
no flags Details

Description Lukas Slebodnik 2018-07-25 16:27:03 UTC
Description of problem:
I wanted to do some testing on rawhide but I did not get so far.
I was not able to prepare even openshfift cluster which was prerequisite for my testing.

Version-Release number of selected component (if applicable):
sh$ rpm -q docker origin-clients
docker-1.13.1-60.git9cb56fd.fc29.x86_64
origin-clients-3.9.0-4.fc29.x86_64

How reproducible:
Deterministic

Steps to Reproduce:
1. dnf install -y docker
2. echo -e "\n[registries.insecure]\nregistries = ['172.30.0.0/16']" > /etc/containers/registries.conf
3. echo 'STORAGE_DRIVER="overlay2"' >> /etc/sysconfig/docker-storage-setup
4. systemctl start docker
5. setenforce 0 # due to other BUGS on rawhide
5. oc cluster up

Actual results:
sh# oc cluster up
Using nsenter mounter for OpenShift volumes
Using 127.0.0.1 as the server IP
Starting OpenShift using openshift/origin:v3.9.0 ...
-- Starting OpenShift container ...
   Creating initial OpenShift configuration
   Starting OpenShift using container 'origin'
   Waiting for API server to start listening
   OpenShift server started
-- Adding default OAuthClient redirect URIs ... OK
-- Installing registry ...
   scc "privileged" added to: ["system:serviceaccount:default:registry"]
-- Installing router ... OK
-- Importing image streams ... OK
-- Importing templates ... OK
-- Importing internal templates ... OK
-- Installing web console ... FAIL
   Error: failed to start the web console server: timed out waiting for the condition

Expected results:
cluster configured without any problem

Additional info:

Comment 1 Lukas Slebodnik 2018-07-25 16:28 UTC
Created attachment 1470541 [details]
Output of journalctl -u docker

Comment 2 Jakub Čajka 2018-07-30 13:23:25 UTC
This seems to affect only rawhide(f27, f28 with the rawhide origin(3.9) is not affected), I have managed to reproduce it. It seems that the docker daemon fails to pull openshift/origin-pod:v3.9.0 based on `time="2018-07-25T12:13:05.554183434-04:00" level=error msg="Handler for GET /v1.26/images/openshift/origin-pod:v3.9.0/json returned error: No such image: openshift/origin-pod:v3.9.0"` in log, although the image is available and pull-able on the host.

As I'm planning to do the rebase to 3.10, I will revisit this issue after the rebase.

Comment 3 Lukas Slebodnik 2018-07-30 19:19:25 UTC
(In reply to Jakub Čajka from comment #2)
> This seems to affect only rawhide(f27, f28 with the rawhide origin(3.9) is
> not affected), I have managed to reproduce it. It seems that the docker
> daemon fails to pull openshift/origin-pod:v3.9.0 based on
> `time="2018-07-25T12:13:05.554183434-04:00" level=error msg="Handler for GET
> /v1.26/images/openshift/origin-pod:v3.9.0/json returned error: No such
> image: openshift/origin-pod:v3.9.0"` in log, although the image is available
> and pull-able on the host.
> 
> As I'm planning to do the rebase to 3.10, I will revisit this issue after
> the rebase.

Is there any ETA?

Comment 4 Jakub Čajka 2018-08-01 12:41:53 UTC
I have managed to reproduce it on rawhide with origin 3.10 alpha release. It seems that something breaks networking/image pulling on the machine. This happens even with same version of docker as run on f28 not affected machine(anecdotal same as mentioned by reporter) and disabled firewall and selinux. This even happens with default docker configuration(not using the commands from the reproducer).

I'm kind of out of ideas what I can do tho debug this or what can influence the networking/pulling.

Trying docker now, although I don't believe that it is the root cause. Folks do you have any ideas?

Errors observed in the log(that are not present with successful oc up) 

Aug 01 14:24:15 localhost.localdomain dockerd-current[810]: time="2018-08-01T14:24:15.749839355+02:00" level=error msg="Handler for GET /v1.26/images/openshift/origin-pod:v3.10/json returned error: No such image: openshift/origin-pod:v3.10"
Aug 01 14:24:15 localhost.localdomain dockerd-current[810]: time="2018-08-01T14:24:15.750494849+02:00" level=error msg="Handler for GET /v1.26/images/openshift/origin-pod:v3.10/json returned error: No such image: openshift/origin-pod:v3.10"
Aug 01 14:24:15 localhost.localdomain dockerd-current[810]: time="2018-08-01T14:24:15.767248526+02:00" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n

Comment 5 Jan Kurik 2018-08-14 11:02:08 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle.
Changing version to '29'.

Comment 6 Jakub Čajka 2018-08-20 13:00:19 UTC
Hm... Seems that there is issue with systemd. Updating the systemd on the f28 to the version from rawhide results in timeout waiting for "https://127.0.0.1:8443/healthz?timeout=32s: dial tcp 127.0.0.1:8443: connect: connection refused ()". It seems to be unreachable/closed(mind that firewall is disabled along with selinux)

Systemd folks, have there been changes in rawhide/f29 systemd that this could be attributed to?

Comment 7 Lukas Slebodnik 2018-08-21 12:48:17 UTC
(In reply to Jakub Čajka from comment #6)
> Hm... Seems that there is issue with systemd. Updating the systemd on the
> f28 to the version from rawhide results in timeout waiting for
> "https://127.0.0.1:8443/healthz?timeout=32s: dial tcp 127.0.0.1:8443:
> connect: connection refused ()". It seems to be unreachable/closed(mind that
> firewall is disabled along with selinux)
> 
> Systemd folks, have there been changes in rawhide/f29 systemd that this
> could be attributed to?

I can confirm that it works well with systemd-238-9.git0e0aa59.fc29.x86_64.

Comment 8 Jakub Čajka 2018-08-21 13:48:14 UTC
(In reply to Lukas Slebodnik from comment #7)
> (In reply to Jakub Čajka from comment #6)
> > Hm... Seems that there is issue with systemd. Updating the systemd on the
> > f28 to the version from rawhide results in timeout waiting for
> > "https://127.0.0.1:8443/healthz?timeout=32s: dial tcp 127.0.0.1:8443:
> > connect: connection refused ()". It seems to be unreachable/closed(mind that
> > firewall is disabled along with selinux)
> > 
> > Systemd folks, have there been changes in rawhide/f29 systemd that this
> > could be attributed to?
> 
> I can confirm that it works well with systemd-238-9.git0e0aa59.fc29.x86_64.

And it doesn't with 239-3.fc29, right?

Comment 9 Joe Doss 2018-08-26 22:57:09 UTC
I can confirm that it doesn't work with 239-3.fc29. Downgrading to systemd-238-9.git0e0aa59.fc29.x86_64 allows origin 3.10 to install correctly.

This might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1568594 and https://bugzilla.redhat.com/show_bug.cgi?id=1558425

Comment 10 Jakub Čajka 2018-08-29 10:08:07 UTC
For the record steps to reproduce.
1. Clean f28(of f29) install
2. Update to rawhide/f29 systemd 239(Downgrade f29 to 238)
3  systemctl disable firewalld
4. reboot
5. dnf isntall origin-clients docker
6. create /etc/docker/daemon.json with contents
{
    "insecure-registries" : [ "172.30.0.0/16" ]
}
7. systemctl start docker
8. oc cluster up
It should fail as up mentioned

Comment 11 Jakub Čajka 2018-09-17 06:20:11 UTC
*** Bug 1629431 has been marked as a duplicate of this bug. ***

Comment 12 Jakub Čajka 2018-09-26 13:24:03 UTC
Based on the discussion in https://pagure.io/atomic-wg/issue/510 it is and origin/runc issue.

Comment 13 Fedora Update System 2018-10-02 10:18:40 UTC
origin-3.11.0-0.alpha1.0.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2018-7ed03f9dcf

Comment 14 Fedora Update System 2018-10-02 21:18:37 UTC
origin-3.11.0-0.alpha1.0.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-7ed03f9dcf

Comment 15 Fedora Update System 2018-10-09 00:05:21 UTC
origin-3.11.0-0.alpha1.0.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.