Bug 1608505

Summary: oc cluster up fails with error Error: failed to start the web console server
Product: [Fedora] Fedora Reporter: Lukas Slebodnik <lslebodn>
Component: originAssignee: Jakub Čajka <jcajka>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 29CC: adimania, admiller, amurdaca, dwalsh, ealfassa, fkluknav, ichavero, jcajka, joe, lnykryn, lsm5, marianne, mpatel, msekleta, nalin, santiago, ssahani, s, systemd-maint, tdawson, vbatts, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: origin-3.11.0-0.alpha1.0.fc30 origin-3.11.0-0.alpha1.0.fc29 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-09 00:05:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 1598406    
Attachments:
Description Flags
Output of journalctl -u docker none

Description Lukas Slebodnik 2018-07-25 16:27:03 UTC
Description of problem:
I wanted to do some testing on rawhide but I did not get so far.
I was not able to prepare even openshfift cluster which was prerequisite for my testing.

Version-Release number of selected component (if applicable):
sh$ rpm -q docker origin-clients
docker-1.13.1-60.git9cb56fd.fc29.x86_64
origin-clients-3.9.0-4.fc29.x86_64

How reproducible:
Deterministic

Steps to Reproduce:
1. dnf install -y docker
2. echo -e "\n[registries.insecure]\nregistries = ['172.30.0.0/16']" > /etc/containers/registries.conf
3. echo 'STORAGE_DRIVER="overlay2"' >> /etc/sysconfig/docker-storage-setup
4. systemctl start docker
5. setenforce 0 # due to other BUGS on rawhide
5. oc cluster up

Actual results:
sh# oc cluster up
Using nsenter mounter for OpenShift volumes
Using 127.0.0.1 as the server IP
Starting OpenShift using openshift/origin:v3.9.0 ...
-- Starting OpenShift container ...
   Creating initial OpenShift configuration
   Starting OpenShift using container 'origin'
   Waiting for API server to start listening
   OpenShift server started
-- Adding default OAuthClient redirect URIs ... OK
-- Installing registry ...
   scc "privileged" added to: ["system:serviceaccount:default:registry"]
-- Installing router ... OK
-- Importing image streams ... OK
-- Importing templates ... OK
-- Importing internal templates ... OK
-- Installing web console ... FAIL
   Error: failed to start the web console server: timed out waiting for the condition

Expected results:
cluster configured without any problem

Additional info:

Comment 1 Lukas Slebodnik 2018-07-25 16:28 UTC
Created attachment 1470541 [details]
Output of journalctl -u docker

Comment 2 Jakub Čajka 2018-07-30 13:23:25 UTC
This seems to affect only rawhide(f27, f28 with the rawhide origin(3.9) is not affected), I have managed to reproduce it. It seems that the docker daemon fails to pull openshift/origin-pod:v3.9.0 based on `time="2018-07-25T12:13:05.554183434-04:00" level=error msg="Handler for GET /v1.26/images/openshift/origin-pod:v3.9.0/json returned error: No such image: openshift/origin-pod:v3.9.0"` in log, although the image is available and pull-able on the host.

As I'm planning to do the rebase to 3.10, I will revisit this issue after the rebase.

Comment 3 Lukas Slebodnik 2018-07-30 19:19:25 UTC
(In reply to Jakub Čajka from comment #2)
> This seems to affect only rawhide(f27, f28 with the rawhide origin(3.9) is
> not affected), I have managed to reproduce it. It seems that the docker
> daemon fails to pull openshift/origin-pod:v3.9.0 based on
> `time="2018-07-25T12:13:05.554183434-04:00" level=error msg="Handler for GET
> /v1.26/images/openshift/origin-pod:v3.9.0/json returned error: No such
> image: openshift/origin-pod:v3.9.0"` in log, although the image is available
> and pull-able on the host.
> 
> As I'm planning to do the rebase to 3.10, I will revisit this issue after
> the rebase.

Is there any ETA?

Comment 4 Jakub Čajka 2018-08-01 12:41:53 UTC
I have managed to reproduce it on rawhide with origin 3.10 alpha release. It seems that something breaks networking/image pulling on the machine. This happens even with same version of docker as run on f28 not affected machine(anecdotal same as mentioned by reporter) and disabled firewall and selinux. This even happens with default docker configuration(not using the commands from the reproducer).

I'm kind of out of ideas what I can do tho debug this or what can influence the networking/pulling.

Trying docker now, although I don't believe that it is the root cause. Folks do you have any ideas?

Errors observed in the log(that are not present with successful oc up) 

Aug 01 14:24:15 localhost.localdomain dockerd-current[810]: time="2018-08-01T14:24:15.749839355+02:00" level=error msg="Handler for GET /v1.26/images/openshift/origin-pod:v3.10/json returned error: No such image: openshift/origin-pod:v3.10"
Aug 01 14:24:15 localhost.localdomain dockerd-current[810]: time="2018-08-01T14:24:15.750494849+02:00" level=error msg="Handler for GET /v1.26/images/openshift/origin-pod:v3.10/json returned error: No such image: openshift/origin-pod:v3.10"
Aug 01 14:24:15 localhost.localdomain dockerd-current[810]: time="2018-08-01T14:24:15.767248526+02:00" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n

Comment 5 Jan Kurik 2018-08-14 11:02:08 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle.
Changing version to '29'.

Comment 6 Jakub Čajka 2018-08-20 13:00:19 UTC
Hm... Seems that there is issue with systemd. Updating the systemd on the f28 to the version from rawhide results in timeout waiting for "https://127.0.0.1:8443/healthz?timeout=32s: dial tcp 127.0.0.1:8443: connect: connection refused ()". It seems to be unreachable/closed(mind that firewall is disabled along with selinux)

Systemd folks, have there been changes in rawhide/f29 systemd that this could be attributed to?

Comment 7 Lukas Slebodnik 2018-08-21 12:48:17 UTC
(In reply to Jakub Čajka from comment #6)
> Hm... Seems that there is issue with systemd. Updating the systemd on the
> f28 to the version from rawhide results in timeout waiting for
> "https://127.0.0.1:8443/healthz?timeout=32s: dial tcp 127.0.0.1:8443:
> connect: connection refused ()". It seems to be unreachable/closed(mind that
> firewall is disabled along with selinux)
> 
> Systemd folks, have there been changes in rawhide/f29 systemd that this
> could be attributed to?

I can confirm that it works well with systemd-238-9.git0e0aa59.fc29.x86_64.

Comment 8 Jakub Čajka 2018-08-21 13:48:14 UTC
(In reply to Lukas Slebodnik from comment #7)
> (In reply to Jakub Čajka from comment #6)
> > Hm... Seems that there is issue with systemd. Updating the systemd on the
> > f28 to the version from rawhide results in timeout waiting for
> > "https://127.0.0.1:8443/healthz?timeout=32s: dial tcp 127.0.0.1:8443:
> > connect: connection refused ()". It seems to be unreachable/closed(mind that
> > firewall is disabled along with selinux)
> > 
> > Systemd folks, have there been changes in rawhide/f29 systemd that this
> > could be attributed to?
> 
> I can confirm that it works well with systemd-238-9.git0e0aa59.fc29.x86_64.

And it doesn't with 239-3.fc29, right?

Comment 9 Joe Doss 2018-08-26 22:57:09 UTC
I can confirm that it doesn't work with 239-3.fc29. Downgrading to systemd-238-9.git0e0aa59.fc29.x86_64 allows origin 3.10 to install correctly.

This might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1568594 and https://bugzilla.redhat.com/show_bug.cgi?id=1558425

Comment 10 Jakub Čajka 2018-08-29 10:08:07 UTC
For the record steps to reproduce.
1. Clean f28(of f29) install
2. Update to rawhide/f29 systemd 239(Downgrade f29 to 238)
3  systemctl disable firewalld
4. reboot
5. dnf isntall origin-clients docker
6. create /etc/docker/daemon.json with contents
{
    "insecure-registries" : [ "172.30.0.0/16" ]
}
7. systemctl start docker
8. oc cluster up
It should fail as up mentioned

Comment 11 Jakub Čajka 2018-09-17 06:20:11 UTC
*** Bug 1629431 has been marked as a duplicate of this bug. ***

Comment 12 Jakub Čajka 2018-09-26 13:24:03 UTC
Based on the discussion in https://pagure.io/atomic-wg/issue/510 it is and origin/runc issue.

Comment 13 Fedora Update System 2018-10-02 10:18:40 UTC
origin-3.11.0-0.alpha1.0.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2018-7ed03f9dcf

Comment 14 Fedora Update System 2018-10-02 21:18:37 UTC
origin-3.11.0-0.alpha1.0.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-7ed03f9dcf

Comment 15 Fedora Update System 2018-10-09 00:05:21 UTC
origin-3.11.0-0.alpha1.0.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.