1498643 – Master controller service failed to start during installation

Bug 1498643 - Master controller service failed to start during installation

Summary: Master controller service failed to start during installation

Keywords:
Status:	CLOSED DUPLICATE of bug 1491399
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Scott Dodson
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-04 20:23 UTC by Vikas Laad
Modified:	2017-10-06 13:16 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-05 20:19:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Vikas Laad 2017-10-04 20:23:28 UTC

Description of problem:
Master controller service failed to start during installation

Oct 04 20:18:31 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a
Oct 04 20:18:31 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Scope libcontainer-40619-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Oct 04 20:18:31 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Scope libcontainer-40619-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Oct 04 20:18:31 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Created slice libcontainer_40619_systemd_test_default.slice.
Oct 04 20:18:31 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Starting libcontainer_40619_systemd_test_default.slice.
Oct 04 20:18:31 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Removed slice libcontainer_40619_systemd_test_default.slice.
Oct 04 20:18:31 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Stopping libcontainer_40619_systemd_test_default.slice.
Oct 04 20:18:31 ip-172-31-17-67.us-west-2.compute.internal atomic-openshift-master-controllers[40619]: container "atomic-openshift-master-controllers" does not exist
Oct 04 20:18:31 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service: control process exited, code=exited status=1
Oct 04 20:18:31 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state.
Oct 04 20:18:31 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service failed.
Oct 04 20:18:36 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service holdoff time over, scheduling restart.
Oct 04 20:18:36 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Starting atomic-openshift-master-controllers.service...
Oct 04 20:18:36 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Started atomic-openshift-master-controllers.service.
Oct 04 20:18:36 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Scope libcontainer-40627-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Oct 04 20:18:36 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Scope libcontainer-40627-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Oct 04 20:18:36 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Created slice libcontainer_40627_systemd_test_default.slice.
Oct 04 20:18:36 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Starting libcontainer_40627_systemd_test_default.slice.
Oct 04 20:18:36 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Removed slice libcontainer_40627_systemd_test_default.slice.
Oct 04 20:18:36 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Stopping libcontainer_40627_systemd_test_default.slice.
Oct 04 20:18:36 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Started libcontainer container atomic-openshift-master-controllers.
Oct 04 20:18:36 ip-172-31-17-67.us-west-2.compute.internal systemd[1]: Starting libcontainer container atomic-openshift-master-controllers.


Version-Release number of the following components:
openshift-ansible version 84f27a8d66b8638c32e9dca5eec05df684d20773

rpm -q ansible
ansible-2.4.0.0-3.el7.noarch

ansible --version
ansible 2.4.0.0
  config file = /root/openshift-ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]


How reproducible:
Always

Steps to Reproduce:
1. Install openshift on atomic host by running byo playbook

Actual results:
TASK [openshift_master : Wait for master controller service to start on first master] ****************************************
task path: /root/openshift-ansible/roles/openshift_master/tasks/main.yml:353
Pausing for 15 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)

Expected results:
playbook should complete 

Additional info:
attached inventory and logs

Comment 4 Mike Fiedler 2017-10-05 20:05:45 UTC

See https://bugzilla.redhat.com/show_bug.cgi?id=1498934

Comment 5 Scott Dodson 2017-10-05 20:19:47 UTC

We don't manage the instances and the instances need to be labeled prior to running the installer.

Comment 6 Scott Dodson 2017-10-05 20:24:02 UTC

Rob,

Just to clarify, if the instance is not tagged with the appropriate label there's no configuration file for either master or node that would label the instance correctly? The stance we took is that we'd rely on the master looking up the correct metadata as that's the most important bit and adding this to a configuration file on the instance is just another config file value that needs to be kept in sync.

--
Scott

Comment 7 Robert Rati 2017-10-06 12:36:35 UTC

Correct.  The master will not label nodes under any circumstances.  It also doesn't store that label information anywhere.  On startup the master looks on its instance for one of 2 specific labels, and if it finds one then it uses the value of that label.  If it doesn't find the label (there's an old and new method) then the controllers process will exit or print a warning in its log file depending on command line options.

Comment 8 Scott Dodson 2017-10-06 13:16:18 UTC

Ok, I think ansible can gather the tags from the metadata API. I think what we'll do is query that API for all node and master hosts. If a node or master host has the AWS cloud provider configured and they don't have a tag named "kubernetes.io/cluster/xxxx" we'll block the upgrade and install on 3.7 with a message that links to documentation. We need to get that documentation ready that explains both how to properly label new installations and how to retroactively label existing installations. We'll do that in https://bugzilla.redhat.com/show_bug.cgi?id=1498643

*** This bug has been marked as a duplicate of bug 1491399 ***

Note You need to log in before you can comment on or make changes to this bug.