Bug 1328977

Summary: atomic-openshift-master serivce masked/fails to restart after successful HA install of 3.2.0.17
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: InstallerAssignee: Jason DeTiberus <jdetiber>
Status: CLOSED NOTABUG QA Contact: Ma xiaoqiang <xiama>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: aos-bugs, jialiu, jokerman, mifiedle, mmccomas, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-21 03:35:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Mike Fiedler 2016-04-20 19:20:35 UTC
Description of problem:

After installing an HA config in 3.2.0.17, the atomic-open-shift-master systemd unit was masked and could not be restarted.  I unmasked it and ran restart (to pick up a master-config.yaml change) and the service would not restart - see log messages below.

The install was on AWS using the aos-ansible/playbooks/aws_install_prep.yml and openshift-ansible/playbooks/byo/config.yml playbooks (see detailed steps below, inventory files will be placed on internal site - see below).


Version-Release number of selected component (if applicable): 3.2.0.17.

How reproducible: always


Steps to Reproduce:
1. Install is on AWS.  2 masters, 1 load balancer, 1 node.  I tried larger configurations and hit the same issue.  This seems to be the simplest way to repro.
2. Configure playbooks for 3.2.0.17 (see below for playbook location).
3. Run the playbooks - both run clean.  Successful installs
4. After install, the cluster seems ok, basic oc commands are working, router and registry deploy OK.  master load balancer is accepting logins and is doing its job
5. Run systemctl restart atomic-openshift-master (as if master-config.yaml had changed, for instance)

Actual results:
# systemctl restart atomic-openshift-master
Failed to restart atomic-openshift-master.service: Unit atomic-openshift-master.service is masked.
# systemctl unmask atomic-openshift-master
Removed symlink /etc/systemd/system/atomic-openshift-master.service.
# systemctl restart atomic-openshift-master
Job for atomic-openshift-master.service failed because the control process exited with error code. See "systemctl status atomic-openshift-master.service" and "journalctl -xe" for details.

system log of failed restart below


Expected results:

Service restarts normally.   Unmasking the service and restarting gives the messages below in the system log:


Additional info:

Apr 20 14:51:30 ip-172-31-6-18 systemd: Starting Atomic OpenShift Master...
Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: W0420 14:51:31.081811   38940 start_master.go:270] assetConfig.loggingPublicURL: Invalid value: "": required to view aggregated container logs in the console
Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: W0420 14:51:31.081899   38940 start_master.go:270] assetConfig.metricsPublicURL: Invalid value: "": required to view cluster metrics in the console
Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: E0420 14:51:31.943555   38940 aws.go:676] Tag "KubernetesCluster" not found; Kuberentes may behave unexpectedly.
Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:31.943601   38940 aws.go:683] AWS cloud - no tag filtering
Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:31.943618   38940 plugins.go:41] Registered credential provider "aws-ecr-key"
Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:31.943648   38940 master_config.go:145] Successfully initialized cloud provider: "aws" from the config file: "/etc/origin/cloudprovider/aws.conf"
Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.227696   38940 genericapiserver.go:81] Adding storage destination for group
Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.227738   38940 genericapiserver.go:81] Adding storage destination for group extensions
Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.227767   38940 start_master.go:383] Starting master on 0.0.0.0:8443 (v3.2.0.17)
Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.227774   38940 start_master.go:384] Public master address is https://ec2-54-186-44-98.us-west-2.compute.amazonaws.com:8443
Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.227789   38940 start_master.go:388] Using images from "registry.qe.openshift.com/openshift3/ose-<component>:v3.2.0.17"
Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.398217   38940 run_components.go:204] Using default project node label selector:
Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-node: I0420 14:51:32.583754   33946 iowatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-node: I0420 14:51:32.606866   33946 iowatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: W0420 14:51:33.616522   38940 controller.go:297] Resetting endpoints for master service "kubernetes" to &{{ } {kubernetes  default  5f2757aa-0727-11e6-8eac-028b2ba9cf7f 271 0 2016-04-20 14:40:36 -0400 EDT <nil> <nil> map[] map[]} [{[{172.31.6.18 <nil>} {172.31.6.19 <nil>}] [] [{https 8443 TCP} {dns 8053 UDP} {dns-tcp 8053 TCP}]}]}
Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878803   38940 master.go:262] Started Kubernetes API at 0.0.0.0:8443/api/v1
Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878840   38940 master.go:262] Started Kubernetes API Extensions at 0.0.0.0:8443/apis/extensions/v1beta1
Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878847   38940 master.go:262] Started Origin API at 0.0.0.0:8443/oapi/v1
Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878852   38940 master.go:262] Started OAuth2 API at 0.0.0.0:8443/oauth
Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878856   38940 master.go:262] Started Web Console 0.0.0.0:8443/console/
Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878861   38940 master.go:262] Started Swagger Schema API at 0.0.0.0:8443/swaggerapi/
Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.905200   38940 ensure.go:226] Ignoring bootstrap policy file because cluster policy found
Apr 20 14:51:34 ip-172-31-6-18 atomic-openshift-master: F0420 14:51:34.029810   38940 master.go:277] listen tcp4 0.0.0.0:8443: bind: address already in use
Apr 20 14:51:34 ip-172-31-6-18 systemd: atomic-openshift-master.service: main process exited, code=exited, status=255/n/a
Apr 20 14:51:34 ip-172-31-6-18 systemd: Failed to start Atomic OpenShift Master.
Apr 20 14:51:34 ip-172-31-6-18 systemd: Unit atomic-openshift-master.service entered failed state.
Apr 20 14:51:34 ip-172-31-6-18 systemd: atomic-openshift-master.service failed.

Comment 2 Mike Fiedler 2016-04-20 19:34:25 UTC
Failed to mention.  Non-HA (i.e. single master, no load balancer) installs using the same playbook, including AWS config, do not experience the issue.

Comment 3 Mike Fiedler 2016-04-20 20:24:40 UTC
Removing the cloudprovider config made no difference.

Comment 4 Ma xiaoqiang 2016-04-21 02:03:47 UTC
@Mike

I find '0.0.0.0:8443: bind: address already in use' in the error log, I think the master-api service is still running on the server when you try to start the master service. it is not reasonable to start atomic-openshift-master service on the ha-master. So I think it is not a bug. thx

Comment 5 Mike Fiedler 2016-04-21 02:10:18 UTC
The restart issue happens on all master nodes.  How are changes to master-config.yaml supposed to be picked up?  Normally (per the documentation), it is via atomic-openshift-master restart.

Comment 6 Mike Fiedler 2016-04-21 02:11:29 UTC
Example of what I mean in comment 4:  setting metricsPublicURL or the default project node selector.

Comment 7 Jason DeTiberus 2016-04-21 03:35:34 UTC
For HA installations we mask the atomic-openshift-master service on purpose, instead the atomic-openshift-api and atomic-openshift-controllers units are used in their place.

Both the atomic-openshift-api and atomic-openshift-controllers service use the /etc/origin/master/master-config.yaml config file, however the environment file for the controllers service (/etc/sysconfig/atomic-openshift-controllers) overrides some values via command line parameters.

Instead of issuing a `systemctl restart atomic-openshift-master` for an HA install, you should instead issue `systemctl restart atomic-openshift-master-api atomic-openshift-master-controllers` instead.

Comment 8 Johnny Liu 2016-04-21 07:20:58 UTC
HA master will disable "atomic-openshift-master" service, and split the service into to "atomic-openshift-master-api" and "atomic-openshift-master-controllers" services, the two service will pick up different parts according to what change you did in master config file.

If you are not sure your should restart which service to pick up your change, suggest restart both of the two services per master.

I agree comment 4, it should be NOTABUG, or maybe a docs bug.

Comment 9 Mike Fiedler 2016-04-21 09:48:02 UTC
Thanks for the clarification and apologies for the firedrill