Hide Forgot
Description of problem: After installing an HA config in 3.2.0.17, the atomic-open-shift-master systemd unit was masked and could not be restarted. I unmasked it and ran restart (to pick up a master-config.yaml change) and the service would not restart - see log messages below. The install was on AWS using the aos-ansible/playbooks/aws_install_prep.yml and openshift-ansible/playbooks/byo/config.yml playbooks (see detailed steps below, inventory files will be placed on internal site - see below). Version-Release number of selected component (if applicable): 3.2.0.17. How reproducible: always Steps to Reproduce: 1. Install is on AWS. 2 masters, 1 load balancer, 1 node. I tried larger configurations and hit the same issue. This seems to be the simplest way to repro. 2. Configure playbooks for 3.2.0.17 (see below for playbook location). 3. Run the playbooks - both run clean. Successful installs 4. After install, the cluster seems ok, basic oc commands are working, router and registry deploy OK. master load balancer is accepting logins and is doing its job 5. Run systemctl restart atomic-openshift-master (as if master-config.yaml had changed, for instance) Actual results: # systemctl restart atomic-openshift-master Failed to restart atomic-openshift-master.service: Unit atomic-openshift-master.service is masked. # systemctl unmask atomic-openshift-master Removed symlink /etc/systemd/system/atomic-openshift-master.service. # systemctl restart atomic-openshift-master Job for atomic-openshift-master.service failed because the control process exited with error code. See "systemctl status atomic-openshift-master.service" and "journalctl -xe" for details. system log of failed restart below Expected results: Service restarts normally. Unmasking the service and restarting gives the messages below in the system log: Additional info: Apr 20 14:51:30 ip-172-31-6-18 systemd: Starting Atomic OpenShift Master... Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: W0420 14:51:31.081811 38940 start_master.go:270] assetConfig.loggingPublicURL: Invalid value: "": required to view aggregated container logs in the console Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: W0420 14:51:31.081899 38940 start_master.go:270] assetConfig.metricsPublicURL: Invalid value: "": required to view cluster metrics in the console Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: E0420 14:51:31.943555 38940 aws.go:676] Tag "KubernetesCluster" not found; Kuberentes may behave unexpectedly. Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:31.943601 38940 aws.go:683] AWS cloud - no tag filtering Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:31.943618 38940 plugins.go:41] Registered credential provider "aws-ecr-key" Apr 20 14:51:31 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:31.943648 38940 master_config.go:145] Successfully initialized cloud provider: "aws" from the config file: "/etc/origin/cloudprovider/aws.conf" Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.227696 38940 genericapiserver.go:81] Adding storage destination for group Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.227738 38940 genericapiserver.go:81] Adding storage destination for group extensions Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.227767 38940 start_master.go:383] Starting master on 0.0.0.0:8443 (v3.2.0.17) Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.227774 38940 start_master.go:384] Public master address is https://ec2-54-186-44-98.us-west-2.compute.amazonaws.com:8443 Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.227789 38940 start_master.go:388] Using images from "registry.qe.openshift.com/openshift3/ose-<component>:v3.2.0.17" Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:32.398217 38940 run_components.go:204] Using default project node label selector: Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-node: I0420 14:51:32.583754 33946 iowatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF Apr 20 14:51:32 ip-172-31-6-18 atomic-openshift-node: I0420 14:51:32.606866 33946 iowatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: W0420 14:51:33.616522 38940 controller.go:297] Resetting endpoints for master service "kubernetes" to &{{ } {kubernetes default 5f2757aa-0727-11e6-8eac-028b2ba9cf7f 271 0 2016-04-20 14:40:36 -0400 EDT <nil> <nil> map[] map[]} [{[{172.31.6.18 <nil>} {172.31.6.19 <nil>}] [] [{https 8443 TCP} {dns 8053 UDP} {dns-tcp 8053 TCP}]}]} Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878803 38940 master.go:262] Started Kubernetes API at 0.0.0.0:8443/api/v1 Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878840 38940 master.go:262] Started Kubernetes API Extensions at 0.0.0.0:8443/apis/extensions/v1beta1 Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878847 38940 master.go:262] Started Origin API at 0.0.0.0:8443/oapi/v1 Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878852 38940 master.go:262] Started OAuth2 API at 0.0.0.0:8443/oauth Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878856 38940 master.go:262] Started Web Console 0.0.0.0:8443/console/ Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.878861 38940 master.go:262] Started Swagger Schema API at 0.0.0.0:8443/swaggerapi/ Apr 20 14:51:33 ip-172-31-6-18 atomic-openshift-master: I0420 14:51:33.905200 38940 ensure.go:226] Ignoring bootstrap policy file because cluster policy found Apr 20 14:51:34 ip-172-31-6-18 atomic-openshift-master: F0420 14:51:34.029810 38940 master.go:277] listen tcp4 0.0.0.0:8443: bind: address already in use Apr 20 14:51:34 ip-172-31-6-18 systemd: atomic-openshift-master.service: main process exited, code=exited, status=255/n/a Apr 20 14:51:34 ip-172-31-6-18 systemd: Failed to start Atomic OpenShift Master. Apr 20 14:51:34 ip-172-31-6-18 systemd: Unit atomic-openshift-master.service entered failed state. Apr 20 14:51:34 ip-172-31-6-18 systemd: atomic-openshift-master.service failed.
Failed to mention. Non-HA (i.e. single master, no load balancer) installs using the same playbook, including AWS config, do not experience the issue.
Removing the cloudprovider config made no difference.
@Mike I find '0.0.0.0:8443: bind: address already in use' in the error log, I think the master-api service is still running on the server when you try to start the master service. it is not reasonable to start atomic-openshift-master service on the ha-master. So I think it is not a bug. thx
The restart issue happens on all master nodes. How are changes to master-config.yaml supposed to be picked up? Normally (per the documentation), it is via atomic-openshift-master restart.
Example of what I mean in comment 4: setting metricsPublicURL or the default project node selector.
For HA installations we mask the atomic-openshift-master service on purpose, instead the atomic-openshift-api and atomic-openshift-controllers units are used in their place. Both the atomic-openshift-api and atomic-openshift-controllers service use the /etc/origin/master/master-config.yaml config file, however the environment file for the controllers service (/etc/sysconfig/atomic-openshift-controllers) overrides some values via command line parameters. Instead of issuing a `systemctl restart atomic-openshift-master` for an HA install, you should instead issue `systemctl restart atomic-openshift-master-api atomic-openshift-master-controllers` instead.
HA master will disable "atomic-openshift-master" service, and split the service into to "atomic-openshift-master-api" and "atomic-openshift-master-controllers" services, the two service will pick up different parts according to what change you did in master config file. If you are not sure your should restart which service to pick up your change, suggest restart both of the two services per master. I agree comment 4, it should be NOTABUG, or maybe a docs bug.
Thanks for the clarification and apologies for the firedrill