Bug 1539691 - 3.9.0-0.31.0 - web console pod does not start because master is not schedulable
Summary: 3.9.0-0.31.0 - web console pod does not start because master is not schedulable
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.9.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.9.0
Assignee: Vadim Rutkovsky
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-29 13:05 UTC by Mike Fiedler
Modified: 2018-04-13 12:17 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-13 12:17:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Mike Fiedler 2018-01-29 13:05:07 UTC
Description of problem:

OpenShift and openshift-ansible 3.9.0-0.31.0

During the install the webconsole pod is stuck in Pending with the following issue:
Events:                          
  Type     Reason            Age               From               Message                                                              
  ----     ------            ----              ----               -------                                                              
  Warning  FailedScheduling  5s (x13 over 2m)  default-scheduler  0/5 nodes are available: 1 NodeUnschedulable, 4 MatchNodeSelector. 

Making the master schedulable allows the install to proceed.

see also:  https://bugzilla.redhat.com/show_bug.cgi?id=1535673


Version-Release number of the following components:

openshift-ansible-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch.rpm
openshift-ansible-docs-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch.rpm
openshift-ansible-playbooks-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch.rpm
openshift-ansible-roles-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch.rpm

How reproducible:  Always

Steps to Reproduce:
1.  Use openshift-ansible 3.9.0-0.31.0 to install Openshift 3.9.0-0.31.0

Actual results:

The check for the web console to be running is stuck until I manually made the master schedulable.

TASK [openshift_web_console : Verify that the web console is running] **********
Monday 29 January 2018  12:50:43 +0000 (0:00:01.414)       0:11:37.128 ******** 
FAILED - RETRYING: Verify that the web console is running (120 retries left).
FAILED - RETRYING: Verify that the web console is running (119 retries left).
FAILED - RETRYING: Verify that the web console is running (118 retries left).
FAILED - RETRYING: Verify that the web console is running (117 retries left).
FAILED - RETRYING: Verify that the web console is running (116 retries left).
FAILED - RETRYING: Verify that the web console is running (115 retries left).
FAILED - RETRYING: Verify that the web console is running (114 retries left).

Inventory (some info redacted) 

[OSEv3:children]
masters
nodes

etcd





[OSEv3:vars]

#The following parameters is used by post-actions
iaas_name=AWS
use_rpm_playbook=true
openshift_playbook_rpm_repos=[{'id': 'aos-playbook-rpm', 'name': 'aos-playbook-rpm', 'baseurl': 'http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/AtomicOpenShift/3.9/latest/x86_64/os', 'enabled': 1, 'gpgcheck': 0}]




update_is_images_url=registry.reg-aws.openshift.com:443











#The following parameters is used by openshift-ansible
ansible_ssh_user=root




openshift_cloudprovider_kind=aws

openshift_cloudprovider_aws_access_key=<redacted>


openshift_cloudprovider_aws_secret_key=<redacted>












openshift_master_default_subdomain_enable=true
openshift_master_default_subdomain=apps.0129-os8.qe.rhcloud.com




openshift_auth_type=allowall

openshift_master_identity_providers=[{'name': 'allow_all', 'login': 'true', 'challenge': 'true', 'kind': 'AllowAllPasswordIdentityProvider'}]



openshift_release=v3.9
openshift_deployment_type=openshift-enterprise
openshift_cockpit_deployer_prefix=registry.reg-aws.openshift.com:443/openshift3/
oreg_url=registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version}
oreg_auth_user={{ lookup('env','REG_AUTH_USER') }}
oreg_auth_password={{ lookup('env','REG_AUTH_PASSWORD') }}
openshift_docker_additional_registries=registry.reg-aws.openshift.com:443
openshift_docker_insecure_registries=registry.reg-aws.openshift.com:443
openshift_service_catalog_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
ansible_service_broker_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
ansible_service_broker_image_tag=v3.9
template_service_broker_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
template_service_broker_version=v3.9
openshift_web_console_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
openshift_enable_service_catalog=true
osm_cockpit_plugins=['cockpit-kubernetes']
osm_use_cockpit=false
openshift_docker_options=--log-opt max-size=100M --log-opt max-file=3 --signature-verification=false
use_cluster_metrics=true
openshift_master_cluster_method=native
openshift_master_dynamic_provisioning_enabled=true
openshift_hosted_router_registryurl=registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version}
openshift_hosted_registry_registryurl=registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version}
osm_default_node_selector=region=primary
openshift_registry_selector="region=infra,zone=default"
openshift_hosted_router_selector="region=infra,zone=default"
openshift_disable_check=disk_availability,memory_availability,package_availability,docker_image_availability,docker_storage,package_version
openshift_master_portal_net=172.24.0.0/14
openshift_portal_net=172.24.0.0/14
osm_cluster_network_cidr=172.20.0.0/14
osm_host_subnet_length=9
openshift_node_kubelet_args={"pods-per-core": ["0"], "max-pods": ["510"], "image-gc-high-threshold": ["80"], "image-gc-low-threshold": ["70"]}
debug_level=2
openshift_set_hostname=true
openshift_override_hostname_check=true
os_sdn_network_plugin_name=redhat/openshift-ovs-networkpolicy
openshift_hosted_router_replicas=1
openshift_hosted_registry_storage_kind=object
openshift_hosted_registry_storage_provider=s3
openshift_hosted_registry_storage_s3_accesskey=<redacted>
openshift_hosted_registry_storage_s3_secretkey=<redacted>
openshift_hosted_registry_storage_s3_bucket=aoe-svt-test
openshift_hosted_registry_storage_s3_region=us-west-2
openshift_hosted_registry_replicas=1
openshift_hosted_prometheus_deploy=true
openshift_prometheus_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_prometheus_image_version=v3.9
openshift_prometheus_proxy_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_prometheus_proxy_image_version=v3.9
openshift_prometheus_alertmanager_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_prometheus_alertmanager_image_version=v3.9
openshift_prometheus_alertbuffer_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_prometheus_alertbuffer_image_version=v3.9
openshift_metrics_install_metrics=false
openshift_metrics_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_metrics_image_version=v3.9
openshift_metrics_cassandra_storage_type=dynamic
openshift_metrics_cassandra_pvc_size=25Gi
openshift_logging_install_logging=false
openshift_logging_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_logging_image_version=v3.9
openshift_logging_storage_kind=dynamic
openshift_logging_es_pvc_size=50Gi
openshift_logging_es_pvc_dynamic=true
openshift_clusterid=mffiedler-39
openshift_image_tag=v3.9.0-0.31.0




[lb]


[etcd]
ec2-54-149-171-156.us-west-2.compute.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave3/workspace/Launch Environment Flexy/private/config/keys/id_rsa_perf" openshift_public_hostname=ec2-54-149-171-156.us-west-2.compute.amazonaws.com


[masters]
ec2-54-149-171-156.us-west-2.compute.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave3/workspace/Launch Environment Flexy/private/config/keys/id_rsa_perf" openshift_public_hostname=ec2-54-149-171-156.us-west-2.compute.amazonaws.com



[nodes]
ec2-54-149-171-156.us-west-2.compute.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave3/workspace/Launch Environment Flexy/private/config/keys/id_rsa_perf" openshift_public_hostname=ec2-54-149-171-156.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}" openshift_scheduleable=false

ec2-54-149-182-141.us-west-2.compute.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave3/workspace/Launch Environment Flexy/private/config/keys/id_rsa_perf" openshift_public_hostname=ec2-54-149-182-141.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}"

ec2-54-149-182-141.us-west-2.compute.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave3/workspace/Launch Environment Flexy/private/config/keys/id_rsa_perf" openshift_public_hostname=ec2-54-149-182-141.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}"

ec2-34-217-73-171.us-west-2.compute.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave3/workspace/Launch Environment Flexy/private/config/keys/id_rsa_perf" openshift_public_hostname=ec2-34-217-73-171.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'primary', 'zone': 'default'}"
ec2-54-213-250-6.us-west-2.compute.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave3/workspace/Launch Environment Flexy/private/config/keys/id_rsa_perf" openshift_public_hostname=ec2-54-213-250-6.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'primary', 'zone': 'default'}"
ec2-34-209-72-237.us-west-2.compute.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave3/workspace/Launch Environment Flexy/private/config/keys/id_rsa_perf" openshift_public_hostname=ec2-34-209-72-237.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'primary', 'zone': 'default'}"



Expected results:

Successful install

Comment 1 Scott Dodson 2018-01-29 13:39:02 UTC
Need to make sure that masters are by default schedulable and I think this would be addressed.

Comment 2 Vadim Rutkovsky 2018-01-30 10:33:01 UTC
Created PR https://github.com/openshift/openshift-ansible/pull/6932

Comment 3 Scott Dodson 2018-02-01 15:16:38 UTC
Lets add a check to ensure that if the console is deployed that masters are not openshift_schedulable=false. openshift_sanitize_inventory is likely a good place for this.

Comment 4 Vadim Rutkovsky 2018-02-01 17:04:57 UTC
(In reply to Scott Dodson from comment #3)
> Lets add a check to ensure that if the console is deployed that masters are
> not openshift_schedulable=false. openshift_sanitize_inventory is likely a
> good place for this.

Created https://github.com/openshift/openshift-ansible/pull/6984 to address this

Comment 5 Vadim Rutkovsky 2018-02-02 17:51:33 UTC
Fix is available in openshift-ansible-3.9.0-0.36.0.git.0.da68f13.el7

Comment 6 Johnny Liu 2018-02-05 08:48:06 UTC
"Taint master nodes" task is not merged into openshift-ansible-3.9.0-0.36.0.git.0.da68f13.el7.noarch yet.

Comment 7 Johnny Liu 2018-02-05 09:45:15 UTC
After go through the code, seem like the PR would introduce some other issues.

1. service catalog would have no available node to deploy.
By default, installer would label the 1st master node with "openshift-infra=apiserver", once taint is added for all masters, then service catalog daemonset would fail to deploy pod.

2. By default, installer would deploy logging fluentd via daemonset on all nodes, also including master node, once train is added for all master nodes, that means no fluentd pod is running on master nodes, logging can not collect log from there.

Comment 8 Vadim Rutkovsky 2018-02-05 09:57:05 UTC
(In reply to Johnny Liu from comment #7)
> After go through the code, seem like the PR would introduce some other
> issues.
> 
> 1. service catalog would have no available node to deploy.
> By default, installer would label the 1st master node with
> "openshift-infra=apiserver", once taint is added for all masters, then
> service catalog daemonset would fail to deploy pod.
> 
> 2. By default, installer would deploy logging fluentd via daemonset on all
> nodes, also including master node, once train is added for all master nodes,
> that means no fluentd pod is running on master nodes, logging can not
> collect log from there.

Good points, these would be discussed. Sounds like service catalog and logging templates should add tolerations too

(In reply to Johnny Liu from comment #6)
> "Taint master nodes" task is not merged into
> openshift-ansible-3.9.0-0.36.0.git.0.da68f13.el7.noarch yet.

Correct, tainting masters is still being discussed and is out of scope of this issue.

Comment 9 liujia 2018-02-23 05:52:52 UTC
@Vadim Rutkovsky 
Upgrade git the issue related with schedulable master that some app pods are scheduled on master node after upgrade. I think this is not expected result. 

# oc get node
NAME                               STATUS    ROLES     AGE       VERSION
qe-jliu-r-master-etcd-1            Ready     master    2h        v1.9.1+a0ce1bc657
qe-jliu-r-node-registry-router-1   Ready     <none>    2h        v1.9.1+a0ce1bc657


# oc get pod -o wide --all-namespaces |grep master
default                 registry-console-2-hlgln         1/1       Running     0          1h        10.129.0.4    qe-jliu-r-master-etcd-1
install-test            mongodb-1-psr9l                  1/1       Running     0          1h        10.129.0.5    qe-jliu-r-master-etcd-1
install-test            nodejs-mongodb-example-1-k56zh   1/1       Running     0          1h        10.129.0.18   qe-jliu-r-master-etcd-1
openshift-web-console   webconsole-54877f6577-g7tb8      1/1       Running     0          1h        10.129.0.2    qe-jliu-r-master-etcd-1
test                    mysql-1-ptblc                    1/1       Running     0          1h        10.129.0.19   qe-jliu-r-master-etcd-1

Not sure if the issue is in the scope of this bug, or should I track it in a new bug?

Comment 10 Vadim Rutkovsky 2018-02-23 09:58:58 UTC
(In reply to liujia from comment #9)
> @Vadim Rutkovsky 
> Upgrade git the issue related with schedulable master that some app pods are
> scheduled on master node after upgrade. I think this is not expected result. 

Right, that's certainly not expected

> Not sure if the issue is in the scope of this bug, or should I track it in a
> new bug?

Lets file a new bug for this (and move this one in VERIFIED), as it gets pretty complex to track it. The new bug should be a blocker for 3.9

Comment 11 Johnny Liu 2018-03-01 02:34:25 UTC
Verified this bug with openshift-ansible-3.9.1-1.git.0.9862628.el7.noarch, and PASS.

Now master is schedulable, web console could be deployed successfully.


Note You need to log in before you can comment on or make changes to this bug.