Bug 1416032

Summary: Installer completes ok even though components fail
Product: OpenShift Container Platform Reporter: Marko Myllynen <myllynen>
Component: InstallerAssignee: Scott Dodson <sdodson>
Status: CLOSED WONTFIX QA Contact: Johnny Liu <jialiu>
Severity: medium Docs Contact:
Priority: low    
Version: 3.4.0CC: aos-bugs, jokerman, mmccomas
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-16 14:44:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marko Myllynen 2017-01-24 12:30:43 UTC
Description of problem:
I ran ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml and it finished all ok but I see metrics pod failing:

[root@master01 ~]# for prj in default logging openshift-infra ; do oc get pods -n $prj ; done
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-2-vzw4h    1/1       Running   0          22m
registry-console-1-cr9cc   1/1       Running   0          10m
router-1-inwnw             1/1       Running   0          25m
NAME                          READY     STATUS      RESTARTS   AGE
logging-curator-1-u3f1i       1/1       Running     0          14m
logging-deployer-8mh4h        0/1       Completed   0          17m
logging-es-7i4bhlpr-1-e6van   1/1       Running     0          14m
logging-fluentd-132mv         1/1       Running     0          12m
logging-fluentd-m1uxw         1/1       Running     0          12m
logging-fluentd-tas3o         1/1       Running     0          12m
logging-fluentd-wujxv         1/1       Running     0          12m
logging-fluentd-x6dkn         1/1       Running     0          12m
logging-fluentd-zlqz4         1/1       Running     0          12m
logging-kibana-1-twf1a        2/2       Running     0          14m
NAME                     READY     STATUS    RESTARTS   AGE
metrics-deployer-dh3qu   0/1       Error     0          27m
[root@master01 ~]# oc describe pod metrics-deployer-dh3qu -n openshift-infra
Name:			metrics-deployer-dh3qu
Namespace:		openshift-infra
Security Policy:	anyuid
Node:			node01.example.com/192.168.122.201
Start Time:		Tue, 24 Jan 2017 07:00:08 -0500
Labels:			component=deployer
			metrics-infra=deployer
			provider=openshift
Status:			Failed
IP:			10.1.1.2
Controllers:		<none>
Containers:
  deployer:
    Container ID:	docker://8604f99338e20660c4193ebf5a2a2e309ad5f5e2853d8833411886f6bb0d9d14
    Image:		registry.access.redhat.com/openshift3/metrics-deployer:3.4.0
    Image ID:		docker-pullable://registry.access.redhat.com/openshift3/metrics-deployer@sha256:123825bc4576cbc4b2a699ccbc6e61666d9b5cb76a544104010298e9efbb1f7e
    Port:		
    State:		Terminated
      Reason:		Error
      Exit Code:	255
      Started:		Tue, 24 Jan 2017 07:07:10 -0500
      Finished:		Tue, 24 Jan 2017 07:07:25 -0500
    Ready:		False
    Restart Count:	0
    Volume Mounts:
      /etc/deploy from empty (rw)
      /secret from secret (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from metrics-deployer-token-3e8am (ro)
    Environment Variables:
      PROJECT:				openshift-infra (v1:metadata.namespace)
      POD_NAME:				metrics-deployer-dh3qu (v1:metadata.name)
      IMAGE_PREFIX:			registry.access.redhat.com/openshift3/
      IMAGE_VERSION:			3.4.0
      MASTER_URL:			https://kubernetes.default.svc:443
      MODE:				deploy
      CONTINUE_ON_ERROR:		false
      REDEPLOY:				false
      IGNORE_PREFLIGHT:			false
      USE_PERSISTENT_STORAGE:		true
      DYNAMICALLY_PROVISION_STORAGE:	false
      HAWKULAR_METRICS_HOSTNAME:	metrics.example.com
      CASSANDRA_NODES:			1
      CASSANDRA_PV_SIZE:		10Gi
      METRIC_DURATION:			7
      USER_WRITE_ACCESS:		false
      HEAPSTER_NODE_ID:			nodename
      METRIC_RESOLUTION:		10s
      STARTUP_TIMEOUT:			500
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  empty:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:	
  secret:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	metrics-deployer
  metrics-deployer-token-3e8am:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	metrics-deployer-token-3e8am
QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From				SubobjectPath			Type		Reason		Message
  ---------	--------	-----	----				-------------			--------	------		-------
  27m		27m		1	{default-scheduler }			Normal		Scheduled	Successfully assigned metrics-deployer-dh3qu to node01.example.com
  22m		22m		1	{kubelet node01.example.com}	spec.containers{deployer}	Normal		Pulling		pulling image "registry.access.redhat.com/openshift3/metrics-deployer:3.4.0"
  20m		20m		1	{kubelet node01.example.com}	spec.containers{deployer}	Normal		Pulled		Successfully pulled image "registry.access.redhat.com/openshift3/metrics-deployer:3.4.0"
  20m		20m		1	{kubelet node01.example.com}	spec.containers{deployer}	Normal		Created		Created container with docker id 8604f99338e2; Security:[seccomp=unconfined]
  20m		20m		1	{kubelet node01.example.com}	spec.containers{deployer}	Normal		Started		Started container with docker id 8604f99338e2
[root@master01 ~]# 

/etc/ansible/hosts contains:

openshift_hosted_metrics_deploy=true
openshift_hosted_metrics_storage_kind=nfs
openshift_hosted_metrics_storage_access_modes=['ReadWriteOnce']
openshift_hosted_metrics_storage_host=nfs01.example.com
openshift_hosted_metrics_storage_nfs_directory=/srv/nfs
openshift_hosted_metrics_storage_nfs_options='*(rw,root_squash)'
openshift_hosted_metrics_storage_volume_name=metrics
openshift_hosted_metrics_storage_volume_size=10Gi
openshift_hosted_metrics_public_url=https://metrics.example.com/hawkular/metrics

Version-Release number of selected component (if applicable):
openshift-ansible-playbooks-3.4.44-1.git.0.efa61c6.el7.noarch

Comment 1 Marko Myllynen 2017-01-24 12:34:32 UTC
[root@master01 ~]# oc logs metrics-deployer-dh3qu -n openshift-infra | tail -n 20
+ '[' -n 1 ']'
+ oc config use-context deployer-context
switched to context "deployer-context".
+ case $deployer_mode in
+ '[' false '!=' true ']'
+ validate_preflight
+ set +x

PREFLIGHT CHECK FAILED
========================
validate_master_accessible: 
unable to access master url https://kubernetes.default.svc:443
See the error from 'curl https://kubernetes.default.svc:443' below for details:
curl: (28) timed out before SSL handshake

Deployment has been aborted prior to starting, as these failures often indicate fatal problems.
Please evaluate any error messages above and determine how they can be addressed.
To ignore this validation failure and continue, specify IGNORE_PREFLIGHT=true.

PREFLIGHT CHECK FAILED

Comment 3 Scott Dodson 2017-08-24 19:00:59 UTC
3.7 may add the ability to determine template success. If so we'll update the installer to leverage that feature.

Comment 5 Scott Dodson 2018-04-16 14:44:55 UTC
Components should now report success but unless they're critical they don't immediately halt the installation process and you'll receive an output detailing which components failed at the end of the run so you can investigate