Bug 1417668

Summary: podfying cfme: evmserver.service restart would cause cfme container to be killed and re-start
Product: Red Hat CloudForms Management Engine Reporter: Dafna Ron <dron>
Component: cfme-openshift-appAssignee: Franco Bladilo <fbladilo>
Status: CLOSED CURRENTRELEASE QA Contact: Einat Pacifici <epacific>
Severity: medium Docs Contact: Red Hat CloudForms Documentation <cloudforms-docs>
Priority: medium    
Version: 5.7.0CC: dron, fbladilo, jhardy, lavenel
Target Milestone: GA   
Target Release: cfme-future   
Hardware: x86_64   
OS: Linux   
Whiteboard: container
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-02 09:15:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Container Management Target Upstream Version:
Embargoed:

Description Dafna Ron 2017-01-30 15:44:12 UTC
Description of problem:

I restarted the evmserver.sevice on my cloudforms pod and it triggered a kill for the cfme pod. 
this is not urgent since it would re-create the cfme again, but more of an annoyance since all of my logs, history and other configuration I did on the pod were deleted. 

Version-Release number of selected component (if applicable):

cfme-5.7.1.0-2.el7cf.x86_64

How reproducible:

100% 

Steps to Reproduce:
1. oc rsh <pod> 
2. systemctl restart evmserver.service
3. log out -> oc get pods 
4. oc descrive pod <pod> 

Actual results:

the container is killed and re-build after the service restart 

Expected results:

service restart on cfme should not kill the cloudforms container

Additional info:

NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-1-2xdsx   1/1       Running   1          3d
memcached-1-yfu12    1/1       Running   0          3d
postgresql-1-0ckis   1/1       Running   0          3d
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# oc rsh cloudforms-1-2xdsx 
sh-4.2# 
sh-4.2# 

sh-4.2# systemctl restart  evmserverd.service




sh-4.2# 
sh-4.2# 
sh-4.2# 
sh-4.2# 
sh-4.2# 
sh-4.2# 
sh-4.2# 
sh-4.2# 
sh-4.2# exit

[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# oc get pods
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-1-2xdsx   0/1       Running   1          3d
memcached-1-yfu12    1/1       Running   0          3d
postgresql-1-0ckis   1/1       Running   0          3d
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# 
[root@dafna-openshift-master01 ~]# oc get pods
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-1-2xdsx   0/1       Running   2          3d
memcached-1-yfu12    1/1       Running   0          3d
postgresql-1-0ckis   1/1       Running   0          3d
[root@dafna-openshift-master01 ~]# oc log cloudforms-1-2xdsx
W0130 17:25:51.629513    5627 cmd.go:357] log is DEPRECATED and will be removed in a future version. Use logs instead.
[root@dafna-openshift-master01 ~]# oc logs cloudforms-1-2xdsx
[root@dafna-openshift-master01 ~]# oc describe pod cloudforms-1-2xdsx
Name:			cloudforms-1-2xdsx
Namespace:		dafna-test
Security Policy:	privileged
Node:			dafna-openshift-node01.qa.lab.tlv.redhat.com/10.35.97.112
Start Time:		Fri, 27 Jan 2017 14:06:50 +0200
Labels:			app=cloudforms
			deployment=cloudforms-1
			deploymentconfig=cloudforms
			name=cloudforms
Status:			Running
      Started:		Mon, 30 Jan 2017 17:24:43 +0200
    Last State:		Terminated
      Reason:		Error
      Exit Code:	137
      Started:		Mon, 30 Jan 2017 17:16:04 +0200
      Finished:		Mon, 30 Jan 2017 17:24:41 +0200
    Ready:		False
    Restart Count:	2
    Liveness:		http-get http://:80/ delay=480s timeout=3s period=10s #success=1 #failure=3
    Readiness:		http-get http://:80/ delay=200s timeout=3s period=10s #success=1 #failure=3
    Volume Mounts:
      /persistent from cfme-app-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-sc3e4 (ro)
    Environment Variables:
      APPLICATION_INIT_DELAY:		30
      DATABASE_SERVICE_NAME:		postgresql
      DATABASE_REGION:			0
      MEMCACHED_SERVICE_NAME:		memcached
      POSTGRESQL_USER:			root
      POSTGRESQL_PASSWORD:		smartvm
      POSTGRESQL_DATABASE:		vmdb_production
      POSTGRESQL_MAX_CONNECTIONS:	100
      POSTGRESQL_SHARED_BUFFERS:	64MB
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  cfme-app-volume:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	cloudforms
    ReadOnly:	false
  default-token-sc3e4:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-sc3e4
QoS Class:	Burstable
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From							SubobjectPath			Type		Reason		Message
  ---------	--------	-----	----							-------------			--------	------		-------
  10m		10m		1	{kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com}	spec.containers{cloudforms}	Normal		Killing		Killing container with docker id 465a20de6cd3: pod "cloudforms-1-2xdsx_dafna-test(154e6279-e489-11e6-a555-001a4a169777)" container "cloudforms" is unhealthy, it will be killed and re-created.
  10m		10m		1	{kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com}	spec.containers{cloudforms}	Normal		Created		Created container with docker id 37e33a567190; Security:[seccomp=unconfined]
  10m		10m		1	{kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com}	spec.containers{cloudforms}	Normal		Started		Started container with docker id 37e33a567190
  11m		2m		4	{kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com}	spec.containers{cloudforms}	Warning		Unhealthy	Liveness probe failed: HTTP probe failed with statuscode: 503
  11m		1m		14	{kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com}	spec.containers{cloudforms}	Warning		Unhealthy	Readiness probe failed: HTTP probe failed with statuscode: 503
  3d		1m		14	{kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com}	spec.containers{cloudforms}	Warning		Unhealthy	Readiness probe failed: Get http://10.128.0.67:80/: dial tcp 10.128.0.67:80: getsockopt: connection refused
  3d		1m		3	{kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com}	spec.containers{cloudforms}	Normal		Pulled		Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cloudforms/cfme-openshift-app@sha256:693b7a3cb4dba5b54b8b3369e9f3c7b6ae398067bb2e0c4c2a0e4e8fdb9e91e9" already present on machine
  1m		1m		1	{kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com}	spec.containers{cloudforms}	Normal		Killing		Killing container with docker id 37e33a567190: pod "cloudforms-1-2xdsx_dafna-test(154e6279-e489-11e6-a555-001a4a169777)" container "cloudforms" is unhealthy, it will be killed and re-created.
  1m		1m		1	{kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com}	spec.containers{cloudforms}	Normal		Created		Created container with docker id b3313241aa9b; Security:[seccomp=unconfined]
  1m		1m		1	{kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com}	spec.containers{cloudforms}	Normal		Started		Started container with docker id b3313241aa9b

Comment 2 Barak 2017-02-16 19:31:40 UTC
Franco,

Why is this happening,

Is it the liveness  check ?

Comment 3 Franco Bladilo 2017-03-27 20:40:07 UTC
Yes it is the liveness probe, now, we might have solved this issue (as a bonus) via  :
 
https://github.com/ManageIQ/manageiq-pods/pull/112

I haven't tested it but I must also admit I'm still unsure about the use case of restarting EVM service within container.

Comment 4 Barak 2017-04-09 14:54:34 UTC
Well the use case of restarting EVM service is not so common, but I am sure it will be required for debugging in a complex customer environment (happens sometimes on high profile customers), so this needs to be solved.

The question is do we need to make sure that  the POD is not rescheduled when we bring the EVM service down for a long perios of time ?

Given comment #3
Dafna - can you please check whether the initial behavior you reported in this bug is still happening (evm service restart) ?

Comment 5 Dafna Ron 2017-04-10 12:23:10 UTC
There are some changes in cfme side and on podfied side but basically it can be reproduced by disabling the evm-watchdog.service and restart the mserverd.service

you can see that the pod becomes 0/1 ready once we restart the evm

[root@dafna-pods-master ~]# oc get pods
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         1/1       Running   0          4d
memcached-1-bby93    1/1       Running   0          4d
postgresql-1-9y9hd   1/1       Running   0          4d
[root@dafna-pods-master ~]# oc get pods
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         1/1       Running   0          4d
memcached-1-bby93    1/1       Running   0          4d
postgresql-1-9y9hd   1/1       Running   0          4d
[root@dafna-pods-master ~]# watch 'co get pods'
[root@dafna-pods-master ~]# 
[root@dafna-pods-master ~]# watch 'oc get pods'
[root@dafna-pods-master ~]# oc get pods
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         0/1       Running   0          4d
memcached-1-bby93    1/1       Running   0          4d
postgresql-1-9y9hd   1/1       Running   0          4d
[root@dafna-pods-master ~]# oc get pods
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         0/1       Running   0          4d
memcached-1-bby93    1/1       Running   0          4d
postgresql-1-9y9hd   1/1       Running   0          4d
[root@dafna-pods-master ~]# oc get pods
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         0/1       Running   0          4d
memcached-1-bby93    1/1       Running   0          4d
postgresql-1-9y9hd   1/1       Running   0          4d
[root@dafna-pods-master ~]# oc get pods
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         0/1       Running   0          4d
memcached-1-bby93    1/1       Running   0          4d
postgresql-1-9y9hd   1/1       Running   0          4d
[root@dafna-pods-master ~]# oc get pods
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         0/1       Running   0          4d
memcached-1-bby93    1/1       Running   0          4d
postgresql-1-9y9hd   1/1       Running   0          4d
[root@dafna-pods-master ~]# 
[root@dafna-pods-master ~]# 
[root@dafna-pods-master ~]# oc get pods
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         0/1       Running   0          4d
memcached-1-bby93    1/1       Running   0          4d
postgresql-1-9y9hd   1/1       Running   0          4d
[root@dafna-pods-master ~]# oc get pods
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         1/1       Running   0          4d
memcached-1-bby93    1/1       Running   0          4d
postgresql-1-9y9hd   1/1       Running   0          4d

Comment 6 Franco Bladilo 2017-04-10 13:12:13 UTC
Dafna,

That is the intended behavior, as your restart EVM server, the pod would become not ready as it is not responding successfully. The most important piece if you notice is that the POD is not getting KILLED/RESTARTED by the liveness probe which was the case before.

Comment 7 Franco Bladilo 2017-04-10 13:42:30 UTC
Barak,

I must advice that for debugging purposes I would turn off the liveness check, as it could potentially kill the container in the middle of a debugging session and become counter productive.