Description of problem: I restarted the evmserver.sevice on my cloudforms pod and it triggered a kill for the cfme pod. this is not urgent since it would re-create the cfme again, but more of an annoyance since all of my logs, history and other configuration I did on the pod were deleted. Version-Release number of selected component (if applicable): cfme-5.7.1.0-2.el7cf.x86_64 How reproducible: 100% Steps to Reproduce: 1. oc rsh <pod> 2. systemctl restart evmserver.service 3. log out -> oc get pods 4. oc descrive pod <pod> Actual results: the container is killed and re-build after the service restart Expected results: service restart on cfme should not kill the cloudforms container Additional info: NAME READY STATUS RESTARTS AGE cloudforms-1-2xdsx 1/1 Running 1 3d memcached-1-yfu12 1/1 Running 0 3d postgresql-1-0ckis 1/1 Running 0 3d [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# oc rsh cloudforms-1-2xdsx sh-4.2# sh-4.2# sh-4.2# systemctl restart evmserverd.service sh-4.2# sh-4.2# sh-4.2# sh-4.2# sh-4.2# sh-4.2# sh-4.2# sh-4.2# sh-4.2# exit [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# oc get pods NAME READY STATUS RESTARTS AGE cloudforms-1-2xdsx 0/1 Running 1 3d memcached-1-yfu12 1/1 Running 0 3d postgresql-1-0ckis 1/1 Running 0 3d [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# [root@dafna-openshift-master01 ~]# oc get pods NAME READY STATUS RESTARTS AGE cloudforms-1-2xdsx 0/1 Running 2 3d memcached-1-yfu12 1/1 Running 0 3d postgresql-1-0ckis 1/1 Running 0 3d [root@dafna-openshift-master01 ~]# oc log cloudforms-1-2xdsx W0130 17:25:51.629513 5627 cmd.go:357] log is DEPRECATED and will be removed in a future version. Use logs instead. [root@dafna-openshift-master01 ~]# oc logs cloudforms-1-2xdsx [root@dafna-openshift-master01 ~]# oc describe pod cloudforms-1-2xdsx Name: cloudforms-1-2xdsx Namespace: dafna-test Security Policy: privileged Node: dafna-openshift-node01.qa.lab.tlv.redhat.com/10.35.97.112 Start Time: Fri, 27 Jan 2017 14:06:50 +0200 Labels: app=cloudforms deployment=cloudforms-1 deploymentconfig=cloudforms name=cloudforms Status: Running Started: Mon, 30 Jan 2017 17:24:43 +0200 Last State: Terminated Reason: Error Exit Code: 137 Started: Mon, 30 Jan 2017 17:16:04 +0200 Finished: Mon, 30 Jan 2017 17:24:41 +0200 Ready: False Restart Count: 2 Liveness: http-get http://:80/ delay=480s timeout=3s period=10s #success=1 #failure=3 Readiness: http-get http://:80/ delay=200s timeout=3s period=10s #success=1 #failure=3 Volume Mounts: /persistent from cfme-app-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-sc3e4 (ro) Environment Variables: APPLICATION_INIT_DELAY: 30 DATABASE_SERVICE_NAME: postgresql DATABASE_REGION: 0 MEMCACHED_SERVICE_NAME: memcached POSTGRESQL_USER: root POSTGRESQL_PASSWORD: smartvm POSTGRESQL_DATABASE: vmdb_production POSTGRESQL_MAX_CONNECTIONS: 100 POSTGRESQL_SHARED_BUFFERS: 64MB Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: cfme-app-volume: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: cloudforms ReadOnly: false default-token-sc3e4: Type: Secret (a volume populated by a Secret) SecretName: default-token-sc3e4 QoS Class: Burstable Tolerations: <none> Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 10m 10m 1 {kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com} spec.containers{cloudforms} Normal Killing Killing container with docker id 465a20de6cd3: pod "cloudforms-1-2xdsx_dafna-test(154e6279-e489-11e6-a555-001a4a169777)" container "cloudforms" is unhealthy, it will be killed and re-created. 10m 10m 1 {kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com} spec.containers{cloudforms} Normal Created Created container with docker id 37e33a567190; Security:[seccomp=unconfined] 10m 10m 1 {kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com} spec.containers{cloudforms} Normal Started Started container with docker id 37e33a567190 11m 2m 4 {kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com} spec.containers{cloudforms} Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 503 11m 1m 14 {kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com} spec.containers{cloudforms} Warning Unhealthy Readiness probe failed: HTTP probe failed with statuscode: 503 3d 1m 14 {kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com} spec.containers{cloudforms} Warning Unhealthy Readiness probe failed: Get http://10.128.0.67:80/: dial tcp 10.128.0.67:80: getsockopt: connection refused 3d 1m 3 {kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com} spec.containers{cloudforms} Normal Pulled Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/cloudforms/cfme-openshift-app@sha256:693b7a3cb4dba5b54b8b3369e9f3c7b6ae398067bb2e0c4c2a0e4e8fdb9e91e9" already present on machine 1m 1m 1 {kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com} spec.containers{cloudforms} Normal Killing Killing container with docker id 37e33a567190: pod "cloudforms-1-2xdsx_dafna-test(154e6279-e489-11e6-a555-001a4a169777)" container "cloudforms" is unhealthy, it will be killed and re-created. 1m 1m 1 {kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com} spec.containers{cloudforms} Normal Created Created container with docker id b3313241aa9b; Security:[seccomp=unconfined] 1m 1m 1 {kubelet dafna-openshift-node01.qa.lab.tlv.redhat.com} spec.containers{cloudforms} Normal Started Started container with docker id b3313241aa9b
Franco, Why is this happening, Is it the liveness check ?
Yes it is the liveness probe, now, we might have solved this issue (as a bonus) via : https://github.com/ManageIQ/manageiq-pods/pull/112 I haven't tested it but I must also admit I'm still unsure about the use case of restarting EVM service within container.
Well the use case of restarting EVM service is not so common, but I am sure it will be required for debugging in a complex customer environment (happens sometimes on high profile customers), so this needs to be solved. The question is do we need to make sure that the POD is not rescheduled when we bring the EVM service down for a long perios of time ? Given comment #3 Dafna - can you please check whether the initial behavior you reported in this bug is still happening (evm service restart) ?
There are some changes in cfme side and on podfied side but basically it can be reproduced by disabling the evm-watchdog.service and restart the mserverd.service you can see that the pod becomes 0/1 ready once we restart the evm [root@dafna-pods-master ~]# oc get pods NAME READY STATUS RESTARTS AGE cloudforms-0 1/1 Running 0 4d memcached-1-bby93 1/1 Running 0 4d postgresql-1-9y9hd 1/1 Running 0 4d [root@dafna-pods-master ~]# oc get pods NAME READY STATUS RESTARTS AGE cloudforms-0 1/1 Running 0 4d memcached-1-bby93 1/1 Running 0 4d postgresql-1-9y9hd 1/1 Running 0 4d [root@dafna-pods-master ~]# watch 'co get pods' [root@dafna-pods-master ~]# [root@dafna-pods-master ~]# watch 'oc get pods' [root@dafna-pods-master ~]# oc get pods NAME READY STATUS RESTARTS AGE cloudforms-0 0/1 Running 0 4d memcached-1-bby93 1/1 Running 0 4d postgresql-1-9y9hd 1/1 Running 0 4d [root@dafna-pods-master ~]# oc get pods NAME READY STATUS RESTARTS AGE cloudforms-0 0/1 Running 0 4d memcached-1-bby93 1/1 Running 0 4d postgresql-1-9y9hd 1/1 Running 0 4d [root@dafna-pods-master ~]# oc get pods NAME READY STATUS RESTARTS AGE cloudforms-0 0/1 Running 0 4d memcached-1-bby93 1/1 Running 0 4d postgresql-1-9y9hd 1/1 Running 0 4d [root@dafna-pods-master ~]# oc get pods NAME READY STATUS RESTARTS AGE cloudforms-0 0/1 Running 0 4d memcached-1-bby93 1/1 Running 0 4d postgresql-1-9y9hd 1/1 Running 0 4d [root@dafna-pods-master ~]# oc get pods NAME READY STATUS RESTARTS AGE cloudforms-0 0/1 Running 0 4d memcached-1-bby93 1/1 Running 0 4d postgresql-1-9y9hd 1/1 Running 0 4d [root@dafna-pods-master ~]# [root@dafna-pods-master ~]# [root@dafna-pods-master ~]# oc get pods NAME READY STATUS RESTARTS AGE cloudforms-0 0/1 Running 0 4d memcached-1-bby93 1/1 Running 0 4d postgresql-1-9y9hd 1/1 Running 0 4d [root@dafna-pods-master ~]# oc get pods NAME READY STATUS RESTARTS AGE cloudforms-0 1/1 Running 0 4d memcached-1-bby93 1/1 Running 0 4d postgresql-1-9y9hd 1/1 Running 0 4d
Dafna, That is the intended behavior, as your restart EVM server, the pod would become not ready as it is not responding successfully. The most important piece if you notice is that the POD is not getting KILLED/RESTARTED by the liveness probe which was the case before.
Barak, I must advice that for debugging purposes I would turn off the liveness check, as it could potentially kill the container in the middle of a debugging session and become counter productive.