Bug 1575507

Summary: Upgrade hang on task [Upgrade all storage] due to etcd service is missing
Product: OpenShift Container Platform Reporter: liujia <jiajliu>
Component: Cluster Version OperatorAssignee: Vadim Rutkovsky <vrutkovs>
Status: CLOSED CURRENTRELEASE QA Contact: liujia <jiajliu>
Severity: high Docs Contact:
Priority: high    
Version: 3.10.0CC: aos-bugs, dmoessne, gpei, jiajliu, jokerman, mmccomas, vrutkovs, wmeng
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: During upgrade from etcd in RPM/system container to static pod install an incorrect health check command was used Consequence: Upgrade failed Fix: Health check command now checks if etcd is available in static pod Result: Upgrade proceeds
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-20 21:42:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liujia 2018-05-07 07:00:46 UTC
Description of problem:
Upgrade hang on task [Upgrade all storage] due to etcd service is missing. Original rpm etcd was stopped and removed, but new etcd pod was not deployed successfully. Need restart node service.
PLAY [Pre master upgrade - Upgrade all storage] ********************************************************************************
TASK [Gathering Facts] ********************************************************************************
ok: [qe-jliu-rp39-master-etcd-1]
TASK [Upgrade all storage] **************************************************************************************************************************************************
.....

Moreover, task [Verify cluster is healthy] did not verify etcd actually because even etcd was not running, the task continued.
...
TASK [etcd : Deploy etcd static pod] ********************************************************************************
...
TASK [etcd : Remove temp directory] ********************************************************************************
...
TASK [etcd : Verify cluster is healthy] ********************************************************************************
changed: [qe-jliu-rp39-master-etcd-1] => {"attempts": 1, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://qe-jliu-rp39-master-etcd-1:2379", "cluster-health"], "delta": "0:00:00.065427", "end": "2018-05-07 01:22:19.931088", "rc": 0, "start": "2018-05-07 01:22:19.865661", "stderr": "Component etcd is stopped or not running", "stderr_lines": ["Component etcd is stopped or not running"], "stdout": "", "stdout_lines": []}


# systemctl status etcd.service 
● etcd.service
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead) since Mon 2018-05-07 01:21:49 EDT; 1h 3min ago
 Main PID: 15332 (code=killed, signal=TERM)

May 07 01:21:49 qe-jliu-rp39-master-etcd-1 etcd[15332]: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
May 07 01:21:49 qe-jliu-rp39-master-etcd-1 systemd[1]: Stopped Etcd Server.

# docker ps |grep etcd

# docker images
REPOSITORY                                                      TAG                 IMAGE ID            CREATED             SIZE
registry.reg-aws.openshift.com:443/openshift3/ose-node          v3.10               71b1a582da1e        5 days ago          1.61 GB
registry.reg-aws.openshift.com:443/openshift3/ose-deployer      v3.9.27             3395208c0d39        10 days ago         1.23 GB
registry.reg-aws.openshift.com:443/openshift3/ose-web-console   v3.9.27             4b36f9fa8d8b        10 days ago         466 MB
registry.reg-aws.openshift.com:443/openshift3/ose-pod           v3.9.27             9852c33ee54b        10 days ago         214 MB

# ls /etc/origin/node/pods/etcd.yaml 
/etc/origin/node/pods/etcd.yaml

Version-Release number of the following components:
openshift-ansible-3.10.0-0.33.0.git.0.0aa585e.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. rpm install ocp with collaborated etcd on master hosts
2. upgrade above ocp
3.

Actual results:
Upgrade hang at task [Upgrade all storage].

Expected results:
Upgrade succeed.

Additional info:

Comment 1 Vadim Rutkovsky 2018-05-07 08:19:42 UTC
Should be fixed by https://github.com/openshift/openshift-ansible/pull/8239

Comment 5 liujia 2018-05-07 09:20:27 UTC
Got it. Then I will have a try. I just noticed that only etcd.yml file seem not enough when hit the issue. I restart node service then, etcd static pod run.

Comment 6 Vadim Rutkovsky 2018-05-07 13:12:29 UTC
Fix is available in openshift-ansible-3.10.0-0.35.0

Comment 7 liujia 2018-05-08 06:04:38 UTC
Version:
openshift-ansible-3.10.0-0.36.0.git.0.521f0ef.el7.noarch

Upgrade still failed due to etcd service is not running when verify etcd is healthy.
TASK [etcd : Verify cluster is healthy] *************************************************************************************************************************************
FAILED - RETRYING: Verify cluster is healthy (30 retries left).
...
FAILED - RETRYING: Verify cluster is healthy (1 retries left).
fatal: [x.x.x.x]: FAILED! => {"attempts": 30, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://qe-jliu-rpm39-master-etcd-1:2379", "cluster-health"], "delta": "0:00:00.072569", "end": "2018-05-07 23:40:03.077367", "rc": 0, "start": "2018-05-07 23:40:03.004798", "stderr": "Component etcd is stopped or not running", "stderr_lines": ["Component etcd is stopped or not running"], "stdout": "", "stdout_lines": []}

Checked that etcd static pod was not running.
# docker ps |grep etcd

# docker images
REPOSITORY                                                                  TAG                 IMAGE ID            CREATED             SIZE
registry.reg-aws.openshift.com:443/openshift3/ose-node                      v3.10               71b1a582da1e        6 days ago          1.61 GB
registry.reg-aws.openshift.com:443/openshift3/ose-deployer                  v3.9.27             3395208c0d39        11 days ago         1.23 GB
registry.reg-aws.openshift.com:443/openshift3/ose-web-console               v3.9.27             4b36f9fa8d8b        11 days ago         466 MB
registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog           v3.9.27             927cc8489f53        11 days ago         288 MB
registry.reg-aws.openshift.com:443/openshift3/ose-pod                       v3.9.27             9852c33ee54b        11 days ago         214 MB
registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker   v3.9.27             6a62667d32b2        11 days ago         300 MB


SyncLoop (SYNC): 1 pods; master-etcd-qe-jliu-rpm39-master-etcd-1_kube-system(e41c2a57b9d15b28cb8a240c7780597f)
:38.817312  104561 kubelet_pods.go:1372] Generating status for "master-etcd-qe-jliu-rpm39-master-etcd-1_kube-system(e41c2a57b9d15b28cb8a240c7780597f)"
:38.817361  104561 kubelet_pods.go:1337] pod waiting > 0, pending
:38.817397  104561 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
:38.825072  104561 config.go:99] Looking for [api file], have seen map[file:{}]
:38.827100  104561 generic.go:183] GenericPLEG: Relisting
:38.852862  104561 kubelet_node_status.go:350] Adding node label from cloud provider: beta.kubernetes.io/instance-type=n1-standard-1
:38.852893  104561 kubelet_node_status.go:361] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=us-central1-a
:38.852902  104561 kubelet_node_status.go:365] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=us-central1
:38.864832  104561 kubelet_node_status.go:445] Recording NodeHasSufficientDisk event message for node qe-jliu-rpm39-master-etcd-1
:38.864864  104561 kubelet_node_status.go:445] Recording NodeHasSufficientMemory event message for node qe-jliu-rpm39-master-etcd-1
:38.864876  104561 kubelet_node_status.go:445] Recording NodeHasNoDiskPressure event message for node qe-jliu-rpm39-master-etcd-1
:38.864888  104561 kubelet_node_status.go:445] Recording NodeHasSufficientPID event message for node qe-jliu-rpm39-master-etcd-1
:38.865012  104561 status_manager.go:340] Ignoring same status for pod "master-etcd-qe-jliu-rpm39-master-etcd-1_kube-system(e41c2a57b9d15b28cb8a240c7780597f)", status: {Phas
:38.865246  104561 kubelet.go:1613] No need to create a mirror pod, since node "qe-jliu-rpm39-master-etcd-1" has been removed from the cluster
:38.865291  104561 volume_manager.go:347] Waiting for volumes to attach and mount for pod "master-etcd-qe-jliu-rpm39-master-etcd-1_kube-system(e41c2a57b9d15b28cb8a240c778059
:38.865322  104561 server.go:428] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"qe-jliu-rpm39-master-etcd-1", UID:"qe-jliu-rpm39-master-etcd-1", APIVersion:"", R
:38.865340  104561 server.go:428] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"qe-jliu-rpm39-master-etcd-1", UID:"qe-jliu-rpm39-master-etcd-1", APIVersion:"", R
:38.865355  104561 server.go:428] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"qe-jliu-rpm39-master-etcd-1", UID:"qe-jliu-rpm39-master-etcd-1", APIVersion:"", R
:38.865370  104561 server.go:428] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"qe-jliu-rpm39-master-etcd-1", UID:"qe-jliu-rpm39-master-etcd-1", APIVersion:"", R
:38.925048  104561 config.go:99] Looking for [api file], have seen map[file:{}]

Comment 9 liujia 2018-05-09 07:39:59 UTC
Block rpm upgrade test.

Comment 11 Scott Dodson 2018-05-10 13:10:25 UTC
Please also attach the full contents of `journalctl --no-pager -u atomic-openshift-node`

Comment 16 Vadim Rutkovsky 2018-05-11 15:34:34 UTC
That looks like ASB / TSB migration issue. The upgrade succeeds if ASB is not installed.

Do you mind opening a new bug for that?

Comment 17 liujia 2018-05-14 05:35:40 UTC
@Vadim Rutkovsky
Discussed with gpei about comment12, no need to file a new bug for the issue, which caused by low resources on the vms.

Back to comment7. Still hit the issue on openshift-ansible-3.10.0-0.38.0.git.7.848b045.el7.noarch.(The same in latest openshift-ansible-3.10.0-0.41.0.git.0.88119e4.el7.noarch)

Upgrade failed at task[etcd : Verify cluster is healthy]. When hit the issue, no etcd pod was running, and no etcd image pulled.

# docker ps|grep etcd
# docker images|grep etcd

After restart node service, etcd was running.
# docker ps|grep etcd
77e7486831f8        registry.access.redhat.com/rhel7/etcd@sha256:128649895440ef04261b5198f3b558fd368be62123114fa45e5cacd589d18c67   "/bin/sh -c '#!/bi..."   25 seconds ago      Up 18 seconds                           k8s_etcd_master-etcd-qe-jliu-rp39a-master-etcd-1_kube-system_e41c2a57b9d15b28cb8a240c7780597f_0
32d5ed507ba1        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10                                                     "/usr/bin/pod"           38 seconds ago      Up 33 seconds                           k8s_POD_master-etcd-qe-jliu-rp39a-master-etcd-1_kube-system_e41c2a57b9d15b28cb8a240c7780597f_0
[root@qe-jliu-rp39a-master-etcd-1 ~]# docker ps|grep etcd
77e7486831f8        registry.access.redhat.com/rhel7/etcd@sha256:128649895440ef04261b5198f3b558fd368be62123114fa45e5cacd589d18c67   "/bin/sh -c '#!/bi..."   29 minutes ago      Up 29 minutes                           k8s_etcd_master-etcd-qe-jliu-rp39a-master-etcd-1_kube-system_e41c2a57b9d15b28cb8a240c7780597f_0
32d5ed507ba1        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10                                                     "/usr/bin/pod"           30 minutes ago      Up 29 minutes                           k8s_POD_master-etcd-qe-jliu-rp39a-master-etcd-1_kube-system_e41c2a57b9d15b28cb8a240c7780597f_0

Checked that, before node upgrade, fail to create etcd pod because node try to pull ose-pod:v3.10.0-0.41.0 image which can not be found.
Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-web-console", Name:"webconsole-7d56f5fd7f-dsmlp", UID:"1f164e39-571a-11e8-9203-42010af00025", APIVersion:"v1", ResourceVersion:"4287", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed create pod sandbox: rpc error: code = Unknown desc = Error: image openshift3/ose-pod:v3.10.0-0.41.0 not found

After restart node service, etcd pod was running well because ose-pod:v3.10 was available. 

But I've no idea why node try to pull ose-pod:v3.10.0-0.41.0 image instead of ose-pod:v3.10 during upgrade. I found image tag should be v3.10 in both node-config and configmap.

# oc get cm node-config-master -o yaml -n openshift-node|grep -A 1 imageConfig:
    imageConfig:
      format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:v3.10

# cat /etc/origin/node/node-config.yaml |grep -A 1 imageConfig
imageConfig:
  format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:v3.10

# cat /etc/origin/master/master-config.yaml|grep -A 1 imageConfig
imageConfig:
  format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version}

See inventory file(hosts),upgrade log(upgrade.log),node service log(node.log) in attachment.

Comment 21 liujia 2018-05-14 07:27:06 UTC
Due to the root cause is inconsistent ose-pod's image tag in several configs, so try to upgrade it to an old version(enable v3.10.0-0.38.0 repos on the hosts). Then upgrade can proceed successfully. So this issue did not block upgrade test now. But with the workaround, still some other issues that installer still try to pull openshift3/ose-deployer:v3.10.0-0.41.0 but not openshift3/ose-deployer:v3.10, which caused some app fail to deploy.

Comment 22 Vadim Rutkovsky 2018-05-14 09:30:07 UTC
Node would use `openshift_image_tag` tag, which generally should be set to v3.10 (as the logs suggest).

Could you verify /etc/sysconfig/atomic-openshift-node has correct IMAGE_VERSION? I see ansible has set it to v3.10, but it could have been changed later on

Comment 23 liujia 2018-05-14 09:36:58 UTC
(In reply to Vadim Rutkovsky from comment #22)
> Node would use `openshift_image_tag` tag, which generally should be set to
> v3.10 (as the logs suggest).
> 
> Could you verify /etc/sysconfig/atomic-openshift-node has correct
> IMAGE_VERSION? I see ansible has set it to v3.10, but it could have been
> changed later on

# cat /etc/sysconfig/atomic-openshift-node |grep IMAGE
IMAGE_VERSION=v3.10

Comment 24 Vadim Rutkovsky 2018-05-16 09:33:23 UTC
Tried this a couple more times and couldn't reproduce missing pod image.

Is this still happening? How often is this reproducible?

I'd suggest to close this issue, as it got sidetracked multiple times.

Comment 25 liujia 2018-05-16 10:29:21 UTC
@Vadim Rutkovsky

I did not hit it in recent test because the ose-pod:v3.10.0-x.x.x image was available. 

But the root cause for the issue is that why installer try to pull openshift3/ose-deployer:v3.10.0-x.x.x instead of v3.10 which set in node config and configmap(refer to comment 17).

And now, I checked that on latest installer(openshift-ansible-3.10.0-0.46.0.git.0.85c3afd.el7.noarch), here should be updated by some pr that node config and configmap was changed to comply with master config to use ${version} but not v3.10 now. 

# oc get cm node-config-master -o yaml -n openshift-node|grep -A 1 imageConfig:
    imageConfig:
      format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version}

# cat /etc/origin/node/node-config.yaml |grep -A 1 imageConfig
imageConfig:
  format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version}

# cat /etc/origin/master/master-config.yaml|grep -A 1 imageConfig
imageConfig:
  format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version}

So I think this issue can be verify now.

Comment 26 liujia 2018-05-17 08:32:36 UTC
Original issue was fixed now. Verified on openshift-ansible-3.10.0-0.46.0.git.0.85c3afd.el7.noarch