Bug 1575507
Summary: | Upgrade hang on task [Upgrade all storage] due to etcd service is missing | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> |
Component: | Cluster Version Operator | Assignee: | Vadim Rutkovsky <vrutkovs> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | liujia <jiajliu> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.10.0 | CC: | aos-bugs, dmoessne, gpei, jiajliu, jokerman, mmccomas, vrutkovs, wmeng |
Target Milestone: | --- | ||
Target Release: | 3.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
During upgrade from etcd in RPM/system container to static pod install an incorrect health check command was used
Consequence:
Upgrade failed
Fix:
Health check command now checks if etcd is available in static pod
Result:
Upgrade proceeds
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-12-20 21:42:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
liujia
2018-05-07 07:00:46 UTC
Should be fixed by https://github.com/openshift/openshift-ansible/pull/8239 Got it. Then I will have a try. I just noticed that only etcd.yml file seem not enough when hit the issue. I restart node service then, etcd static pod run. Fix is available in openshift-ansible-3.10.0-0.35.0 Version: openshift-ansible-3.10.0-0.36.0.git.0.521f0ef.el7.noarch Upgrade still failed due to etcd service is not running when verify etcd is healthy. TASK [etcd : Verify cluster is healthy] ************************************************************************************************************************************* FAILED - RETRYING: Verify cluster is healthy (30 retries left). ... FAILED - RETRYING: Verify cluster is healthy (1 retries left). fatal: [x.x.x.x]: FAILED! => {"attempts": 30, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://qe-jliu-rpm39-master-etcd-1:2379", "cluster-health"], "delta": "0:00:00.072569", "end": "2018-05-07 23:40:03.077367", "rc": 0, "start": "2018-05-07 23:40:03.004798", "stderr": "Component etcd is stopped or not running", "stderr_lines": ["Component etcd is stopped or not running"], "stdout": "", "stdout_lines": []} Checked that etcd static pod was not running. # docker ps |grep etcd # docker images REPOSITORY TAG IMAGE ID CREATED SIZE registry.reg-aws.openshift.com:443/openshift3/ose-node v3.10 71b1a582da1e 6 days ago 1.61 GB registry.reg-aws.openshift.com:443/openshift3/ose-deployer v3.9.27 3395208c0d39 11 days ago 1.23 GB registry.reg-aws.openshift.com:443/openshift3/ose-web-console v3.9.27 4b36f9fa8d8b 11 days ago 466 MB registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog v3.9.27 927cc8489f53 11 days ago 288 MB registry.reg-aws.openshift.com:443/openshift3/ose-pod v3.9.27 9852c33ee54b 11 days ago 214 MB registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker v3.9.27 6a62667d32b2 11 days ago 300 MB SyncLoop (SYNC): 1 pods; master-etcd-qe-jliu-rpm39-master-etcd-1_kube-system(e41c2a57b9d15b28cb8a240c7780597f) :38.817312 104561 kubelet_pods.go:1372] Generating status for "master-etcd-qe-jliu-rpm39-master-etcd-1_kube-system(e41c2a57b9d15b28cb8a240c7780597f)" :38.817361 104561 kubelet_pods.go:1337] pod waiting > 0, pending :38.817397 104561 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach :38.825072 104561 config.go:99] Looking for [api file], have seen map[file:{}] :38.827100 104561 generic.go:183] GenericPLEG: Relisting :38.852862 104561 kubelet_node_status.go:350] Adding node label from cloud provider: beta.kubernetes.io/instance-type=n1-standard-1 :38.852893 104561 kubelet_node_status.go:361] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=us-central1-a :38.852902 104561 kubelet_node_status.go:365] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=us-central1 :38.864832 104561 kubelet_node_status.go:445] Recording NodeHasSufficientDisk event message for node qe-jliu-rpm39-master-etcd-1 :38.864864 104561 kubelet_node_status.go:445] Recording NodeHasSufficientMemory event message for node qe-jliu-rpm39-master-etcd-1 :38.864876 104561 kubelet_node_status.go:445] Recording NodeHasNoDiskPressure event message for node qe-jliu-rpm39-master-etcd-1 :38.864888 104561 kubelet_node_status.go:445] Recording NodeHasSufficientPID event message for node qe-jliu-rpm39-master-etcd-1 :38.865012 104561 status_manager.go:340] Ignoring same status for pod "master-etcd-qe-jliu-rpm39-master-etcd-1_kube-system(e41c2a57b9d15b28cb8a240c7780597f)", status: {Phas :38.865246 104561 kubelet.go:1613] No need to create a mirror pod, since node "qe-jliu-rpm39-master-etcd-1" has been removed from the cluster :38.865291 104561 volume_manager.go:347] Waiting for volumes to attach and mount for pod "master-etcd-qe-jliu-rpm39-master-etcd-1_kube-system(e41c2a57b9d15b28cb8a240c778059 :38.865322 104561 server.go:428] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"qe-jliu-rpm39-master-etcd-1", UID:"qe-jliu-rpm39-master-etcd-1", APIVersion:"", R :38.865340 104561 server.go:428] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"qe-jliu-rpm39-master-etcd-1", UID:"qe-jliu-rpm39-master-etcd-1", APIVersion:"", R :38.865355 104561 server.go:428] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"qe-jliu-rpm39-master-etcd-1", UID:"qe-jliu-rpm39-master-etcd-1", APIVersion:"", R :38.865370 104561 server.go:428] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"qe-jliu-rpm39-master-etcd-1", UID:"qe-jliu-rpm39-master-etcd-1", APIVersion:"", R :38.925048 104561 config.go:99] Looking for [api file], have seen map[file:{}] Block rpm upgrade test. Please also attach the full contents of `journalctl --no-pager -u atomic-openshift-node` That looks like ASB / TSB migration issue. The upgrade succeeds if ASB is not installed. Do you mind opening a new bug for that? @Vadim Rutkovsky Discussed with gpei about comment12, no need to file a new bug for the issue, which caused by low resources on the vms. Back to comment7. Still hit the issue on openshift-ansible-3.10.0-0.38.0.git.7.848b045.el7.noarch.(The same in latest openshift-ansible-3.10.0-0.41.0.git.0.88119e4.el7.noarch) Upgrade failed at task[etcd : Verify cluster is healthy]. When hit the issue, no etcd pod was running, and no etcd image pulled. # docker ps|grep etcd # docker images|grep etcd After restart node service, etcd was running. # docker ps|grep etcd 77e7486831f8 registry.access.redhat.com/rhel7/etcd@sha256:128649895440ef04261b5198f3b558fd368be62123114fa45e5cacd589d18c67 "/bin/sh -c '#!/bi..." 25 seconds ago Up 18 seconds k8s_etcd_master-etcd-qe-jliu-rp39a-master-etcd-1_kube-system_e41c2a57b9d15b28cb8a240c7780597f_0 32d5ed507ba1 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10 "/usr/bin/pod" 38 seconds ago Up 33 seconds k8s_POD_master-etcd-qe-jliu-rp39a-master-etcd-1_kube-system_e41c2a57b9d15b28cb8a240c7780597f_0 [root@qe-jliu-rp39a-master-etcd-1 ~]# docker ps|grep etcd 77e7486831f8 registry.access.redhat.com/rhel7/etcd@sha256:128649895440ef04261b5198f3b558fd368be62123114fa45e5cacd589d18c67 "/bin/sh -c '#!/bi..." 29 minutes ago Up 29 minutes k8s_etcd_master-etcd-qe-jliu-rp39a-master-etcd-1_kube-system_e41c2a57b9d15b28cb8a240c7780597f_0 32d5ed507ba1 registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10 "/usr/bin/pod" 30 minutes ago Up 29 minutes k8s_POD_master-etcd-qe-jliu-rp39a-master-etcd-1_kube-system_e41c2a57b9d15b28cb8a240c7780597f_0 Checked that, before node upgrade, fail to create etcd pod because node try to pull ose-pod:v3.10.0-0.41.0 image which can not be found. Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-web-console", Name:"webconsole-7d56f5fd7f-dsmlp", UID:"1f164e39-571a-11e8-9203-42010af00025", APIVersion:"v1", ResourceVersion:"4287", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed create pod sandbox: rpc error: code = Unknown desc = Error: image openshift3/ose-pod:v3.10.0-0.41.0 not found After restart node service, etcd pod was running well because ose-pod:v3.10 was available. But I've no idea why node try to pull ose-pod:v3.10.0-0.41.0 image instead of ose-pod:v3.10 during upgrade. I found image tag should be v3.10 in both node-config and configmap. # oc get cm node-config-master -o yaml -n openshift-node|grep -A 1 imageConfig: imageConfig: format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:v3.10 # cat /etc/origin/node/node-config.yaml |grep -A 1 imageConfig imageConfig: format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:v3.10 # cat /etc/origin/master/master-config.yaml|grep -A 1 imageConfig imageConfig: format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version} See inventory file(hosts),upgrade log(upgrade.log),node service log(node.log) in attachment. Due to the root cause is inconsistent ose-pod's image tag in several configs, so try to upgrade it to an old version(enable v3.10.0-0.38.0 repos on the hosts). Then upgrade can proceed successfully. So this issue did not block upgrade test now. But with the workaround, still some other issues that installer still try to pull openshift3/ose-deployer:v3.10.0-0.41.0 but not openshift3/ose-deployer:v3.10, which caused some app fail to deploy. Node would use `openshift_image_tag` tag, which generally should be set to v3.10 (as the logs suggest). Could you verify /etc/sysconfig/atomic-openshift-node has correct IMAGE_VERSION? I see ansible has set it to v3.10, but it could have been changed later on (In reply to Vadim Rutkovsky from comment #22) > Node would use `openshift_image_tag` tag, which generally should be set to > v3.10 (as the logs suggest). > > Could you verify /etc/sysconfig/atomic-openshift-node has correct > IMAGE_VERSION? I see ansible has set it to v3.10, but it could have been > changed later on # cat /etc/sysconfig/atomic-openshift-node |grep IMAGE IMAGE_VERSION=v3.10 Tried this a couple more times and couldn't reproduce missing pod image. Is this still happening? How often is this reproducible? I'd suggest to close this issue, as it got sidetracked multiple times. @Vadim Rutkovsky I did not hit it in recent test because the ose-pod:v3.10.0-x.x.x image was available. But the root cause for the issue is that why installer try to pull openshift3/ose-deployer:v3.10.0-x.x.x instead of v3.10 which set in node config and configmap(refer to comment 17). And now, I checked that on latest installer(openshift-ansible-3.10.0-0.46.0.git.0.85c3afd.el7.noarch), here should be updated by some pr that node config and configmap was changed to comply with master config to use ${version} but not v3.10 now. # oc get cm node-config-master -o yaml -n openshift-node|grep -A 1 imageConfig: imageConfig: format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version} # cat /etc/origin/node/node-config.yaml |grep -A 1 imageConfig imageConfig: format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version} # cat /etc/origin/master/master-config.yaml|grep -A 1 imageConfig imageConfig: format: registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version} So I think this issue can be verify now. Original issue was fixed now. Verified on openshift-ansible-3.10.0-0.46.0.git.0.85c3afd.el7.noarch |