Description of problem: Upgrade against containerized ocp failed at task [openshift_node : Wait for node to be ready], but actually node service is running well. fatal: [x.x.x.x]: FAILED! => {"attempts": 24, "changed": false, "results": {"cmd": "/usr/local/bin/oc get node qe-jliu-ha1-me-1 -o json -n default", "results": [{"apiVersion": "v1", "kind": "Node", "metadata": {"annotations": {"volumes.kubernetes.io/controller-managed-attach-detach": "true"}, "creationTimestamp": "2018-02-23T06:10:16Z", "labels": {"beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "n1-standard-1", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "us-central1", "failure-domain.beta.kubernetes.io/zone": "us-central1-a", "kubernetes.io/hostname": "qe-jliu-ha1-me-1", "node-role.kubernetes.io/master": "true", "role": "node"}, "name": "qe-jliu-ha1-me-1", "resourceVersion": "87607", "selfLink": "/api/v1/nodes/qe-jliu-ha1-me-1", "uid": "37716b1f-1860-11e8-a246-42010af00005"}, "spec": {"externalID": "2340195980930061460", "providerID": "gce://openshift-gce-devel/us-central1-a/qe-jliu-ha1-me-1", "unschedulable": true}, "status": {"addresses": [{"address": "10.240.0.4", "type": "InternalIP"}, {"address": "35.202.164.251", "type": "ExternalIP"}, {"address": "qe-jliu-ha1-me-1", "type": "Hostname"}], "allocatable": {"cpu": "1", "memory": "3521452Ki", "pods": "250"}, "capacity": {"cpu": "1", "memory": "3623852Ki", "pods": "250"}, "conditions": [{"lastHeartbeatTime": null, "lastTransitionTime": "2018-02-23T06:10:16Z", "message": "openshift-sdn cleared kubelet-set NoRouteCreated", "reason": "RouteCreated", "status": "False", "type": "NetworkUnavailable"}, {"lastHeartbeatTime": "2018-02-23T07:39:21Z", "lastTransitionTime": "2018-02-23T07:40:03Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "OutOfDisk"}, {"lastHeartbeatTime": "2018-02-23T07:39:21Z", "lastTransitionTime": "2018-02-23T07:40:03Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "MemoryPressure"}, {"lastHeartbeatTime": "2018-02-23T07:39:21Z", "lastTransitionTime": "2018-02-23T07:40:03Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "DiskPressure"}, {"lastHeartbeatTime": "2018-02-23T07:39:21Z", "lastTransitionTime": "2018-02-23T07:40:03Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "Ready"}], "daemonEndpoints": {"kubeletEndpoint": {"Port": 10250}}, "images": [{"names": ["registry.reg-aws.openshift.com:443/openshift3/openvswitch@sha256:db244a025139cb4a5c07b225c69f3a2f9f80c803b9fea5155f7fd270d1deb7a2", "registry.reg-aws.openshift.com:443/openshift3/openvswitch:v3.8.32"], "sizeBytes": 1492455641}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/node@sha256:6522d285a759069f4c956c8e97dfe9b6a7bfa56fe367684198c2ae85a6388254", "registry.reg-aws.openshift.com:443/openshift3/node:v3.8.32"], "sizeBytes": 1490767522}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/openvswitch@sha256:e3a4efd23e3c8d09694ca6b55d1f6e09522d0d932afccc228fdc003cd99d4c84", "registry.reg-aws.openshift.com:443/openshift3/openvswitch:v3.7.31"], "sizeBytes": 1291033078}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/node@sha256:d32eb46ceb5d2d06e36aa7c30fad531e3f3cb513762143e59e01e05d025300c5", "registry.reg-aws.openshift.com:443/openshift3/node:v3.7.31"], "sizeBytes": 1289344961}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose@sha256:e1aaa7fd1150cfc661f12f52a54b70ae0b8e65aee7d98fe12c3c5909b71e944a", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.8", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.8.32"], "sizeBytes": 1275825650}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose@sha256:186466dbc98bf0fd349e034b9236dd829be9d4c951074b434a842316fe22d6f9", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.9", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.9.0"], "sizeBytes": 1254280268}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose@sha256:bf047055665628ee3dae30400c54a0b87229b3d4997c89c8cb29852317fcd95e", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.7", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.7.31"], "sizeBytes": 1088450170}, {"names": ["registry.access.redhat.com/rhel7/etcd@sha256:7af27eb689307db36cab833d302cdc46fab5850b90c155cb74b14ed7fec62e60", "registry.access.redhat.com/rhel7/etcd:latest"], "sizeBytes": 250063370}], "nodeInfo": {"architecture": "amd64", "bootID": "477ef20c-6ca6-486d-b9db-79648ba47178", "containerRuntimeVersion": "docker://1.13.1", "kernelVersion": "3.10.0-693.11.1.el7.x86_64", "kubeProxyVersion": "v1.8.5+440f8d36da", "kubeletVersion": "v1.8.5+440f8d36da", "machineID": "6578704f71144944bcf05068370a5315", "operatingSystem": "linux", "osImage": "Red Hat Enterprise Linux Server 7.4 (Maipo)", "systemUUID": "C388BCEC-8DA0-7EFE-DA70-0B6445CF8C1B"}}}], "returncode": 0}, "state": "list"} # oc get node NAME STATUS ROLES AGE VERSION qe-jliu-ha1-me-1 NotReady,SchedulingDisabled master 3h v1.8.5+440f8d36da qe-jliu-ha1-me-2 Ready master 3h v1.8.5+440f8d36da qe-jliu-ha1-me-3 Ready master 3h v1.8.5+440f8d36da qe-jliu-ha1-nrr-1 Ready <none> 3h v1.7.6+a08f5eeb62 qe-jliu-ha1-nrr-2 Ready <none> 3h v1.7.6+a08f5eeb62 Version-Release number of the following components: openshift-ansible-3.9.0-0.50.0.git.0.bb78b91.el7.noarch How reproducible: sometimes Steps to Reproduce: 1. Container install ocp v3.7 2. Upgrade v3.7 to v3.9 3. Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: Met this issue for both non-ha and ha env. For ha env, it failed at task [openshift_node : Wait for node to be ready] on first master host. Docker was upgraded during ocp upgraded. Please refer to output in jenkins job 1438. For non-ha env, it failed at task [openshift_node : Wait for node to be ready] on dedicated node host.Docker was not upgraded during ocp upgraded. Please refer to output in jenkins job 1440.
I don't see any obvious reason for this to fail. It appears this is failing on the second part of the double upgrade; the first round of node upgrades on masters went well. I'm going to try to replicate this.
I was able to replicate this issue in HA deployment. Noticed the following error messages on the host that is failing (first master): Feb 26 13:43:42 ip-172-18-8-1.ec2.internal atomic-openshift-node[105444]: E0226 13:43:42.102124 105515 remote_runtime.go:115] StopPodSandbox "c42b9448030e432262d46ee66ea69a62ea9a69a6b61d9c3ace1d08e95445c01e" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "docker-registry-1-wmlfn_default" network: cni config uninitialized Feb 26 13:43:42 ip-172-18-8-1.ec2.internal atomic-openshift-node[105444]: E0226 13:43:42.102151 105515 kuberuntime_gc.go:153] Failed to stop sandbox "c42b9448030e432262d46ee66ea69a62ea9a69a6b61d9c3ace1d08e95445c01e" before removing: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "docker-registry-1-wmlfn_default" network: cni config uninitialized
Started node in the terminal directly: I0226 14:03:05.734743 111572 node.go:352] Starting openshift-sdn pod manager I0226 14:03:05.737484 111572 node.go:395] openshift-sdn network plugin ready I0226 14:03:05.739959 111572 network.go:95] Using iptables Proxier. W0226 14:03:05.742995 111572 proxier.go:468] clusterCIDR not specified, unable to distinguish between internal and external traffic ... W0226 14:04:04.822958 111572 docker_sandbox.go:340] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-wmlfn_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "c42b9448030e432262d46ee66ea69a62ea9a69a6b61d9c3ace1d08e95445c01e" E0226 14:04:04.823988 111572 remote_runtime.go:115] StopPodSandbox "c42b9448030e432262d46ee66ea69a62ea9a69a6b61d9c3ace1d08e95445c01e" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "docker-registry-1-wmlfn_default" network: cni config uninitialized E0226 14:04:04.824012 111572 kuberuntime_gc.go:153] Failed to stop sandbox "c42b9448030e432262d46ee66ea69a62ea9a69a6b61d9c3ace1d08e95445c01e" before removing: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "docker-registry-1-wmlfn_default" network: cni config uninitialized More details attached.
Created attachment 1400986 [details] node log
Also, noticed the latest ose-pod is being run via docker ps: ...openshift.com:443/openshift3/ose-pod:latest "/usr/bin/pod"
Updated docker and restarted docker service, no change. Rebooted the host, now it looks like node is finally talking: [root@ip-172-18-8-1 ~]# oc get nodes NAME STATUS ROLES AGE VERSION ip-172-18-10-155.ec2.internal Ready master 2h v1.8.5+440f8d36da ip-172-18-4-42.ec2.internal Ready <none> 1h v1.7.6+a08f5eeb62 ip-172-18-7-250.ec2.internal Ready master 2h v1.8.5+440f8d36da ip-172-18-8-1.ec2.internal Ready,SchedulingDisabled master 2h v1.9.1+a0ce1bc657 ip-172-18-8-24.ec2.internal Ready <none> 1h v1.7.6+a08f5eeb62
Manually marking node as schedulable via oc adm manage-node ip-172-18-8-1.ec2.internal --schedulable=True works as intended; pods are scheduled on the node.
After performing the above steps and rerunning the upgrade, fails on the same host in the same place. No useful error messages, leads me to believe previous error messages were not applicable to this particular problem. Tried restarting docker, iptables, openvswitch, and the atomic-openshift-* services, no change.
Created attachment 1401055 [details] atomic-openshift-node.service debugging logs loglevel 8 (max debug) from atomic-openshift-node.service.
Debug logging attached (Attachment #2 [details] / Comment 12). Scott thinks this might be a network/cni problem.
Still hit the issue on openshift-ansible-3.9.0-0.53.0.git.0.f8f01ef.el7.noarch
In my testing I'm not able to reproduce this, upgrading the control plane does abort but it's when deploying the service catalog, well after the node has been updated to 3.9 and the node goes ready. I believe there's an selinux policy version dependency issue being tracked here https://bugzilla.redhat.com/show_bug.cgi?id=1548677 Which version of selinux-policy-targeted is installed on the hosts where this upgrade fails? On my host it's selinux-policy-targeted-3.13.1-166.el7_4.7.noarch which is the latest 7.4 errata.
Hi Scott You need disable service catalog to workaround bz 1547803. Then you will reproduce it with openshift-ansible-3.9.0-0.53.0.git.0.f8f01ef.el7.noarch. This issue seems different with ovs2.9 issue caused by selinux-policy, because when upgrade failed at task [openshift_node : Wait for node to be ready], the node service actually was running well. BTW, before and after upgrade, the selinux policy was selinux-policy-targeted-3.13.1-166.el7_4.7.noarch. I try to reproduce it on latest build 3.9.1, but hit another issue (bz1548352) again. Seems now this issue and bz1548352 happened in turn which caused container upgrade never succeed.
Block container upgrade on rhel.
Triage for why main kubelet processes are not running: runKubeletInProcess() cmd.Run() Run() run() RunKubelet() startKubelet() go func (kl *Kubelet) Run <-- we know we get this far indicated by Run() starts a bunch of goroutines that constitute the kubelet We know we get as far as (kl *Kubelet) Run() because we see server.go:818] Started kubelet in the logs. We also know that we call CreateAndInitKubelet from RunKubelet() because we see the Starting event (BirthCry) from the node. k.StartGarbageCollection() is called from the end of that function which explains why we see GC happening but nothing else. However (kl *Kubelet) Run doesn't have any error paths where it could bail out and we see no activity from any of the goroutines it would have started.
I attempted a bit of bisecting by rolling back 3.9 image versions in the order below. The first version I got to that worked was actually a kube 1.8 based build. So I went to 0.15.0 which worked and reported v1.9.0-beta1. I then worked my way back up looking to see it fail again. The odd thing is that when i got back to 0.20.0 it worked too whereas on the way down it had failed! Then I moved back to v3.9.1 and it too worked! # BAD #IMAGE_VERSION=v3.9.1 #IMAGE_VERSION=v3.9.0-0.45.0 #IMAGE_VERSION=v3.9.0-0.39.0 #IMAGE_VERSION=v3.9.0-0.30.0 #IMAGE_VERSION=v3.9.0-0.20.0 # Good but actually kube-1.8 #IMAGE_VERSION=v3.9.0-0.9.0 # Good but v1.9.0-beta1 #IMAGE_VERSION=v3.9.0-0.15.0 # First v1.9.1+a0ce1bc657 #IMAGE_VERSION=v3.9.0-0.17.0 #IMAGE_VERSION=v3.9.0-0.18.0 #IMAGE_VERSION=v3.9.0-0.19.0 Tomorrow lets try to upgrade the control plane again and hopefully the second master will exhibit the same problems and we can dig back into it as a clean host.
Created attachment 1403171 [details] incomplete-kubelet.log logs from the incomplete start up by the kubelet
*** Bug 1548352 has been marked as a duplicate of this bug. ***
Debug update: Issue with cgroups apply: https://github.com/kubernetes/kubernetes/blob/release-1.9/pkg/kubelet/cm/cgroup_manager_linux.go#L478 Inside libcontainer, here: https://github.com/openshift/origin/blob/master/vendor/github.com/opencontainers/runc/libcontainer/cgroups/systemd/apply_systemd.go#L299 The operation never completes and the status channel is what the whole thing hangs on to. Nothing in the kubelet moves forward because of this. A node reboot somehow fixes this.
Could be cause by a recent pick of mine: https://github.com/openshift/origin/pull/18106 Still trying to figure out exactly how this happens i.e. what is different about the containerized env
Seems that we are not receiving getting a dbus signal that the StartTransientUnit was successful. Not sure how dbus works from within the containerized env or how we might fix this. Vikas, can you urgently look at this and see if there might be a way to work around this without reverting and reintroducing the other issue?
opened this at runc as a stop gap: https://github.com/opencontainers/runc/pull/1754 Debugging further to find the cause that why dbus does not sent signal.
I picked to Origin master and 3.9 https://github.com/openshift/origin/pull/18876 https://github.com/openshift/origin/pull/18877 On hold pending upstream approval
PRs have approval. Should be in the queue soon.
3.9 is merged, master is in the submit queue.
Version: openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch Steps: 1. container install ocp v3.7 on rhel hosts 2. run upgrade Upgrade succeed.