1548358 – Containerized upgrade failed at TASK [openshift_node : Wait for node to be ready] for container on rhel

Bug 1548358 - Containerized upgrade failed at TASK [openshift_node : Wait for node to be ready] for container on rhel

Summary: Containerized upgrade failed at TASK [openshift_node : Wait for node to be re...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Vikas Choudhary
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1548352 (view as bug list)
Depends On:
Blocks:	1548352
TreeView+	depends on / blocked

Reported:	2018-02-23 10:06 UTC by liujia
Modified:	2018-06-18 18:29 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openshift-ansible-3.9.7-1.git.0.60d5c90.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-18 17:42:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
node log (28.36 KB, text/plain) 2018-02-26 19:09 UTC, Michael Gugino	no flags	Details
atomic-openshift-node.service debugging logs (1.50 MB, text/x-vhdl) 2018-02-26 22:07 UTC, Michael Gugino	no flags	Details
incomplete-kubelet.log (33.72 KB, text/plain) 2018-03-02 19:51 UTC, Seth Jennings	no flags	Details
View All

Description liujia 2018-02-23 10:06:28 UTC

Description of problem:
Upgrade against containerized ocp failed at task [openshift_node : Wait for node to be ready], but actually node service is running well.

fatal: [x.x.x.x]: FAILED! => {"attempts": 24, "changed": false, "results": {"cmd": "/usr/local/bin/oc get node qe-jliu-ha1-me-1 -o json -n default", "results": [{"apiVersion": "v1", "kind": "Node", "metadata": {"annotations": {"volumes.kubernetes.io/controller-managed-attach-detach": "true"}, "creationTimestamp": "2018-02-23T06:10:16Z", "labels": {"beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "n1-standard-1", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "us-central1", "failure-domain.beta.kubernetes.io/zone": "us-central1-a", "kubernetes.io/hostname": "qe-jliu-ha1-me-1", "node-role.kubernetes.io/master": "true", "role": "node"}, "name": "qe-jliu-ha1-me-1", "resourceVersion": "87607", "selfLink": "/api/v1/nodes/qe-jliu-ha1-me-1", "uid": "37716b1f-1860-11e8-a246-42010af00005"}, "spec": {"externalID": "2340195980930061460", "providerID": "gce://openshift-gce-devel/us-central1-a/qe-jliu-ha1-me-1", "unschedulable": true}, "status": {"addresses": [{"address": "10.240.0.4", "type": "InternalIP"}, {"address": "35.202.164.251", "type": "ExternalIP"}, {"address": "qe-jliu-ha1-me-1", "type": "Hostname"}], "allocatable": {"cpu": "1", "memory": "3521452Ki", "pods": "250"}, "capacity": {"cpu": "1", "memory": "3623852Ki", "pods": "250"}, "conditions": [{"lastHeartbeatTime": null, "lastTransitionTime": "2018-02-23T06:10:16Z", "message": "openshift-sdn cleared kubelet-set NoRouteCreated", "reason": "RouteCreated", "status": "False", "type": "NetworkUnavailable"}, {"lastHeartbeatTime": "2018-02-23T07:39:21Z", "lastTransitionTime": "2018-02-23T07:40:03Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "OutOfDisk"}, {"lastHeartbeatTime": "2018-02-23T07:39:21Z", "lastTransitionTime": "2018-02-23T07:40:03Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "MemoryPressure"}, {"lastHeartbeatTime": "2018-02-23T07:39:21Z", "lastTransitionTime": "2018-02-23T07:40:03Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "DiskPressure"}, {"lastHeartbeatTime": "2018-02-23T07:39:21Z", "lastTransitionTime": "2018-02-23T07:40:03Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "Ready"}], "daemonEndpoints": {"kubeletEndpoint": {"Port": 10250}}, "images": [{"names": ["registry.reg-aws.openshift.com:443/openshift3/openvswitch@sha256:db244a025139cb4a5c07b225c69f3a2f9f80c803b9fea5155f7fd270d1deb7a2", "registry.reg-aws.openshift.com:443/openshift3/openvswitch:v3.8.32"], "sizeBytes": 1492455641}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/node@sha256:6522d285a759069f4c956c8e97dfe9b6a7bfa56fe367684198c2ae85a6388254", "registry.reg-aws.openshift.com:443/openshift3/node:v3.8.32"], "sizeBytes": 1490767522}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/openvswitch@sha256:e3a4efd23e3c8d09694ca6b55d1f6e09522d0d932afccc228fdc003cd99d4c84", "registry.reg-aws.openshift.com:443/openshift3/openvswitch:v3.7.31"], "sizeBytes": 1291033078}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/node@sha256:d32eb46ceb5d2d06e36aa7c30fad531e3f3cb513762143e59e01e05d025300c5", "registry.reg-aws.openshift.com:443/openshift3/node:v3.7.31"], "sizeBytes": 1289344961}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose@sha256:e1aaa7fd1150cfc661f12f52a54b70ae0b8e65aee7d98fe12c3c5909b71e944a", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.8", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.8.32"], "sizeBytes": 1275825650}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose@sha256:186466dbc98bf0fd349e034b9236dd829be9d4c951074b434a842316fe22d6f9", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.9", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.9.0"], "sizeBytes": 1254280268}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose@sha256:bf047055665628ee3dae30400c54a0b87229b3d4997c89c8cb29852317fcd95e", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.7", "registry.reg-aws.openshift.com:443/openshift3/ose:v3.7.31"], "sizeBytes": 1088450170}, {"names": ["registry.access.redhat.com/rhel7/etcd@sha256:7af27eb689307db36cab833d302cdc46fab5850b90c155cb74b14ed7fec62e60", "registry.access.redhat.com/rhel7/etcd:latest"], "sizeBytes": 250063370}], "nodeInfo": {"architecture": "amd64", "bootID": "477ef20c-6ca6-486d-b9db-79648ba47178", "containerRuntimeVersion": "docker://1.13.1", "kernelVersion": "3.10.0-693.11.1.el7.x86_64", "kubeProxyVersion": "v1.8.5+440f8d36da", "kubeletVersion": "v1.8.5+440f8d36da", "machineID": "6578704f71144944bcf05068370a5315", "operatingSystem": "linux", "osImage": "Red Hat Enterprise Linux Server 7.4 (Maipo)", "systemUUID": "C388BCEC-8DA0-7EFE-DA70-0B6445CF8C1B"}}}], "returncode": 0}, "state": "list"}

# oc get node
NAME                STATUS                        ROLES     AGE       VERSION
qe-jliu-ha1-me-1    NotReady,SchedulingDisabled   master    3h        v1.8.5+440f8d36da
qe-jliu-ha1-me-2    Ready                         master    3h        v1.8.5+440f8d36da
qe-jliu-ha1-me-3    Ready                         master    3h        v1.8.5+440f8d36da
qe-jliu-ha1-nrr-1   Ready                         <none>    3h        v1.7.6+a08f5eeb62
qe-jliu-ha1-nrr-2   Ready                         <none>    3h        v1.7.6+a08f5eeb62


Version-Release number of the following components:
openshift-ansible-3.9.0-0.50.0.git.0.bb78b91.el7.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. Container install ocp v3.7
2. Upgrade v3.7 to v3.9
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Met this issue for both non-ha and ha env.
For ha env, it failed at task [openshift_node : Wait for node to be ready] on first master host. Docker was upgraded during ocp upgraded. Please refer to output in jenkins job 1438.

For non-ha env, it failed at task [openshift_node : Wait for node to be ready]  on dedicated node host.Docker was not upgraded during ocp upgraded. Please refer to output in jenkins job 1440.

Comment 4 Michael Gugino 2018-02-23 16:02:18 UTC

I don't see any obvious reason for this to fail.  It appears this is failing on the second part of the double upgrade; the first round of node upgrades on masters went well.

I'm going to try to replicate this.

Comment 5 Michael Gugino 2018-02-26 18:52:19 UTC

I was able to replicate this issue in HA deployment.

Noticed the following error messages on the host that is failing (first master):

Feb 26 13:43:42 ip-172-18-8-1.ec2.internal atomic-openshift-node[105444]: E0226 13:43:42.102124  105515 remote_runtime.go:115] StopPodSandbox "c42b9448030e432262d46ee66ea69a62ea9a69a6b61d9c3ace1d08e95445c01e" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "docker-registry-1-wmlfn_default" network: cni config uninitialized
Feb 26 13:43:42 ip-172-18-8-1.ec2.internal atomic-openshift-node[105444]: E0226 13:43:42.102151  105515 kuberuntime_gc.go:153] Failed to stop sandbox "c42b9448030e432262d46ee66ea69a62ea9a69a6b61d9c3ace1d08e95445c01e" before removing: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "docker-registry-1-wmlfn_default" network: cni config uninitialized

Comment 6 Michael Gugino 2018-02-26 19:08:47 UTC

Started node in the terminal directly:

I0226 14:03:05.734743  111572 node.go:352] Starting openshift-sdn pod manager
I0226 14:03:05.737484  111572 node.go:395] openshift-sdn network plugin ready
I0226 14:03:05.739959  111572 network.go:95] Using iptables Proxier.
W0226 14:03:05.742995  111572 proxier.go:468] clusterCIDR not specified, unable to distinguish between internal and external traffic

...

W0226 14:04:04.822958  111572 docker_sandbox.go:340] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-wmlfn_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "c42b9448030e432262d46ee66ea69a62ea9a69a6b61d9c3ace1d08e95445c01e"
E0226 14:04:04.823988  111572 remote_runtime.go:115] StopPodSandbox "c42b9448030e432262d46ee66ea69a62ea9a69a6b61d9c3ace1d08e95445c01e" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "docker-registry-1-wmlfn_default" network: cni config uninitialized
E0226 14:04:04.824012  111572 kuberuntime_gc.go:153] Failed to stop sandbox "c42b9448030e432262d46ee66ea69a62ea9a69a6b61d9c3ace1d08e95445c01e" before removing: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "docker-registry-1-wmlfn_default" network: cni config uninitialized

More details attached.

Comment 7 Michael Gugino 2018-02-26 19:09:19 UTC

Created attachment 1400986 [details]
node log

Comment 8 Michael Gugino 2018-02-26 19:32:36 UTC

Also, noticed the latest ose-pod is being run via docker ps:

...openshift.com:443/openshift3/ose-pod:latest   "/usr/bin/pod"

Comment 9 Michael Gugino 2018-02-26 19:40:51 UTC

Updated docker and restarted docker service, no change.

Rebooted the host, now it looks like node is finally talking:

[root@ip-172-18-8-1 ~]# oc get nodes
NAME                            STATUS                     ROLES     AGE       VERSION
ip-172-18-10-155.ec2.internal   Ready                      master    2h        v1.8.5+440f8d36da
ip-172-18-4-42.ec2.internal     Ready                      <none>    1h        v1.7.6+a08f5eeb62
ip-172-18-7-250.ec2.internal    Ready                      master    2h        v1.8.5+440f8d36da
ip-172-18-8-1.ec2.internal      Ready,SchedulingDisabled   master    2h        v1.9.1+a0ce1bc657
ip-172-18-8-24.ec2.internal     Ready                      <none>    1h        v1.7.6+a08f5eeb62

Comment 10 Michael Gugino 2018-02-26 19:45:12 UTC

Manually marking node as schedulable via

oc adm manage-node ip-172-18-8-1.ec2.internal --schedulable=True

works as intended; pods are scheduled on the node.

Comment 11 Michael Gugino 2018-02-26 21:42:21 UTC

After performing the above steps and rerunning the upgrade, fails on the same host in the same place.  No useful error messages, leads me to believe previous error messages were not applicable to this particular problem.

Tried restarting docker, iptables, openvswitch, and the atomic-openshift-* services, no change.

Comment 12 Michael Gugino 2018-02-26 22:07:31 UTC

Created attachment 1401055 [details]
atomic-openshift-node.service debugging logs

loglevel 8 (max debug) from atomic-openshift-node.service.

Comment 13 Michael Gugino 2018-02-26 22:09:11 UTC

Debug logging attached (Attachment #2 [details] / Comment 12).

Scott thinks this might be a network/cni problem.

Comment 14 liujia 2018-02-27 06:50:17 UTC

Still hit the issue on openshift-ansible-3.9.0-0.53.0.git.0.f8f01ef.el7.noarch

Comment 15 Scott Dodson 2018-02-28 02:49:23 UTC

In my testing I'm not able to reproduce this, upgrading the control plane does abort but it's when deploying the service catalog, well after the node has been updated to 3.9 and the node goes ready.

I believe there's an selinux policy version dependency issue being tracked here https://bugzilla.redhat.com/show_bug.cgi?id=1548677

Which version of selinux-policy-targeted is installed on the hosts where this upgrade fails? On my host it's selinux-policy-targeted-3.13.1-166.el7_4.7.noarch which is the latest 7.4 errata.

Comment 16 liujia 2018-02-28 07:51:50 UTC

Hi Scott

You need disable service catalog to workaround bz 1547803. Then you will reproduce it with openshift-ansible-3.9.0-0.53.0.git.0.f8f01ef.el7.noarch. 

This issue seems different with ovs2.9 issue caused by selinux-policy, because when upgrade failed at task [openshift_node : Wait for node to be ready], the node service actually was running well. BTW, before and after upgrade, the selinux policy was selinux-policy-targeted-3.13.1-166.el7_4.7.noarch. 

I try to reproduce it on latest build 3.9.1, but hit another issue (bz1548352) again.

Seems now this issue and bz1548352 happened in turn which caused container upgrade never succeed.

Comment 17 liujia 2018-02-28 08:07:12 UTC

Block container upgrade on rhel.

Comment 18 Seth Jennings 2018-03-02 00:00:36 UTC

Triage for why main kubelet processes are not running:

runKubeletInProcess()
cmd.Run()
Run()
run()
RunKubelet()
startKubelet()
go func (kl *Kubelet) Run <-- we know we get this far indicated by 
Run() starts a bunch of goroutines that constitute the kubelet

We know we get as far as (kl *Kubelet) Run() because we see server.go:818] Started kubelet in the logs.

We also know that we call CreateAndInitKubelet from RunKubelet() because we see the Starting event (BirthCry) from the node. k.StartGarbageCollection() is called from the end of that function which explains why we see GC happening but nothing else.

However (kl *Kubelet) Run doesn't have any error paths where it could bail out and we see no activity from any of the goroutines it would have started.

Comment 19 Scott Dodson 2018-03-02 01:35:55 UTC

I attempted a bit of bisecting by rolling back 3.9 image versions in the order below. The first version I got to that worked was actually a kube 1.8 based build. So I went to 0.15.0 which worked and reported v1.9.0-beta1. I then worked my way back up looking to see it fail again. The odd thing is that when i got back to 0.20.0 it worked too whereas on the way down it had failed! Then I moved back to v3.9.1 and it too worked!


# BAD
#IMAGE_VERSION=v3.9.1
#IMAGE_VERSION=v3.9.0-0.45.0
#IMAGE_VERSION=v3.9.0-0.39.0
#IMAGE_VERSION=v3.9.0-0.30.0
#IMAGE_VERSION=v3.9.0-0.20.0

# Good but actually kube-1.8
#IMAGE_VERSION=v3.9.0-0.9.0

# Good but v1.9.0-beta1
#IMAGE_VERSION=v3.9.0-0.15.0 

# First v1.9.1+a0ce1bc657
#IMAGE_VERSION=v3.9.0-0.17.0
#IMAGE_VERSION=v3.9.0-0.18.0
#IMAGE_VERSION=v3.9.0-0.19.0

Tomorrow lets try to upgrade the control plane again and hopefully the second master will exhibit the same problems and we can dig back into it as a clean host.

Comment 20 Seth Jennings 2018-03-02 19:51:41 UTC

Created attachment 1403171 [details]
incomplete-kubelet.log

logs from the incomplete start up by the kubelet

Comment 21 Scott Dodson 2018-03-05 15:03:50 UTC

*** Bug 1548352 has been marked as a duplicate of this bug. ***

Comment 22 Rajat Chopra 2018-03-06 23:07:11 UTC

Debug update:

Issue with cgroups apply:
https://github.com/kubernetes/kubernetes/blob/release-1.9/pkg/kubelet/cm/cgroup_manager_linux.go#L478

Inside libcontainer, here:
https://github.com/openshift/origin/blob/master/vendor/github.com/opencontainers/runc/libcontainer/cgroups/systemd/apply_systemd.go#L299


The operation never completes and the status channel is what the whole thing hangs on to. Nothing in the kubelet moves forward because of this. A node reboot somehow fixes this.

Comment 23 Seth Jennings 2018-03-07 01:37:32 UTC

Could be cause by a recent pick of mine:
https://github.com/openshift/origin/pull/18106

Still trying to figure out exactly how this happens i.e. what is different about the containerized env

Comment 24 Seth Jennings 2018-03-07 01:46:15 UTC

Seems that we are not receiving getting a dbus signal that the StartTransientUnit was successful.

Not sure how dbus works from within the containerized env or how we might fix this.

Vikas, can you urgently look at this and see if there might be a way to work around this without reverting and reintroducing the other issue?

Comment 25 Vikas Choudhary 2018-03-07 14:29:21 UTC

opened this at runc as a stop gap: https://github.com/opencontainers/runc/pull/1754

Debugging further to find the cause that why dbus does not sent signal.

Comment 26 Seth Jennings 2018-03-07 15:27:08 UTC

I picked to Origin master and 3.9
https://github.com/openshift/origin/pull/18876
https://github.com/openshift/origin/pull/18877

On hold pending upstream approval

Comment 27 Seth Jennings 2018-03-08 15:23:43 UTC

PRs have approval.  Should be in the queue soon.

Comment 28 Seth Jennings 2018-03-08 18:22:23 UTC

3.9 is merged, master is in the submit queue.

Comment 30 liujia 2018-03-12 08:42:53 UTC

Version:
openshift-ansible-3.9.7-1.git.0.60d5c90.el7.noarch

Steps:
1. container install ocp v3.7 on rhel hosts
2. run upgrade

Upgrade succeed.

Note You need to log in before you can comment on or make changes to this bug.