Bug 1585070

Summary: Node service will start failed if cpu_manager_state conflict with new value
Product: OpenShift Container Platform Reporter: DeShuai Ma <dma>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED ERRATA QA Contact: Sunil Choudhary <schoudha>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.10.0CC: aos-bugs, jokerman, mmariyan, mmccomas, rphillips, schoudha
Target Milestone: ---Keywords: Reopened
Target Release: 3.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-26 09:07:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description DeShuai Ma 2018-06-01 09:11:11 UTC
Description of problem:
node service will start failed if cpu_manager_state conflict with new value.

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.56.0
kubernetes v1.10.0+b81c8f8

How reproducible:
Always

Steps to Reproduce:
1.Check the default cpu_manager_state on node
# cat /var/lib/origin/openshift.local.volumes/cpu_manager_state
{"policyName":"none","defaultCpuSet":""}

2.Update the cpu-manager-policy=static
kubeletArguments:
  feature-gates:
  - RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true,CPUManager=true
  cpu-manager-policy:
  - static
  cpu-manager-reconcile-period:
  - 5s
  kube-reserved:
  - cpu=500m

3.Restart node service
# systemctl restart atomic-openshift-node.service

Actual results:
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: I0601 08:27:00.707976   20784 cpu_manager.go:114] [cpumanager] detected CPU topology: &{4 4 4 map[0:{0 0} 1:{1 1} 2:{2 2} 3:{3 3}]}
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: I0601 08:27:00.708008   20784 cpu_assignment.go:163] [cpumanager] takeByTopology: claiming socket [0]
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: I0601 08:27:00.708016   20784 policy_static.go:99] [cpumanager] reserved 1 CPUs ("0") not available for exclusive assignment
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: I0601 08:27:00.708033   20784 state_mem.go:36] [cpumanager] initializing new in-memory state store
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: panic: [cpumanager] state file: unable to restore state from disk (policy configured "static" != policy from state file "none")
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: Panicking because we cannot guarantee sane CPU affinity for existing containers.
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: Please drain this node and delete the CPU manager state file "/var/lib/origin/openshift.local.volumes/cpu_manager_state" before restarting Kubelet.
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: goroutine 1 [running]:
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/state.NewFileState(0xc42060fc40, 0x39, 0x4f27ec0, 0x6, 0x39, 0xc421288a01)
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/state/state_file.go:57 +0x345
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm/cpumanager.NewManager(0x7ffe29523526, 0x6, 0x12a05f200, 0xc421382e38, 0xc421393890, 0x7ffe295238f9, 0x27, 0xc4201bda00, 0x100, 0x100, ...)
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/cpu_manager.go:140 +0x24e
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm.NewContainerManager(0x8c46620, 0xc4204a6a78, 0x8c41e80, 0xc420ebcea0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm/container_manager_linux.go:278 +0xb0e
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app.run(0xc42017b800, 0xc421090000, 0xc420e99a28, 0x1)
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app/server.go:647 +0xc15
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app.Run(0xc42017b800, 0xc421090000, 0x0, 0x0)
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app/server.go:387 +0xfc
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app.NewKubeletCommand.func1(0xc420cd5680, 0xc4209bcd80, 0x47, 0x48)
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app/server.go:232 +0x362
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).execute(0xc420cd5680, 0xc4209bcd80, 0x47, 0x48, 0xc420cd5680, 0xc4209bcd80)
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:757 +0x2c1
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4209eaf00, 0xc420fe3680, 0xc4203be700, 0xc420a070f0)
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:843 +0x334
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4209eaf00, 0x9, 0xc4209eaf00)
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:791 +0x2b
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: main.main()
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/hyperkube/main.go:63 +0x23e
Jun 01 08:27:00 qe-dma310-node-registry-router-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20829]: container "atomic-openshift-node" does not exist
Jun 01 08:27:00 qe-dma310-node-registry-router-1 systemd[1]: atomic-openshift-node.service: control process exited, code=exited status=1
Jun 01 08:27:00 qe-dma310-node-registry-router-1 systemd[1]: Failed to start atomic-openshift-node.service.


Expected results:
3. Should restart node service successfully.

Additional info:
After manually remove the file /var/lib/origin/openshift.local.volumes/cpu_manager_state can restart successfully.

Comment 1 Seth Jennings 2018-06-01 18:49:13 UTC
This is actually intentional:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/state/state_file.go#L57

However, there is a case in https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/state/state_file.go#L80 where if the file is empty or doesn't exist, it creates the file and does not return an error.

What content does your state file have before you restart the kubelet and it panics?

This is not a release blocker so moving to 3.11 for now.  I'll move back if we can find a timely fix.

Ryan, can you follow up on this one?

Comment 2 Ryan Phillips 2018-06-04 13:27:20 UTC
Looking into this.

Comment 4 Ryan Phillips 2018-06-13 16:20:10 UTC
There is a new upstream PR (that will most likely be slated for OpenShift 3.12) that refactors the cpu_manager_state module. The refactor removes the panic and changes the code to decrease code redundancy and improve consistency.

A backport is potentially risky and we are likely to wait for the upstream changes for openshift 3.12.

https://github.com/kubernetes/kubernetes/pull/59214

Comment 6 Ryan Phillips 2019-06-06 15:24:20 UTC
We are going to fix this a different way for 3.11.

PR: https://github.com/openshift/openshift-ansible/pull/11669

Comment 10 errata-xmlrpc 2019-06-26 09:07:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605