Description of problem: node service will start failed if cpu_manager_state conflict with new value. Version-Release number of selected component (if applicable): openshift v3.10.0-0.56.0 kubernetes v1.10.0+b81c8f8 How reproducible: Always Steps to Reproduce: 1.Check the default cpu_manager_state on node # cat /var/lib/origin/openshift.local.volumes/cpu_manager_state {"policyName":"none","defaultCpuSet":""} 2.Update the cpu-manager-policy=static kubeletArguments: feature-gates: - RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true,CPUManager=true cpu-manager-policy: - static cpu-manager-reconcile-period: - 5s kube-reserved: - cpu=500m 3.Restart node service # systemctl restart atomic-openshift-node.service Actual results: Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: I0601 08:27:00.707976 20784 cpu_manager.go:114] [cpumanager] detected CPU topology: &{4 4 4 map[0:{0 0} 1:{1 1} 2:{2 2} 3:{3 3}]} Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: I0601 08:27:00.708008 20784 cpu_assignment.go:163] [cpumanager] takeByTopology: claiming socket [0] Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: I0601 08:27:00.708016 20784 policy_static.go:99] [cpumanager] reserved 1 CPUs ("0") not available for exclusive assignment Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: I0601 08:27:00.708033 20784 state_mem.go:36] [cpumanager] initializing new in-memory state store Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: panic: [cpumanager] state file: unable to restore state from disk (policy configured "static" != policy from state file "none") Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: Panicking because we cannot guarantee sane CPU affinity for existing containers. Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: Please drain this node and delete the CPU manager state file "/var/lib/origin/openshift.local.volumes/cpu_manager_state" before restarting Kubelet. Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: goroutine 1 [running]: Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/state.NewFileState(0xc42060fc40, 0x39, 0x4f27ec0, 0x6, 0x39, 0xc421288a01) Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/state/state_file.go:57 +0x345 Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm/cpumanager.NewManager(0x7ffe29523526, 0x6, 0x12a05f200, 0xc421382e38, 0xc421393890, 0x7ffe295238f9, 0x27, 0xc4201bda00, 0x100, 0x100, ...) Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/cpu_manager.go:140 +0x24e Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm.NewContainerManager(0x8c46620, 0xc4204a6a78, 0x8c41e80, 0xc420ebcea0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/cm/container_manager_linux.go:278 +0xb0e Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app.run(0xc42017b800, 0xc421090000, 0xc420e99a28, 0x1) Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app/server.go:647 +0xc15 Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app.Run(0xc42017b800, 0xc421090000, 0x0, 0x0) Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app/server.go:387 +0xfc Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app.NewKubeletCommand.func1(0xc420cd5680, 0xc4209bcd80, 0x47, 0x48) Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kubelet/app/server.go:232 +0x362 Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).execute(0xc420cd5680, 0xc4209bcd80, 0x47, 0x48, 0xc420cd5680, 0xc4209bcd80) Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:757 +0x2c1 Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4209eaf00, 0xc420fe3680, 0xc4203be700, 0xc420a070f0) Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:843 +0x334 Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4209eaf00, 0x9, 0xc4209eaf00) Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:791 +0x2b Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: main.main() Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20773]: /builddir/build/BUILD/atomic-openshift-git-0.c304575/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/hyperkube/main.go:63 +0x23e Jun 01 08:27:00 qe-dma310-node-registry-router-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=2/INVALIDARGUMENT Jun 01 08:27:00 qe-dma310-node-registry-router-1 atomic-openshift-node[20829]: container "atomic-openshift-node" does not exist Jun 01 08:27:00 qe-dma310-node-registry-router-1 systemd[1]: atomic-openshift-node.service: control process exited, code=exited status=1 Jun 01 08:27:00 qe-dma310-node-registry-router-1 systemd[1]: Failed to start atomic-openshift-node.service. Expected results: 3. Should restart node service successfully. Additional info: After manually remove the file /var/lib/origin/openshift.local.volumes/cpu_manager_state can restart successfully.
This is actually intentional: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/state/state_file.go#L57 However, there is a case in https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/state/state_file.go#L80 where if the file is empty or doesn't exist, it creates the file and does not return an error. What content does your state file have before you restart the kubelet and it panics? This is not a release blocker so moving to 3.11 for now. I'll move back if we can find a timely fix. Ryan, can you follow up on this one?
Looking into this.
There is a new upstream PR (that will most likely be slated for OpenShift 3.12) that refactors the cpu_manager_state module. The refactor removes the panic and changes the code to decrease code redundancy and improve consistency. A backport is potentially risky and we are likely to wait for the upstream changes for openshift 3.12. https://github.com/kubernetes/kubernetes/pull/59214
We are going to fix this a different way for 3.11. PR: https://github.com/openshift/openshift-ansible/pull/11669
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1605