1880932 – Pod with kata-runtime won't start, QEMU: "vhost_user_dev init failed, Operation not permitted"

Bug 1880932 - Pod with kata-runtime won't start, QEMU: "vhost_user_dev init failed, Operation not permitted"

Summary: Pod with kata-runtime won't start, QEMU: "vhost_user_dev init failed, Operati...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	sandboxed-containers
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jens Freimann
QA Contact:	Cameron Meadors
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1884276 1889306
TreeView+	depends on / blocked

Reported:	2020-09-21 07:41 UTC by Jens Freimann
Modified:	2020-12-11 07:23 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1884276 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:42:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kubelet logs (2.21 MB, application/zip) 2020-09-21 20:40 UTC, Cameron Meadors	no flags	Details
crio logs (262.54 KB, application/zip) 2020-09-21 20:44 UTC, Cameron Meadors	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:43:22 UTC

Comment 1 Cameron Meadors 2020-09-21 18:53:11 UTC

I just ran into this bug while testing with Openshift 4.6.  I am calling this a blocker and I cannot actually deploy a pod using kata.

Comment 2 Cameron Meadors 2020-09-21 20:38:51 UTC

I tried with selinux in enforcing and permissive.  I saw no difference.  I will attach the kubelet logs.

Comment 3 Cameron Meadors 2020-09-21 20:40:37 UTC

Created attachment 1715593 [details]
kubelet logs

Comment 4 Cameron Meadors 2020-09-21 20:44:25 UTC

Created attachment 1715594 [details]
crio logs

Comment 6 Jens Freimann 2020-09-22 07:09:31 UTC

crio config:

sh-4.4# crio config
INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_KILL
# The CRI-O configuration file specifies all of the available configuration
# options and command-line flags for the crio(8) OCI Kubernetes Container Runtime
# daemon, but in a TOML format that can be more easily modified and versioned.
#
# Please refer to crio.conf(5) for details of all configuration options.

# CRI-O supports partial configuration reload during runtime, which can be
# done by sending SIGHUP to the running process. Currently supported options
# are explicitly mentioned with: 'This option supports live configuration
# reload'.

# CRI-O reads its storage defaults from the containers-storage.conf(5) file
# located at /etc/containers/storage.conf. Modify this storage configuration if
# you want to change the system's defaults. If you want to modify storage just
# for CRI-O, you can change the storage configuration options here.
[crio]

# Path to the "root directory". CRI-O stores all of its data, including
# containers images, in this directory.
#root = "/var/lib/containers/storage"

# Path to the "run directory". CRI-O stores all of its state in this directory.
#runroot = "/var/run/containers/storage"

# Storage driver used to manage the storage of images and containers. Please
# refer to containers-storage.conf(5) to see all available storage drivers.
#storage_driver = "overlay"

# List to pass options to the storage driver. Please refer to
# containers-storage.conf(5) to see all available storage options.
#storage_option = [
#]

# The default log directory where all logs will go unless directly specified by
# the kubelet. The log directory specified must be an absolute directory.
log_dir = "/var/log/crio/pods"

# Location for CRI-O to lay down the temporary version file.
# It is used to check if crio wipe should wipe containers, which should
# always happen on a node reboot
version_file = "/var/run/crio/version"

# Location for CRI-O to lay down the persistent version file.
# It is used to check if crio wipe should wipe images, which should
# only happen when CRI-O has been upgraded
version_file_persist = "/var/lib/crio/version"

# The crio.api table contains settings for the kubelet/gRPC interface.
[crio.api]

# Path to AF_LOCAL socket on which CRI-O will listen.
listen = "/var/run/crio/crio.sock"

# IP address on which the stream server will listen.
stream_address = ""

# The port on which the stream server will listen. If the port is set to "0", then
# CRI-O will allocate a random free port number.
stream_port = "10010"

# Enable encrypted TLS transport of the stream server.
stream_enable_tls = false

# Path to the x509 certificate file used to serve the encrypted stream. This
# file can change, and CRI-O will automatically pick up the changes within 5
# minutes.
stream_tls_cert = ""

# Path to the key file used to serve the encrypted stream. This file can
# change and CRI-O will automatically pick up the changes within 5 minutes.
stream_tls_key = ""

# Path to the x509 CA(s) file used to verify and authenticate client
# communication with the encrypted stream. This file can change and CRI-O will
# automatically pick up the changes within 5 minutes.
stream_tls_ca = ""

# Maximum grpc send message size in bytes. If not set or <=0, then CRI-O will default to 16 * 1024 * 1024.
grpc_max_send_msg_size = 16777216

# Maximum grpc receive message size. If not set or <= 0, then CRI-O will default to 16 * 1024 * 1024.
grpc_max_recv_msg_size = 16777216

# The crio.runtime table contains settings pertaining to the OCI runtime used
# and options for how to set up and manage the OCI runtime.
[crio.runtime]

# A list of ulimits to be set in containers by default, specified as
# "<ulimit name>=<soft limit>:<hard limit>", for example:
# "nofile=1024:2048"
# If nothing is set here, settings will be inherited from the CRI-O daemon
#default_ulimits = [
#]

# default_runtime is the _name_ of the OCI runtime to be used as the default.
# The name is matched against the runtimes map below.
default_runtime = "runc"

# If true, the runtime will not use pivot_root, but instead use MS_MOVE.
no_pivot = false

# decryption_keys_path is the path where the keys required for
# image decryption are stored. This option supports live configuration reload.
decryption_keys_path = "/etc/crio/keys/"

# Path to the conmon binary, used for monitoring the OCI runtime.
# Will be searched for using $PATH if empty.
conmon = "/usr/libexec/crio/conmon"

# Cgroup setting for conmon
conmon_cgroup = "pod"

# Environment variable list for the conmon process, used for passing necessary
# environment variables to conmon or the runtime.
conmon_env = [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
]

# Additional environment variables to set for all the
# containers. These are overridden if set in the
# container image spec or in the container runtime configuration.
default_env = [
"NSS_SDB_USE_CACHE=no",
]

# If true, SELinux will be used for pod separation on the host.
selinux = true

# Path to the seccomp.json profile which is used as the default seccomp profile
# for the runtime. If not specified, then the internal default seccomp profile
# will be used. This option supports live configuration reload.
seccomp_profile = ""

# Used to change the name of the default AppArmor profile of CRI-O. The default
# profile name is "crio-default". This profile only takes effect if the user
# does not specify a profile via the Kubernetes Pod's metadata annotation. If
# the profile is set to "unconfined", then this equals to disabling AppArmor.
# This option supports live configuration reload.
apparmor_profile = "crio-default"

# Cgroup management implementation used for the runtime.
cgroup_manager = "systemd"

# List of default capabilities for containers. If it is empty or commented out,
# only the capabilities defined in the containers json file by the user/kube
# will be added.
default_capabilities = [
"CHOWN",
"DAC_OVERRIDE",
"FSETID",
"FOWNER",
"NET_RAW",
"SETGID",
"SETUID",
"SETPCAP",
"NET_BIND_SERVICE",
"SYS_CHROOT",
"KILL",
]

# List of default sysctls. If it is empty or commented out, only the sysctls
# defined in the container json file by the user/kube will be added.
default_sysctls = [
]

# List of additional devices. specified as
# "<device-on-host>:<device-on-container>:<permissions>", for example: "--device=/dev/sdc:/dev/xvdc:rwm".
#If it is empty or commented out, only the devices
# defined in the container json file by the user/kube will be added.
additional_devices = [
]

# Path to OCI hooks directories for automatically executed hooks. If one of the
# directories does not exist, then CRI-O will automatically skip them.
hooks_dir = [
"/etc/containers/oci/hooks.d",
]

# List of default mounts for each container. **Deprecated:** this option will
# be removed in future versions in favor of default_mounts_file.
default_mounts = [
]

# Path to the file specifying the defaults mounts for each container. The
# format of the config is /SRC:/DST, one mount per line. Notice that CRI-O reads
# its default mounts from the following two files:
#
# 1) /etc/containers/mounts.conf (i.e., default_mounts_file): This is the
# override file, where users can either add in their own default mounts, or
# override the default mounts shipped with the package.
#
# 2) /usr/share/containers/mounts.conf: This is the default file read for
# mounts. If you want CRI-O to read from a different, specific mounts file,
# you can change the default_mounts_file. Note, if this is done, CRI-O will
# only add mounts it finds in this file.
#
#default_mounts_file = ""

# Maximum number of processes allowed in a container.
pids_limit = 1024

# Maximum sized allowed for the container log file. Negative numbers indicate
# that no size limit is imposed. If it is positive, it must be >= 8192 to
# match/exceed conmon's read buffer. The file is truncated and re-opened so the
# limit is never exceeded.
log_size_max = -1

# Whether container output should be logged to journald in addition to the kuberentes log file
log_to_journald = false

# Path to directory in which container exit files are written to by conmon.
container_exits_dir = "/var/run/crio/exits"

# Path to directory for container attach sockets.
container_attach_socket_dir = "/var/run/crio"

# The prefix to use for the source of the bind mounts.
bind_mount_prefix = ""

# If set to true, all containers will run in read-only mode.
read_only = false

# Changes the verbosity of the logs based on the level it is set to. Options
# are fatal, panic, error, warn, info, debug and trace. This option supports
# live configuration reload.
log_level = "info"

# Filter the log messages by the provided regular expression.
# This option supports live configuration reload.
log_filter = ""

# The UID mappings for the user namespace of each container. A range is
# specified in the form containerUID:HostUID:Size. Multiple ranges must be
# separated by comma.
uid_mappings = ""

# The GID mappings for the user namespace of each container. A range is
# specified in the form containerGID:HostGID:Size. Multiple ranges must be
# separated by comma.
gid_mappings = ""

# The minimal amount of time in seconds to wait before issuing a timeout
# regarding the proper termination of the container. The lowest possible
# value is 30s, whereas lower values are not considered by CRI-O.
ctr_stop_timeout = 30

# **DEPRECATED** this option is being replaced by manage_ns_lifecycle, which is described below.
# manage_network_ns_lifecycle = true

# manage_ns_lifecycle determines whether we pin and remove namespaces
# and manage their lifecycle
manage_ns_lifecycle = true

# The directory where the state of the managed namespaces gets tracked.
# Only used when manage_ns_lifecycle is true.
namespaces_dir = "/var/run"

# pinns_path is the path to find the pinns binary, which is needed to manage namespace lifecycle
pinns_path = ""

# The "crio.runtime.runtimes" table defines a list of OCI compatible runtimes.
# The runtime to use is picked based on the runtime_handler provided by the CRI.
# If no runtime_handler is provided, the runtime will be picked based on the level
# of trust of the workload. Each entry in the table should follow the format:
#
#[crio.runtime.runtimes.runtime-handler]
# runtime_path = "/path/to/the/executable"
# runtime_type = "oci"
# runtime_root = "/path/to/the/root"
#
# Where:
# - runtime-handler: name used to identify the runtime
# - runtime_path (optional, string): absolute path to the runtime executable in
# the host filesystem. If omitted, the runtime-handler identifier should match
# the runtime executable name, and the runtime executable should be placed
# in $PATH.
# - runtime_type (optional, string): type of runtime, one of: "oci", "vm". If
# omitted, an "oci" runtime is assumed.
# - runtime_root (optional, string): root directory for storage of containers
# state.

[crio.runtime.runtimes.kata-oc]
runtime_path = "/usr/bin/containerd-shim-kata-v2"
runtime_type = "vm"
runtime_root = "/run/vc"

[crio.runtime.runtimes.runc]
runtime_path = ""
runtime_type = "oci"
runtime_root = "/run/runc"

# Kata Containers is an OCI runtime, where containers are run inside lightweight
# VMs. Kata provides additional isolation towards the host, minimizing the host attack
# surface and mitigating the consequences of containers breakout.

# Kata Containers with the default configured VMM
#[crio.runtime.runtimes.kata-runtime]

# Kata Containers with the QEMU VMM
#[crio.runtime.runtimes.kata-qemu]

# Kata Containers with the Firecracker VMM
#[crio.runtime.runtimes.kata-fc]

# The crio.image table contains settings pertaining to the management of OCI images.
#
# CRI-O reads its configured registries defaults from the system wide
# containers-registries.conf(5) located in /etc/containers/registries.conf. If
# you want to modify just CRI-O, you can change the registries configuration in
# this file. Otherwise, leave insecure_registries and registries commented out to
# use the system's defaults from /etc/containers/registries.conf.
[crio.image]

# Default transport for pulling images from a remote container storage.
default_transport = "docker://"

# The path to a file containing credentials necessary for pulling images from
# secure registries. The file is similar to that of /var/lib/kubelet/config.json
global_auth_file = "/var/lib/kubelet/config.json"

# The image used to instantiate infra containers.
# This option supports live configuration reload.
pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb64c463c4b999fbd124a772e3db56f27c74b1e033052c737b69510fc1b9e590"

# The path to a file containing credentials specific for pulling the pause_image from
# above. The file is similar to that of /var/lib/kubelet/config.json
# This option supports live configuration reload.
pause_image_auth_file = "/var/lib/kubelet/config.json"

# The command to run to have a container stay in the paused state.
# When explicitly set to "", it will fallback to the entrypoint and command
# specified in the pause image. When commented out, it will fallback to the
# default: "/pause". This option supports live configuration reload.
pause_command = "/usr/bin/pod"

# Path to the file which decides what sort of policy we use when deciding
# whether or not to trust an image that we've pulled. It is not recommended that
# this option be used, as the default behavior of using the system-wide default
# policy (i.e., /etc/containers/policy.json) is most often preferred. Please
# refer to containers-policy.json(5) for more details.
signature_policy = ""

# List of registries to skip TLS verification for pulling images. Please
# consider configuring the registries via /etc/containers/registries.conf before
# changing them here.
#insecure_registries = "[]"

# Controls how image volumes are handled. The valid values are mkdir, bind and
# ignore; the latter will ignore volumes entirely.
image_volumes = "mkdir"

# List of registries to be used when pulling an unqualified image (e.g.,
# "alpine:latest"). By default, registries is set to "docker.io" for
# compatibility reasons. Depending on your workload and usecase you may add more
# registries (e.g., "quay.io", "registry.fedoraproject.org",
# "registry.opensuse.org", etc.).
#registries = [
# ]

# The crio.network table containers settings pertaining to the management of
# CNI plugins.
[crio.network]

# The default CNI network name to be selected. If not set or "", then
# CRI-O will pick-up the first one found in network_dir.
# cni_default_network = ""

# Path to the directory where CNI configuration files are located.
network_dir = "/etc/kubernetes/cni/net.d/"

# Paths to directories where CNI plugin binaries are located.
plugin_dirs = [
"/var/lib/cni/bin",
"/usr/libexec/cni",
]

# A necessary configuration for Prometheus based metrics retrieval
[crio.metrics]

# Globally enable or disable metrics support.
enable_metrics = true

# The port on which the metrics server will listen.
metrics_port = 9537

Comment 10 Dr. David Alan Gilbert 2020-09-22 09:01:24 UTC

It's a bit confusing because there seem to be at least two different things going on:

> [ID: 00000001] tmpdir(virtiofsd-I5XLfP): Operation not permitted

that's during the sandboxing when it does:

    char template[] = "virtiofsd-XXXXXX";
    char *tmpdir;
...
    tmpdir = mkdtemp(template);
    if (!tmpdir) {
        fuse_log(FUSE_LOG_ERR, "tmpdir(%s): %m\n", template);
        exit(1);
    }

but those selinux warnings seem to be related to .. something else, maybe just the journal stuff from --syslog?

So I think I'd go hunt down why we can't create the tmpdir

Comment 12 Danilo de Paula 2020-09-22 18:06:26 UTC

I would suggest bring this up with qemu-kvm engineers.

I had an 'operation not permitted' problem when dealing with the guest-agent-container, but the reason was that guest-agent has to be configured in a way to share some /dev devices, and the guest needs to run on privileged mode.

Comment 14 Jens Freimann 2020-10-06 14:09:29 UTC

A patch was posted by stefanha on qemu-devel, subject "[PATCH] virtiofsd: avoid /proc/self/fd tempdir"

It gets rid of the temporary directy, this means it fixes our problem and we don't
need an additonal fix of the SELinux rules.

Fabiano and I tested it and verified that it solves the problem we see here.

Comment 16 Cameron Meadors 2020-10-22 13:06:52 UTC

I have verified that this issue has been fixed.  Tested with Openshift 4.6 rc4 and release-4.6 branch of kata-operator.

Comment 19 errata-xmlrpc 2020-10-27 16:42:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.