Bug 1760294

Summary: kernel: seccomp: wrong return value for blocked syscalls on s390x
Product: Red Hat Enterprise Linux 7 Reporter: Jan Staněk <jstanek>
Component: kernelAssignee: Vladis Dronov <vdronov>
kernel sub component: Memory Management QA Contact: Ping Fang <pifang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: arozansk, brueckner, bugproxy, cye, hannsj_uhl, lmiksik, longman, mm-maint, omosnace, prudo, qcai, vdronov, vondruch
Version: 7.7Keywords: Patch
Target Milestone: rc   
Target Release: 7.8   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-3.10.0-1113.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-31 19:33:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1689150, 1713152    

Description Jan Staněk 2019-10-10 10:37:16 UTC
NodeJS in newer versions (v10.16.0, v12) uses libuv that tries to use `statx` syscall if available. In s390x RHEL7-based container, this emits inconsistent errors when trying to query a status of a symlink.

According to upstream (https://github.com/nodejs/node/issues/29916), this is a seccopm/kernel-level issue.

See bug#1759152 and it's clone bug#1760184 for the original issue and reproducer.

Version-Release number of selected component (if applicable):
Linux ibm-z-06.xxx.redhat.com 3.10.0-1062.el7.s390x

Comment 2 Jan Staněk 2019-10-10 13:04:45 UTC
Additional info from bug#1759152:

> It may not be a kernel issue but instead a Docker problem. It has been addresses in this ticket:
> https://github.com/docker/for-linux/issues/208

So if it is a docker/podman issue, feel free to reassign – although I found it strange that it would manifest only on s390x if that was the case.

Comment 4 Jan Staněk 2019-10-14 09:02:31 UTC
In case it helps, there is a simple reproducer, adapted from stat() example in man pages by joransiu.com:

> You can actually reproduce the issue with a simple C testcase running on a Ubuntu docker image, and query against a symbolic link... after a few runs, you should observe inconsistent results being printed out.   This was the testcase I have in my notes.. but haven't had a chance to test it again... about to board a flight in 5 minutes!  Hope this helps!
 

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>            /* Definition of AT_* constants */
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

struct uv__statx_timestamp {
  int64_t tv_sec;
  uint32_t tv_nsec;
  int32_t unused0;
};


struct uv__statx {
  uint32_t stx_mask;
  uint32_t stx_blksize;
  uint64_t stx_attributes;
  uint32_t stx_nlink;
  uint32_t stx_uid;
  uint32_t stx_gid;
  uint16_t stx_mode;
  uint16_t unused0;
  uint64_t stx_ino;
  uint64_t stx_size;
  uint64_t stx_blocks;
  uint64_t stx_attributes_mask;
  struct uv__statx_timestamp stx_atime;
  struct uv__statx_timestamp stx_btime;
  struct uv__statx_timestamp stx_ctime;
  struct uv__statx_timestamp stx_mtime;
  uint32_t stx_rdev_major;
  uint32_t stx_rdev_minor;
  uint32_t stx_dev_major;
  uint32_t stx_dev_minor;
  uint64_t unused1[14];
};

int statx(int dirfd, const char *pathname, int flags,
                 unsigned int mask, struct uv__statx *statxbuf);

intmain(int argc, char *argv[]) {
  int dirfd = AT_FDCWD;
  int flags = AT_SYMLINK_NOFOLLOW;
  int mode = 0xFFF;
  int rc;
  struct uv__statx sb;
   
   if (argc != 2) {
        fprintf(stderr, "Usage: %s <pathname>\n", argv[0]);
        exit(EXIT_FAILURE);
   }
   
   printf("Path: %s\n", argv[1]);
   if (syscall(379, dirfd, argv[1], flags, mode, &sb) == -1) {
       perror("statx");
       exit(EXIT_FAILURE);
   }

   printf("File type:                ");

   switch (sb.stx_mode & S_IFMT) {
    case S_IFBLK:  printf("block device\n");            break;
    case S_IFCHR:  printf("character device\n");        break;
    case S_IFDIR:  printf("directory\n");               break;
    case S_IFIFO:  printf("FIFO/pipe\n");               break;
    case S_IFLNK:  printf("symlink\n");                 break;
    case S_IFREG:  printf("regular file\n");            break;
    case S_IFSOCK: printf("socket\n");                  break;
    default:       printf("unknown?\n");                break;
    }
}

Comment 5 Vladis Dronov 2019-10-21 17:36:39 UTC
as mentioned in "man 2 statx":

VERSIONS
       statx() was added to Linux in kernel 4.11.

so, RHEL-7 does not have a statx() syscall implemented in any arch including s390x. double-check:

[src/rhel7]$ git grep statx | grep -v -e ^tools/ -e ^redhat/
arch/s390/include/uapi/asm/unistd.h:/* Number 379 is reserved for sys_statx */

the reproducer above on a bare-metal system expectedly says:

# uname -r
3.10.0-1062.el7.s390x

# ./statx statx.c 
Path: statx.c
statx: Function not implemented
ret -1 errno 38

the reproducer above in an s390x RHEL7-based podman container with "--security-opt=seccomp=unconfined" expectedly says:

(app-root)./statx statx.c 
Path: statx.c
statx: Function not implemented
ret -1 errno 38

the expected strace is:

syscall_379(0xffffffffffffff9c, 0x3ffffb9f8e2, 0x100, 0xfff, 0x3ffffb9efd8, 0x3ffffb9f318) = -1 (errno 38)

the reproducer above in an s390x RHEL7-based podman container (default seccomp config is /usr/share/containers/seccomp.json) says:

(app-root)./statx statx.c 
Path: statx.c
ret 1 errno 0
File type:                mode: 0 mask: 3ff
unknown?

with the strace:

13:31:26 exit(-100)                     = ?
13:31:26 <... exit resumed> strace: _exit returned!
)           = ?

Comment 6 Vladis Dronov 2019-10-21 18:12:37 UTC
normally, syscall blocked by seccomp should force retval for syscall() to be -1 and errno 1 (EPERM).

compare with RHEL7 x86_64, podman with "--security-opt=seccomp=unconfined":

(app-root)./statx statx.c 
Path: statx.c
statx: Function not implemented
ret -1 errno 38

syscall_379(0xffffff9c, 0x7fff29c6f8e2, 0x100, 0xfff, 0x7fff29c6e970, 0x7fff29c6eb78) = -1 (errno 38)

compare with RHEL7 x86_64, podman with default seccomp config:

(app-root)./statx statx.c 
Path: statx.c
statx: Operation not permitted
ret -1 errno 1

syscall_379(0xffffff9c, 0x7ffdd0fc68e2, 0x100, 0xfff, 0x7ffdd0fc4460, 0x7ffdd0fc4668) = -1 (errno 38)

statfs() blocked by seccomp:

(app-root)./statx statx.c 
Path: statx.c
statx: Operation not permitted
ret -1 errno 1

statfs(0xffffff9c, 0x7fff1addd8e2) = -1 ENOSYS (Function not implemented)


also, s390x RHEL8-based podman container with statx disabled by seccomp says:

(app-root)./statx statx.c 
Path: statx.c
statx: Operation not permitted
ret -1 errno 1

statx(AT_FDCWD, "statx.c", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, 0x3ffe09fe708) = -1 EPERM (Operation not permitted)

need to look at seccomp code which handles SCMP_ACT_ERRNO (default action in seccomp.json) in RHEL-7 s390x.

Comment 8 Vladis Dronov 2019-10-22 00:34:03 UTC
test: 3.10.0-1104.el7.scmpfx.s390x - OK

(app-root)./statx statx.c 
Path: statx.c
statx: Operation not permitted
ret -1 errno 1

syscall_379(0xffffffffffffff9c, 0x3ffff8068e2, 0x100, 0xfff, 0x3ffff805d28, 0x3ffff806068) = -1 (errno 1)

test with statfs() disabled by seccomp - OK

(app-root)./statx statx.c 
Path: statx.c
statx: Operation not permitted
ret -1 errno 1

statfs(0xffffffffffffff9c, 0x3ffff8658e2) = -1 EPERM (Operation not permitted)

Comment 9 Vít Ondruch 2019-10-22 07:49:29 UTC
So do I correctly understand that this really is kernel issue on s390x in implementation of secomp?

Comment 11 Vladis Dronov 2019-10-22 09:44:19 UTC
(In reply to Vít Ondruch from comment #9)
> So do I correctly understand that this really is kernel issue on s390x in implementation of secomp?

in a short word - yes. still, statx() syscall is not present on RHEL-7, so libuv shouldn't call it anyway.

in regular setup such a call would return -38 (-ENOSYS), but the current implementation of libseccomp(?, i guess)
on RHEL-7 makes a kernel to return -1 (-EPERM) for blocked calls, so userspace cannot determine if a syscall was
blocked or returned just an ordinary error.

Comment 12 Ondrej Mosnacek 2019-11-07 12:40:18 UTC
See BZ1762578#c12, the seccomp.json in podman/docker whitelists the statx() syscall, but it is still blocked due to libseccomp missing the string-to-number mapping (and thus dropping it from the whitelist). Once this is fixed in libseccomp, statx() should be returning ENOSYS in containers as expected. (I assume libuv does some fallback when it gets ENOSYS error.)

Comment 13 Vladis Dronov 2019-11-09 08:11:01 UTC
all 'seccomp after ptrace' upstream patches (except unsupported archs like um or tile):

$ git l --oneline | grep 'seccomp after ptrace'
+ 1addc57e111b powerpc/ptrace: run seccomp after ptrace
+ 0208b9445bc0 s390/ptrace: run seccomp after ptrace
- a5cd110cb836 arm64/ptrace: run seccomp after ptrace (arch is not supported)
- 0f3912fd934c arm/ptrace: run seccomp after ptrace (arch is not supported)
* 93e35efb8de4 x86/ptrace: run seccomp after ptrace

so indeed, we need to add 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace")

strace test: OK

08:26:12 syscall_0x1df(0xffffff9c, 0x7ffdbb643457, 0x100, 0xfff, 0x7ffdbb642f70, 0x7ffdbb643178) = -1 ENOSYS (Function not implemented)
08:26:12 read(0, "3\n", 1024)           = 2
08:26:13 dup(2)                         = 3
08:26:13 fcntl(3, F_GETFL)              = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
08:26:13 brk(NULL)                      = 0x15a9000
08:26:13 brk(0x15ca000)                 = 0x15ca000
08:26:13 brk(NULL)                      = 0x15ca000

strace+seccomp test: FAIL

08:24:42 syscall_0x1df(0xffffff9c, 0x7ffe7842d8d5, 0x100, 0xfff, 0x7ffe7842ba40, 0x7ffe7842bc48) = -1 ENOSYS (Function not implemented)
08:24:45 read(0, 0x7f8fa5044000, 1024)  = -1 ENOSYS (Function not implemented)
08:24:45 dup(2)                         = -1 ENOSYS (Function not implemented)
08:24:45 fcntl(3, F_GETFL)              = -1 ENOSYS (Function not implemented)
08:24:45 brk(NULL)                      = -1 ENOSYS (Function not implemented)
08:24:45 brk(0xe2f000)                  = -1 ENOSYS (Function not implemented)
08:24:45 brk(NULL)                      = -1 ENOSYS (Function not implemented)
08:24:45 fstat(3, 0x7ffe7842ad90)       = -1 ENOSYS (Function not implemented)

strace+seccomp+93e35efb8de4 test: OK

03:05:55 syscall_0x1df(0xffffff9c, 0x7ffe4ac578d5, 0x100, 0xfff, 0x7ffe4ac56c10, 0x7ffe4ac56e18) = -1 EPERM (Operation not permitted)
03:05:55 read(0, "3\n", 1024)           = 2
03:06:04 dup(2)                         = 3
03:06:04 fcntl(3, F_GETFL)              = 0x8002 (flags O_RDWR|O_LARGEFILE)
03:06:04 brk(NULL)                      = 0x1d17000
03:06:04 brk(0x1d38000)                 = 0x1d38000
03:06:04 brk(NULL)                      = 0x1d38000

Comment 20 Vladis Dronov 2019-11-20 06:49:20 UTC
*** Bug 1772147 has been marked as a duplicate of this bug. ***

Comment 23 Jan Stancek 2019-11-24 09:39:19 UTC
Patch(es) committed on kernel-3.10.0-1113.el7

Comment 28 errata-xmlrpc 2020-03-31 19:33:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1016