1322495 – CVE-2016-6213 kernel: user namespace: unlimited consumed of kernel mount resources [rhel-7.4]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1322495 - CVE-2016-6213 kernel: user namespace: unlimited consumed of kernel mount resources [rhel-7.4]

Summary: CVE-2016-6213 kernel: user namespace: unlimited consumed of kernel mount reso...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.4
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Aristeu Rozanski
QA Contact:	Wang Shu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	CVE-2016-6213
TreeView+	depends on / blocked

Reported:	2016-03-30 15:01 UTC by Qian Cai
Modified:	2017-08-02 00:35 UTC (History)
CC List:	7 users (show)
Fixed In Version:	kernel-3.10.0-644.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-01 20:07:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:1842	0	normal	SHIPPED_LIVE	Important: kernel security, bug fix, and enhancement update	2017-08-01 18:22:09 UTC

Description Qian Cai 2016-03-30 15:01:15 UTC

Description of problem:
Nested bind mount started to trigger kernel soft-lockup and eventually deadlock starting from 20 instances during either mount or umount.

# for i in `seq 1 20`; do mount -o bind /root/ /mnt/; done
# for i in `seq 1 20`; do umount /mnt/; done

[  361.301885] NMI backtrace for cpu 0
[  361.302352] CPU: 0 PID: 29 Comm: kworker/0:1 Not tainted 3.10.0-327.10.1.el7.x86_64 #1
[  361.303062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
[  361.303882] Workqueue: events qxl_fb_work [qxl]
[  361.304379] task: ffff88013943d080 ti: ffff8801395ec000 task.ti: ffff8801395ec000
[  361.305057] RIP: 0010:[<ffffffff811e0b33>]  [<ffffffff811e0b33>] prune_super+0x23/0x170
[  361.305784] RSP: 0000:ffff8801395ef5c0  EFLAGS: 00000206
[  361.306318] RAX: 0000000000000080 RBX: ffff8801394243b0 RCX: 0000000000000000
[  361.306973] RDX: 0000000000000000 RSI: ffff8801395ef710 RDI: ffff8801394243b0
[  361.307628] RBP: ffff8801395ef5e8 R08: 0000000000000000 R09: 0000000000000040
[  361.308282] R10: 0000000000000000 R11: 0000000000000220 R12: ffff8801395ef710
[  361.308939] R13: ffff880139424000 R14: ffff8801395ef710 R15: 0000000000000000
[  361.309599] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[  361.310320] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  361.310893] CR2: 00007fea0c55dc3d CR3: 00000000b770d000 CR4: 00000000003406f0
[  361.311561] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  361.312227] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  361.312894] Stack:
[  361.313223]  0000000000000400 ffff8801395ef710 ffff8801394243b0 0000000000000258
[  361.313950]  0000000000000000 ffff8801395ef688 ffffffff8117c46b 0000000000000000
[  361.314858]  ffff8801395ef630 ffffffff811d5e21 ffff8801395ef720 0000000000000036
[  361.315615] Call Trace:
[  361.315995]  [<ffffffff8117c46b>] shrink_slab+0xab/0x300
[  361.316559]  [<ffffffff811d5e21>] ? vmpressure+0x21/0x90
[  361.317114]  [<ffffffff8117f6a2>] do_try_to_free_pages+0x3c2/0x4e0
[  361.317727]  [<ffffffff8117f8bc>] try_to_free_pages+0xfc/0x180
[  361.318309]  [<ffffffff811735bd>] __alloc_pages_nodemask+0x7fd/0xb90
[  361.318927]  [<ffffffff811b4429>] alloc_pages_current+0xa9/0x170
[  361.319513]  [<ffffffff811be9ec>] new_slab+0x2ec/0x300
[  361.320045]  [<ffffffff8163220f>] __slab_alloc+0x315/0x48f
[  361.320596]  [<ffffffff811e064c>] ? get_empty_filp+0x5c/0x1a0
[  361.321161]  [<ffffffff811c0fb3>] kmem_cache_alloc+0x193/0x1d0
[  361.321735]  [<ffffffff811e064c>] ? get_empty_filp+0x5c/0x1a0
[  361.322295]  [<ffffffff811e064c>] get_empty_filp+0x5c/0x1a0
[  361.322845]  [<ffffffff811e07ae>] alloc_file+0x1e/0xf0
[  361.323362]  [<ffffffff81182773>] __shmem_file_setup+0x113/0x1f0
[  361.323940]  [<ffffffff81182860>] shmem_file_setup+0x10/0x20
[  361.324496]  [<ffffffffa039f5ab>] drm_gem_object_init+0x2b/0x40 [drm]
[  361.325103]  [<ffffffffa0422c3d>] qxl_bo_create+0x7d/0x190 [qxl]
[  361.325680]  [<ffffffffa042798c>] ? qxl_release_list_add+0x5c/0xc0 [qxl]
[  361.326299]  [<ffffffffa0424066>] qxl_alloc_bo_reserved+0x46/0xb0 [qxl]
[  361.326912]  [<ffffffffa0424fde>] qxl_image_alloc_objects+0xae/0x140 [qxl]
[  361.327544]  [<ffffffffa042556e>] qxl_draw_opaque_fb+0xce/0x3c0 [qxl]
[  361.328145]  [<ffffffffa0421ee2>] qxl_fb_dirty_flush+0x1a2/0x260 [qxl]
[  361.328754]  [<ffffffffa0421fb9>] qxl_fb_work+0x19/0x20 [qxl]
[  361.329306]  [<ffffffff8109d5db>] process_one_work+0x17b/0x470
[  361.329865]  [<ffffffff8109e3ab>] worker_thread+0x11b/0x400
[  361.330393]  [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
[  361.330940]  [<ffffffff810a5acf>] kthread+0xcf/0xe0
[  361.331419]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
[  361.332005]  [<ffffffff81645998>] ret_from_fork+0x58/0x90
[  361.332513]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
[  361.333093] Code: 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 f6 41 55 4c 8d af 50 fc ff ff 41 54 53 4c 8b 46 08 48 89 fb <4d> 85 c0 74 09 f6 06 80 0f 84 2f 01 00 00 48 8b 83 80 fc ff ff
[  361.335906] Kernel panic - not syncing: hung_task: blocked tasks

Version-Release number of selected component (if applicable):
kernel-3.10.0-327.10.1.el7.x86_64

How reproducible:
always

Comment 3 Dave Anderson 2016-05-09 16:21:01 UTC

This test is insane.  

Are you aware that the mount table expands by a power-of-two
with each mount command?  For example, on a live system, your
test results in a mount table with over 1 million entries:

  # for i in `seq 1 20`; do mount -o bind /root/ /mnt/; done
  # mount | wc -l
  1048606
  # 

If I unmount them all, and then do each command manually, you
can see the progression:
  
  # mount | wc -l
  31
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  32
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  34
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  38
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  46
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  62
  #
  
  # mount | tail -31
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  # 
  
In the supplied vmcore, the mount table contains over 4 million entries, 
(actually 4194334), which means the the mount command was actually run 22 
times.  For example, if I continue running the mount command on my live 
system from the "seq 1 20" point shown above:
  
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  2097182
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  4194334
  #
  
Here is the beginning of the 4 million mount table entries in the vmcore:
  
  crash> mount
       MOUNT           SUPERBLK     TYPE   DEVNAME   DIRNAME
  ffff88013b026100 ffff880139a68800 rootfs rootfs    /         
  ffff88013568ac00 ffff880132dbb000 sysfs  sysfs     /sys      
  ffff88013568a600 ffff880139a6b000 proc   proc      /proc     
  ffff88013568a400 ffff880139420000 devtmpfs devtmpfs /dev      
  ffff88013568a300 ffff880132eb0000 securityfs securityfs /sys/kernel/security
  ffff880136aec100 ffff880132dbb800 tmpfs  tmpfs     /dev/shm  
  ffff880136aec200 ffff880139423800 devpts devpts    /dev/pts  
  ffff880136aec300 ffff880132dbc000 tmpfs  tmpfs     /run      
  ffff880136aec400 ffff880132dbc800 tmpfs  tmpfs     /sys/fs/cgroup
  ffff880136aec500 ffff880132dbd000 cgroup cgroup    /sys/fs/cgroup/systemd
  ffff880136aec600 ffff880132dbd800 pstore pstore    /sys/fs/pstore
  ffff880132e5c700 ffff880132dbe800 cgroup cgroup    /sys/fs/cgroup/cpu,cpuacct
  ffff880132e5c800 ffff880132dbe000 cgroup cgroup    /sys/fs/cgroup/perf_event
  ffff880132e5c900 ffff880132dbf000 cgroup cgroup    /sys/fs/cgroup/freezer
  ffff880132e5ca00 ffff880132dbf800 cgroup cgroup    /sys/fs/cgroup/memory
  ffff880132e5cb00 ffff880132e20000 cgroup cgroup    /sys/fs/cgroup/hugetlb
  ffff880132e5cc00 ffff880132e20800 cgroup cgroup    /sys/fs/cgroup/devices
  ffff880132e5cd00 ffff880132e21000 cgroup cgroup    /sys/fs/cgroup/net_cls
  ffff880132e5ce00 ffff880132e21800 cgroup cgroup    /sys/fs/cgroup/blkio
  ffff880132e5cf00 ffff880132e22000 cgroup cgroup    /sys/fs/cgroup/cpuset
  ffff88003544ed00 ffff88003565d800 configfs configfs /sys/kernel/config
  ffff8801356af900 ffff880034f0b000 xfs    /dev/sda1 /         
  ffff88003542c800 ffff88003565a800 rpc_pipefs rpc_pipefs /var/lib/nfs/rpc_pipefs
  ffff8800357a1d00 ffff880139425000 selinuxfs selinuxfs /sys/fs/selinux
  ffff880034b0a000 ffff880034848800 autofs systemd-1 /proc/sys/fs/binfmt_misc
  ffff8800356d6100 ffff880139424800 mqueue mqueue    /dev/mqueue
  ffff8800354f5d00 ffff880034849000 hugetlbfs hugetlbfs /dev/hugepages
  ffff8801397a0f00 ffff880139a6e000 debugfs debugfs  /sys/kernel/debug
  ffff880132f62700 ffff880034f0c800 nfsd   nfsd      /proc/fs/nfsd
  ffff8800b9c3ac00 ffff880034f0b000 xfs    /dev/sda1 /var/lib/docker/devicemapper
  ffff8800356d8100 ffff8800b1168800 tmpfs  tmpfs     /run/user/1000
  ffff8800348dc600 ffff880034f0b000 xfs    /dev/sda1 /tmp      
  ffff8800bb6a3d00 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800bb6a3800 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800a7c6fe00 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800354f5b00 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800354f5600 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800b74a9200 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff880034b13800 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff880034b13400 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff880034b13500 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800349a3b00 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800349a3800 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800349a3400 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800b74c6c00 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800b74c6100 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff88013573e000 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff88013573e100 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff88013573e300 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800b9dfb800 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800b9dfb000 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800b9dfb600 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800b9dfbd00 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800b9dfbe00 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800b9dfb300 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800354d1500 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800354d1700 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ... 

The test system has 2 cpus and ~4GB of memory.  The mounts cause the
slab cache to balloon to consume 86% of memory:
  
  crash> kmem -i
                   PAGES        TOTAL      PERCENTAGE
      TOTAL MEM   970581       3.7 GB         ----
           FREE    21617      84.4 MB    2% of TOTAL MEM
           USED   948964       3.6 GB   97% of TOTAL MEM
         SHARED      705       2.8 MB    0% of TOTAL MEM
        BUFFERS        0            0    0% of TOTAL MEM
         CACHED     4201      16.4 MB    0% of TOTAL MEM
           SLAB   841799       3.2 GB   86% of TOTAL MEM
  
     TOTAL SWAP        0            0         ----
      SWAP USED        0            0  100% of TOTAL SWAP
      SWAP FREE        0            0    0% of TOTAL SWAP
  
   COMMIT LIMIT   485290       1.9 GB         ----
      COMMITTED    97824     382.1 MB   20% of TOTAL LIMIT
  crash>

And the system is trying desperately to free memory, where the
runnable tasks on the system are trying to allocate memory, 
and 15 of which have ended up in shrink_slab(), and have called 
cond_resched() because they cannot free any memory:
  
  crash> foreach RU bt | grep -e PID -e shrink_slab 
  PID: 0      TASK: ffffffff81951440  CPU: 0   COMMAND: "swapper/0"
  PID: 0      TASK: ffff880139b62280  CPU: 1   COMMAND: "swapper/1"
  PID: 29     TASK: ffff88013943d080  CPU: 0   COMMAND: "kworker/0:1"
   #5 [ffff8801395ef5f0] shrink_slab at ffffffff8117c46b
  PID: 34     TASK: ffff880132ce0b80  CPU: 1   COMMAND: "khungtaskd"
  PID: 35     TASK: ffff880132ce1700  CPU: 1   COMMAND: "kswapd0"
   #5 [ffff880132cebca8] shrink_slab at ffffffff8117c46b
  PID: 323    TASK: ffff8800356e1700  CPU: 1   COMMAND: "kworker/1:2"
   #5 [ffff88013283fa48] shrink_slab at ffffffff8117c46b
  PID: 414    TASK: ffff88003494ae00  CPU: 1   COMMAND: "systemd-journal"
   #3 [ffff880135bb3980] shrink_slab at ffffffff8117c59c
  PID: 476    TASK: ffff8800bb602e00  CPU: 0   COMMAND: "auditd"
   #3 [ffff8800bb64f980] shrink_slab at ffffffff8117c59c
  PID: 563    TASK: ffff8800bb607300  CPU: 0   COMMAND: "systemd-logind"
   #3 [ffff8800b9c9f980] shrink_slab at ffffffff8117c59c
  PID: 567    TASK: ffff880132ebae00  CPU: 0   COMMAND: "NetworkManager"
   #3 [ffff8800b9d2b980] shrink_slab at ffffffff8117c59c
  PID: 579    TASK: ffff880138205c00  CPU: 1   COMMAND: "gssproxy"
   #3 [ffff8800b9ca3980] shrink_slab at ffffffff8117c59c
  PID: 581    TASK: ffff880138205080  CPU: 0   COMMAND: "irqbalance"
   #3 [ffff8801397bf980] shrink_slab at ffffffff8117c59c
  PID: 610    TASK: ffff880132f35080  CPU: 1   COMMAND: "crond"
   #3 [ffff8800b9def980] shrink_slab at ffffffff8117c59c
  PID: 1135   TASK: ffff8800b9e32e00  CPU: 1   COMMAND: "tuned"
   #3 [ffff880134593980] shrink_slab at ffffffff8117c59c
  PID: 1495   TASK: ffff8800b760c500  CPU: 1   COMMAND: "master"
   #3 [ffff88003497f980] shrink_slab at ffffffff8117c59c
  PID: 1511   TASK: ffff8800bb5b2280  CPU: 0   COMMAND: "pickup"
  PID: 1512   TASK: ffff8800bb5b0b80  CPU: 1   COMMAND: "qmgr"
   #3 [ffff880034c07980] shrink_slab at ffffffff8117c59c
  PID: 2041   TASK: ffff88013544a280  CPU: 1   COMMAND: "docker"
  PID: 2069   TASK: ffff8800b771dc00  CPU: 0   COMMAND: "docker"
   #3 [ffff8801352c3980] shrink_slab at ffffffff8117c59c
  PID: 2217   TASK: ffff8800b7609700  CPU: 0   COMMAND: "sshd"
   #3 [ffff8800b76d3980] shrink_slab at ffffffff8117c59c
  crash> 
  
There is no "fix" for a root user who is abusing the system in
this manner.  There's no way the system can recover because there
is no memory to be reclaimed/allocated.   And as expected, the same
thing happens on an upstream kernel as well, so it's not even a
"rhel-7.3" issue.

In my opinion, this BZ should be CLOSED/NOTABUG.

Comment 6 Vladis Dronov 2016-05-23 17:18:04 UTC

hello. vladis from prodsec is here.

1) testing this potential flaw from a security point of view for rhel-7: mount table indeed expands by a power-of-two with each mount command when run by root:

# uname -r
3.10.0-327.18.2.el7.x86_64

# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-327.18.2.el7.x86_64 ... namespace.enable=1

# cat /proc/mounts |wc -l
29
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
30
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
32
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
36
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
44
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
60
# tail -6 /proc/mounts 
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0

but have just one entry added when run inside root's mount namespace:

# unshare -m
# cat /proc/mounts | wc -l
60
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
61
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
62
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
63
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
64
# tail -6 /proc/mounts 
/dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
# logout
# cat /proc/mounts | wc -l
60

mount or mount+user namespace cannot be entered from non-root user. user namespace entered from non-root user does not allow mount:

# su - testuser

$ unshare -m
unshare: unshare failed: Operation not permitted

$ unshare -U -r -m
unshare: unshare failed: Operation not permitted

$ unshare -U -r
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
mount: permission denied
# logout

so mount table exponential growth is possible by root only, user/mount namespaces do not allow entering or mounting and so this is not a security flaw (as root user have many simpler ways to destroy the system).

2) on the other hand, this behavior is not reproducible in rhel-6:

# uname -r
2.6.32-573.22.1.el6.x86_64
# cat /proc/mounts | wc -l
10
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
11
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
12
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
13
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
14
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
15
# tail /proc/mounts
tmpfs /dev/shm tmpfs rw,relatime 0 0
/dev/vda3 / ext4 rw,relatime,barrier=1,data=ordered 0 0
/proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0
/dev/vda1 /boot ext4 rw,relatime,barrier=1,data=ordered 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
/dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0
/dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0
/dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0
/dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0
/dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0

so in some sense this exponential growth of a mount table is a kind of a regression.

3) on the third hand, the same behavior is present in the latest upstream kernel: mount table also expands by a power-of-two with each mount command when run by root:

# uname -r
4.6.0-vladis

# cat /proc/mounts |wc -l
30
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
31
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
33
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
37
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
45
# tail -5 /proc/mounts
/dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0 

and again, have just one entry added when run inside mount namespace entered from non-root user:

# su - testuser
$ unshare -U -r -m
# cat /proc/mounts |wc -l
45
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
46
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
47
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
48
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
49
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
50
# tail -6 /proc/mounts
/dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0

so, i do not think this exponential growth of a mount table is easily fixable in rhel-7, as (afaiu) we first need to fix this in an upstream (and make them [Linus] to accept it).

Comment 8 Qian Cai 2016-08-02 14:42:57 UTC

This bug is now all around the user namespace.
http://seclists.org/oss-sec/2016/q3/75

First of all, this patch along from Al Viro helps significantly for the softlock up/NMI watchdog with the high mount table entries.

=================
On Thu, Jul 14, 2016 at 05:11:13PM -0400, CAI Qian wrote:
> Tested it on this large memory machine. consumed 1.5G memory to create 8388640
> entries in the mount table. Immediately afterwards, NMI watchdog/soft-lockup
> kicked in and the kernel is dead.

Cute...  Doesn't look like an OOM, though - more like hash chains growing long
enough for hash insertions to happen while we'd been searching.  Which
triggers repeated lookup, etc.  There's your livelock...

Actually, it's not even hash insertions; clone_mnt() does
        lock_mount_hash();
        list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
        unlock_mount_hash();
and _that_ is an overkill.  We don't need write_seqlock(&mount_lock) for
that; read_seqlock_excl() is enough.  No need to bump the count on that
one.  I suspect that this alone will make the situation with livelocks
much better; it's not the only place where we would be fine with
read_seqlock_excl(), but it's the easiest source of rapid bumps.  Won't do
anything about OOM, obviously...

diff --git a/fs/namespace.c b/fs/namespace.c
index 419f746..5ced8e3 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1013,9 +1013,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	mnt->mnt.mnt_root = dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
-	lock_mount_hash();
+	read_seqlock_excl(&mount_lock);
 	list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
-	unlock_mount_hash();
+	read_sequnlock_excl(&mount_lock);

 	if ((flag & CL_SLAVE) ||
 	    ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) {
===================

The patch does seems improve the system response time a lot once the mount count is over 8 millions. Although it still triggered NMI/softlock during bind mount, the system seems running fine afterwards. No significant deadlock anymore. While the mount table is growing bigger, those NMI/softlock happens more often. Eventually, NMI/softlock is running in a loop and the CPU won't be able to process anything when 9G of memory is consumed by those bind mount.

Comment 9 Qian Cai 2016-08-02 14:46:48 UTC

Secondly, we might need something like this below to limit the number of mount namespaces, so the admin can set an upper limit.

https://lists.linuxfoundation.org/pipermail/containers/2016-July/037216.html

Comment 10 Vladis Dronov 2016-08-30 13:04:27 UTC

hello. vladis from prodsec is here again, analyzing new exploit possibilities suggested.

1) exploit described in http://seclists.org/oss-sec/2016/q3/65

> # docker run -it -v /mnt/:/mnt/:shared --cap-add=SYS_ADMIN rhel7 /bin/bash
> (insider container) # for i in `seq 1 20`; mount -o bind /mnt/1 /mnt/2; done

still requires a privileged user giving the container --cap-add=SYS_ADMIN. as previously, privileged user has much simpler ways to do bad things to a system.

2) exploit with user namespaces described in http://seclists.org/oss-sec/2016/q3/75

> $ unshare -r -m --propagation shared
> # for i in `seq 1 30`; do mount -o bind ~/src/ ~/dst/; done

testing this on the latest RHEL-7.3-20160825.1-Server (beta) still shows that unprivileged mount namespaces are not enabled in rhel-7.3:

> $ uname -r
> 3.10.0-495.el7.x86_64
> 
> $ unshare -r -m
> unshare: unshare failed: Operation not permitted
> 
> $ unshare -r -m --propagation shared
> unshare: unshare failed: Operation not permitted

making the exploit described impossible. so, under current constraints of rhel-7 it's probably unexploitable. yes, it is still exploitable on fedora/upstream-kernel, bz1356472 is for this.

3) still, there was a discussion, if we are sure that unprivileged mount namespaces will remain disabled in future rhel-7. containers/docker presses quite hard to get these features in, so we don't. so it was suggested to have y-stream rhel-7 tracker bz1322495 (this one) and probably cc: Eric Biederman. 

if this is solved, the tracker bzs for rhel-7/kernel-rt, mrg-2/realtime-kernel and rhel-7/arm-kernel should be created too.

Comment 13 Vladis Dronov 2016-10-03 13:43:29 UTC

per Andrew Vagin: https://lkml.org/lkml/2016/8/28/269

Comment 17 Rafael Aquini 2017-04-03 18:39:23 UTC

Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 19 Rafael Aquini 2017-04-07 15:49:04 UTC

Patch(es) available on kernel-3.10.0-644.el7

Comment 25 errata-xmlrpc 2017-08-01 20:07:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842

Comment 26 errata-xmlrpc 2017-08-02 00:35:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842

Note You need to log in before you can comment on or make changes to this bug.