Bug 1322495 - CVE-2016-6213 kernel: user namespace: unlimited consumed of kernel mount resources [rhel-7.4]
Summary: CVE-2016-6213 kernel: user namespace: unlimited consumed of kernel mount reso...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.4
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Aristeu Rozanski
QA Contact: Wang Shu
URL:
Whiteboard:
Depends On:
Blocks: CVE-2016-6213
TreeView+ depends on / blocked
 
Reported: 2016-03-30 15:01 UTC by Qian Cai
Modified: 2017-08-02 00:35 UTC (History)
7 users (show)

Fixed In Version: kernel-3.10.0-644.el7
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-01 20:07:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:1842 normal SHIPPED_LIVE Important: kernel security, bug fix, and enhancement update 2017-08-01 18:22:09 UTC

Description Qian Cai 2016-03-30 15:01:15 UTC
Description of problem:
Nested bind mount started to trigger kernel soft-lockup and eventually deadlock starting from 20 instances during either mount or umount.

# for i in `seq 1 20`; do mount -o bind /root/ /mnt/; done
# for i in `seq 1 20`; do umount /mnt/; done

[  361.301885] NMI backtrace for cpu 0
[  361.302352] CPU: 0 PID: 29 Comm: kworker/0:1 Not tainted 3.10.0-327.10.1.el7.x86_64 #1
[  361.303062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
[  361.303882] Workqueue: events qxl_fb_work [qxl]
[  361.304379] task: ffff88013943d080 ti: ffff8801395ec000 task.ti: ffff8801395ec000
[  361.305057] RIP: 0010:[<ffffffff811e0b33>]  [<ffffffff811e0b33>] prune_super+0x23/0x170
[  361.305784] RSP: 0000:ffff8801395ef5c0  EFLAGS: 00000206
[  361.306318] RAX: 0000000000000080 RBX: ffff8801394243b0 RCX: 0000000000000000
[  361.306973] RDX: 0000000000000000 RSI: ffff8801395ef710 RDI: ffff8801394243b0
[  361.307628] RBP: ffff8801395ef5e8 R08: 0000000000000000 R09: 0000000000000040
[  361.308282] R10: 0000000000000000 R11: 0000000000000220 R12: ffff8801395ef710
[  361.308939] R13: ffff880139424000 R14: ffff8801395ef710 R15: 0000000000000000
[  361.309599] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[  361.310320] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  361.310893] CR2: 00007fea0c55dc3d CR3: 00000000b770d000 CR4: 00000000003406f0
[  361.311561] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  361.312227] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  361.312894] Stack:
[  361.313223]  0000000000000400 ffff8801395ef710 ffff8801394243b0 0000000000000258
[  361.313950]  0000000000000000 ffff8801395ef688 ffffffff8117c46b 0000000000000000
[  361.314858]  ffff8801395ef630 ffffffff811d5e21 ffff8801395ef720 0000000000000036
[  361.315615] Call Trace:
[  361.315995]  [<ffffffff8117c46b>] shrink_slab+0xab/0x300
[  361.316559]  [<ffffffff811d5e21>] ? vmpressure+0x21/0x90
[  361.317114]  [<ffffffff8117f6a2>] do_try_to_free_pages+0x3c2/0x4e0
[  361.317727]  [<ffffffff8117f8bc>] try_to_free_pages+0xfc/0x180
[  361.318309]  [<ffffffff811735bd>] __alloc_pages_nodemask+0x7fd/0xb90
[  361.318927]  [<ffffffff811b4429>] alloc_pages_current+0xa9/0x170
[  361.319513]  [<ffffffff811be9ec>] new_slab+0x2ec/0x300
[  361.320045]  [<ffffffff8163220f>] __slab_alloc+0x315/0x48f
[  361.320596]  [<ffffffff811e064c>] ? get_empty_filp+0x5c/0x1a0
[  361.321161]  [<ffffffff811c0fb3>] kmem_cache_alloc+0x193/0x1d0
[  361.321735]  [<ffffffff811e064c>] ? get_empty_filp+0x5c/0x1a0
[  361.322295]  [<ffffffff811e064c>] get_empty_filp+0x5c/0x1a0
[  361.322845]  [<ffffffff811e07ae>] alloc_file+0x1e/0xf0
[  361.323362]  [<ffffffff81182773>] __shmem_file_setup+0x113/0x1f0
[  361.323940]  [<ffffffff81182860>] shmem_file_setup+0x10/0x20
[  361.324496]  [<ffffffffa039f5ab>] drm_gem_object_init+0x2b/0x40 [drm]
[  361.325103]  [<ffffffffa0422c3d>] qxl_bo_create+0x7d/0x190 [qxl]
[  361.325680]  [<ffffffffa042798c>] ? qxl_release_list_add+0x5c/0xc0 [qxl]
[  361.326299]  [<ffffffffa0424066>] qxl_alloc_bo_reserved+0x46/0xb0 [qxl]
[  361.326912]  [<ffffffffa0424fde>] qxl_image_alloc_objects+0xae/0x140 [qxl]
[  361.327544]  [<ffffffffa042556e>] qxl_draw_opaque_fb+0xce/0x3c0 [qxl]
[  361.328145]  [<ffffffffa0421ee2>] qxl_fb_dirty_flush+0x1a2/0x260 [qxl]
[  361.328754]  [<ffffffffa0421fb9>] qxl_fb_work+0x19/0x20 [qxl]
[  361.329306]  [<ffffffff8109d5db>] process_one_work+0x17b/0x470
[  361.329865]  [<ffffffff8109e3ab>] worker_thread+0x11b/0x400
[  361.330393]  [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
[  361.330940]  [<ffffffff810a5acf>] kthread+0xcf/0xe0
[  361.331419]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
[  361.332005]  [<ffffffff81645998>] ret_from_fork+0x58/0x90
[  361.332513]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
[  361.333093] Code: 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 f6 41 55 4c 8d af 50 fc ff ff 41 54 53 4c 8b 46 08 48 89 fb <4d> 85 c0 74 09 f6 06 80 0f 84 2f 01 00 00 48 8b 83 80 fc ff ff
[  361.335906] Kernel panic - not syncing: hung_task: blocked tasks

Version-Release number of selected component (if applicable):
kernel-3.10.0-327.10.1.el7.x86_64

How reproducible:
always

Comment 3 Dave Anderson 2016-05-09 16:21:01 UTC
This test is insane.  

Are you aware that the mount table expands by a power-of-two
with each mount command?  For example, on a live system, your
test results in a mount table with over 1 million entries:

  # for i in `seq 1 20`; do mount -o bind /root/ /mnt/; done
  # mount | wc -l
  1048606
  # 

If I unmount them all, and then do each command manually, you
can see the progression:
  
  # mount | wc -l
  31
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  32
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  34
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  38
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  46
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  62
  #
  
  # mount | tail -31
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
  # 
  
In the supplied vmcore, the mount table contains over 4 million entries, 
(actually 4194334), which means the the mount command was actually run 22 
times.  For example, if I continue running the mount command on my live 
system from the "seq 1 20" point shown above:
  
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  2097182
  # mount -o bind /root/ /mnt/
  # mount | wc -l
  4194334
  #
  
Here is the beginning of the 4 million mount table entries in the vmcore:
  
  crash> mount
       MOUNT           SUPERBLK     TYPE   DEVNAME   DIRNAME
  ffff88013b026100 ffff880139a68800 rootfs rootfs    /         
  ffff88013568ac00 ffff880132dbb000 sysfs  sysfs     /sys      
  ffff88013568a600 ffff880139a6b000 proc   proc      /proc     
  ffff88013568a400 ffff880139420000 devtmpfs devtmpfs /dev      
  ffff88013568a300 ffff880132eb0000 securityfs securityfs /sys/kernel/security
  ffff880136aec100 ffff880132dbb800 tmpfs  tmpfs     /dev/shm  
  ffff880136aec200 ffff880139423800 devpts devpts    /dev/pts  
  ffff880136aec300 ffff880132dbc000 tmpfs  tmpfs     /run      
  ffff880136aec400 ffff880132dbc800 tmpfs  tmpfs     /sys/fs/cgroup
  ffff880136aec500 ffff880132dbd000 cgroup cgroup    /sys/fs/cgroup/systemd
  ffff880136aec600 ffff880132dbd800 pstore pstore    /sys/fs/pstore
  ffff880132e5c700 ffff880132dbe800 cgroup cgroup    /sys/fs/cgroup/cpu,cpuacct
  ffff880132e5c800 ffff880132dbe000 cgroup cgroup    /sys/fs/cgroup/perf_event
  ffff880132e5c900 ffff880132dbf000 cgroup cgroup    /sys/fs/cgroup/freezer
  ffff880132e5ca00 ffff880132dbf800 cgroup cgroup    /sys/fs/cgroup/memory
  ffff880132e5cb00 ffff880132e20000 cgroup cgroup    /sys/fs/cgroup/hugetlb
  ffff880132e5cc00 ffff880132e20800 cgroup cgroup    /sys/fs/cgroup/devices
  ffff880132e5cd00 ffff880132e21000 cgroup cgroup    /sys/fs/cgroup/net_cls
  ffff880132e5ce00 ffff880132e21800 cgroup cgroup    /sys/fs/cgroup/blkio
  ffff880132e5cf00 ffff880132e22000 cgroup cgroup    /sys/fs/cgroup/cpuset
  ffff88003544ed00 ffff88003565d800 configfs configfs /sys/kernel/config
  ffff8801356af900 ffff880034f0b000 xfs    /dev/sda1 /         
  ffff88003542c800 ffff88003565a800 rpc_pipefs rpc_pipefs /var/lib/nfs/rpc_pipefs
  ffff8800357a1d00 ffff880139425000 selinuxfs selinuxfs /sys/fs/selinux
  ffff880034b0a000 ffff880034848800 autofs systemd-1 /proc/sys/fs/binfmt_misc
  ffff8800356d6100 ffff880139424800 mqueue mqueue    /dev/mqueue
  ffff8800354f5d00 ffff880034849000 hugetlbfs hugetlbfs /dev/hugepages
  ffff8801397a0f00 ffff880139a6e000 debugfs debugfs  /sys/kernel/debug
  ffff880132f62700 ffff880034f0c800 nfsd   nfsd      /proc/fs/nfsd
  ffff8800b9c3ac00 ffff880034f0b000 xfs    /dev/sda1 /var/lib/docker/devicemapper
  ffff8800356d8100 ffff8800b1168800 tmpfs  tmpfs     /run/user/1000
  ffff8800348dc600 ffff880034f0b000 xfs    /dev/sda1 /tmp      
  ffff8800bb6a3d00 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800bb6a3800 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800a7c6fe00 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800354f5b00 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800354f5600 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800b74a9200 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff880034b13800 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff880034b13400 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff880034b13500 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800349a3b00 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800349a3800 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800349a3400 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800b74c6c00 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800b74c6100 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff88013573e000 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff88013573e100 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff88013573e300 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800b9dfb800 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800b9dfb000 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800b9dfb600 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800b9dfbd00 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800b9dfbe00 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800b9dfb300 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ffff8800354d1500 ffff880034f0b000 xfs    /dev/sda1 /root     
  ffff8800354d1700 ffff880034f0b000 xfs    /dev/sda1 /tmp/root 
  ... 

The test system has 2 cpus and ~4GB of memory.  The mounts cause the
slab cache to balloon to consume 86% of memory:
  
  crash> kmem -i
                   PAGES        TOTAL      PERCENTAGE
      TOTAL MEM   970581       3.7 GB         ----
           FREE    21617      84.4 MB    2% of TOTAL MEM
           USED   948964       3.6 GB   97% of TOTAL MEM
         SHARED      705       2.8 MB    0% of TOTAL MEM
        BUFFERS        0            0    0% of TOTAL MEM
         CACHED     4201      16.4 MB    0% of TOTAL MEM
           SLAB   841799       3.2 GB   86% of TOTAL MEM
  
     TOTAL SWAP        0            0         ----
      SWAP USED        0            0  100% of TOTAL SWAP
      SWAP FREE        0            0    0% of TOTAL SWAP
  
   COMMIT LIMIT   485290       1.9 GB         ----
      COMMITTED    97824     382.1 MB   20% of TOTAL LIMIT
  crash>

And the system is trying desperately to free memory, where the
runnable tasks on the system are trying to allocate memory, 
and 15 of which have ended up in shrink_slab(), and have called 
cond_resched() because they cannot free any memory:
  
  crash> foreach RU bt | grep -e PID -e shrink_slab 
  PID: 0      TASK: ffffffff81951440  CPU: 0   COMMAND: "swapper/0"
  PID: 0      TASK: ffff880139b62280  CPU: 1   COMMAND: "swapper/1"
  PID: 29     TASK: ffff88013943d080  CPU: 0   COMMAND: "kworker/0:1"
   #5 [ffff8801395ef5f0] shrink_slab at ffffffff8117c46b
  PID: 34     TASK: ffff880132ce0b80  CPU: 1   COMMAND: "khungtaskd"
  PID: 35     TASK: ffff880132ce1700  CPU: 1   COMMAND: "kswapd0"
   #5 [ffff880132cebca8] shrink_slab at ffffffff8117c46b
  PID: 323    TASK: ffff8800356e1700  CPU: 1   COMMAND: "kworker/1:2"
   #5 [ffff88013283fa48] shrink_slab at ffffffff8117c46b
  PID: 414    TASK: ffff88003494ae00  CPU: 1   COMMAND: "systemd-journal"
   #3 [ffff880135bb3980] shrink_slab at ffffffff8117c59c
  PID: 476    TASK: ffff8800bb602e00  CPU: 0   COMMAND: "auditd"
   #3 [ffff8800bb64f980] shrink_slab at ffffffff8117c59c
  PID: 563    TASK: ffff8800bb607300  CPU: 0   COMMAND: "systemd-logind"
   #3 [ffff8800b9c9f980] shrink_slab at ffffffff8117c59c
  PID: 567    TASK: ffff880132ebae00  CPU: 0   COMMAND: "NetworkManager"
   #3 [ffff8800b9d2b980] shrink_slab at ffffffff8117c59c
  PID: 579    TASK: ffff880138205c00  CPU: 1   COMMAND: "gssproxy"
   #3 [ffff8800b9ca3980] shrink_slab at ffffffff8117c59c
  PID: 581    TASK: ffff880138205080  CPU: 0   COMMAND: "irqbalance"
   #3 [ffff8801397bf980] shrink_slab at ffffffff8117c59c
  PID: 610    TASK: ffff880132f35080  CPU: 1   COMMAND: "crond"
   #3 [ffff8800b9def980] shrink_slab at ffffffff8117c59c
  PID: 1135   TASK: ffff8800b9e32e00  CPU: 1   COMMAND: "tuned"
   #3 [ffff880134593980] shrink_slab at ffffffff8117c59c
  PID: 1495   TASK: ffff8800b760c500  CPU: 1   COMMAND: "master"
   #3 [ffff88003497f980] shrink_slab at ffffffff8117c59c
  PID: 1511   TASK: ffff8800bb5b2280  CPU: 0   COMMAND: "pickup"
  PID: 1512   TASK: ffff8800bb5b0b80  CPU: 1   COMMAND: "qmgr"
   #3 [ffff880034c07980] shrink_slab at ffffffff8117c59c
  PID: 2041   TASK: ffff88013544a280  CPU: 1   COMMAND: "docker"
  PID: 2069   TASK: ffff8800b771dc00  CPU: 0   COMMAND: "docker"
   #3 [ffff8801352c3980] shrink_slab at ffffffff8117c59c
  PID: 2217   TASK: ffff8800b7609700  CPU: 0   COMMAND: "sshd"
   #3 [ffff8800b76d3980] shrink_slab at ffffffff8117c59c
  crash> 
  
There is no "fix" for a root user who is abusing the system in
this manner.  There's no way the system can recover because there
is no memory to be reclaimed/allocated.   And as expected, the same
thing happens on an upstream kernel as well, so it's not even a
"rhel-7.3" issue.

In my opinion, this BZ should be CLOSED/NOTABUG.

Comment 6 Vladis Dronov 2016-05-23 17:18:04 UTC
hello. vladis from prodsec is here.

1) testing this potential flaw from a security point of view for rhel-7: mount table indeed expands by a power-of-two with each mount command when run by root:

# uname -r
3.10.0-327.18.2.el7.x86_64

# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-327.18.2.el7.x86_64 ... namespace.enable=1

# cat /proc/mounts |wc -l
29
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
30
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
32
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
36
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
44
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
60
# tail -6 /proc/mounts 
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0

but have just one entry added when run inside root's mount namespace:

# unshare -m
# cat /proc/mounts | wc -l
60
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
61
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
62
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
63
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
64
# tail -6 /proc/mounts 
/dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0
# logout
# cat /proc/mounts | wc -l
60

mount or mount+user namespace cannot be entered from non-root user. user namespace entered from non-root user does not allow mount:

# su - testuser

$ unshare -m
unshare: unshare failed: Operation not permitted

$ unshare -U -r -m
unshare: unshare failed: Operation not permitted

$ unshare -U -r
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
mount: permission denied
# logout

so mount table exponential growth is possible by root only, user/mount namespaces do not allow entering or mounting and so this is not a security flaw (as root user have many simpler ways to destroy the system).

2) on the other hand, this behavior is not reproducible in rhel-6:

# uname -r
2.6.32-573.22.1.el6.x86_64
# cat /proc/mounts | wc -l
10
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
11
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
12
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
13
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
14
# mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l
15
# tail /proc/mounts
tmpfs /dev/shm tmpfs rw,relatime 0 0
/dev/vda3 / ext4 rw,relatime,barrier=1,data=ordered 0 0
/proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0
/dev/vda1 /boot ext4 rw,relatime,barrier=1,data=ordered 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
/dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0
/dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0
/dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0
/dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0
/dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0

so in some sense this exponential growth of a mount table is a kind of a regression.

3) on the third hand, the same behavior is present in the latest upstream kernel: mount table also expands by a power-of-two with each mount command when run by root:

# uname -r
4.6.0-vladis

# cat /proc/mounts |wc -l
30
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
31
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
33
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
37
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
45
# tail -5 /proc/mounts
/dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0 

and again, have just one entry added when run inside mount namespace entered from non-root user:

# su - testuser
$ unshare -U -r -m
# cat /proc/mounts |wc -l
45
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
46
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
47
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
48
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
49
# mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l
50
# tail -6 /proc/mounts
/dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0
/dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0

so, i do not think this exponential growth of a mount table is easily fixable in rhel-7, as (afaiu) we first need to fix this in an upstream (and make them [Linus] to accept it).

Comment 8 Qian Cai 2016-08-02 14:42:57 UTC
This bug is now all around the user namespace.
http://seclists.org/oss-sec/2016/q3/75

First of all, this patch along from Al Viro helps significantly for the softlock up/NMI watchdog with the high mount table entries.

=================
On Thu, Jul 14, 2016 at 05:11:13PM -0400, CAI Qian wrote:
> Tested it on this large memory machine. consumed 1.5G memory to create 8388640
> entries in the mount table. Immediately afterwards, NMI watchdog/soft-lockup
> kicked in and the kernel is dead.

Cute...  Doesn't look like an OOM, though - more like hash chains growing long
enough for hash insertions to happen while we'd been searching.  Which
triggers repeated lookup, etc.  There's your livelock...

Actually, it's not even hash insertions; clone_mnt() does
        lock_mount_hash();
        list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
        unlock_mount_hash();
and _that_ is an overkill.  We don't need write_seqlock(&mount_lock) for
that; read_seqlock_excl() is enough.  No need to bump the count on that
one.  I suspect that this alone will make the situation with livelocks
much better; it's not the only place where we would be fine with
read_seqlock_excl(), but it's the easiest source of rapid bumps.  Won't do
anything about OOM, obviously...

diff --git a/fs/namespace.c b/fs/namespace.c
index 419f746..5ced8e3 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1013,9 +1013,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	mnt->mnt.mnt_root = dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
-	lock_mount_hash();
+	read_seqlock_excl(&mount_lock);
 	list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
-	unlock_mount_hash();
+	read_sequnlock_excl(&mount_lock);
 
 	if ((flag & CL_SLAVE) ||
 	    ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) {
===================

The patch does seems improve the system response time a lot once the mount count is over 8 millions. Although it still triggered NMI/softlock during bind mount, the system seems running fine afterwards. No significant deadlock anymore. While the mount table is growing bigger, those NMI/softlock happens more often. Eventually, NMI/softlock is running in a loop and the CPU won't be able to process anything when 9G of memory is consumed by those bind mount.

Comment 9 Qian Cai 2016-08-02 14:46:48 UTC
Secondly, we might need something like this below to limit the number of mount namespaces, so the admin can set an upper limit.

https://lists.linuxfoundation.org/pipermail/containers/2016-July/037216.html

Comment 10 Vladis Dronov 2016-08-30 13:04:27 UTC
hello. vladis from prodsec is here again, analyzing new exploit possibilities suggested.

1) exploit described in http://seclists.org/oss-sec/2016/q3/65

> # docker run -it -v /mnt/:/mnt/:shared --cap-add=SYS_ADMIN rhel7 /bin/bash
> (insider container) # for i in `seq 1 20`; mount -o bind /mnt/1 /mnt/2; done

still requires a privileged user giving the container --cap-add=SYS_ADMIN. as previously, privileged user has much simpler ways to do bad things to a system.

2) exploit with user namespaces described in http://seclists.org/oss-sec/2016/q3/75

> $ unshare -r -m --propagation shared
> # for i in `seq 1 30`; do mount -o bind ~/src/ ~/dst/; done

testing this on the latest RHEL-7.3-20160825.1-Server (beta) still shows that unprivileged mount namespaces are not enabled in rhel-7.3:

> $ uname -r
> 3.10.0-495.el7.x86_64
> 
> $ unshare -r -m
> unshare: unshare failed: Operation not permitted
> 
> $ unshare -r -m --propagation shared
> unshare: unshare failed: Operation not permitted

making the exploit described impossible. so, under current constraints of rhel-7 it's probably unexploitable. yes, it is still exploitable on fedora/upstream-kernel, bz1356472 is for this.

3) still, there was a discussion, if we are sure that unprivileged mount namespaces will remain disabled in future rhel-7. containers/docker presses quite hard to get these features in, so we don't. so it was suggested to have y-stream rhel-7 tracker bz1322495 (this one) and probably cc: Eric Biederman. 

if this is solved, the tracker bzs for rhel-7/kernel-rt, mrg-2/realtime-kernel and rhel-7/arm-kernel should be created too.

Comment 13 Vladis Dronov 2016-10-03 13:43:29 UTC
per Andrew Vagin: https://lkml.org/lkml/2016/8/28/269

Comment 17 Rafael Aquini 2017-04-03 18:39:23 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 19 Rafael Aquini 2017-04-07 15:49:04 UTC
Patch(es) available on kernel-3.10.0-644.el7

Comment 25 errata-xmlrpc 2017-08-01 20:07:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842

Comment 26 errata-xmlrpc 2017-08-02 00:35:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842


Note You need to log in before you can comment on or make changes to this bug.