1557332 – mesa: Missing allocation failure checks in i965_dri.so

Bug 1557332 - mesa: Missing allocation failure checks in i965_dri.so

Summary: mesa: Missing allocation failure checks in i965_dri.so

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mesa
Sub Component:
Version:	27
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Adam Jackson
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-03-16 12:44 UTC by Florian Weimer
Modified:	2018-11-30 22:18 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-11-30 22:18:45 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Florian Weimer 2018-03-16 12:44:38 UTC

On my GNOME system, a bunch of functions call brw_bo_map with NULL for the bo argument.

intel_batchbuffer.c contains this code in intel_batchbuffer_reset:

176	   batch->batch.bo = brw_bo_alloc(bufmgr, "batchbuffer", BATCH_SZ, 4096);
177	   if (!batch->batch.cpu_map) {
178	      batch->batch.map =
179	         brw_bo_map(brw, batch->batch.bo, MAP_READ | MAP_WRITE);
180	   }
181	   batch->map_next = batch->batch.map;

Here the allocation apparently failed and batch->batch.bo is NULL.

Another crash happened in the brw_bo_map call in brw_map_buffer_range:

471	   void *map = brw_bo_map(brw, intel_obj->buffer, access);

Here, an earlier call to alloc_buffer_object probably failed:

99	   intel_obj->buffer = brw_bo_alloc(brw->bufmgr, "bufferobj", size, 64);

I don't know why brw_bo_alloc fails so frequently on this machine.  I'm trying to run with vm.overcommit_memory=2, maybe this is what triggers more often than for other users.

mesa-dri-drivers-17.3.6-1.fc27.x86_64
kernel-4.15.6-300.fc27.x86_64

Comment 1 Florian Weimer 2018-04-04 21:03:06 UTC

I had another desktop crash, but with vm.overcommit_memory=0.  This time, the screen just froze.  Log data is inconclusive whether it is the same issue.  I will keep running with vm.overcommit_memory=0 and see if I can get a better log the next time.

Comment 2 Rob Clark 2018-06-25 19:41:39 UTC

(In reply to Florian Weimer from comment #0)
> On my GNOME system, a bunch of functions call brw_bo_map with NULL for the
> bo argument.
> 
> intel_batchbuffer.c contains this code in intel_batchbuffer_reset:
> 
> 176	   batch->batch.bo = brw_bo_alloc(bufmgr, "batchbuffer", BATCH_SZ, 4096);
> 177	   if (!batch->batch.cpu_map) {
> 178	      batch->batch.map =
> 179	         brw_bo_map(brw, batch->batch.bo, MAP_READ | MAP_WRITE);
> 180	   }
> 181	   batch->map_next = batch->batch.map;
> 
> Here the allocation apparently failed and batch->batch.bo is NULL.
> 
> Another crash happened in the brw_bo_map call in brw_map_buffer_range:
> 
> 471	   void *map = brw_bo_map(brw, intel_obj->buffer, access);
> 
> Here, an earlier call to alloc_buffer_object probably failed:
> 
> 99	   intel_obj->buffer = brw_bo_alloc(brw->bufmgr, "bufferobj", size, 64);
> 
> I don't know why brw_bo_alloc fails so frequently on this machine.  I'm
> trying to run with vm.overcommit_memory=2, maybe this is what triggers more
> often than for other users.
> 
> mesa-dri-drivers-17.3.6-1.fc27.x86_64
> kernel-4.15.6-300.fc27.x86_64

i915 kinda by design wants to allocate all your memory and then use shrinker on the kernel side to let buffers go when under memory pressure.  This might not play well with disabled/limited overcommit.

Comment 3 Rob Clark 2018-06-25 19:51:53 UTC

some suggestion/comments from Chris Wilson on #dri-devel:

robclark> ickle, btw, ever played w/ non-default vm.overcommit_memory settings?  https://bugzilla.redhat.com/show_bug.cgi?id=1557332
<ickle> robclark: gem uses VM_NORESERVE for its shmemfs objects, afaik they aren't accounted until actual page allocation
<ickle> I suspect that's just plain old malloc returning NULL
<robclark> hmm, could be.. but because of bo cache, I guess?
<ickle> it'll be tight, but should all be shrinkable
<ickle> keep an eye in dmesg for page allocation fails or resort to strace / drm.debug=0x1
<ickle> just to isolate an ioctl returning -ENOMEM vs malloc

the shmemfs objects are the actual gpu buffers, the malloc'd things would be the userspace struct ptr that brw_bo_alloc() allocates (something rather small, but I guess somehow in this case the shrinker doesn't get a chance to free up enough memory so that malloc could allocate another chunk of pages??)

Comment 4 Florian Weimer 2018-06-25 20:15:34 UTC

I didn't see any kernel messages before the allocation failure.

I'm also not convinced that there would be any kind of OOM with vm.overcommit_memory=2.  Does the shrinker even run before allocations fail?

Comment 5 Florian Weimer 2018-06-25 20:16:41 UTC

Regarding VM_NORESERVE, if it is anything like MAP_NORESERVE, then it does account against the commit limit.

Comment 6 Rob Clark 2018-06-26 00:32:30 UTC

(In reply to Florian Weimer from comment #4)
> I didn't see any kernel messages before the allocation failure.
> 
> I'm also not convinced that there would be any kind of OOM with
> vm.overcommit_memory=2.  Does the shrinker even run before allocations fail?

the whole assumption behind the intel driver approach of greedily keeping around "freed" buffers in the bo cache to re-use them[1] relies on the shrinker running before allocation fails.  I can certainly see how this could fall down when changing vm.overcommit_memory from the default if shrinker doesn't run before expanding heap fails..

[1] ie. setting up and tearing down vm mappings is expensive, and to be avoided if you care about moar-fps

Comment 7 Rob Clark 2018-06-26 17:41:36 UTC

so thinking about this a bit more, I suspect that to deal w/ overcommit limits, i915 would need to somehow update the process accounting when unused buffers are MADVISE(madv=DONTNEED).  I suppose shrinker doesn't actually run before malloc expanding the heap fails, since it is a process accounting limit, not actually running out of memory.  But when unused buffers are cached in userspace (but candidate to be purged from shrinker on kernel side), they should somehow now count against overcommit limits.

Comment 8 Ben Cotton 2018-11-27 14:52:27 UTC

This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30  Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora  'version' of '27'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 27 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 9 Ben Cotton 2018-11-30 22:18:45 UTC

Fedora 27 changed to end-of-life (EOL) status on 2018-11-30. Fedora 27 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.