On my GNOME system, a bunch of functions call brw_bo_map with NULL for the bo argument. intel_batchbuffer.c contains this code in intel_batchbuffer_reset: 176 batch->batch.bo = brw_bo_alloc(bufmgr, "batchbuffer", BATCH_SZ, 4096); 177 if (!batch->batch.cpu_map) { 178 batch->batch.map = 179 brw_bo_map(brw, batch->batch.bo, MAP_READ | MAP_WRITE); 180 } 181 batch->map_next = batch->batch.map; Here the allocation apparently failed and batch->batch.bo is NULL. Another crash happened in the brw_bo_map call in brw_map_buffer_range: 471 void *map = brw_bo_map(brw, intel_obj->buffer, access); Here, an earlier call to alloc_buffer_object probably failed: 99 intel_obj->buffer = brw_bo_alloc(brw->bufmgr, "bufferobj", size, 64); I don't know why brw_bo_alloc fails so frequently on this machine. I'm trying to run with vm.overcommit_memory=2, maybe this is what triggers more often than for other users. mesa-dri-drivers-17.3.6-1.fc27.x86_64 kernel-4.15.6-300.fc27.x86_64
I had another desktop crash, but with vm.overcommit_memory=0. This time, the screen just froze. Log data is inconclusive whether it is the same issue. I will keep running with vm.overcommit_memory=0 and see if I can get a better log the next time.
(In reply to Florian Weimer from comment #0) > On my GNOME system, a bunch of functions call brw_bo_map with NULL for the > bo argument. > > intel_batchbuffer.c contains this code in intel_batchbuffer_reset: > > 176 batch->batch.bo = brw_bo_alloc(bufmgr, "batchbuffer", BATCH_SZ, 4096); > 177 if (!batch->batch.cpu_map) { > 178 batch->batch.map = > 179 brw_bo_map(brw, batch->batch.bo, MAP_READ | MAP_WRITE); > 180 } > 181 batch->map_next = batch->batch.map; > > Here the allocation apparently failed and batch->batch.bo is NULL. > > Another crash happened in the brw_bo_map call in brw_map_buffer_range: > > 471 void *map = brw_bo_map(brw, intel_obj->buffer, access); > > Here, an earlier call to alloc_buffer_object probably failed: > > 99 intel_obj->buffer = brw_bo_alloc(brw->bufmgr, "bufferobj", size, 64); > > I don't know why brw_bo_alloc fails so frequently on this machine. I'm > trying to run with vm.overcommit_memory=2, maybe this is what triggers more > often than for other users. > > mesa-dri-drivers-17.3.6-1.fc27.x86_64 > kernel-4.15.6-300.fc27.x86_64 i915 kinda by design wants to allocate all your memory and then use shrinker on the kernel side to let buffers go when under memory pressure. This might not play well with disabled/limited overcommit.
some suggestion/comments from Chris Wilson on #dri-devel: robclark> ickle, btw, ever played w/ non-default vm.overcommit_memory settings? https://bugzilla.redhat.com/show_bug.cgi?id=1557332 <ickle> robclark: gem uses VM_NORESERVE for its shmemfs objects, afaik they aren't accounted until actual page allocation <ickle> I suspect that's just plain old malloc returning NULL <robclark> hmm, could be.. but because of bo cache, I guess? <ickle> it'll be tight, but should all be shrinkable <ickle> keep an eye in dmesg for page allocation fails or resort to strace / drm.debug=0x1 <ickle> just to isolate an ioctl returning -ENOMEM vs malloc the shmemfs objects are the actual gpu buffers, the malloc'd things would be the userspace struct ptr that brw_bo_alloc() allocates (something rather small, but I guess somehow in this case the shrinker doesn't get a chance to free up enough memory so that malloc could allocate another chunk of pages??)
I didn't see any kernel messages before the allocation failure. I'm also not convinced that there would be any kind of OOM with vm.overcommit_memory=2. Does the shrinker even run before allocations fail?
Regarding VM_NORESERVE, if it is anything like MAP_NORESERVE, then it does account against the commit limit.
(In reply to Florian Weimer from comment #4) > I didn't see any kernel messages before the allocation failure. > > I'm also not convinced that there would be any kind of OOM with > vm.overcommit_memory=2. Does the shrinker even run before allocations fail? the whole assumption behind the intel driver approach of greedily keeping around "freed" buffers in the bo cache to re-use them[1] relies on the shrinker running before allocation fails. I can certainly see how this could fall down when changing vm.overcommit_memory from the default if shrinker doesn't run before expanding heap fails.. [1] ie. setting up and tearing down vm mappings is expensive, and to be avoided if you care about moar-fps
so thinking about this a bit more, I suspect that to deal w/ overcommit limits, i915 would need to somehow update the process accounting when unused buffers are MADVISE(madv=DONTNEED). I suppose shrinker doesn't actually run before malloc expanding the heap fails, since it is a process accounting limit, not actually running out of memory. But when unused buffers are cached in userspace (but candidate to be purged from shrinker on kernel side), they should somehow now count against overcommit limits.
This message is a reminder that Fedora 27 is nearing its end of life. On 2018-Nov-30 Fedora will stop maintaining and issuing updates for Fedora 27. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '27'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 27 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 27 changed to end-of-life (EOL) status on 2018-11-30. Fedora 27 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.