Bug 1384096

Summary: user's login session sometimes fails to start because no permission on DRI
Product: [Fedora] Fedora Reporter: Ian Collier <imc>
Component: gdmAssignee: Ray Strode [halfline] <rstrode>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 24CC: normand, rstrode
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-13 11:17:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Xorg.0.log of the user while unsuccessfully logging in
none
Xorg.0.log of root attempting to start X on the console (and failing) none

Description Ian Collier 2016-10-12 14:06:09 UTC
Created attachment 1209622 [details]
Xorg.0.log of the user while unsuccessfully logging in

Every so often, in an unpredictable fashion, gdm (or some system process) gets
into a state where users can't log in: when the correct password is entered,
the system tries to start the session and fails, then returns to the login
screen.

The user's Xorg.0.log file says things like:

 vesa: Ignoring device with a bound kernel driver
 (EE) modeset(0): drmSetMaster failed: Permission denied
 (EE) AddScreen/ScreenInit failed for driver 0

(bearing in mind the driver for this system should be intel(4) not
modesetting(4))

whereas if one logs on to the console as root and tries to start X,
it says things like:

 (EE) intel(0): [drm] failed to set drm interface version: Permission denied [13].
 (EE) intel(0): Failed to claim DRM device.

The problem seems to be related to this:

# cat /sys/kernel/debug/dri/0/clients 
             command   pid dev master a   uid      magic
           <unknown>  1004   0   y    y     0          0
            Xwayland  1520   0   n    y    42          1
            Xwayland  1520   0   n    y    42          2
            Xwayland  1520   0   n    y    42          3

There is no process 1004 running on the system.  However, if one
kills process 1520 then gdm restarts and the ghost of process 1004
disappears:

# cat /sys/kernel/debug/dri/0/clients 
             command   pid dev master a   uid      magic
      systemd-logind  4955   0   n    y     0          0
            Xwayland  6780   0   n    y    42          1
            Xwayland  6780   0   n    y    42          2

At that point, users are again able to log in successfully.

Comment 1 Ian Collier 2016-10-12 14:07:40 UTC
Created attachment 1209623 [details]
Xorg.0.log of root attempting to start X on the console (and failing)

Comment 2 Ian Collier 2016-10-12 16:55:32 UTC
Digging further... in the logs we have a record of pid 1004 crashing - 
it was systemd-logind.

Question would be why didn't the pid disappear from the DRI clients when the
process crashed?

Further mystery: we have recently installed Fedora 24 on 85 machines, and
16 of them have recorded a systemd-logind crash at exactly 18:50:00 on 
several different dates.  We don't have anything that happens at 18:50
(closest is a cron job that runs at 18:35 and runs a command that calls
systemd-inhibit then maybe issues a shutdown for 19:00 and sleeps for
1800 seconds.  None of these machines did in fact shut down at 19:00).

Anyway, if systemd-logind is crashing then maybe this is in fact a systemd
bug.  However, it would be nice if gdm could restart properly when
systemd-login crashes.

Comment 3 Ian Collier 2016-10-13 11:16:11 UTC
Right, the systemd crash is Bug 1371596 and that's just been fixed
so hopefully once all our machines have been rebooted to restart their
systemd this will no longer be an issue.

Comment 4 Ian Collier 2016-10-13 11:17:07 UTC

*** This bug has been marked as a duplicate of bug 1371596 ***