Bug 1464294 - Gnome-Shell deadly lockup with btrfs fle system if certain extensions are installed.
Gnome-Shell deadly lockup with btrfs fle system if certain extensions are i...
Status: NEW
Product: Fedora
Classification: Fedora
Component: gnome-shell (Show other bugs)
27
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Owen Taylor
Fedora Extras Quality Assurance
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-22 21:48 EDT by Leslie Satenstein
Modified: 2018-02-01 05:06 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-09-26 05:04:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Systemd journal from test (1.16 MB, text/plain)
2017-11-04 15:37 EDT, Ulrich Beckmann
no flags Details
Repeated test of comment#9 with No_COW attributes set (1.29 KB, text/plain)
2017-11-18 08:01 EST, Ulrich Beckmann
no flags Details

  None (edit)
Description Leslie Satenstein 2017-06-22 21:48:59 EDT
Description of problem:

BACKGROUND
I have been testing Fedora26 beta1-4 and also the nightlies. Installation is on physical hardware (ssd and spinning disks, not virtual system)

Ext4 developer recommended that SSD's not be formatted with ext4, but to use btrfs (RedHat recommendation). That recommendation is to extend SSD life (btrfs does COW, and avoids journalling)

ENDBACKGROUND


BTRFS file system
In trying to setup the Gnome extension TASKBAR (by zpydr), The Gnome session (Wayland and xorg) options will lockup solidly, A restore from backup is the only option to recover.  (killall -u locked_up_user does not restore the session.

EXT4 File System
THERE IS NO PROBLEM if the file system is ext4 or xfs for any of the GNOME extensions
 



Version-Release number of selected component (if applicable):

All current software as of June 22,2017

How reproducible:

Setup /home on a small btrfs formatted partition.  
Install gnome-tweak-tool
install extensions Taskbar by zypydr
do alignment settings (if you can)

Steps to Reproduce:
1.
2.
3.

Actual results:

Gnome will lockup and not be recoverable by rebooting or by forced logoff

Expected results:

Relogin of user after a lockup must recover session 


Additional info:

Have worked multiple tests with extension's author to prove it was not his software.

This could be an implementation stopper or at least information in the installation guide to warn of the problem.
Comment 1 Leslie Satenstein 2017-07-05 19:59:45 EDT
I just setup 3 different Fedora 26 candidates (all from July 2, 2017
and using

https://kojipkgs.fedoraproject.org/compose/26/latest-Fedora-26/compose/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-26-1.4.iso

My testing consisted of 3 Fedora 26Gnome installed on ext4 and 3 Fedora 26Gnome installed onto btrfs.

All my comments mentioned above were reconfirmed -- gnome and btrfs do not get along well.

BTRFS Problems wth Extensions
Generally, I test with gcc 7 (does not work), I test with the FF browser and I add extensions -- I first download the gnome "tweak tool" and subsequently begin to add the extensions I have used with Fedora 22,23,24, and 25. With Fedora 22-25, all worked, not gui lockups or other.

With Fedora26 candidate, some extensions will not install, some will lockup on landing within /home/user/.local/share/gnome-shell/extensions.  
When the system locks up, there is no longer a keyboard light. I have to reboot.
When I examine the gear (distribution selection between gnome and gnome-org, the default setting is up in a third area "awesome".  If I do not reset the login to gnome--a system lockup occurs.
Even with gnome and some reliable extensions lockups occur.


Ext4 problems with Extensions
None

Why btrfs?   Theodore Ts'o in a publication suggested that btrfs was better for SSD's than ext4. Journalling will shorten the life of SSD's. Btrfs's COW cuts I/O to half-- what is wanted for a SSD.
Comment 2 Leslie Satenstein 2017-08-01 01:30:27 EDT
Definitely a gnome-shell btrfs problem.

test with the attachment.
Comment 3 Leslie Satenstein 2017-09-02 22:20:43 EDT
With Gnome 3.26 candidate and Fedora 27, the problem is less severe. 

There is definitely an incompatibilty between Gnome-shell resident on a btrfs file system. 

This problem occurs for the distributed extensions coming with F27 or the extension that I provided. You can use that extension for debugging  Gnome.

The above extension which crashes on installation of Gnome 3.24 Fedora 26) 
does not crash with the gnome 3.26 beta   under Fedora 27 beta.  Unless 
estabishing execution parameters (configuration). 

My observation is that the schema is ignored during configuration, causing out or bounds values and the crash.

Once installed the extension functions as designed.  Gnome Shell is flawed.
Comment 4 Leslie Satenstein 2017-09-22 10:15:52 EDT
Can crash activities configurator if the host file system is btrfs

A good candidate to test gnome-shell is 

TaskBbar by Zpydr  

This second extension installs cleanly, works fine with  
ext4, xfs, lvm lvm-thin, f2fs 

But not with btrfs.  

The btrfs problem and Gnome has been present since F25
Comment 5 Leslie Satenstein 2017-09-26 05:04:51 EDT
Works for me with

Fedora-Workstation-netinst-x86_64-27-20170925.n.0.iso

TaskBbar by Zpydr 
Activities Configurator by nls1729
Comment 6 Leslie Satenstein 2017-09-28 19:47:48 EDT
Once in a few days tests, attempting to log onto the system fails.
I ran as root to a virtual terminal.  When the problem did occur, via
root, and the top command, it showed gnome-shell in a tight 99.5% cpu loop.
a kill of the process and a  second logon succeeded.

However, until the next boot, logging in after logging out did not fail

Random fails only after a reboot and random means, not every occurrence.

Just providing this additional information in case someone else raises a bug report.
Comment 7 Leslie Satenstein 2017-10-25 17:14:39 EDT
The problem has returned with the beta Fedora 27
How to demonstrate
Create an installation of F27 beta using btrfs 
After all updates (as of Oct 25)
perform a gnome extension for;
install activities configurator.    Did setting up parameters lock up the system (keyboard lost)? 

If not
install  Gno-menu                    did setting up parameters lockup the system (keyboard lost)?

if not

install TaskBar by zpydr            did setting up parameters lock up the system
(keyboard lost)


Ditto for OpenWeather by Jens Lody (its also distributed within the ISO)

I could go on, but just to advise you that on reboot, one cannot log into Fedora with Gnome or Gnome-xorg

TEST2
Install with /home as an ext4 or xfs based file system. You will not have any issues with installation of the four listed or any other

I have researched this further and IF ~/.config/dconf/user is on a btrfs file system, gnome will lockup during setup of parameters or randomly thereafter

If ~/.conf/dconf/user is placed on an ext4 file system, while the rest of /home is on btrfs (I used a ln -s to relocate dconf directory) Gnome will work properly.

It took me days of testing to dermine that somehow the ~/.config/dconf/user was being corrupted if it was btrfs based.
 
This is a gnome bug or a btrfs bug. The four extensions listed above work with other than a btrfs host.

  
              .
Comment 8 陳鐸元 2017-11-03 02:43:07 EDT
It may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1469129
Comment 9 Ulrich Beckmann 2017-11-04 15:37 EDT
Created attachment 1347877 [details]
Systemd journal from test

When I tried to reproduce the issue, I switched TaskBar to a non-existent bottom panel, and the system stalled and woke up after unknown time. My system is a Sony Vaio E series notebook, Intel i5, 4Core, 8 GB RAM and AMD/ATI grafics, no SSD.

The systemd journal was flooded with messages (5000 lines of totally 7700). Of course, I can't say if it is btrfs related. I could not identify any error message referring to SELinux or  I/O and the filesystem.

Please search for tty2 to spot the incident. I tried then to login as root.

Ulrich Beckmann
Comment 10 Leslie Satenstein 2017-11-05 22:41:41 EST
Hi Ulrich

Gnome Support tried to pass the problem off as an extension bug.  I responded to them that the extension runs flawlessly on all but full btrfs systems.

Other extensions experienced similar problems.  

Last June I reported the problem, as you can see from above.

I noticed in the past 5 days, updates to glibc and other gnome objects. Since then the problems with gnome-shell have diminished grately
I am able to use the same unmodified taskbar on Fedora 27 beta  with btrfs for / and btrfs for /home.

In a few days, I will wipe a partition and install a fresh Fedora 27 gnome system using both btrfs for / and for /home.  If I encounter no issues, I will mark this bug as "works for me". Until then....
Comment 11 Leslie Satenstein 2017-11-08 14:33:45 EST
Hi owen 

This is a pernicious bug.  This is what I do to prove it is a gnome-shell bug

a) Create a brand new Fedora 27 beta. The final F27 is out in 6 days.

b) I set the default to btrfs as the default file system. That puts
/ and /home as btrfs files ... /home is a subvolume of btrfs root.

c) I log to my user leslie.
I install TaskBar@by zpydr (Other extensions will cause problems, but this one has many additional settings)
During the setup of Taskbar (eg, spacing between icons, what to display, etc. Gnome shell will lockup the keyboard and mouse.

I reboot.
I try to log to Fed27. Have a grey screen.
Gnome shell is looping at 105% (quad core cpu). I enter root terminal mode and do a killall -u leslie   # leslie is my user.
I relog and I can get in.

This repetition of first login being rejected is a problem, as is the modification in c) above.
Testing /home an ext4

d) On the same drive, I have a ext4 partition  /home2  I tar /home and untar it onto /home2 and swap the /home and /home2 entries within /etc/fstab 

e)Now, fedora 27 works from cold boot, and any setup issues have disappeared.

In plain words,  /home with gnome 3.26.1 shell cannot safely reside on a btrfs system.

e)
MORE DETAILED TESTS.

reverting to /home on btrfs  and using a symbolic link (ln -s )
move /home/leslie/.config/dconf   to an ext4 file system 

Crashing and boot problem stops occurring. Problem solved in same way as step d) above.
I have been chasing this problem and walking through problematic extensions since the beginning of 3.24.
Comment 12 Leslie Satenstein 2017-11-10 23:31:31 EST
With the patches/fixes/updates to Gnome 3.26.1, as of November 10,2017,
The bulk of the problems have almost disappeared.

The extensions which fail are not in static operation but when doing a setting

for example,   turning on/off an option
               adding some spacing or width for an icon
               changing the size of a field.

One can do one or two actions and then POW, the crash lockup occurs, the keyboard is LOST, the Mouse pointer moves on the screen, but it's buttons are not functioning. Other times there is no keyboard or and black screen.

After several reboots, where some settings are done for each boot, the extension is stable and works from thereon in. 

And to repeat one more time.

Transferring /home to a /ext4 or xfs system solves the "extension setting up" problem.      

And by the way, the problems are present with the gnome extensions furnished with Fedora 27 via dnf .

A second experiment I did is to relocate file to ext4 
~/.config/dconf/user      using ln -s    and the problem disappears.

Please refer to 1469129 for similar problem and discussion
Comment 13 Leslie Satenstein 2017-11-15 10:14:06 EST
The weeks between June 22 to November 10, saw a very large number of glib and gnome-shell updates.
I am now able, with one or two reboots, (should not have to reboot), install any extension onto a "btrfs only" system.

I discovered that if I put one critical file onto an ext4 system, I can do a clean installation of all Gnome extensions without having to reboot. And on logging in after a fresh power on, the problem is solved. That file is

~/.config/dconf/user

I relocated that file to an ext4 file system using a soft link even though all Fedora partitions save one, on my SSD are btrfs formatted.   

My steps were simple because, on Fedora, /boot is created onto an ext4 file system.  This is what I did to solve the problem.

As root.
1) mkdir     -p  /boot/leslie/.config/dconf
Note again, /boot is an ext4 file system thanks to grub2 requirements.
2) cp -ra /home/leslie/.config/dconf/user  /boot/leslie/.config/dconf 
3)  mv  /home/leslie/.config/dconf  /home/leslie/.config/dconf.bak  #just in case.
4) ln -s /boot/leslie/.config/dconf     home/leslie/.config/dconf    
5) chown -R leslie:leslie /boot/.config
6) Log into my Leslie logon.   Success.

Does btrfs operate differently from ext4 regarding writing/reading and return codes, or is the problem still unsolved?
Comment 14 Leslie Satenstein 2017-11-15 10:41:06 EST
4) is missing a / for home/leslie should be /home/leslie
Comment 15 Ulrich Beckmann 2017-11-17 10:58:17 EST
Leslie,

Sorry to say, your workaround looks like a poor solution to me.

Referring to my post https://forums.fedoraforum.org/showthread.php?315877-The-beta-is-almost-the-final-Just-act-as-it-is-the-final&p=1797585#post1797585
I confirm now, openSUSE sets xfs as the default filesystem  for /home.
If you choose btrfs for /home, it sets the extended attribute No_COW for all akonadi or baloo database files under ~/.local/share/

So I think, that it is not a btrfs issue, but an issue of bad btrfs implementation. ~/.config/dconf should definitely have the NoCOW attribute (chattr +C ...).

Regards
Ulrich
Comment 16 Ulrich Beckmann 2017-11-18 08:01 EST
Created attachment 1354697 [details]
Repeated test of comment#9 with No_COW attributes set

I repeated my test of comment#9 with No_COW attribute set on ~/.config/dconf.

I confirm now: It works, the system keeps responsive all the time.

Best regards,
Ulrich
Comment 17 Leslie Satenstein 2017-11-18 20:52:17 EST
Hi Ulrich

Thank you for comments 9,15 and16. 
I must read about modifying btrfs for the No_COW. 

Should the COW issue be a user/Linux requirement or should
Gnome Shell handle it?  

I will set No_COW for /home?  
That would of course include all /home subdirectories.

Does it also mean no journalling for /home?
Comment 18 Ulrich Beckmann 2017-11-19 11:23:22 EST
No, only ~/.config/dconf. Otherwise you would disregard your initial statement.
No_COW disables COW (Copy-On-Write). I am not shure, but journalling will/must work instead.

As follows - user is not logged in:

# cd /home/user_name/.config
# mv dconf dconf.old
# mkdir dconf
# chattr +C dconf
# chown user_name.group_name dconf

Then login and Gnome will create a new file .config/dconf/user as in the attachment of comment #16. chattr on a existing file won't work.

The modification should ideally be made upstream, by the Gnome project. I learnt now that my backup media should also be formatted with btrfs. Otherwise the flag might be lost during backup.

Furthermore, I am no Gnome user. I don't know what other data (zeitgeist?) should be treated the same way. Also I would not place a virtual machine or database files on btrfs. 

Ulrich
Comment 19 Leslie Satenstein 2017-11-24 17:25:24 EST
I am waiting for a decision. Is it a user setting (I don't think so as the .config is a hidden folder) or is it a gnome-shell problem that they must resolve when the file system is btrfs.

I chose the move /home to an xfs partItion, bypassing the problem.
Comment 20 Daniel Playfair Cal 2017-11-26 21:10:23 EST
I've been trying to find the cause of what I think is this issue.

My root filesystem is BTRFS, and I use gnome-shell (3.26.1, and the problem has been happening to me since at least 3.24)

The underlying problem is that gsettings (backed by the dconf backend) emits "changed" events for settings keys in various circumstances where the values of the relevant settings have not changed.

Meanwhile, lots of JS code in gnome-shell and extensions assumes that this will not be the case. In some circumstances there is infinite recursion, for example when a handler for the "changed" event causes the a "changed" event to be dispatched for the same key (even if the value of that key never actually changes).

In one case, the spurious event occurs when JS code sets a setting to a value that it already holds. There is a patch here to address this problem: https://bugzilla.gnome.org/show_bug.cgi?id=790640.

In another case, dconf emits a "changed" event for all keys in the database because it reaches a state where it has lost track of which keys have changed since a client subscribed to the "changed" events: https://bugzilla.gnome.org/show_bug.cgi?id=790640. I suspect that this case may be the one that is related to BTRFS. In this case, I think an infinite loop is possible regardless of whether JS code refrains from doing anything when "changed" handlers are called without a change.

For me, all these problems are difficult to reproduce (probably related to race conditions or something), but most of the problems I have found happen much more frequently when running gnome-shell in valgrind with the Taskbar extension enabled

Some examples of resulting bugs: https://bugzilla.gnome.org/show_bug.cgi?id=782688, https://bugzilla.gnome.org/show_bug.cgi?id=786186, https://bugzilla.gnome.org/show_bug.cgi?id=788110

Note You need to log in before you can comment on or make changes to this bug.