Description of problem: I've got a setup where 4 users use the default gnome desktop from X terminals (based on ThinStations and PXES), and hence everything is remote X. So we're talking XDMCP requests to gdm which then serves up the login prompt, etc.. so everything is remote, including the window manager. Obviously, they're all using the same system. These users mostly stick to using Thunderbird, Firefox and OpenOffice. Anyway, what's happening is that metacity freezes/hangs quite regularly (it all depends, some users get it twice per day, some once every other day... no specific pattern I can discern). By freeze, I mean freeze, it's as if the process had received a SIGSTOP, literally. Now to fix this, I simply kill the frozen metacity, at which point gnome-session (I believe) restarts metacity and everything returns to normal. I'm actually using a little tool called "emap" that I've configured to kill metacity on a specific key sequence so that the users can unfreeze themselves when this happens by simply hitting the key combo. This has been occurring for months now. Given my work around, I've almost forgotten about it and thus never got round to filing a bug, but having to educate a new user in the last few days about the problem reminded me it was time to file this. I don't personally use this setup as my desktop, I use Fedora Core as my desktop, not RHEL4, but I did run FC3 for quite some time, which used exactly the same version of metacity as RHEL4, and without X actually going over the network in my case, metacity has never had a problem. So this seems to be something directly related to metacity being remotely displayed. The system this is occurring on is fully up to date and the network between the X terminals and the server, is a perfectly healthy 100Mb/s Ethernet (all systems are on the same subnet as a matter of fact). The X terminals are made up of all sorts of different hardware, and it's occurring on all of them. Version-Release number of selected component (if applicable): metacity-2.8.6-2.8 How reproducible: It occurs daily, but other than using the system for several hours, there's nothing I've been able to pinpoint that specifically triggers the problem. When this happens, it seems users are often in Thunderbird, but that may just be because they spend most of their time using Thunderbird. Steps to Reproduce: 1. Use system purely over a remote X setup for several hours 2. 3. Actual results: complete and unrecoverable (without killing process) freeze of metacity Expected results: No freezing. Additional info:
If you can log in to the computer with the frozen metacity, could you try and attach gdb to it and get a backtrace?
Here's the backtrace from a frozen metacity process: #0 0x005117a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x005e7c6d in poll () from /lib/tls/libc.so.6 #2 0x007bcf5f in g_main_context_acquire () from /usr/lib/libglib-2.0.so.0 #3 0x007bd264 in g_main_loop_run () from /usr/lib/libglib-2.0.so.0 #4 0x0806fb7a in main ()
We've had exactly the same problem since switching from RHEL3 to RHEL4. Our users log in from Windows PCs using Hummingbird Exceed, and mainly use terminal sessions, nedit and emacs. A user's windows may freeze a couple of times a day. I have not been able to pinpoint the cause; the debug option in /etc/X11/gdm/gdm.conf showed nothing relevant. What the user sees is this: windows lose their borders and cannot be moved. If multiple desktops are in use, then windows from the other desktops suddenly appear. However, the terminal and edit sessions themselves still operate and you can create new windows from the command line. You can simulate the effect of this bug (for a second or two) by killing metacity on a working system. Version used: metacity-2.8.6-2.8
I've found the source of the problem from my perspective, and it doesn't seem to actually be a metacity problem. What I've pinpointed it to, is a kernel iptables problem. I have a classic iptables setup that allows all outgoing traffic unfiltered, and allows return traffic through the usual "RELATED, ESTABLISHED" state entry. metacity as well as other X processes are establishing connections from server to client, so the return traffic is allowed in through that above mentioned state iptables entry. When metacity freezes (and at this point, I'm assuming other X processes freeze as well, but the users must simply kill and restart those apps and hence have never complained... though this is pure speculation), it's because though the metacity TCP session is still established (as reported by netstat), it's iptables state entry (as reported by /proc/net/ip_conntrack) is no longer present. Somehow iptables is removing the state entry despite that the TCP session is still established (and hence no FIN or RST packet was received). The fact that return traffic is denied for metacity, is causing the app to freeze, and if you look at the traceback I provided above, it's doing a poll, hence confirming that it's waiting for traffic, traffic that will of course never come. What I've done to work around this and confirm that my troubleshooting is correct, is to add the following iptables rule (and make sure it shows up after the state matching rule, so it only hits when the state rule doesn't): iptables -A INPUT -p tcp -s 192.168.2.0/23 --sport 6000 --dport 32768: ! --syn -j ACCEPT It seems that all X processes use ports higher than 32768, so I used this to narrow it down a little in order to minimize the potential security exposure from having to add this rule. Anyway, looking at the amount of traffic hitting this rule over time, I'm seeing about 1.5% of all return X traffic being allowed through this rule and not the state matching one. Since I've applied this work around, it would seem like the metacity freezing problem is indeed gone (it's always difficult to get good information from the end users, but none of them remember it freezing lately). I've just set something up to track metacity restarts... so in a few weeks, I should be able to confirm that this is indeed the case (though I'm fairly confident that's the case already). The iptables rule I added should of course never hit (unless someone is using something like nmap to generate bogus packets, which definitely isn't what's happening here). The fact that it's hitting 1 million packets per week (1.5% of all X traffic) is a bit concerning as far as the reliability of iptables in the RHEL4 kernel goes....
I added a similar iptables rule seven days ago and we have not seen this problem since. Many thanks for the suggestion.
Reassigning to kernel as should have been done a while ago, since the diagnosis looks like an iptables problem. (Sorry this didn't get taken care of at the time.)
We've experiencing something similar, but afaik our problem is related with iptables restart. As per default iptables-config instructs iptables to unload modules, then reload them. If the connections where opened and allowed as per RELATED, ESTABLISHED, unloading ip_conntrack would render all conections not recorded, and being considered as new, so will be rejected with "icmp-host-prohibited" by default firewall rules... setting "IPTABLES_MODULES_UNLOAD=no" /etc/sysconfig/iptables-config should stop unloading module, and probably stop this behaviour. Can anyone confirm if their system shows something like "kernel: ip_conntrack version 2.4 (8192 buckets, 65536 max) - 228 bytes per conntrack" on /var/log/messages shortly before metacity disappearing ? Regards Pablo PD: happens on EL 5.5 too
I don't use /etc/init.d/iptables to manage iptables, I do it from /etc/rc.d/rc.local by hand if you will (calling iptables directly for every rule) and hence I've disabled the "iptables" service. As a result, I'm pretty confident that the iptables related modules aren't being reloaded on me at any point, so that's not the root cause of the problem, just another scenario that can lead to TCP sessions being established without being tracked by "ip_conntrack". I'm no longer running EL4, I'm on 5.5 as well and 3% of my X traffic is still not being tracked correctly by "ip_conntrack". Hence, the iptables rule I provided in my previous comment is still the necessary workaround to make my environment usable. I'm curious to see whether the problem persists when I upgrade to EL6 in the near future. I'm wondering if this bug should be assigned to RHEL 5.5? Though I suspect the bug is still applicable to EL4, I can only confirm that it's applicable to 5.5 at this point.
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.