594493 – Crashes of gfs2 after upgrade to 5.5

Bug 594493 - Crashes of gfs2 after upgrade to 5.5

Summary: Crashes of gfs2 after upgrade to 5.5

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	gfs2-utils
Sub Component:
Version:	5.5
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Robert Peterson
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-05-20 20:27 UTC by Theophanis Kontogiannis
Modified:	2010-06-24 13:32 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-24 13:32:58 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
kernel log (11.51 KB, text/plain) 2010-05-20 20:27 UTC, Theophanis Kontogiannis	no flags	Details
Full crash detaills (17.47 KB, text/plain) 2010-05-20 21:06 UTC, Theophanis Kontogiannis	no flags	Details
new crash log (3.79 KB, text/plain) 2010-05-21 12:19 UTC, Theophanis Kontogiannis	no flags	Details
Kernel logs - The server did not crashed... just got frozen (15.23 KB, text/plain) 2010-05-21 13:14 UTC, Theophanis Kontogiannis	no flags	Details
the results in the -NO SERVER CRASH- case. (10.62 KB, text/plain) 2010-05-21 13:15 UTC, Theophanis Kontogiannis	no flags	Details
Messages during somehow replicated freeze (20.29 KB, text/plain) 2010-05-21 16:03 UTC, Theophanis Kontogiannis	no flags	Details
Show Obsolete (1) View All

Description Theophanis Kontogiannis 2010-05-20 20:27:12 UTC

Created attachment 415503 [details]
kernel log

Description of problem:

Upgraded to 5.5 from 5.4.
All the config was made with 5.3 upgraded to 5.4 now upgraded to 5.5
Installing some large software (i.e. comsol40) on gfs2 filesystem, the installation hangs, we get the attached messages from the kernel, access to the gfs2 does not hang, system increases load, becomes non-responsive due to D process increasing after the first crash.

Version-Release number of selected component (if applicable):

5.5
2.6.18-194.3.1.el5
gfs2-utils-0.1.62-20.el5

How reproducible:

100%

Steps to Reproduce:
1. start installation of i.e. comsol40
2. watch the kernel messages
3.
  
Actual results:

Should finish the install

Expected results:

the gfs2 file system crashes with the attached messages.


Additional info:

Maybe I am wrong that the problem is with gfs2!

Comment 1 Theophanis Kontogiannis 2010-05-20 21:06:17 UTC

Created attachment 415513 [details]
Full crash detaills

Comment 2 Robert Peterson 2010-05-20 22:26:25 UTC

The message:

BUG: soft lockup - CPU#1 stuck for 10s! [aisexec:2430]

may indicate there is a communications problem or an openais
problem.  If openais is stuck, dlm can't pass cluster traffic
across the network, and that could be why GFS got stuck
waiting for all its glocks.  Just a theory.  I'm going to
cc some cluster folks to get their input.

I'd also like to recommend in cases like this where lots of
processes seem stuck in glock_wait that you download, compile
and run this tool:

http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/gfs2_hangalyzer.c

The program has instructions in comments at the top, and should
be run only with rsa keys set up so it doesn't ask for passwords
a hundred times.

Comment 3 Theophanis Kontogiannis 2010-05-21 12:19:16 UTC

Created attachment 415657 [details]
new crash log

New crash log..
This time it starts with "CPU locked!"

Comment 4 Theophanis Kontogiannis 2010-05-21 12:20:39 UTC

Bob hi.

I run the program (with the necessary modifications like ln -s /mounts /gfs2)
and the results (until the monitored node crashed) are as following:


[root@tweety-2 ~]# ./gfs2_hangalyzer -n tweety1
                                                                                        exec: /usr/bin/ssh ssh -l root tweety1 "/bin/grep 'clusternode name' /etc/cluster/cluste                                                                                                                                                                                                                                                                                                                                                                ssh: : Name or service not known              /"
                                                                                        /gfs2/bin/glocks: No such file or directory             in/glocks"
                                                                                        /bin/cat: /gfs2/BOINC/glocks: No such file or directoryBOINC/glocks"
                                                                                        /bin/cat: /gfs2/conf/glocks: No such file or directory/conf/glocks"
                                                                                        /bin/cat: /gfs2/DBs/glocks: No such file or directory2/DBs/glocks"
                                                                                        /bin/cat: /gfs2/Doms/glocks: No such file or directory/Doms/glocks"
                                                                                        /bin/cat: /gfs2/FileServer/glocks: No such file or directoryerver/glocks"
                                                                                        /bin/cat: /gfs2/fileservice/glocks: No such file or directoryrvice/glocks"
                                                                                        /bin/cat: /gfs2/hs_err_pid497.log/glocks: Not a directory_err_pid497.log/glocks"
                                                                                        /bin/cat: /gfs2/html/glocks: No such file or directory/html/glocks"
                                                                                        /gfs2/jboss-4.2.3.GA/glocks: No such file or directory  boss-4.2.3.GA/glocks"
                                                                                        /bin/cat: /gfs2/log/glocks: No such file or directory2/log/glocks"
                                                                                        /bin/cat: /gfs2/lost+found/glocks: No such file or directoryfound/glocks"
                                                                                        /bin/cat: /gfs2/opt/glocks: No such file or directory2/opt/glocks"
                                                                                        /gfs2/tars/glocks: No such file or directory            ars/glocks"
                                                                                        /bin/cat: /gfs2/test/glocks: Not a directory  at /gfs2/test/glocks"
                                                                                        /bin/cat: /gfs2/tmp/glocks: No such file or directory2/tmp/glocks"
                                                                                        /bin/cat: /gfs2/var/glocks: No such file or directory2/var/glocks"
                                                                                        bash: -c: line 0: syntax error near unexpected token `('ull)"
bash: -c: line 0: `/bin/cat /dlm/(null)'
                                                                                        bash: -c: line 0: syntax error near unexpected token `('ull)"
bash: -c: line 0: `/bin/cat /dlm/(null)'
                                                                                        bash: -c: line 0: syntax error near unexpected token `('ull)"
bash: -c: line 0: `/bin/cat /dlm/(null)'
                                                                                        bash: -c: line 0: syntax error near unexpected token `('ull)"
bash: -c: line 0: `/bin/cat /dlm/(null)'
                                                                                        bash: -c: line 0: syntax error near unexpected token `('ull)"
bash: -c: line 0: `/bin/cat /dlm/(null)'
                                                                                        bash: -c: line 0: syntax error near unexpected token `('ull)"
bash: -c: line 0: `/bin/cat /dlm/(null)'
                                                                                        bash: -c: line 0: syntax error near unexpected token `('ull)"
bash: -c: line 0: `/bin/cat /dlm/(null)'
                                                                                        bash: -c: line 0: syntax error near unexpected token `('ull)"
bash: -c: line 0: `/bin/cat /dlm/(null)'
                                                                                        bash: -c: line 0: syntax error near unexpected token `('ull)"
bash: -c: line 0: `/bin/cat /dlm/(null)'
                                                                                        bash: -c: line 0: syntax error near unexpected token `('ull)"
bash: -c: line 0: `/bin/cat /dlm/(null)'
                                                                                        bash: -c: line 0: syntax error near unexpected token `('ull)"
bash: -c: line 0: `/bin/cat /dlm/(null)'
                                                                                        Write failed: Broken pipesferred              at /dlm/(null)"

Comment 5 Robert Peterson 2010-05-21 12:58:48 UTC

Looks like you need to mount the debugfs to /sys/kernel/debug.
You can do that manually or add this to the /etc/fstab and
then mount -a

debugfs                 /sys/kernel/debug      debugfs  defaults        0 0

Then re-run the tool

Comment 6 Theophanis Kontogiannis 2010-05-21 13:12:15 UTC

Ok here we are.

1. I started again a new installation of the software that triggered the event.
It ended all the way without any kernel logs.

2. I deed a forth isntallation. This time the kernel loged, however I saw that
a) the node, gor frozen but did not hanged/blowed, b) the installation
continued after some time, c) a number of other applications based on this file
system, crashed when they tried to write to the file system.

I attache "crash_without_hang" with kernel logs in this case and
"gfs2_hangalyzer_output_crash_no_hang" the screen output of the software.

All this was done prior to your comment above.

Will try also with the debug option.

Comment 7 Theophanis Kontogiannis 2010-05-21 13:14:27 UTC

Created attachment 415679 [details]
Kernel logs - The server did not crashed... just got frozen

In this case the node did not crashed. The installation continued after a couple of minutes, however any application trying to write on this filesystem during the freeze, crashed completely.

Comment 8 Theophanis Kontogiannis 2010-05-21 13:15:20 UTC

Created attachment 415681 [details]
the results in the -NO SERVER CRASH- case.

Comment 9 Theophanis Kontogiannis 2010-05-21 16:02:17 UTC

I think I managed to get it.

Installation on tweety1.
Monitoring from tweety2.

In "locks" I have included:

i)  output of random run of the gfs2_hangalyzer from tweety2
ii) bellow it, a run of gfs2_hangalyzer, when tweety1 got in the "frozen" condition
iii) below it, a big block of kernel messages (this time it looks that drbd played a role).

Until now I could replicate the "CPU stuck" message.

Only kernel messages related to gfs (as in previous attatchemts).

This is the only time that drbd also ouputs messages, but was the only time that I could somhow replicate the problem and catch some logs with gfs2_hangalyzer

Thank you.

Comment 10 Theophanis Kontogiannis 2010-05-21 16:03:06 UTC

Created attachment 415722 [details]
Messages during somehow replicated freeze

Included are kernel messages and results of gfs2_hangalyzer

Comment 11 Theophanis Kontogiannis 2010-05-21 16:05:54 UTC

Also tweety1 has no access anymore to the gfs2 filesystem.

Any action (like 'ls') causes more glocks and frozen processes.

Comment 12 Robert Peterson 2010-05-21 19:22:16 UTC

Thanks for the information.  Some of the call trace information
lead me to believe that GFS2 might simply be waiting for DRBD.
Other information may indicate GFS2 is at fault.  It's hard to
tell from the output unless I can see what all the processes
are doing.

So bottom line:
It's still not clear to me what is going on here.  I'd like to
see a complete picture of the state of both nodes at the time
of failure.  Therefore I recommend that you do the following:

1. Modify your /etc/cluster/cluster.conf file to temporarily add
   a large post_fail_delay value so the nodes don't fence.
2. Make sure your /etc/sysctl.conf on both nodes includes the
   following:

kernel.sysrq = 1
kernel.printk = 7
kernel.panic_on_oops = 0

3. Reboot both nodes and restart the cluster
4. Start with a fresh file system, so mkfs.gfs2 a new one
   (saving off anything you need of course).
5. Recreate the failure.
6. Just after the failure, but before either machine is rebooted,
   collect the following information:

   a. Run the gfs2_hangalyzer tool again and save its output again.
   b. Capture a sysrq-t output by doing this command on both nodes:
      echo t > /proc/sysrq-trigger
   c. Run the "dmesg" command on both nodes and save both.  For
      example: dmesg > /tmp/dmesg.tweety1

I'd like to see what all the processes are doing, not just a few,
so I'd like the sysrq-t information from both nodes right after the
failure.  I'd like to see what the tool has to say about the
hang from the same time.

If possible, I'd like to get instructions on how to reproduce the
failure in our labs.

Comment 13 Theophanis Kontogiannis 2010-05-22 13:42:05 UTC

Hi Bob,

I think it would be better if I did all but point 4.

In case the problem has anything to do with one upgrade after the other (this gfs2 was created in 5.2), then we might not see the problem.

I will run one sequence without point 4 (until traping the error), and one with point 4 (again until -hopefully- trapping the error). 

If it has anything to do with "old" created gfs2, it looks that this way we will get the error.

Thank you and will come back ASAP.

Theophanis

Comment 14 Steve Whitehouse 2010-05-26 13:54:30 UTC

Adding Linbit contacts as it looks like it might be drbd related.

Comment 15 Steve Whitehouse 2010-05-27 08:33:58 UTC

Theophanis. do you have a support agreement with Red Hat? If so then we need to get support in the loop on this. Can you confirm that you are using RHEL and not CentOS?

Comment 16 Theophanis Kontogiannis 2010-06-24 08:53:09 UTC

Hello all,

I could not replicate the issue ever again, not with the existing GFS2 neither with a new one.

I suggest we close this report, and if I ever encounter that again, I will file a new bug report if more data will be available.

Thank you

Comment 17 Robert Peterson 2010-06-24 13:32:58 UTC

Hi Theophanis,

Thanks for your input.  I'll close this for now.

Note You need to log in before you can comment on or make changes to this bug.