Description of problem:
Customer noticed that mcelog processes were hanging rather than completing. Investigation revealed that most of the hung mcelog processes were waiting on mce_read_sem in the kernel, and the one holding this lock was in synchronize_kernel. To pass out of synchronize kernel an RCU callback needed to complete which was nearly 4000 entries away from the head of the RCU list on that CPU.
Version-Release number of selected component (if applicable):
Highly sporadic but recurring on customer side. No other known reproducers.
Steps to Reproduce:
1. Run mcelog in cron
2. Wait for large number of blocked mcelog processes to accumulate.
Event posted on 08-03-2010 12:01pm EDT by fhirtz
Any luck, observations, thoughts? This has been quite quiet on our side
since the last test failure
This event sent from IssueTracker by fhirtz
(In reply to comment #2)
> Event posted on 08-03-2010 12:01pm EDT by fhirtz
> Any luck, observations, thoughts? This has been quite quiet on our side
> since the last test failure
I haven't seen anything like this -- can we get an sosreport from them as well?
Created attachment 436536 [details]
Attached. Let me know if you have questions or need anything further.
Any thoughts on this?
What type of load are they running? Are they seeing anything else in dmesg?