Bug 1301817 - journal: Unexpected aio error on several nodes [NEEDINFO]
journal: Unexpected aio error on several nodes
Status: CLOSED NOTABUG
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
1.3.0
Unspecified Unspecified
medium Severity medium
: rc
: 1.3.4
Assigned To: Loic Dachary
ceph-qe-bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-25 22:55 EST by Shinobu KINJO
Modified: 2017-07-30 11:14 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-18 21:24:16 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
kdreyer: needinfo? (skinjo)


Attachments (Terms of Use)

  None (edit)
Comment 7 Loic Dachary 2016-01-27 10:23:24 EST
@sam do you have a name in mind ? My usual contact for kernel related stuff is Ilya :-)
Comment 8 Loic Dachary 2016-01-27 10:25:21 EST
@skinjo@redhat.com would it be possible to get access to a cluster where the problem can be reproduced ? Even if not frequent, it would be enough to have one way to reproduce it to make progress. As it it, we're not closer to get it resolved, it only confirms that the problem exists and that we have no theory to explain why.
Comment 9 Josh Durgin 2016-01-27 11:47:44 EST
Loic, do you know if the original instance also used a raid controller in JBOD mode, and if so which model? Perhaps there is some firmware bug common to the two cases.
Comment 26 Loic Dachary 2016-01-28 10:15:54 EST
@josh 

> do you know if the original instance also used a raid controller in JBOD mode, and if so which model? Perhaps there is some firmware bug common to the two cases.

I don't know.
Comment 29 Loic Dachary 2016-01-29 04:24:00 EST
@Shinobu KINJO

In order to correlate the problem with previous occurrences, it would be useful to get the kernel log (dmesg) as well as a log of the load average an hour before and after the crash. 

I'm exploring ideas and will update you when / if I figure something out ;-)
Comment 30 Shinobu KINJO 2016-01-29 05:50:56 EST
@Loic,

I understood.

I'm pretty serious for this issue because I tried to figure out what's exactly going on on this was reported

  http://tracker.ceph.com/issues/12100

Please let me know at any time if your team require more.

Rgds,
Shinobu
Comment 31 Loic Dachary 2016-01-29 10:26:43 EST
@Shinobu Sam suggested that you activate debug with:

debug osd = 0/20
debug filestore = 0/20
debug ms = 0/1
debug journal = 0/20

which we call "in memory logging". What it means is that your logs won't grow. The OSD will be marginaly slower because it will log things into memory but only keep the last 10,000 messages. Should a crash happen, it will dump these messages to the logs and they will likely contain the information we need to better understand what is going on.
Comment 32 Shinobu KINJO 2016-01-29 16:49:06 EST
@Loic,

I will suggest that again.

But since the cluster is in production, customer is worried about performance degradation. So it probably takes a time to have them get core.

Thank you for paying attention to this.

Rgds,
Shinobu
Comment 33 Loic Dachary 2016-02-02 00:05:59 EST
@Shinobu for the record Ilya figured out the root cause of the problem ( http://tracker.ceph.com/issues/9570#note-37 ) and suggested a workaround (upgrade the kernel so it is newer than 3.13).
Comment 34 Shinobu KINJO 2016-02-02 00:44:19 EST
@Loic,

Thank you, and your team (especially Ilya!) for analysis and confirmation.

Rgds,
Shinobu
Comment 37 Josh Durgin 2017-07-18 21:24:16 EDT
This seems to have been a kernel bug.

Note You need to log in before you can comment on or make changes to this bug.