[sldev] Crash reports analysis [http-texture builds]
Philippe Bossut (Merov Linden)
merov at lindenlab.com
Wed Apr 29 23:23:43 PDT 2009
Hi,
So I've been digging through the crash reports and tried to evaluate
how we're doing. Here's some data, I'll give some conclusions at the
end.
Crashers since one week
-----------------------------------
Looking into the crashers on the "Second Life OSS" builds since we're
doing them, this is what I see:
Total : 30 crashers
- VWR-12775 : 12
- OpenJPEG : 3
- VWR-13065: 2
- VWR-13066 : 2
- VWR-12827: 2
- Unusable stacks: 9
Clearly, the notorious VWR-12775
(LLTextureFetchWorker::callbackDecoded crasher) is the one that's
impeding reliability. If you're unlucky, you can crash repeatedly over
and over as I noticed that you can meet that crash just reading the
cache at launch...
If I look from the beginning of the month, I get 32 such crashers out
of 108. Looking into "CommunityDeveloper" channel shows 6 out of 17
crashers. That's a fairly consistent 30% of crashes due to this problem.
VWR-12827 is the LLVertexBuffer one and was fixed with the merge of 1.23
Second Life OSS Build 1.23.0 2166
------------------------------------------------
This is the most recent "1.23 + http-texture" merge build and is only
a couple of days old.
Total : 12 crashers
- VWR-12775 : 1
- OpenJPEG : 2
- VWR-13065: 2
- VWR-13066 : 1
- Unusable stacks: 6
There's no obvious pattern except that VWR-12827 disappeared (as
expected) but VWR-12065 emerged (badNetworkHandler exception). Still,
not much data to see anything new. I have anecdotal data that
VWR-12775 is still very prevalent and that most of those crashes are
not reported. I sure can repro that one easily.
Conclusion
---------------
VWR-12775 LLTextureFetchWorker::callbackDecoded() is the bad guy. I've
seen actually a couple of different stacks but they are all rooted
down to the same problem. Several people on this list commented in the
PJIRA and even proposed patches (thanks Robin!). Discussing this with
the lead dev though, he thinks we better get to the deep down reason
as to why we get the fetch worker into a state that is not supposed to
even exist (in the intent of the code at least), rather than just
writing it down to a race condition, detecting the weird state and
avoiding the crash. Right now then, I'm tracing when this state
happens (and it happens plenty, most of the time without getting into
a crashing tangle) and writing unit tests to ensure the logic is sound
and well understood. Writing unit tests for threaded code is a little
tricky (actually, we don't have example of this in our still skimpy
set of unit tests so, I'm blazing new trails here...) but is
worthwhile. At the same time, I'm tasked with updating the doc (http://wiki.secondlife.com/wiki/HTTP_Texture
) as I go. That keeps me busy which explain (but doesn't excuse) I was
not super responsive on the list today. Apologies for this.
The OpenJPEG crashers are concerning. There are unfortunately little
info in the stack trace and I haven't myself experienced that one.
What's strange is that I looked into the other viewers crash logs
(thousands of them) and it doesn't surface at all in their crashers.
May be of note: our 2 crashers happen on Vista. Does this ring a bell
to anyone?
Keep testing and the logs (with symbols!) coming. This is extremely
useful. Any idea on the here above reported crashes will be very much
appreciated.
Cheers,
- Merov
More information about the SLDev
mailing list