[sldev] Crash reports analysis [http-texture builds]

Philippe Bossut (Merov Linden) merov at lindenlab.com
Wed Apr 29 23:23:43 PDT 2009


Hi,

So I've been digging through the crash reports and tried to evaluate  
how we're doing. Here's some data, I'll give some conclusions at the  
end.

Crashers since one week
-----------------------------------
Looking into the crashers on the "Second Life OSS" builds since we're  
doing them, this is what I see:
Total : 30 crashers
- VWR-12775 : 12
- OpenJPEG : 3
- VWR-13065: 2
- VWR-13066 : 2
- VWR-12827: 2
- Unusable stacks: 9

Clearly, the notorious VWR-12775  
(LLTextureFetchWorker::callbackDecoded crasher) is the one that's  
impeding reliability. If you're unlucky, you can crash repeatedly over  
and over as I noticed that you can meet that crash just reading the  
cache at launch...
If I look from the beginning of the month, I get 32 such crashers out  
of 108. Looking into "CommunityDeveloper" channel shows 6 out of 17  
crashers. That's a fairly consistent 30% of crashes due to this problem.

VWR-12827 is the LLVertexBuffer one and was fixed with the merge of 1.23

Second Life OSS Build 1.23.0 2166
------------------------------------------------
This is the most recent "1.23 + http-texture" merge build and is only  
a couple of days old.
Total : 12 crashers
- VWR-12775 : 1
- OpenJPEG : 2
- VWR-13065: 2
- VWR-13066 : 1
- Unusable stacks: 6

There's no obvious pattern except that VWR-12827 disappeared (as  
expected) but VWR-12065 emerged (badNetworkHandler exception). Still,  
not much data to see anything new. I have anecdotal data that  
VWR-12775 is still very prevalent and that most of those crashes are  
not reported. I sure can repro that one easily.

Conclusion
---------------
VWR-12775 LLTextureFetchWorker::callbackDecoded() is the bad guy. I've  
seen actually a couple of different stacks but they are all rooted  
down to the same problem. Several people on this list commented in the  
PJIRA and even proposed patches (thanks Robin!). Discussing this with  
the lead dev though, he thinks we better get to the deep down reason  
as to why we get the fetch worker into a state that is not supposed to  
even exist (in the intent of the code at least), rather than just  
writing it down to a race condition, detecting the weird state and  
avoiding the crash. Right now then, I'm tracing when this state  
happens (and it happens plenty, most of the time without getting into  
a crashing tangle) and writing unit tests to ensure the logic is sound  
and well understood. Writing unit tests for threaded code is a little  
tricky (actually, we don't have example of this in our still skimpy  
set of unit tests so, I'm blazing new trails here...) but is  
worthwhile. At the same time, I'm tasked with updating the doc (http://wiki.secondlife.com/wiki/HTTP_Texture 
) as I go. That keeps me busy which explain (but doesn't excuse) I was  
not super responsive on the list today. Apologies for this.

The OpenJPEG crashers are concerning. There are unfortunately little  
info in the stack trace and I haven't myself experienced that one.  
What's strange is that I looked into the other viewers crash logs  
(thousands of them) and it doesn't surface at all in their crashers.  
May be of note: our 2 crashers happen on Vista. Does this ring a bell  
to anyone?

Keep testing and the logs (with symbols!) coming. This is extremely  
useful. Any idea on the here above reported crashes will be very much  
appreciated.

Cheers,
- Merov


More information about the SLDev mailing list