[sldev] Preferred CAPS timeouts and retries?

John Hurliman jhurliman at wsu.edu
Mon May 21 23:27:25 PDT 2007


I should have clarified that a bit, *if* I can get to the EventQueueGet 
capability everything is fine and runs perfectly (there are 502 errors 
but only treated as debug output really as the client will reconnect 
immediately and it always seems to work). The main problem is creating 
the initial CAPS connection, to the URL that is handed to the client 
from either the login server, a teleport, or walking over a sim border. 
That initial connection seems to be somewhat prone to failure, and in 
the past has been as bad as maybe 25% failure rates at the wrong times 
(both with the official client and libsl), but now is around just a few 
percent. However, when the initial connection to the CAPS server (before 
anything is received from it) fails, it generally means that the next 
several connections are going to fail and people have reported as many 
as 20 and 40 failures in a row before the server finally wakes up and 
hands over all the capabilities such as EventQueueGet. By monitoring 
logs it seems the official client has the same problem every once in a 
while and if you identify these moments and try and teleport or cross a 
border at that moment it is a guaranteed disconnect or client 
disorientation. That initial connection is where I am trying to fine 
tune the numbers so clients don't have to attempt to reconnect 40 times 
before something works, or maybe give up after 10 and they can assume 
things aren't working for the moment.

All of the documentation I've seen for EventQueueGet is our debugging 
output at http://www.libsecondlife.org/wiki/EventQueueGet which I think 
is outdated now as the XML structure has changed at least once. If I get 
in to a documenting spree I try and put things on the SL wiki these 
days, but have been attempting to avoid any more CAPS programming for now.

John Hurliman


Ryan Williams wrote:
> John Hurliman wrote:
>> It seems like someone has been doing upgrades on the CAPS servers 
>> these days and they perform a lot better than they used to (the 
>> turning point seemed to be shortly after group IMs were converted to 
>> CAPS and everyone complained). The 502 errors are back but that's not 
>> actually a problem, at least we can connect to the servers now most 
>> of the time. The other few times when an initial connection doesn't 
>> succeed, it seems to be because of a server problem and the next 
>> couple of connection attempts will also likely fail. I've seen up to 
>> 20 failed connections before a successful one (using 30 second 
>> timeouts), and one libsecondlife user is reporting 40 failed 
>> connections followed by a successful one (using 60 second timeouts).
>>
> For the EventQueueGet capability, 30 second timeouts with a 502 are 
> normal and expected.  We're using Comet 
> (http://alex.dojotoolkit.org/?p=545) to push messages from the 
> simulator to the viewer over HTTP.  This functionality has been mostly 
> unused, so most of the time the viewer just waits for nothing, causing 
> a timeout and 502, and then it retries. This the viewer always has an 
> HTTP connection open to the simulator.  When the simulator actually 
> does have a message, it sends it down the pipe right away, and then I 
> think you don't get a 502.  As you've noticed, the case where the 
> simulator has a message is increasingly common, and will become more 
> so as we build out our LLSD messaging infrastructure.  We should 
> probably make an effort to not emit a 502 for expected behavior like 
> this, but as I understand it, that will be tricky since Squid is 
> acting as an intermediary and we might not be able to change its 
> behavior for this one case.
>
> You shouldn't be seeing this for other caps, though, so do tell if 
> other ones are.  The capability services don't have a special amount 
> of load, to my knowledge, so you shouldn't be seeing performance 
> problems from them.
>
> Has there been other discussion about the caps?  It seems like the 
> Event Queue is a topic that needs more documentation.
>
> -RYaN
>
>> Is 60 seconds too short of a timeout for establishing the initial 
>> CAPS connection? What is a preferable timeout? After a certain number 
>> of retries should we just give up to prevent the already half-dead 
>> server from getting hit even more, or just keep trying until it works 
>> (although only some bots have the luxury of waiting around for 40 
>> minutes before teleport and group IMs will start functioning)?
>>
>> John Hurliman
>> _______________________________________________
>> Click here to unsubscribe or manage your list subscription:
>> /index.html
>



More information about the SLDev mailing list