[sldev] Preferred CAPS timeouts and retries?

Ryan Williams rdw at lindenlab.com
Tue May 22 10:27:05 PDT 2007


John Hurliman wrote:
> I should have clarified that a bit, *if* I can get to the 
> EventQueueGet capability everything is fine and runs perfectly (there 
> are 502 errors but only treated as debug output really as the client 
> will reconnect immediately and it always seems to work). The main 
> problem is creating the initial CAPS connection, to the URL that is 
> handed to the client from either the login server, a teleport, or 
> walking over a sim border. That initial connection seems to be 
> somewhat prone to failure, and in the past has been as bad as maybe 
> 25% failure rates at the wrong times (both with the official client 
> and libsl), but now is around just a few percent. However, when the 
> initial connection to the CAPS server (before anything is received 
> from it) fails, it generally means that the next several connections 
> are going to fail and people have reported as many as 20 and 40 
> failures in a row before the server finally wakes up and hands over 
> all the capabilities such as EventQueueGet. By monitoring logs it 
> seems the official client has the same problem every once in a while 
> and if you identify these moments and try and teleport or cross a 
> border at that moment it is a guaranteed disconnect or client 
> disorientation. That initial connection is where I am trying to fine 
> tune the numbers so clients don't have to attempt to reconnect 40 
> times before something works, or maybe give up after 10 and they can 
> assume things aren't working for the moment.
>
So, fetching the seed capability isn't returning an immediate response?  
Hmmm..... we're going to have to poke at this.

> All of the documentation I've seen for EventQueueGet is our debugging 
> output at http://www.libsecondlife.org/wiki/EventQueueGet which I 
> think is outdated now as the XML structure has changed at least once. 
> If I get in to a documenting spree I try and put things on the SL wiki 
> these days, but have been attempting to avoid any more CAPS 
> programming for now.

Ah, OK, you already know about it then, sorry for the redundancy.  :-)

-RYaN

>
> John Hurliman
>
>
> Ryan Williams wrote:
>> John Hurliman wrote:
>>> It seems like someone has been doing upgrades on the CAPS servers 
>>> these days and they perform a lot better than they used to (the 
>>> turning point seemed to be shortly after group IMs were converted to 
>>> CAPS and everyone complained). The 502 errors are back but that's 
>>> not actually a problem, at least we can connect to the servers now 
>>> most of the time. The other few times when an initial connection 
>>> doesn't succeed, it seems to be because of a server problem and the 
>>> next couple of connection attempts will also likely fail. I've seen 
>>> up to 20 failed connections before a successful one (using 30 second 
>>> timeouts), and one libsecondlife user is reporting 40 failed 
>>> connections followed by a successful one (using 60 second timeouts).
>>>
>> For the EventQueueGet capability, 30 second timeouts with a 502 are 
>> normal and expected.  We're using Comet 
>> (http://alex.dojotoolkit.org/?p=545) to push messages from the 
>> simulator to the viewer over HTTP.  This functionality has been 
>> mostly unused, so most of the time the viewer just waits for nothing, 
>> causing a timeout and 502, and then it retries. This the viewer 
>> always has an HTTP connection open to the simulator.  When the 
>> simulator actually does have a message, it sends it down the pipe 
>> right away, and then I think you don't get a 502.  As you've noticed, 
>> the case where the simulator has a message is increasingly common, 
>> and will become more so as we build out our LLSD messaging 
>> infrastructure.  We should probably make an effort to not emit a 502 
>> for expected behavior like this, but as I understand it, that will be 
>> tricky since Squid is acting as an intermediary and we might not be 
>> able to change its behavior for this one case.
>>
>> You shouldn't be seeing this for other caps, though, so do tell if 
>> other ones are.  The capability services don't have a special amount 
>> of load, to my knowledge, so you shouldn't be seeing performance 
>> problems from them.
>>
>> Has there been other discussion about the caps?  It seems like the 
>> Event Queue is a topic that needs more documentation.
>>
>> -RYaN
>>
>>> Is 60 seconds too short of a timeout for establishing the initial 
>>> CAPS connection? What is a preferable timeout? After a certain 
>>> number of retries should we just give up to prevent the already 
>>> half-dead server from getting hit even more, or just keep trying 
>>> until it works (although only some bots have the luxury of waiting 
>>> around for 40 minutes before teleport and group IMs will start 
>>> functioning)?
>>>
>>> John Hurliman
>>> _______________________________________________
>>> Click here to unsubscribe or manage your list subscription:
>>> /index.html
>>
>
> _______________________________________________
> Click here to unsubscribe or manage your list subscription:
> /index.html



More information about the SLDev mailing list