[sldev] Preferred CAPS timeouts and retries?
Ryan Williams
rdw at lindenlab.com
Tue May 22 10:27:05 PDT 2007
John Hurliman wrote:
> I should have clarified that a bit, *if* I can get to the
> EventQueueGet capability everything is fine and runs perfectly (there
> are 502 errors but only treated as debug output really as the client
> will reconnect immediately and it always seems to work). The main
> problem is creating the initial CAPS connection, to the URL that is
> handed to the client from either the login server, a teleport, or
> walking over a sim border. That initial connection seems to be
> somewhat prone to failure, and in the past has been as bad as maybe
> 25% failure rates at the wrong times (both with the official client
> and libsl), but now is around just a few percent. However, when the
> initial connection to the CAPS server (before anything is received
> from it) fails, it generally means that the next several connections
> are going to fail and people have reported as many as 20 and 40
> failures in a row before the server finally wakes up and hands over
> all the capabilities such as EventQueueGet. By monitoring logs it
> seems the official client has the same problem every once in a while
> and if you identify these moments and try and teleport or cross a
> border at that moment it is a guaranteed disconnect or client
> disorientation. That initial connection is where I am trying to fine
> tune the numbers so clients don't have to attempt to reconnect 40
> times before something works, or maybe give up after 10 and they can
> assume things aren't working for the moment.
>
So, fetching the seed capability isn't returning an immediate response?
Hmmm..... we're going to have to poke at this.
> All of the documentation I've seen for EventQueueGet is our debugging
> output at http://www.libsecondlife.org/wiki/EventQueueGet which I
> think is outdated now as the XML structure has changed at least once.
> If I get in to a documenting spree I try and put things on the SL wiki
> these days, but have been attempting to avoid any more CAPS
> programming for now.
Ah, OK, you already know about it then, sorry for the redundancy. :-)
-RYaN
>
> John Hurliman
>
>
> Ryan Williams wrote:
>> John Hurliman wrote:
>>> It seems like someone has been doing upgrades on the CAPS servers
>>> these days and they perform a lot better than they used to (the
>>> turning point seemed to be shortly after group IMs were converted to
>>> CAPS and everyone complained). The 502 errors are back but that's
>>> not actually a problem, at least we can connect to the servers now
>>> most of the time. The other few times when an initial connection
>>> doesn't succeed, it seems to be because of a server problem and the
>>> next couple of connection attempts will also likely fail. I've seen
>>> up to 20 failed connections before a successful one (using 30 second
>>> timeouts), and one libsecondlife user is reporting 40 failed
>>> connections followed by a successful one (using 60 second timeouts).
>>>
>> For the EventQueueGet capability, 30 second timeouts with a 502 are
>> normal and expected. We're using Comet
>> (http://alex.dojotoolkit.org/?p=545) to push messages from the
>> simulator to the viewer over HTTP. This functionality has been
>> mostly unused, so most of the time the viewer just waits for nothing,
>> causing a timeout and 502, and then it retries. This the viewer
>> always has an HTTP connection open to the simulator. When the
>> simulator actually does have a message, it sends it down the pipe
>> right away, and then I think you don't get a 502. As you've noticed,
>> the case where the simulator has a message is increasingly common,
>> and will become more so as we build out our LLSD messaging
>> infrastructure. We should probably make an effort to not emit a 502
>> for expected behavior like this, but as I understand it, that will be
>> tricky since Squid is acting as an intermediary and we might not be
>> able to change its behavior for this one case.
>>
>> You shouldn't be seeing this for other caps, though, so do tell if
>> other ones are. The capability services don't have a special amount
>> of load, to my knowledge, so you shouldn't be seeing performance
>> problems from them.
>>
>> Has there been other discussion about the caps? It seems like the
>> Event Queue is a topic that needs more documentation.
>>
>> -RYaN
>>
>>> Is 60 seconds too short of a timeout for establishing the initial
>>> CAPS connection? What is a preferable timeout? After a certain
>>> number of retries should we just give up to prevent the already
>>> half-dead server from getting hit even more, or just keep trying
>>> until it works (although only some bots have the luxury of waiting
>>> around for 40 minutes before teleport and group IMs will start
>>> functioning)?
>>>
>>> John Hurliman
>>> _______________________________________________
>>> Click here to unsubscribe or manage your list subscription:
>>> /index.html
>>
>
> _______________________________________________
> Click here to unsubscribe or manage your list subscription:
> /index.html
More information about the SLDev
mailing list