IMs and group IMs (was Re: [sldev] Frequent bugs with difficult repros)

Kelly Linden kelly at lindenlab.com
Sun Mar 2 09:55:00 PST 2008


Here is what I know.

1.14 did not introduce the architectural changes we referred to as 
"Distributed Messaging" which is being talked about here, I think.
1.14 patch notes: http://secondlife.wikia.com/wiki/Version_1.14.0

1.9 (Build 15, Mar 14, 2006) did:
http://secondlife.wikia.com/wiki/Version_1.9.0

Unless you meant the change for groups, which was more recent, perhaps 
1.14 but I didn't see it mentioned in the notes.

How the old architecture worked:
There was a single server called the user server that maintained an open 
connection to every user connected to the service.  In March 2006 (2 
years ago) our peak concurrency was under 5,500.  At this time we had 
already done some hacks (such as increasing the keep alive time on the 
connections) just to keep the server working.  We estimated that 
somewhere around 6k - 7k connections the userserver would simply fail to 
function at all, constantly dropping connections as they timed out 
etc.   5k persistent connections is a lot for any server, so we had to 
move to some form of distributed server.

To be clear: There is not a snowballs chance in hell the userserver as 
it existed then could work with today's concurrency no matter what kind 
of computer you put it on, within reason.

The first change moved everything *except* group chat off the userserver 
and into the regions.   Sure, we could have gone some different ways 
here.  We could have created a distributed farm of userservers, but 
really that has most of the same difficulties as our current 
implementation, except it would require a separate, additional, farm of 
computers.  So we went with regions handling the forwarding of messages 
- and yes there were some bugs, and I'm sure there are still a few, but 
I'm not convinced that wouldn't be the case with a userserver rewrite.  
The amount of code change needed would have been about equal.

As part of the first change, the userserver became real lazy about 
maintaining its connections, more resilient to lost connections and only 
handled group chat.  We went this route because truly distributed group 
chat is Hard.  Now the userserver is gone and group chat lives on a 
relatively small number of central servers.  It is not handled by the 
regions.  However, it is still distributed and tricky, and yes has 
bugs.  Could the userserver have handled the current load with beefier 
hardware, in its limited group-chat-only state?  Perhaps.  Could it have 
handled 75k? 100k?  Probably not.  We really need "infinitely scalable" 
solutions (ug, sounds like market speak) and solutions that require 
bigger better hardware eventually cap out (and usually end up costing a 
lot).  Solutions that just require you to add more hardware don't (as 
much).  So group chat is now a service distributed across multiple hosts 
and can be easily distributed across more.  I'm sure I don't have to 
enumerate the short comings of the current design, we know it isn't 
perfect and we do actively discuss other solutions .....

As for jabber it is definitely something we would like to investigate 
and potentially move to.   As part of the transition of group chat the 
methods for group chat were abstracted out to a clean API that should 
make such investigation and possible transition much easier than would 
have otherwise been the case.   Aside from the spare time of a few 
engineers, I don't think this is currently an active project or on the 
roadmap for the immediate future.

I wasn't party to all of the decisions and designs made, I hope I have 
all my facts straight.  I did work on the original distributed messaging 
project (1.9), but not the group chat one.

 - Kelly


More information about the SLDev mailing list