IMs and group IMs (was Re: [sldev] Frequent bugs with difficult
repros)
Kelly Linden
kelly at lindenlab.com
Sun Mar 2 09:55:00 PST 2008
Here is what I know.
1.14 did not introduce the architectural changes we referred to as
"Distributed Messaging" which is being talked about here, I think.
1.14 patch notes: http://secondlife.wikia.com/wiki/Version_1.14.0
1.9 (Build 15, Mar 14, 2006) did:
http://secondlife.wikia.com/wiki/Version_1.9.0
Unless you meant the change for groups, which was more recent, perhaps
1.14 but I didn't see it mentioned in the notes.
How the old architecture worked:
There was a single server called the user server that maintained an open
connection to every user connected to the service. In March 2006 (2
years ago) our peak concurrency was under 5,500. At this time we had
already done some hacks (such as increasing the keep alive time on the
connections) just to keep the server working. We estimated that
somewhere around 6k - 7k connections the userserver would simply fail to
function at all, constantly dropping connections as they timed out
etc. 5k persistent connections is a lot for any server, so we had to
move to some form of distributed server.
To be clear: There is not a snowballs chance in hell the userserver as
it existed then could work with today's concurrency no matter what kind
of computer you put it on, within reason.
The first change moved everything *except* group chat off the userserver
and into the regions. Sure, we could have gone some different ways
here. We could have created a distributed farm of userservers, but
really that has most of the same difficulties as our current
implementation, except it would require a separate, additional, farm of
computers. So we went with regions handling the forwarding of messages
- and yes there were some bugs, and I'm sure there are still a few, but
I'm not convinced that wouldn't be the case with a userserver rewrite.
The amount of code change needed would have been about equal.
As part of the first change, the userserver became real lazy about
maintaining its connections, more resilient to lost connections and only
handled group chat. We went this route because truly distributed group
chat is Hard. Now the userserver is gone and group chat lives on a
relatively small number of central servers. It is not handled by the
regions. However, it is still distributed and tricky, and yes has
bugs. Could the userserver have handled the current load with beefier
hardware, in its limited group-chat-only state? Perhaps. Could it have
handled 75k? 100k? Probably not. We really need "infinitely scalable"
solutions (ug, sounds like market speak) and solutions that require
bigger better hardware eventually cap out (and usually end up costing a
lot). Solutions that just require you to add more hardware don't (as
much). So group chat is now a service distributed across multiple hosts
and can be easily distributed across more. I'm sure I don't have to
enumerate the short comings of the current design, we know it isn't
perfect and we do actively discuss other solutions .....
As for jabber it is definitely something we would like to investigate
and potentially move to. As part of the transition of group chat the
methods for group chat were abstracted out to a clean API that should
make such investigation and possible transition much easier than would
have otherwise been the case. Aside from the spare time of a few
engineers, I don't think this is currently an active project or on the
roadmap for the immediate future.
I wasn't party to all of the decisions and designs made, I hope I have
all my facts straight. I did work on the original distributed messaging
project (1.9), but not the group chat one.
- Kelly
More information about the SLDev
mailing list