IMs and group IMs (was Re: [sldev] Frequent bugs with difficult repros)

Henri Beauchamp sldev at free.fr
Mon Mar 3 11:36:24 PST 2008
Previous message: IMs and group IMs (was Re: [sldev] Frequent bugs with difficult repros)
Next message: IMs and group IMs (was Re: [sldev] Frequent bugs with difficult repros)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, 02 Mar 2008 09:55:00 -0800, Kelly Linden wrote:

> Here is what I know.
> 
> 1.14 did not introduce the architectural changes we referred to as 
> "Distributed Messaging" which is being talked about here, I think.
> 1.14 patch notes: http://secondlife.wikia.com/wiki/Version_1.14.0
> 
> 1.9 (Build 15, Mar 14, 2006) did:
> http://secondlife.wikia.com/wiki/Version_1.9.0
> 
> Unless you meant the change for groups, which was more recent, perhaps 
> 1.14 but I didn't see it mentioned in the notes.

Yes, I was speaking about the change for groups which -is- documented in
the blog post I linked to in my email:
http://blog.secondlife.com/2007/04/02/preview-of-second-life-11411-now-up-on-the-beta-test-grid/

Citation:
"The biggest changes, however, are hopefully (!) invisible, and involve
the deprecating the overloaded back-end system responsible for group IM
and Group chat and replacing them with more distributed and scalable
counterparts."

And if you look after this date in the blog, you will see all the
mentions to the group related issues. In fact, this is at this very
point that LL should have rolled back the changes and go back on work
on the new architecture... It would probably had saved us many months
of troubles. Anyone can make mistakes (making mistakes is even one
of the best way to learn), but not recognizing a mistake and insisting
on the wrong way is pretty unforgivable.

> How the old architecture worked:
> There was a single server called the user server that maintained an open 
> connection to every user connected to the service.  In March 2006 (2 
> years ago) our peak concurrency was under 5,500.  At this time we had 
> already done some hacks (such as increasing the keep alive time on the 
> connections) just to keep the server working.  We estimated that 
> somewhere around 6k - 7k connections the userserver would simply fail to 
> function at all, constantly dropping connections as they timed out 
> etc.   5k persistent connections is a lot for any server, so we had to 
> move to some form of distributed server.
> 
> To be clear: There is not a snowballs chance in hell the userserver as 
> it existed then could work with today's concurrency no matter what kind 
> of computer you put it on, within reason.

Computers acting as "concentrators" (not sure of the right term to use
in English) would have solved this issue. More on this below.

> The first change moved everything *except* group chat off the userserver 
> and into the regions.   Sure, we could have gone some different ways 
> here.  We could have created a distributed farm of userservers, but 
> really that has most of the same difficulties as our current 
> implementation, except it would require a separate, additional, farm of 
> computers.  So we went with regions handling the forwarding of messages 
> - and yes there were some bugs, and I'm sure there are still a few, but 
> I'm not convinced that wouldn't be the case with a userserver rewrite.  
> The amount of code change needed would have been about equal.

I did not notice any serious trouble with (non-groups) IMs on SL (if we
except the friends list troubles, but I'm not even sure this list is
handled by IM servers), so your approach for this kind of IMs (user to
user) seems reasonable enough.

> As part of the first change, the userserver became real lazy about 
> maintaining its connections, more resilient to lost connections and only 
> handled group chat.  We went this route because truly distributed group 
> chat is Hard.

And here, I fully agree with you. I'd even say it's impossible to
have reliable group chats over distributed servers when the network
supporting the communications is non-deterministic.

One of the approaches you could possibly consider, if you insist on
keeping distributed servers for group chats, would be to use a
dedicated, deterministic private network between servers (token ring
comes to mind...): but don't encapsulate this kind of network in a
VPN over Internet, of course !... This means point to point leased
lines and would most probably be very uneconomic.

> Now the userserver is gone and group chat lives on a 
> relatively small number of central servers.  It is not handled by the 
> regions.  However, it is still distributed and tricky, and yes has 
> bugs.

Indeed.

> Could the userserver have handled the current load with beefier 
> hardware, in its limited group-chat-only state?  Perhaps.  Could it have 
> handled 75k? 100k?  Probably not.  We really need "infinitely scalable" 
> solutions (ug, sounds like market speak) and solutions that require 
> bigger better hardware eventually cap out (and usually end up costing a 
> lot).  Solutions that just require you to add more hardware don't (as 
> much).  So group chat is now a service distributed across multiple hosts 
> and can be easily distributed across more.  I'm sure I don't have to 
> enumerate the short comings of the current design, we know it isn't 
> perfect and we do actively discuss other solutions .....

OK, now consider this solution: you could have kept your central server
for the group chats, but instead of letting clients connect directly to
it, you would have used concentrators. By "concentrator", I mean a simple
computer (it may even be very cheap and not very powerful) which acts
as a frontend for the main server. Add a load balancer for connections
to/from the concentrators with Internet (the users), and connect your
concentrators via a private network to the server.
Now, your server only has to deal with a number of connections equal to
the number of concentrators, and you can easily scale up your
architecture by simply adding more concentrators. With 10 concentrators,
you could probably easily handle as much as 100k simultaneous live
connections.
Even better: you can delegate to the concentrators some of the pre-
processing and/or post-processing work so to lighten even more the
load on your central server.

I bet this kind of architecture would have worked without a glitch
since day one...

> As for jabber it is definitely something we would like to investigate 
> and potentially move to.   As part of the transition of group chat the 
> methods for group chat were abstracted out to a clean API that should 
> make such investigation and possible transition much easier than would 
> have otherwise been the case.   Aside from the spare time of a few 
> engineers, I don't think this is currently an active project or on the 
> roadmap for the immediate future.

Jabber or IRC are indeed things to look closely at to see what solutions
have already be found to solve some of the problems you encounter in SL.

Yet, these are in no way complete and perfectly fitting solutions:
the concept of SL-like "groups" does not really exist in either IRC nor
Jabber:
IRC got chatrooms, but the chatroom doesn't have a fixed list of
users and won't initiate comms to the users when traffic arises.
IRC doesn't have friends lists either.
Jabber may have friends lists and you can open a group chat too, but
there are still many differences between these group chats and the
groups in SL.


Finally, there is something you could probably do to at least make
the groups chat a tad bit more reliable: timestamp each message at the
server level (of course, the servers clock will have to be properly
synced: NTP is your friend) and use the timestamps in the viewer to
dynamically reorder the messages when needed (swap the lines in the
group IM window, and higlight the out of order message so that the
user notices it), and to ignore such messages when they arrive after
the session was closed by the user (would avoid having group chats
reopening spuriously, which is a cause for many spam complaints in
groups).

> I wasn't party to all of the decisions and designs made, I hope I have 
> all my facts straight.  I did work on the original distributed messaging 
> project (1.9), but not the group chat one.

I want to thank you for all your explanations. It is a good thing that
people like you do care about explaining and sharing. This list should
be used more often as a "brainstorming" place (it was the case for a
few topics already, and I hope it will be the case for more in the
future).

Regards,

Henri.
Previous message: IMs and group IMs (was Re: [sldev] Frequent bugs with difficult repros)
Next message: IMs and group IMs (was Re: [sldev] Frequent bugs with difficult repros)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the SLDev mailing list