IMs and group IMs (was Re: [sldev] Frequent bugs with difficult repros)

Henri Beauchamp sldev at free.fr
Sun Mar 2 06:59:57 PST 2008


On Sun, 02 Mar 2008 06:11:37 -0700, Lawson English wrote:

> Henri Beauchamp wrote:
>
> > On Fri, 29 Feb 2008 17:11:05 -0700, Lawson English wrote:
>.../...
> >> As far as I know, the main issues with IM in Second LIfe have nothing to 
> >> do with UDP and would be just as bad if they were implemented using TCP, 
> >> as long as the same underlying design remains unchanged.
> >>     
> >
> > The IMs and group IMs issue is indeed independant of the protocol. IMs
> > used to work just fine in UDP before LL made one of its most catastrophic
> > mistakes by changing the architecture for IM communications in v1.14.1
> > ( http://blog.secondlife.com/2007/04/02/preview-of-second-life-11411-now-up-on-the-beta-test-grid/ )
> > to try and lighten the load on the servers.
> >
> > Despite a HUGE amount of bug reports and complaints, LL stubbornely kept
> > the changes they made and we are still now, 11 months after the problems
> > were identified, suffering the same out of order IMs, spuriously reopening
> > group sessions, failure to open group sessions and send group notices,
> > failures to set invisble or visible status in friends list (you are marked
> > either visible or not, but your friends see the opposite), failures to load
> > friends or groups lists after a cache emptying, etc, etc, etc...
> >
> > LL, please, STOP this madness and return to the OLD, reliable architecture,
> > simply using better pipes and more powerful servers to keep the IMs and
> > groups CENTRALIZED !
> 
> The concurrency record is now 65K simultaneous users. What was it 11 
> months ago and why do you think that the old way would work better with 
> the current load?

The simultaneous users maximum number was around 6-8K, IIRC, but like I
said, this is not really a problem as long as you are "using better pipes
and more powerful servers" (or even a cluster of servers).

The point is that trying to deal with a service which must be perfectly
synced grid-wide (the problems seen are obviously synchronization
problems) in a decentralized way using non deterministic links to
communicate between nodes (IP is non deterministic as far as transmission
delays are concerned) is simply flawed by design and will never work properly,
unless you voluntarily slow down everything to a crawl (queuing messages
for a minimum time, hoping this time will be long enough to wait for all
the out of order messages and reorder them).

The proof that the old way would work better (with proportionnally bigger
servers and links) is that it did work perfectly before v1.14.1 was
introduced (i.e. when the load was the same for the new and the old system),
unlike the new way which lamentably failed since day one.

In any case, there is a sane principle in computing: when a change to a
software breaks the said software, you must immediately reverse the change
and work more on it before trying to introduce it again, instead of
stubbornely trying to tinker with the new code on the production system.
This is a principle which is alas very far from being a gold rule at LL.

It has been 11 months that LL tried and failed: it would be about time
they admit their mistake and deal with the consequences by revising the
design of the IMs and groups service: the current design might (more or
less) work on LL's own local network, but I can assure you it's an
everyday nightmare for us, overseas users !

Not to mention, the IMs and groups service should be designed to scale
with the needs of the users as well as with their numbers: 25 groups
per users is now obviously too little, and LL should seriously consider
increasing this number like they did in the past (passing from 15 to 25
groups).

Henri.


More information about the SLDev mailing list