[sldev] Grid redundancy in face of disaster (Re: Navigations andLandmark Project)

Joshua Bell josh at lindenlab.com
Wed Apr 23 09:13:13 PDT 2008


Soft wrote:
> Any simulator (machine) on the simulator VPN, with the correct
> certificates, and with the correct simulator binary can bring up any
> region (island/"sim"). 
Modulo different hardware classes (and simulator-to-CPU ratio), as Jason 
pointed out in a fork of this thread. That's also easily tweaked in our 
management tool, but pricing for the host capabilities (i.e. Quality of 
Service) comes into it.
> Pragmatically, we limit machines to specific
> data centers. This is done because adjacent regions chatter back and
> forth considerably and there's no sense in sending that traffic across
> the country. 
>   
Specifically, it makes region crossings much less prone to 
rubber-banding since the handoff is faster.
> The address move project that I believe you're thinking about was for
> changing simulators' IP addresses. That's a bit more touchy
The reason this was problematic was that prior to 1.21 Server (which 
we'll deploy some day, I swear...) the way that this redundancy worked 
is that sim hosts running simulators that weren't hosting regions would 
poll the central database asking "got anything for me to run?" - we call 
these "spares". Too many spares and you have too many machines polling 
the database. Too few spares (or too infrequent polling) and you don't 
have enough reserve capacity to handle the demand when you shut down a 
few hundred sim hosts for maintenance work.

The 1.21 Server update replaces this polling with a service based around 
an HTTP "long poll", so we can in theory have an arbitrary number of 
spares, and the glorious future of "half the grid dies, but the grid 
just routes around the problem" has arrived. Turning this service on 
will be done slowly with lots of monitoring and will happen after the 
1.21 rollout itself, to avoid fireworks.

Joshua



More information about the SLDev mailing list