[sldev] Grid redundancy in face of disaster (Re: Navigations and Landmark Project)

Dale Mahalko dmahalko at gmail.com
Tue Apr 22 08:51:20 PDT 2008


On Tue, Apr 22, 2008 at 12:41 AM, SignpostMarv Martin
<me at signpostmarv.name> wrote:
> Storing all your eggs in one basket is a very, very bad idea.
>
> Do take into account that a sizeable chunk of Second Life is not too far
> from a major fault line.
>
> Decentralisation FTW!


Technically the LL grid is already capable of being decentralized and
redundant to survive such regional catastrophes. Redundant sim
distribution and sparing is doable, using the base architecture that
LL already has in place now.

Sims are currently assigned to run only from a specific machine in a
specific colo at a specific network address, but they do not have to
be like that. Any server is capable of running any simstate, and if
storage of simstates is decentalized across the country then a sim can
be run from any colo if a recent copy of the simstate is available.

If powered-down reserve servers are permitted to exist in colos around
the country, a local earthquake, fire, flood, hurricane, etc, need not
take out large portions of SL. If LL's main sim facility goes down in
California due to a quake, start all those sims up again in Texas and
Chicago and Miami on the reserve servers.

About an hour after a major disaster, clients need only be told to
reconnect to the other colos to find those same sims again, while the
main LL server facility might be going up in flames or whatever.



The core grid systems can already potentially do this, but right now I
believe moving sims to run on different servers is a manual process
(recall the recent address move project that took a few days).

A properly redundant grid would need an automated "sim concurrency"
mechanism able to monitor the colo facilities, and to quickly boot up
a recent simstate on a new server and rapidly inform all clients to
use that newly assigned network address.

Also, with the current asset system design I don't know if simstates
are distributed across the country to other colo asset servers, to
permit a hot-restart of servers that suddenly just went down
permanently possibly thousands of miles away.

But with the current grid architecture, it is definitely possible to
be able to quickly work around and tolerate regional server facility
disasters, without severely impacting the sims that were running from
the destroyed facility.

It mainly requires an investment in reserve server racks at each colo,
wide distribution of simstates across regional asset servers at each
colo, and a distributed grid management and monitoring system.

-Scalar Tardis / Dale Mahalko


More information about the SLDev mailing list