[opensource-dev] SL server issues in the past 4 months

Henri Beauchamp sldev at free.fr
Mon Apr 8 06:27:00 PDT 2019


Greetings,

I'm writing this email because I'm getting tired of having to pile up
workarounds in my viewer code for all the SL server issues that have
appeared in a succession and every few weeks since mid December, and
think it is time for you folks, at LL, to do some serious review of
the latest changes you brought server-side.

1.- Bogus attachment kill messages on region change.

This issue appeared mid-December 2018 and results in derezzed (but still
active) attachments on region change (sim border crossing or TP alike).

I instrumented my viewer code so to be able to trace the problem. Here is
how to witness it occurring in real time:

Get the Cool VL Viewer, disable the workaround for lost attachments
(un-check Advanced -> Network -> Ignore bogus kill-attachment messages),
enable the "Attachments" debug tag (from Advanced -> Consoles -> Debug
tags) and the debug console, then TP and/or cross region boundaries.
Just watch the debug messages that will allow you to track every event
dealing with your attachments thanks to the special debug code I added
to understand and fix this issue.

You will see that, often (but not always), the departure sim sends a
"kill object" messages for your attachments (which causes them to de-rez
in your viewer) while you are already in the arrival region; in this case,
the arrival region usually sends a re-parenting message for your
attachments (they get parented to your avatar again and thus re-rez).
Sadly, the re-parenting message is sometimes (often) not always received
or incomplete, or even received before the kill message from the departure
region (race condition), causing some attachments not to re-rez in your
viewer.
Interestingly, the arrival sim always got the correct, full list of your
attachments since you will notice that, even if not rezzed, they are still
active (their scripts still work): this is also why, when TPing or crossing
a boundary to another sim, your attachments often reappear (the region
still transmitted them right to the next sim).

To make things even more complicated, the bake server (which keeps a copy of
the COF) seems to receive as well the kill and re-parent messages from the
sims, and is therefore reflecting the same bad state for your attachments
in its copy of the COF; this is why, in my workaround for that bug (which
simply ignores the bogus kill object messages sent from the departure sim),
I also trigger a COF full resync (wearables + attachments) and a rebake.

What I do not understand in the first place, is why the Hell the departing
sim sends kill_objects messages at all to the departing avatar for its
attachments !  The attachments follow the avatar and therefore do not change
parent. Even their position, being relative to the parent avatar, does not
change. Same thing for the bake server that apparently receives the same
message while it should not (the avatar outfit did not change at all, and
even if a scripted object could get detached on arrival as a result of
a scripted changed() LSL event, that event would still occur in the arrival
sim and sent from it to both the viewer and the bake server: the departure
sim shall never have anything to so with objects attached to the avatar !).


2.- Bogus rebakes with bad body textures.

For about one week now, the above workaround (which worked fine for almost
4 months) gets partly defeated because the rebake it triggers gets the bake
server to return bogus textures for the body parts (sometimes a layer, such
as a tattoo, is missing, other times it's like if the avatar was not wearing
any skin texture).
Of course, the user can still rebake manually to get it fixed (well, at least
in my viewer, since I'm not even sure you still can rebake in LL's official
viewer), but this is extremely annoying and pretty much inexplicable by me
(nothing wrong, seen viewer side, just bad baked textures arriving).
So, I was in the process of coding yet another workaround (double-rebake
after the bogus kill-attachment message is ignored) when I decided to write
this email, because this is just getting *ridiculous* !


3.- Failed event polls.

For about 3 weeks, and almost always since last week, I get failed/retried
event polls, which never happened before. Here is a log of one such failed
poll:

DEBUG: LLCoreHttpUtil::HttpCoroHandler::onCompleted: Error Http_499 - Cannot access url: https://sim10685.agni.lindenlab.com:12043/cap/287c7dfd-63d1-a74c-6670-17f4f2d1d5c3 - Reason: Malformed response contents
INFO: LLSDXMLParser::parse: XML_STATUS_ERROR parsing:<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
INFO: LLSDXMLParser::parse: XML_STATUS_ERROR parsing:<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
DEBUG: LLCoreHttpUtil::HttpCoroHandler::onCompleted: Returned body:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="http://localhost:13011/agent/b43c4b76-3816-49ce-933d-e1a4eef3226e/event-get">POST&nbsp;http://localhost:13011/agent/b43c4b76-3816-49ce-933d-e1a4eef3226e/event-get</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>
WARNING: LLEventPollImpl::eventPollCoro: Event poll <13> Retrying in 65 seconds; error count is now 10

Most of the times, these failed event polls happen for neighbouring sims,
but they do also sometimes happen for the agent sim, meaning that should
the failed retries count reach 10, the agent gets disconnected as a result.

By looking at the returned body contents, you can see a reference to
"localhost" (which I assume would be an attempt by the server to access
a service running on itself), and to ports in the 120xx and 130xx range,
which are UDP ports... Would it mean you forgot to remove the calls on
your servers, that attempt to access UDP services that got shut down on
them ?...


4.- Failed TPs

It has been years I never saw that many failed TPs resulting in timeouts
and disconnections... I mean, I got barely one in a blue moon for the past
5 years (at least), and I'm now seeing one or several every day !


In the hope my observations will help you guys to get things back on
track (because it's getting really badly needed).


Regards,

Henri.




More information about the opensource-dev mailing list