Stackless Python

Talk among developers, and propose and discuss general development planning/tackling/etc... feature in this forum.
Post Reply
pheonixstorm
Elite
Elite
Posts: 1567
Joined: Tue Jan 26, 2010 2:03 am

Stackless Python

Post by pheonixstorm »

For now this is a data placeholder for eve data I had to dig up. Discuss the possibilities for VS client and server if you wish. I would like to see the VS data using stackless eventually, maybe along with threads in 3.2 (does 2.7 have threads?). And if pypy ever gets stackless in, I would like to also use pypy on the server end, maybe clent as well since its a jit python compiler and should increase our python speeds. Anyway, heres what I was after.
CarbonIO and BlueNet: Next Level Network Technology
Reported by CCP Zulu | 2011.06.20 20:27:35

Most folks familiar with EVE know that it's built on Python, Stackless Python to be specific. Stackless is a micro-threading framework built on top of Python allowing for millions of threads to be live, without a lot of additional cost from just being Python. It is still Python and that means dealing with the Global Interpreter Lock (hereafter known as the damn stinkin' GIL, or just GIL for short).

The GIL is a serial lock that makes sure only a single thread of execution can access the Python interpreter (and all of its data) at a time. So as much as Stackless Python feels like a multi-threaded implementation, with the trappings of task separation, channels, schedulers, shared memory and such, it can only run a single task at any given time. This is not unlike some of the co-operative multitasking operating systems of the past, there is power in this model as the framework makes the promise to all executing threads that they will not be preemptively ended by the system (apart from when the system is suspecting the micro thread to be in violation of the Halting problem). This allows game logic code to make a lot of assumptions on global program state, saving a lot of complexities which would be involved in writing the acres of game logic that make up EVE Online to be written in a callback driven asynchronous manner. In fact most of the high level logic in CCP products looks naively procedurally synchronous, which helps a lot with the speed of development and maintainability for the code.


The down side of all of this is that some of the framework code in EVE Online is written in Python and is thus victim to the GIL. This includes any C-level module that wants to access Python data, the GIL must be acquired before it so much as looks at a string


All tasks working within Python must acquire the singular GIL to legally process, effectively limiting all Python to a single thread of execution.
(please excuse the primitive graph, I'm a programmer not an artist)


Bottom line: Code written in Stackless Python can only execute as fast as your fastest CPU core can go. A 4 or 8 CPU big-iron server will burn up a single CPU and leave the others idle, unless we span more nodes to harness the other CPUs, which works well for a lot of the logic in EVE which is stateless or only mildly reliant on shared state but presents challenges in logic which is heavily reliant on shared state, such as space simulation and walking around space stations.


This is not a problem as long as a single CPU can handle your load, and for sane uses of Python this is true. I shouldn't have to tell you that we are hitting the point at which a single CPU can't process a large fleet fight, despite the work of teams like Gridlock in squeezing out every drop of optimization that they can. CPUs are not getting any faster. They are getting more cache, fatter busses and better pipelines, but not really any faster at doing what EVE needs done. The way of the near (and perhaps far) future is to 'go wide' and run on multiple CPUs simultaneously.


Overall the proliferation of multi-core CPUs is good in the long term for EVE, the basic clustering method that EVE uses will work really well for computers with more than 30-60 cores, as then the routing between cores eclipses the benefit of having the threads of a process flow between the cores but right now we are in the spot where the framework parts of EVE, such as networking and general IO could benefit a lot from being liberated from the GIL.


Multi-core superscalar hardware is good news for modern MMOs, they are well-suited to this paradigm and trivially parallelizable, but not-so-good news for Python-dependent EVE. It's not news: now more than ever, performance critical EVE systems which do not need the benefits of rapid development and iteration in Python need to decouple themselves from the GIL, and soon. CarbonIO is a giant leap in that direction.


CarbonIO is the natural evolution of StacklessIO. Under the hood, it is a completely new engine, written from scratch with one overriding goal tattooed on its forehead: marshal network traffic off the GIL and allow any c++ code to do the same. That second part is the big deal, and it took the better part of a year to make it happen.


Backing up for just one minute, StacklessIO (http://www.eveonline.com/devblog.asp?a=blog&bid=584) was a quantum leap forward for Stackless Python networking. By making network operations "Stackless aware" a blocking operation is offloaded to a non-GIL-locked thread which can complete its wait while Stackless continues processing, and then re-acquires the GIL, notifying Stackless that the operation has completed. This lets the receive run in parallel, allowing communications to flow at OS-level speeds and be fed as-needed into Python as fast as they can be consumed.


stacklessIO completes Python requests without holding the GIL


CarbonIO takes that a step further. By running a multi-threaded communications engine entirely outside the GIL, the interactivity between Python and the system is decoupled completely. It can send and receive without Python asking it to.


That bears repeating: CarbonIO can send and receive data without Python asking it to. Concurrently. Without the GIL.


When a connection is established through CarbonIO a WSARecv() call is immediately issued and data begins to accumulate. That data is decrypted, uncompressed, and parsed into packets by a pool of threads that operate concurrently to any Python processing. These packets are then queued up and wait for Python to ask for them.


When Python decides it wants a packet, it calls down to CarbonIO, which probably already has the data ready. That means data is popped off the queue and returned without ever yielding the GIL. It's effectively instant, or at least nano-secondly fast. That's the first big gain of CarbonIO, its ability to do parallel read-ahead.


The second big gain is on sends. Data is queued to a worker thread in its raw form, and then Python is sent on its way. The compression, encryption, packetization and actual WSASend() call all occur off the GIL in another thread, which allows the OS to schedule it concurrently on another CPU. A method is provided to allow c++ code to do the same, but that requires no special architectural overhaul, StacklessIO could have done that too, but without the whole picture it would have been pointless.

Now back up a minute. "Already has the data ready". Hmm. What if we were to install a c++ callback hook that allowed a non-Python module to get that data without ever touching Machonet? Welcome to BlueNet.


CarbonIO runs its recv-loop continuously, and can notify a c++ module of data arriving without Python intervention.

Machonet is a large collection of functionality which marshals, routes, manages sessions, handles packet scheduling/delivery and all the other glue that holds EVE together. It is a Python module, so all data must pass through the cursed GIL at some point, on every EVE node. No matter how fast a c++ module is for EVE, it is still beholden to that bottleneck. This discourages a lot of potential c++ optimizations, since any gain would be chewed up acquiring the GIL and thunking into Machonet.

But not anymore.

Now a c++ system can receive and send packets through BlueNet and never care about or need to acquire the GIL. This was originally coded for Incarna, which needs to send a significantly higher volume of movement packets to support a walk-around MMO. In space you can cheat, but not so with close-up anthropomorphic motion. Early projections showed that even with a modest tick-rate, Incarna would bring the Tranquility cluster to its knees. BlueNet solves that problem by routing traffic in parallel with the GIL, to and from c++-native systems (like Physics). This is faster because the data stays in its native, bare-metal form and not double-thunked for every operation, a huge savings.


How does this work? BlueNet maintains a read-only copy of all necessary Machonet structures, in addition to this a very small (8-10 byte) out-of-band header is prepended to all packets. This header contains routing information. When a packet arrives, BlueNet can inspect this out-of-band data and sensibly route it, either by forwarding it to another node or delivering it to a registered c++ application locally. If it forwards it, this happens in-place off the GIL; Machonet/Python is never invoked. This means our Proxies can route BlueNet packets entirely in parallel, and without the overhead of thunking to Python or de-pickling/re-pickling the data. How effective is this? We're not sure yet, but its somewhere between "unbelievably stunning" and "impossibly amazing" reductions in lag and machine load for systems where this is applicable. Seriously we can't publish findings yet, we don't believe them.


In addition CarbonIO has a large number of ground-up optimizations, mostly small gains that add up to faster overall code, a few worth mentioning:


Work Grouping

It is difficult to be specific, but CarbonIO goes out of its way to "group" work together for processing. In a nutshell, certain operations have a fixed transactional overhead associated with them. Network engines have this in spades, but it applies to all significant programming, really. Through some pretty careful trickery and special kind of cheating it's possible to group a lot of this work together so many operations can be performed while paying the overhead only once. Like grouping logical packets together to be sent out in a single TCP/IP MTU (which EVE has always done). CarbonIO takes this idea several orders deeper. An easy example would be GIL acquisition aggregation.


The first thread to try and acquire the GIL establishes a queue, so that any other threads trying to acquire it simply add their wake-up call to the end of the queue and go back to the pool. When the GIL finally does yield, the first thread drains the entire queue, without releasing/re-acquiring the GIL for each and every IO wakeup. On a busy server this happens a lot, and has been a pretty big win.


openSSL integration

CarbonIO implements SSL with openSSL and can engage this protocol without locking the GIL. The library is used only as a BIO pair, all the data routing is handled by CarbonIO using completion ports. This will allow us to take continued steps by making EVE more and more secure and even allows for some of the account management features of the web site to be moved over to the EVE client for convenience


Compression integration

CarbonIO can compress/decompress with zlib or snappy on a per-packet basis, again, independent of the GIL


Field Trials

Data collected over a 24-hour run of a busy proxy server (~1600 peak users, a typical week day) Showed the overall CPU% usage drop dramatically, as well as the per-user CPU%. This reflects gains from the overall paradigm of CarbonIO, which is to reduce transactional overhead. As the server becomes busy, the effectiveness of these optimizations begins to be dominated by the sheer number of transactions that must be performed, but at peak load it still showed a substantial gain.


CPU% per user over a 24-hour run

Raw CPU% used over the same 24-hour period


Sol nodes benefit much less from these modifications since their main job is game mechanics rather than network activity, but we still saw measurable improvement in the 8%-10% range.


It is important to note that none of the BlueNet options were engaged with these trials, no c-routing, no off-GIL compression/encryption. These were meant to be in-place process-alike tests which were fearlessly attempted on live loads to ‘proof' the code once we had thrashed it as much as we could in closed testing. The gains we see here are gravy compared to the performance gains we will see on modules that will use the new functionality.


Bottom Line

What this all means is that the EVE server is now better positioned to take advantage of modern server hardware, keeping it busier and doing more work, which in turn pushes the upper limit on single-system activity. By moving as much code away from the GIL as possible, it leaves more room for the code that must use it. In addition, since less code is competing for the GIL, the overall system efficiency goes up. With the addition of BlueNet and some very optimal code paths, the door is now open. How far it swings and what gains we end up getting remains to be seen, but at the very least we can say a major bottleneck has been removed.
Because of YOU Arbiter, MY kids? can't get enough gas. OR NIPPLE! How does that mkae you feeeel? ~ Halo
pheonixstorm
Elite
Elite
Posts: 1567
Joined: Tue Jan 26, 2010 2:03 am

Re: Stackless Python

Post by pheonixstorm »

This next covers an older tech that was replaced by the above. This one is StacklessIO
StacklessIO

For the past two years we have been developing new technology, called StacklessIO, to increase the performance of our network communication infrastructure in EVE. This new network layer reduces network latency and improves performance in high-volume situations, e.g., in fleet-fights and market hubs such as Jita.

On 16 September we successfully deployed StacklessIO to Tranquility. We noticed an astounding, yet expected, measurable difference.

Normally Jita reaches a maximum of about 800-900 pilots on Sundays. On the Friday following the deployment of StacklessIO there were close to 1,000 concurrent pilots in Jita and on Saturday the maximum number reached 1,400. This is more than have ever been in Jita at the same time. Jita could become rather unresponsive at 800-900 pilots but on Sunday it was quite playable and very responsive with 800 pilots. It should continue to be snappier and more responsive in the future.

The Measurements

This spring we saw the fruits of our R&D work when we deployed StacklessIO to Singularity and began measuring the difference.

Confirming suspicion we had had for a long time the Core Server Group team, lead by CCP porkbelly, proved that StacklessIO vastly outperformed the old network technology. They also demonstrated that the old technology could sometimes, under extreme lab conditions, delay network packets in an arbitrary manner for a significant amount of time.

Later CCP Atlas of the EVE Software Group showed that those symptoms also happened in wild with the old technology; although on a smaller scale, then network response to client requests could in some cases be delayed for a few minutes on highly loaded nodes in the cluster. In particular we measured client network communication to the node that hosts Jita.

Since the client and server clocks are synchronised then we called a remote service on the server from the client, the server responded with the global time and we measured the server and client deltas. We also called that same service directly on the server node to measure the service call's processing time, which turned out to be negligible.

What we discovered in our tests is that the server delta was almost identical to the client's received delta so the delay was due to the remote service call taking a long time to reach the server-side service, most likely somewhere in the network layer on the server. The values on the graphs below are seconds.

This is a Sunday profile and is very specific to Jita. This was one of the primary reasons why Jita could sometimes become fairly unresponsive on Sunday evenings. It was not uncommon that client requests could take up to 1-2 minutes to reach the service layer on the server, and the requests would be delayed seemingly randomly since for two requests in succession then the first one could be delayed for minutes while the second one would get a response almost immediately. From a player's perspective this would manifest itself in lag and strange client behaviour as requests were delayed and completed by the server much out-of-order.

By comparison, here is Jita with approximately the same number of players, around 800 pilots in local, after the deployment of StacklessIO.

It's very apparent that StacklessIO does not demonstrate any of the earlier issues. There is only one small spike and two small bumps but we must keep in mind that such isolated occurrences could be caused by general network issues on the internet. Since the client/server network communication has to travel through the internet then some delays would be expected depending on general internet health and the particular ISP.

There are no systemic issues anymore as with the old network technology and StacklessIO provides all-around superior performance.

One of the other measurements we did was to ping all nodes in the cluster from a single node to measure network latency within the server cluster. The values in the tables below are seconds.

Ping Pre-StacklessIO
Time Minimum Maximum Average Stddev
16:00 0.00065 3.22 0.042 0.032
21:00 0.00064 4.36 0.068 0.056
22:00 0.00065 1.21 0.027 0.027
23:00 0.00064 4.36 0.027 0.028
00:00 0.00065 1.01 0.020 0.017

Ping StacklessIO
Time Minimum Maximum Average Stddev
16:00 0.00064 2.00 0.014 0.021
21:00 0.00064 1.02 0.014 0.018
22:00 0.00064 0.25 0.009 0.011
23:00 0.00064 1.93 0.014 0.021
00:00 0.00064 1.06 0.010 0.014

From the table we notice that the minimum values are the same before and after. The lowest maximum is approximately the same but overall the maximum values are lower with StacklessIO by approximately a factor of 2 and they are more consistent.

The average values are lower overall with StacklessIO by a factor of 3 and the standard deviation is lower by a factor of 2. Below is a visual representation of the average values.

At 1,400 pilots on Saturday the node hosting Jita ran out of memory and crashed. As crazy as it may sound this was very exciting since we had not been in the position before to be able to have that problem. We immediately turned our attention to solving that challenge and are making significant progress. I will provide information on that specific effort in a dev blog later.

But we have already made good progress on memory optimisation as a part of the StacklessIO technology effort, e.g., memory usage on the proxy servers in the cluster reduced significantly.

The two tall peaks are memory issues we encountered in the first days after deploying StacklessIO. A task force was put into action and it reduced the memory usage by 50% compared to pre-StacklessIO values.

The graphs and measurements above show primarily statistics for Jita but the benefits of StacklessIO apply everywhere. We measured Jita in particular because we could rely on activity and regular load in Jita for measurements. StacklessIO should have a positive impact on your playing experience, no matter where you are in the EVE universe and no matter what you are doing.
Its high time we could do something like this in VS.

For the missing images visit:
CarbonIO http://www.eveonline.com/en/incarna/art ... technology
StacklessIO http://www.eveonline.com/devblog.asp?a=blog&bid=584
Last edited by pheonixstorm on Sat Nov 26, 2011 9:57 am, edited 1 time in total.
Reason: Forgot eve links to devblog articles
Because of YOU Arbiter, MY kids? can't get enough gas. OR NIPPLE! How does that mkae you feeeel? ~ Halo
pheonixstorm
Elite
Elite
Posts: 1567
Joined: Tue Jan 26, 2010 2:03 am

Re: Stackless Python

Post by pheonixstorm »

A little more digging on StacklessIO brought this out:
Hello there.
We have promised for some time to bring the code out and it is still our plan.
But I am still streamlining and adding final touches.

It is basically an asynch IO scheduler for tasklets, written in C++ for windows, with some
Minor twists. We use it primarily for socket IO but also for DB and file. A patched
Version of socketmodule is part of this.
Once it's ready, I'll bring out the code.

And yes, as Hilmar explained, StacklessIO was an in-house ad-hoc codename
For the project that somehow got upgraded :)

Cheers,
Kristján
If CCP released it I haven't found it yet, but I hope its true :D

And just found the blog for the CCP python guru
Last edited by pheonixstorm on Sat Nov 26, 2011 10:16 am, edited 1 time in total.
Reason: Add blog link for ccp python guru
Because of YOU Arbiter, MY kids? can't get enough gas. OR NIPPLE! How does that mkae you feeeel? ~ Halo
klauss
Elite
Elite
Posts: 7243
Joined: Mon Apr 18, 2005 2:40 pm
Location: LS87, Buenos Aires, República Argentina

Re: Stackless Python

Post by klauss »

Stackless' coroutine model is very tough to follow for coders. Just imagine what it would be like for non-coders or regular modders.

I'd say we should stick to regular python.

It does have threads. Python's not the issue regarding threads, it's the engine's generalized thread-unsafety.
Oíd mortales, el grito sagrado...
Call me "Menes, lord of Cats"
Wing Commander Universe
pheonixstorm
Elite
Elite
Posts: 1567
Joined: Tue Jan 26, 2010 2:03 am

Re: Stackless Python

Post by pheonixstorm »

My thinking on this was for backend work away from the general modder friendly stuff and most of the server backend stuff. If we build a nice framework that can be accessed by modders in a friendly way (see my rants about nwnscript) So we could rebuild the mission director (who messes with it anyway) to run all the mission events while leaving files such as the campaign files much easier to alter for the modders.

The current mission setup isnnn't too difficult to follow but it lacks documentation or even a tutorial on how to create game campaigns or add in new npc missions.
Because of YOU Arbiter, MY kids? can't get enough gas. OR NIPPLE! How does that mkae you feeeel? ~ Halo
www2
Venturer
Venturer
Posts: 537
Joined: Sat May 14, 2005 10:51 am
Location: milkyway->the sol system->earth->Europe->The Nederland->Soud Holland->Leiden
Contact:

Re: Stackless Python

Post by www2 »

@klauss
So long we make a good api for the missions/mod's files id Stackless Python no problem.
And what phoenixstorm say we can better use Stackless as a backend than frond end.
All Your Base Are Belong To Us
Post Reply