scary thread thread

Post by **safemode** » Wed Feb 20, 2008 5:49 pm

Ok, threading is very dangerous as the code currently exists. We do a lot of thread unsafe things. There may be one thing we can thread that would be beneficial and avoid all that nastiness. Currently, we simulate (on a different scale) a number of nearby systems. This is dependant on config variable (your ram setting). The simulating of these systems is largely unrelated to the current system the player is in. The majority of activity from these systems to the current system is in news of "battles" and other goings on that are decided on some kind of chance algorithm.

I'm thinking it should be possible to allow a config variable to directly control how many systems get co-simulated. If we can comfortably simulate 4 systems with minimal cpu and ram impact, if we thread up the simulating of these other systems, we could simulate a dozen of the nearest systems that way. Very little would need to be changed, just a lock in the basecomputer controlling the writing of news and the reading of news.

We could also complicate the simulation of these systems to make them more realistic/interesting. Users without multiple cores could continue under the current method (only 2 or 4 systems on the other thread), but mult-core users would be able to increase that to their heart's content.

I think this is a pretty safe and largely unobtrusive way to offload some simulating from the current frame. The main thread would never get stuck waiting for the other thread, and the other thread can always wait until it's safe to lock the news without affecting anything.

Post by **jackS** » Wed Feb 20, 2008 8:40 pm

Unless I'm misremembering, it's actually going to be more complicated than that. The simulation that the news and such is dependent on is the python-level Flightgroup-granularity simulation, which is done on all star systems. The configurable "remembered" star systems variable is different.

IIRC, the other systems loaded into memory are actually being simulated in normal physics manner, but at greatly reduced frequency (and hence fidelity, but you can't see it, so it's not a big deal). To properly multithread the simulations, we'd need to synchronize on every access by any vessel to any jumppoint in any system that's being simulated. This doesn't make things in any way intractable, but I don't think it's as simple as one might hope for.

On a separate note, I'm not certain how highly to prioritize this among the set of things that would benefit from additional threading, because the simulation burden is currently mostly imperceptible. What does tend to be noticeable (and has been the historical limiting factor in how many systems we can remember) is the memory footprint for multiple system simulation. On the multi-threading front, I'm of the opinion that this is certainly lower priority than creating a pre-loader thread to mitigate delays on jumping or other mass-load activities.

Post by **safemode** » Wed Feb 20, 2008 10:34 pm

Images are where we use all our memory. We could probably simulate every system in our universe at the same time and not use much ram. Textures are were we have all our ram use in.

out of the 600MB vs uses on my system, i'm sure over 500 of that is just textures.

Ok so the simulations are more complicated than i thought they'd be. Fine. so lets pretend we've cleaned up the "system" class so that it doesn't use static or global vars. This means for all the classes inside it would also not be allowed static vars. Pretend that cleanup was done.

Each system in the universe (ram limited) gets simulated on it's own thread from the start of the game.

The universe object controls the priority of each thread (here we let the OS do our scheduling). This wont be a problem, because the current system gets a much higher priority.

Now, all the other systems each get a universe level lock. The universe holds these locks to control synchonisity. Basically, the system will be allowed 1 physics frame for all it's units and then have to wait for the universe to release it's lock.
The universe does not do the same for the current system, it is allowed to cycle as fast as the computer will allow.
In this way, the universe can control the level of fidelity simulated on a per system basis. It can control how often it holds a system lock.

Now, what about news and jumping you say?
when a system has a ship jumping to another system, (other to other, other to current, or current to other when it's not the player), the system requests a lock, then pushes the Unit onto a queue of that system. it releases the lock and moves on. This lock request only takes as long as it takes to push a pointer onto a queue or pop one. So it's fast enough to not matter.
When that system is set to run again, it first processes it's jump queue. It holds a lock, pops the unit pointer, unlocks. Looks at the unit's current system, places the unit at the correct jumppoint based on that, sets the new current system for that unit, then repeats back at the Jump queue. When it's done, it runs it's physics frame on all the units. To the unit, the jump was instant, and since we're not viewing these systems, it doesn't matter that it wasn't. Everything is kept in sync ny the universe, so one system wont get stuck never getting processed or anything.

The special case: player jumps to one of these systems.
The Universe holds a special lock that is shared in all systems, every unit during a physics frame checks this lock and will halt at that point (beginning of frame), essentially pausing all systems at the start of thier next unit physics frame. This is basically instant We do the same thing a system does when it's not a player to jump, but then we push the current system on the list of "other" systems, and move the system we jumped to to our "current" system thread. by changing thier priorities.
Now we do what we normally do when we jump to a new system as far as loading graphics and removing graphics.
We unlock the special shared lock, then we allow things to continue as normal.

News can have a similar locked queue.

Post by **chuck_starchaser** » Sat Feb 23, 2008 6:16 pm

I'd totally agree with JackS; if the simulations represent little cpu loading, there's not much sense splitting them into another thread. No benefit accrued from multicore or otherwise.
Morover, preloading files is something that would be enormously beneficial, and which I banged my head against the wall thinking about on several occasions, and it seems to me that doing it in a separate thread(s) ought to be the way to go. And it's a pretty easily separable task; no producer/consumer problems; no dangers of race conditions I can think of...

Xit · Post by **Xit** » Sat Feb 23, 2008 7:02 pm

Will that still be the case in the future though, when the economy is more complicated, and seamless planetary flight is implemented? If the economy was to be generated dynamically, across several systems, how much would that extra simulation benefit from being in another thread?

Post by **safemode** » Sat Feb 23, 2008 11:40 pm

physics may be light weight, but it has an absolute time limit. That means that long before you get to full cpu use in VS, we'll have huge latency problems. That's the idea for threading simulation. You can scale the simulation to huge numbers of units without affecting the simulation time.

Let me explain it like this.

If we have N seconds between graphics frames to maintain at least 30fps, then each update_physics() call to a unit has to take a fraction of N, i . Each update to the economy has to be a fraction of N, j. Every update to the player's unit takes a fraction of N, k. Now, we can pack M frames between graphics updates until N is used up. So as i, j , or k increase in cpu time (due to increased complexity and number of units in game) that decreases the number M is. It also increases the likelihood that we'll overrun N when a single unit takes an unexpectedly long time to do it's thing, because each unit will have less time to do it's work in.

Right now, we dont have enough going on usually for all the units to get their turn and use up all the cpu. In the future with thousands of units, that wont be the case. In order to scale, we're going to have to find a way to decrease latency (the risk of a single unit overrunning N), and without adding preemption in VS, we need to offload it on another thread.

basically it's about scaling in the future, because if we're going to increase the time a single function takes, we'll have to make sure it doesn't leave us missing graphics frames.

i'm not against preloading systems, but we dont often jump into systems. i'd prefer to use threads on something that will help overall performance.

Post by **safemode** » Thu Mar 06, 2008 7:11 am

I think creating an opcode tree is very expensive, time wise. I've started to notice much greater pauses when a new type of unit is put into a system during gameplay and i think that has to do with opcode somewhat (Unit itself a lot too)

I have an idea for safely threading the creation of the tree.

Basically, I'd have a pthread compatible (C style ) function in csOPCODECollider that gets called when geometryInitialize is run using the pthread_create function. Everything in geometryInitialize will be put into this pthread_compat function. GeometryInitialize would thus return practically immediately.

Collide() would then have a check to see if the thread has exited yet, if not, it waits using pthread_join. Once the geometryInitialize pthread function exits, it then continues.

So it blocks only if absolutely necessary, and for much less time than we'd block on insertion into the system. I could even set it up to copy the vertices first before calling to thread, and we'd have 0 chance of races and no need to lock anything outside of the csOPCODECollider class because nothing is accessed outside of it, and nothing is accessed inside of it directly. i can block all functions until the thread exits and the only time you'd ever notice that is if the unit was destroyed or attacked the moment it was spawned. After that the geometry_initialize pthread function will have finished long ago.

I'm gonna do a test run of this tomorrow and see if it makes any difference when a unit is spawned in the system. I think it will as optimized collision tree generation is not cheap.

I'm excited now, i found a use for a little threading that may provide real results with no downsides.

Post by **chuck_starchaser** » Thu Mar 06, 2008 12:58 pm

Are you using BOOST:threads?
The project's been kind of abandoned for years; but I tried what's there, a year ago or so, and it works like a charm; --and is portable.
(And if I managed to figure it out and compile it, any alley cat can do it

)
Might save you some headaches (and #ifdef's pollution).

Post by **safemode** » Thu Mar 06, 2008 1:17 pm

looks good. I'll use that.

This is gonna be cool. You can probably say goodbye to a good slice of time it takes to spawn a new unit, especially the more complex ones. For simplicity and the guaranteed to be race free, i'll still copy the vertexes in the main thread, then spawn off after that for the rest of initialization. Basically the Model Build. It's highly unlikely that we'd unload the geometry of a unit right after we spawn it (maybe cuz it was destroyed) but we'll steer to the side of caution.

Post by **chuck_starchaser** » Thu Mar 06, 2008 5:35 pm

So, you're going to make one-shot threadlets that are spawned with a name, say "llama.basic", and the thread does all the mesh and texture loading and when it finishes it sends a message back with a pointer to a struct it allocates with the info to where everything is, then exits? Not sure if I read you correctly, but if I did, it would be a good idea for these threads to load meshes and textures to system memory, rather than to the videocard, and let the main thread do the final videocard thing. Several reasons:
1) You could later on have these little threads spawned by another, background, low priority thread that predicts unit needs, like if you're heading for a jump point, a few seconds before jumping, it begins to figure out what's needed for the other side.
2) Multithreaded access to the videocard is probably dangerous.
3) Moving data from system memory to the videocard probably takes a fraction of the time it takes to load the data from disk.
4) During the transition, you could evict everything from the videocard's memory, to have a clean slate; and move all the new data in from system memory faster than a single frame's time.

Post by **safemode** » Thu Mar 06, 2008 6:02 pm

you misread what i was talking about.

We make copies of the vertexes that make up a mesh, just like we did with the rapid collider, because VS's mesh construct uses Vector's and that's not compatible with Opcode::Point's. So we copy the mesh when we create a csOPCODECollider object. Then we build the collider tree at the same time.

That is all i am threading. The generation of the collider tree. I'm not threading the copying of the mesh data becuase there is a small chance the unit could be destroyed and the mesh unloaded while i'm still copying the mesh. So copying of the mesh is done on the main process, then i generate a thread to build the tree. I dont wait for it, but i do block until it's done if it's still working when we call Collide or GetVertex or any other function that hits the model data.

None of this has anything to do with the data that gets sent to GL.

Post by **chuck_starchaser** » Thu Mar 06, 2008 6:09 pm

Ah, sorry. BTW, if you want to copy the vertices in the thread, you could do so by using a shared_ptr, so the vector data can't be deleted until the thread discards it. Just mentioning. (shared_ptr is another boost thing, and is thread-safe)

Post by **safemode** » Thu Mar 06, 2008 6:43 pm

that function copies the data i believe, which it what it would be trying to avoid ... so i dont think it would save much time and it'd end up using more memory ..by having a 3rd copy of the mesh for a short time.

basically these threads will exist for less than a second, then die. My goal is to simply reduce the time a unit takes to spawn, this time is very small for a single unit, but usuallly we push entire flight groups into the universe and system at a time (20-30 units). This can easily hit our 1/60 of a second magic time limit.

If i can take the time it takes to generate an optimized collide tree, out of that equation, then we win. no matter if that particular process takes slightly longer to do than the current mode, becuase it's no longer blocking.

So you can imagine say, the loading of a system having at times 20 threads active at once for fractions of a second as entire flight groups are loaded at once everywhere. very short lived, possible inefficient in overall cpu use, but it will free up the main process significantly i believe. I'll be implementing it tonight hopefully and trying it out.

Post by **chuck_starchaser** » Thu Mar 06, 2008 10:45 pm

No, no; it doesn't copy anything; it's just a superb implementation of extrinsic reference counter through a smart pointer type.
http://www.boost.org/libs/smart_ptr/shared_ptr.htm
But if copying vertex data in the thread is not important, then it's not important; I was just letting you know there's an easy way to do it.

Post by **safemode** » Thu Mar 06, 2008 11:22 pm

i misread it, I assumed it was something i would only need to set on one end. This is something i would need to set outside of the code i'm going to thread ...meaning i'd need to change all the code that uses it so that it would be seen as a shared_ptr object.

if i wanted to do something like that, i'd just hold a lock around the destructor of the bsp_polygon object, without changing the type of it. I dont need to worry about concurrent access, because you dont write to it once it's made.

but like i said, i'm gonna go for simplicity and less intrusiveness and just start my thread after that access to shared data.

Post by **chuck_starchaser** » Thu Mar 06, 2008 11:41 pm

No; when you do an allocation via new() and get a pointer, you stick that pointer into a shared_ptr instead of a raw pointer. If it's a vector<Vector> v, then yes, you'd have to change that to a vector<Vector> *pv first step, change the code, then change to a boost::shared_ptr<vector<Vector> >, but you were talking about deallocation so I guess you have those vertices on the heap already. Then you could give a copy of the shared_ptr to the thread. Only once all shared_ptr's to a common object are destroyed or go out of scope does the object get deallocated. No intrinsics; it's completely extrinsic and transparent. But if you don't need a reference-counted object, then you don't need it; I just wanted to clarify.

Post by **safemode** » Fri Mar 07, 2008 2:11 am

Kind of sounds like shared_ptr makes itself a base class to whatever object you pass it, so that it can overload it's destructor. Interesting.

Though i'm not sure i want to alter any outside code for this little change. I'd like to make the threading as non-intrusive as possible for now. I dont want to have parts of it's functionality all over the code base, or in this case, just in one other place in the code base.

Post by **safemode** » Fri Mar 07, 2008 4:14 am

Well, i implemented it.

Unfortunately, it doesn't give the effect i had hoped. Apparently, even though tree generation is somewhat expensive, we're smart enough to only do it on new models... So even in a very long run, you'll only see maybe a couple hundred calls. And the pauses that occur (apparently as new units are spawned) are not effected by it.

basically it comes down to the fact that geometryInitialize didn't really take up that much time, and it's only called for a given type of mesh once. Thus it's impact at being parallelized (which it can be very wel) is minimal to gameplay

Post by **chuck_starchaser** » Fri Mar 07, 2008 5:32 am

Damn!

You wrote:And the pauses that occur (apparently as new units are spawned) are not effected by it.

Any chance the new unit spawning could be parallelized?
Like have a low priority thread that's normally blocking but, whenever you need a ship, you send a message to the thread with a name. The thread loads the mesh and textures to ram, puts pointers to them in an allocated struct, sends a message back with a pointer to the struct, and checks the next message, or blocks again if there's no message.

And what ever happened to the idea of compressing the mesh and the textures?
I'm thinking, all the files in each unit folder could be gzipped/tarred.
So the thread would look for a file called a.tar.gz or whatever in the folder. If it's not there, it loads the plain files. If it's there, it extracts them to ram.

(Yeah, I think I'd use "a" as the name for all of them. Why waste a column in units.csv for a filename, when the name of the ship is in the folder under units already? Frankly, I'd get rid of the mesh names in units.csv and give them all the same name, and then get rid of the texture names in the mesh files, and standardize the names, like d.dds for diffuse, s.dds for specular, a.dds for ambient/lightmap/glow, x.dds for damage, n.dds for normal map, g.dds for gloss/shininess/hardness, t.dds for detail. If they are there, they are used; if they are not there, they are not. Simple.)

Post by **safemode** » Fri Mar 07, 2008 6:07 am

part of the issue is that there are lists that get made everywhere ....so it's not like you can just thread the creation of an object and not immediately have to block the main process until the thread is finished so you can push that object where you want it .

That's not the only thing.

anything with GL directives can't be threaded. This also happens to be the areas with the most cpu usage .

I just spent hours staring at the code in universe_ starsystem_ galaxy_ and unit_ and i could find _nothing_ worthwhile to thread that wouldn't require rewriting a crap load of stuff.

Very depressing.

It's obvious that the overhead of creating a thread is too great for anything that would create a new thread each frame.

that leaves persistant waiting loops ... RegenShields is the closest candidate i've found that satisfies a good deal of issues. But all in all, i think this is a wash.

If there was a way to update physics on non-current systems (or the place this occurs) then i could block if the threads were active, then set them off if they are sleeping, keeping them in a function that i can have exit finally when the system either gets deleted or loaded as current. Of course, these other systems probably add no real cpu usage to the ame. The real cpu usage is in Unit::Update subunitPhysics would be a good one too. but this is infested with graphics code and just plain linear only stuff. I'm heading off to sleep now, maybe we'll think of something by tomorrow

at least the threads work and are easy as hell to use.

Post by **chuck_starchaser** » Fri Mar 07, 2008 6:40 am

safemode wrote:part of the issue is that there are lists that get made everywhere ....so it's not like you can just thread the creation of an object and not immediately have to block the main process until the thread is finished so you can push that object where you want it.

Well, I don't know enough about the engine to contribute much on this, but it seems to me you're saying that creation and linkage of a new unit are mixed. This is BAD. There should be a single point of linkage. The lifetime of a unit should go like:

load (z-lib extract the bfxm's and dds's from a single file) (threaded)
create (shove the stuff to the videocard)
link (update lists, whatever)
use
unlink
destroy

All the linkage stuff should be made by calls from within a single function call. When you started on all these optimizations you were actually working on refactoring. I think this is a good opportunity for refactoring; and it's also close to unit, which you were planning to refactor recently.

anything with GL directives can't be threaded. This also happens to be the areas with the most cpu usage .

I know. That's why I was suggesting that the little threads extract the mesh and textures to memory, rather than directly to the videocard. You might say, why bother putting so little work in a thread? Because the real drag is disk i/o. Disk i/o is the kind of thing that unpredictably stops you cold for milliseconds at a time. Perfect stuff for a thread.
But it certainly needs some refactoring of the code. Adding linkage to the newly created unit needs to be first separated from the load, then moved elsewhere; --namely, to where you check for returning thread messages.

It's obvious that the overhead of creating a thread is too great for anything that would create a new thread each frame.

that leaves persistant waiting loops ... RegenShields is the closest candidate i've found that satisfies a good deal of issues. But all in all, i think this is a wash.

Any kind of code that can be stopped waiting for a resource is a perfect candidate for a thread. Disk i/o foremost.

Post by **safemode** » Fri Mar 07, 2008 7:21 am

as for pkging images up together ... there's not a whole lot of benefit you'd gain from it.. you only need to read textures in from disk when you dont have the textures already loaded...which is basically only done on the first request for that unit on most systems.

The real gains would be seen on generating a star system (there's a function for it) and separating out of the physics frame from the graphics draw. This way we could begin the next physics simulation while the draw function is executing.

There's only a couple ways you can do this.

Setup Draw to be run in an infinite loop, with a cond.wait() at the beginning. The calling function would have a cond.notify_one() at the spot we would have normally called draw. It would then immediately begin executing updatePhysics. In the event the calling object is destroyed, we'd set a variable to true for killed and notify_one the condition variable, causing Draw to exit. The iffy part there is copying the data it needs that may be invalidated by UpdatePhysics. The idea being to keep the current iteration of updatePhysics unable to alter the current execution of Draw.

The other way to do it would be to make Galaxy smart, and put the adjacent systems that it has to load and simulate each on their own thread. It would then execute each simultaneously and sync them, only blocking themselves of course, not the main current system. This way we can simulate adjacent systems with less impact on the current system (including loading them and thus pre-caching would be possible for jumps). Adjacent systems would be blocked by current system's simulation, such that all can be executing at the same time, but if the adjacent ones finish first, they wait until the current system is done it's cycle. Then they all blast off again. We'd only block to the current system during a jump where we are swapping the current system for an adjacent one. This way we can make a better jump animation that takes up the entire screen to run full framerate while we are loading up the second system on the adjacent system's thread. Once it's done, we do the swap. drop out of animation and reveal the new system, loaded in a snap.

scratch "galaxy" this is all done in star_system_generic.

Post by **safemode** » Fri Mar 07, 2008 7:36 am

i dont think you realize just how little reading from disk comes into play with VS. It's literally non-existent. We read once, during initial load, and that's it. And both windows and linux will cache fs access, making that even less of a factor. Think about how fast we read in the textures needed to load a system from scratch (worst case scenario) it takes less than 20 seconds from starting VS to being in a system flying. That's reading the textures in and doing all the other grunt work on a single process.

Now, once that's done and you're in game, you have basically all graphics oriented functions taking up your time and updatePhysics. The graphics oriented functions aren't spending their time loading textures ...they're doing transformations and blocking on video card transfers via openGL. The worst performance in the video card is bouncing textures from system ram to video ram... not hdd to video ram. We cache everything possible. Profile vs and see for yourself. You'd see the cpu usage in ReadDDS. My profile shows 548 calls and it took 0.00 seconds each call.

The idea of precaching the system you're going ot jump to isn't to avoid reading textures from disk, that's not really a problem. The OS has cached things fairly well already for us and on top of that, most of the textures are already loaded. What we dont have loaded wouldn't take more than a second or two to read in. What really lags things during a jump is getting those textures drawn in the new system and likely all the copy/construction operations that go along with swapping adjacent systems for the current one.

I think our safest bet is to try and get physics updates to run at the same time as draws and/or to get adjacent systems physics updates parallel though blockable by the current system for when we are jumping

Post by **chuck_starchaser** » Fri Mar 07, 2008 2:32 pm

Well, maybe you're right; but you mentioned unit creation in the same sentence with the hiccups...

And the pauses that occur (apparently as new units are spawned) are not effected by [parallelizing the collide tree initialization].

Besides, while now you're saying there's so much cacheing done for us by the OS, you were saying earlier,

My goal is to simply reduce the time a unit takes to spawn, this time is very small for a single unit, but usuallly we push entire flight groups into the universe and system at a time (20-30 units). This can easily hit our 1/60 of a second magic time limit.

And I see not a contradiction, per se; --after all, these flight groups may not involve any new unit types, and so their meshes and textures may already be in local or at least system memory--, but perhaps some experiment is needed to establish whether large flight-group spawings of already loaded ships, or spawnings involving new ship types, are responsible.

The OS may do a lot of file cacheing on our behalf; but it does exactly zero PRE-cacheing. AFAIK, the hiccups don't happen every time a flight group is created, so it may be that they only happen when the flight group(s) include(s) units that haven't been spawned before, and therefore need to be loaded from disk. I think your argument about how little file i/o goes on is non-germane to diagnosing of an occasional phenomenon. Occasional hiccups would naturally be the consequence of some kind of latency; NOT of an amortized slowness. Reading a unit from disk involves reading the mesh and typically 4 textures. That's 5 disk accesses. Even at 5 milliseconds each, that's 25 milliseconds, which is more than 17 milliseconds (1/60th of a second magic number). So you could expect a minimum of one frame dropped per unit type that needs to be loaded from disk (in addition to per-unit code execution latencies).

I'm not trying to push a particular theory; just trying to clarify my thoughts. It could be something else. In PU we get a lot of hiccups related to disk writes, due to our stdout.txt being set at a pretty high verbosity. (You could argue that disk writes are cached and made to happen behind reads; but, while normally this would be true, I think stdxxx.txt writes are explicitly flush()'d, to make sure important stuff is not missed in the case of a crash.) But anyhow, it just goes to show that disk i/o may very well be the cause of this type of problem; and note that however verbose stdout may be set at, it doesn't even compare in terms of i/o burst sizes to loading textures...

EDIT:
Maybe one experiment to do would be to log disk i/o events to stdout.txt, then play the game and, the moment it pauses, hit ctrl-alt-delete, kill vegastrike.exe, and see if some disk i/o was the last thing to happen before the kill. Might as well log spawns, too. Might as well log times every quarter-second too.

Post by **safemode** » Fri Mar 07, 2008 3:02 pm

I was tired last night. rambling.

let me just say that I dont believe we'll be able to parallelize Unit, not without some serious rewrites. It's just too serial in nature and api's are too jumbled.

I think the only hope we have at parallelizing VS (as it exists now) is putting each system on it's own thread. Sychronizing them to the current system on a frame-by-frame basis, but allowing them to all execute at the same time as eachother and the current system.

We may also be able to put the UnitFactory on it's own thread, so that Units are generated on a thread in parallel to processing the game. We will lock the unit list prior to insertion, then unlock. That should be extremely fast and not noticable. I think that can avoid other issues we see everywhere else in the game. Basically call our unitfactory thread, and forget about it. We pass it the list we want to put the unit in, and it gets processed like normal whenever it ends up getting put into the list.

in the future:
We can add a precache function to mimic the act of becoming the current system, loading the textures into GL that aren't already, but not drawing them. Then we'd need to change how we swap so that we handle the fact that things are precached.

Perhaps new code could be made to take advantage of threads.

I also want to look into processing AI in parallel with drawing. Basically Unit execution would happen like this.
thread.lock(ai)
unit physics update
if(theadunitai_done)
cond.notify_one(aidone);
thread unit ai (lock at beginning of function, unlock at end, if done wait).
unit draw update.

thread doesn't block, so draw update executes at the same time. This shouldn't be a problem, because draw depends on physics not running at the same time, ai doesn't alter how the unit is drawn. but ai does alter the next physics frame, so we wait until it is done if it's still executing.

thread unit ai, this is actually a call to push our unit's ai onto a queue running on a thread in an infinite loop. We sit and pop the queue, if nothing then we wait until signalled by the cond variable to start again.

We likely do things to the unit we create currently however, other than just inserting it into a list. so we'd have to deal with that because we dont want to have to block the thread so soon.