scary thread thread

Post by **safemode** » Fri Mar 07, 2008 3:29 pm

in python you can disable 99% of the IO buffer issues.

even at your estimations for disk reading, we only have a hundred or so units. If VS had to load every unit in a row (and we dont do that currently) we'd drop 250 milliseconds. These pauses are well over a quarter second long. And we dont load anywhere near 100 units at a time, and we stagger load and processing frame by frame.

for instance, i have a thread where me and pyramid go over how much video ram unit and background textures take. If you had no uncompressed textures at all, it would take 256MB to hold 85 units. Obviously that leaves no additional ram at all, which is unrealistic for a 256MB vid card anyway.

Now i'm not ruling out disk io as contributing but i think it may be coincidental. We do a lot of stuff when we have to load a new unit, mesh generation, collider generation, texture loading to GL, adding to system etc. I just think those other things make up a much greater time span than reading from disk.

but, aside from guessing. I've looked at trying to thread image loading, but we cant. We use the image data right after we try to load it. If we dont have the image data, we fail at loading, and bad things happen. So it's impossible to thread image loading on demand. We'd have to create our own precache process and guess what textures we'd need and then create a lookup hash for normal image loading to check before trying to load the image manually.

Anything that we need the results of right after we call the function we cant thread. we'd get no benefit.

Post by **chuck_starchaser** » Fri Mar 07, 2008 3:35 pm

safemode wrote:I was tired last night. rambling.

let me just say that I dont believe we'll be able to parallelize Unit, not without some serious rewrites. It's just too serial in nature and api's are too jumbled.

k

I think the only hope we have at parallelizing VS (as it exists now) is putting each system on it's own thread. Sychronizing them to the current system on a frame-by-frame basis, but allowing them to all execute at the same time as eachother and the current system.

How about just 2/3 threads:
Current system on main thread.
Next system (when about to jump) on second thread.
All other systems on a third thread; --i.e.: all the low rez, dyn universe simulations executing on a separate thread.
Though, I remember JackS saying that those low rez simulations take negligible time.

We may also be able to put the UnitFactory on it's own thread, so that Units are generated on a thread in parallel to processing the game.

Well, NOW you're talking... Isn't this, technically, what we were talking about unit creation, all along? Sorry, I'm not that familiar with the engine.

I also want to look into processing AI in parallel with drawing. Basically Unit execution would happen like this.
thread.lock(ai)
unit physics update
if(theadunitai_done)
cond.notify_one(aidone);
thread unit ai (lock at beginning of function, unlock at end, if done wait).
unit draw update.

Might as well, in that case, do it like...

Code: Select all

block if previous ai frame not finished
read results from ai previous frame
give ai data to chew on for next frame
unit physics update
unit draw update.

No locks; just an unlikely block in dual core/more likely block in single core.

thread doesn't block, so draw update executes at the same time. This shouldn't be a problem, because draw depends on physics not running at the same time, ai doesn't alter how the unit is drawn. but ai does alter the next physics frame, so we wait until it is done if it's still executing.

Yep. Except that, by reading output from previous frame, followed immediately by feeding it input for the next frame, --though, admittedly, at the cost of 1 frame's worth of ai latency--, allows full parallelization when running in multi-core platforms.

thread unit ai, this is actually a call to push our unit's ai onto a queue running on a thread in an infinite loop. We sit and pop the queue, if nothing then we wait until signalled by the cond variable to start again.

Safemode, cond variables are infamous for their unsafety. Just use messages. They are faster, safer, and easier to understand and debug. Well, if the data size that's needed to be shared is too big or complex for fully messaging it, you could duplicate it, have a and b copies, where in frame n the main thread works with a and the ai thread works with b; then they swap ownership, so that, in frame n+1, the main thread works with b and the ai thread works with a. Then, the only lock is on a single bit (well, a boolean variable) expressing the swap state. Otherwise you'll be going through gazillions of data locks, and that's insane, and totally inappropriate for real-time apps.

Post by **chuck_starchaser** » Fri Mar 07, 2008 3:38 pm

safemode wrote:I was tired last night. rambling.

let me just say that I dont believe we'll be able to parallelize Unit, not without some serious rewrites. It's just too serial in nature and api's are too jumbled.

k

I think the only hope we have at parallelizing VS (as it exists now) is putting each system on it's own thread. Sychronizing them to the current system on a frame-by-frame basis, but allowing them to all execute at the same time as eachother and the current system.

How about just 2/3 threads:
Current system on main thread.
Next system (when about to jump) on second thread.
All other systems on a third thread; --i.e.: all the low rez, dyn universe simulations executing on a separate thread.
Though, I remember JackS saying that those low rez simulations take negligible time.

We may also be able to put the UnitFactory on it's own thread, so that Units are generated on a thread in parallel to processing the game.

Well, NOW you're talking... Isn't this, technically, what we were talking about unit creation, all along? Sorry, I'm not that familiar with the engine.

I also want to look into processing AI in parallel with drawing. Basically Unit execution would happen like this.
thread.lock(ai)
unit physics update
if(theadunitai_done)
cond.notify_one(aidone);
thread unit ai (lock at beginning of function, unlock at end, if done wait).
unit draw update.

Might as well, in that case, do it like...

Code: Select all

block if previous ai frame not finished
read results from ai previous frame
give ai data to chew on for next frame
unit physics update
unit draw update.

No locks; just an unlikely block in dual core/more likely block in single core.

thread doesn't block, so draw update executes at the same time. This shouldn't be a problem, because draw depends on physics not running at the same time, ai doesn't alter how the unit is drawn. but ai does alter the next physics frame, so we wait until it is done if it's still executing.

Yep. Except that, by reading output from previous frame, followed immediately by feeding it input for the next frame, --though, admittedly, at the cost of 1 frame's worth of ai latency--, allows full parallelization when running in multi-core platforms.

thread unit ai, this is actually a call to push our unit's ai onto a queue running on a thread in an infinite loop. We sit and pop the queue, if nothing then we wait until signalled by the cond variable to start again.

Safemode, cond variables are infamous for their unsafety. Just use messages. They are faster, safer, and easier to understand and debug. Well, if the data size that's needed to be shared is too big or complex for fully messaging it, you could duplicate it, have a and b copies, where in frame n the main thread works with a and the ai thread works with b; then they swap ownership, so that, in frame n+1, the main thread works with b and the ai thread works with a. Then, the only lock is on a single bit (well, a boolean variable) expressing the swap state. Otherwise you'll be going through gazillions of data locks, and that's insane, and totally inappropriate for real-time apps. Unless I'm misunderstanding...

Post by **safemode** » Fri Mar 07, 2008 4:20 pm

chuck_starchaser wrote:
safemode wrote:I was tired last night. rambling.

let me just say that I dont believe we'll be able to parallelize Unit, not without some serious rewrites. It's just too serial in nature and api's are too jumbled.
k

I think the only hope we have at parallelizing VS (as it exists now) is putting each system on it's own thread. Sychronizing them to the current system on a frame-by-frame basis, but allowing them to all execute at the same time as eachother and the current system.
How about just 2/3 threads:
Current system on main thread.
Next system (when about to jump) on second thread.
All other systems on a third thread; --i.e.: all the low rez, dyn universe simulations executing on a separate thread.
Though, I remember JackS saying that those low rez simulations take negligible time.

they dont take much time, but it gives us a lunching platform for pre-loading the current system while the current system can play the wormhole animation.

We may also be able to put the UnitFactory on it's own thread, so that Units are generated on a thread in parallel to processing the game.
Well, NOW you're talking... Isn't this, technically, what we were talking about unit creation, all along? Sorry, I'm not that familiar with the engine.

I also want to look into processing AI in parallel with drawing. Basically Unit execution would happen like this.
thread.lock(ai)
unit physics update
if(theadunitai_done)
cond.notify_one(aidone);
thread unit ai (lock at beginning of function, unlock at end, if done wait).
unit draw update.
Might as well, in that case, do it like...
Code: Select all
block if previous ai frame not finished
read results from ai previous frame
give ai data to chew on for next frame
unit physics update
unit draw update.
No locks; just an unlikely block in dual core/more likely block in single core.

I dont want to copy data, so we cant run ai code parallel to physics, we'd lose data from collisions. AI code affects physics, so we'd have to block. Physics affects drawing, so we'd have to block drawing for physics. But it's then clear, we can put AI at the same time as drawing.

Dont forget, we only process the player unit and it's target rapidly, the rest get processed in a list. So it's very unlikely that it's AI is still being processed by the time it gets another frame. So it likely will never block at the beginnning of another frame.

that's why it's gotta be block on ai;physics; push ai to thread; graphics.

then we'd process down the list pushing unit after unit to the queue as it's processing them asynchronously to the main thread. We just keep popping and processing, releasing out ai lock for that unit and moving on. If we reach the end before we get an ai push, we have to wait for one, hence the condition objects. The main thread has to wake the ai thread up when we push another ai process, in the case that it's sleeping.

thread doesn't block, so draw update executes at the same time. This shouldn't be a problem, because draw depends on physics not running at the same time, ai doesn't alter how the unit is drawn. but ai does alter the next physics frame, so we wait until it is done if it's still executing.
Yep. Except that, by reading output from previous frame, followed immediately by feeding it input for the next frame, --though, admittedly, at the cost of 1 frame's worth of ai latency--, allows full parallelization when running in multi-core platforms.

physics frame need direction from ai, you want to make it have a delayed reaction. Ai operates on the previous physics frame already. Now you want to make the physics frame operate on a previous ai frame, that means the next ai has lost contact with the current state of the universe. that means the physics frame is operating on input based on how the universe was 2 frames ago.

thread unit ai, this is actually a call to push our unit's ai onto a queue running on a thread in an infinite loop. We sit and pop the queue, if nothing then we wait until signalled by the cond variable to start again.
Safemode, cond variables are infamous for their unsafety. Just use messages. They are faster, safer, and easier to understand and debug. Well, if the data size that's needed to be shared is too big or complex for fully messaging it, you could duplicate it, have a and b copies, where in frame n the main thread works with a and the ai thread works with b; then they swap ownership, so that, in frame n+1, the main thread works with b and the ai thread works with a. Then, the only lock is on a single bit (well, a boolean variable) expressing the swap state. Otherwise you'll be going through gazillions of data locks, and that's insane, and totally inappropriate for real-time apps.

[/quote]

locking does nothing if you dont have to wait. and you'd have to wait the same time as a block anyway as it's serial now. I've used rw_locks before and they're quite fast and low weight. and they wrap up the functionality of waiting until we are ready for them to continue.

checking a boolean would have to also have while loop, a sleep statement based on some guess as to the optimal time to wait and a break when the boolean changes. Now you have to extend that boolean to handle multiple threads (not in this particular case but if you want to avoid locks in any realistic setup, we'd have more than one thread), so you have to make a ref counter and keep things in order so when you unlock you unlock the correct thread. Wow, that looks just like a lock. I prefer to not rewrite basic data constructs.

and like is said, physics uses ai output to do it's work. If you parallelize it, physics is using data that is one frame old and that data was created from a physics frame that was one frame older than that. So physics is working 2 with instructions that are 2 frames old.

Post by **safemode** » Fri Mar 07, 2008 4:59 pm

I'm not trying to be combative. This inability to thread _anything_ worthwhile has been completely pissing me off these past couple days. I get like 4 hours of sleep a night because i'm trying to find some way to thread anything that would save any time and everytime i'm hit with the fact that almost everything is strickly serial in nature. And i'm very unwilling to have units out of sync with the universe.

It's really frustrating when you have the potential right here, but nothing is written in a way for you to utilize it.

AI is inherently parallel. But in-game, we dont have a message based ai system. We have a construct that reads in state data, makes decisions based on them and writes out new state data that the physics code then acts on . If we change things from under it, it's now using invalid state data. Copying keeps you from crashing the game, but it's no less invalid.

that was what my idea behind real faction AI's was. A message based ai that communicated with it's units via micro-missions, that would be processed in parallel to the game, completely asynchonous because it had no physics associated with it, no graphics. It would recieve events from it's units and act on them, divying out micro-missions in a queue based fashion.

by the way. It's good to have the arguing. I think better when there's some opposition.

I'm interested in some other areas where threading could be used.

I am pretty sure i ruled out threading image loading. Without a magic way of knowing ahead of time what we need, loading images is 100% serial once we damand it.

We have a possible threading of unit level ai functions either across graphics updates or both physics and graphics updates)

we have the possibility of maybe simulating other starsystems on another set of threads. (possibly config variabled). Though i've looked at that code and we'd need to rewrite quite a bit to make this happen.

possibly threading of the conversion of an adjacent system to be the current system. This i haven't looked at completely, but it is probably ugly like everything is in the starsystem files.

edit:
If we could ensure that the unit's AI only does atomic operations on the Unit it was a part of, we could ignore blocking altogether (except when we are rolling back to push another ai frame). If AI read and wrote thing atomicly on demand (not making local copies of anything) then we wouldn't be out of sync with the physics frame. But the physics frame would also have to do everything that it shared with the ai code atomicly, and without local copies of that data.

In this setup, we are mixing the reading and writing of values, but if it's all atomic, then this shouldn't matter that a value it read a few ms ago is different now. We just have to make sure we watch the conditionals. That's how the real world works, and it should behave correctly in game if we dont hit conditional issues. Just an idea anyway. AI _Should_ be parallel to it's physics and graphics frames, but obviously it's going to take some work to make it so and we should be able to do it without copies.

shadow_slicer · Post by **shadow_slicer** » Fri Mar 07, 2008 5:52 pm

I'm not sure I understand your reluctance to add an extra frame of latency to the AI responses. Sure the AI would effectively be operating based on the universe state 2 frames previous, but this data is likely to be less than 33ms old (though if the frame rate drops to 10 FPS, it goes up to 200ms). In any case this still exceeds the human reaction time significantly (humans are clocked at around 150-300 ms). I haven't looked at the AI code, and it could make assumptions that the extra latency violates, but I'd be surprised if the effect was noticeable.

Additionally this could allow us to avoid copying anything. If we used two buffers to hold physics data, the graphics thread, physics thread and AI thread could all read the 'old' data buffer while the physics thread puts the next frame's position data in the 'new' data buffer, and the AI puts the next frame's AI control inputs into its part of the 'new' data buffer. Then we would only need synchronization to ensure everything is done before we swap buffers.

Post by **safemode** » Fri Mar 07, 2008 6:25 pm

your suggestion about buffers is still copying, it just gives you a different method of swapping. It still puts the physics code one frame behind. And this is significant because we dont simulate everything all in a row and start over again. Units get simulated asynchronously, some get a frame rapidly, some less so. So a frame can mean a lot for units not necessarily on your screen. or not targetted.

It's not really about latency, it's about acting on the universe with wrong data, the copy could infact destroy changed data fed back by the physics update, there is no guarantee right now that cummunication of input and output is one way.

You cant just assume a frame is some tiny fraction of a second though, that only works for some units.

edit: i hate typing in qwerty stupid spelling errors.

Post by **chuck_starchaser** » Fri Mar 07, 2008 9:31 pm

safemode wrote:your suggestion about buffers is still copying,

Only once; during initialization. From then on, there's only pointer-swapping. Instead of dozens or hundreds of locks, you have a single lock at the swap point.

it just gives you a different method of swapping. It still puts the physics code one frame behind. And this is significant because we dont simulate everything all in a row and start over again. Units get simulated asynchronously, some get a frame rapidly, some less so. So a frame can mean a lot for units not necessarily on your screen. or not targetted.

It's still not significant, because if a unit is updated once every 10 minutes, so be it; it'll just arrive to the other thread at 10 minutes and 17 milliseconds.

Come to think of it, I smell a fowl rat, here...

How does all this asynchronous simulation work?

In standard (synchronous) simulation, one needs a previous and next frames; so all computations for the next frame are being changed based on previous frame data, which is NOT changing during the frame. Once the next frame is done, the roles are reversed, and the next frame becomes the previous, and what was the previous becomes the next. Otherwise you get aliasing, which is when some interactions cascade faster than others, depending on the order in which units are processed.
How does the vs engine solve this problem in an asynchronous simulation system?

Or... (UNTHINKABLE...KNOCKING ON WOOD...CROSSING FINGERS) ... Does it?

Post by **chuck_starchaser** » Fri Mar 07, 2008 10:29 pm

Bah, never mind; in Vegastrike there are no interactions other than collisions, anyways, so aliasing is not an issue. It's still highly "incorrect", to be a bit pedantic, in simulations, for physical updates to be written back to the same and only data set. But, be it as it may...
Anyways, I'll try and come up with an ascii diagram of how the "buffers" work, at least in my mind.

Post by **chuck_starchaser** » Fri Mar 07, 2008 11:23 pm

Here we go:

The bigger rectangles left and right are "memory areas" (not really, just data sets; but it may help to think of them as data areas). I put bogus addresses on top of them just to be clear at what level of abstraction we're talking by showing that the addresses don't change between even and odd frames.

The smaller rectangles inside represent processing. The direction of the arrows go from reading to processing to writing.

Code: Select all

                 ON EVEN FRAMES
                 ==============
   0x00a000                            0x00b000
   --------                            --------
  |        |                          |        |
  |   ai   |                          |   ai   |
  |  frmA  |    rd       ------       |  frmB  |
  |        | ---------> |  AI  |  wr  |        |
  | (prev) |            | THRD | ---> | (next) |
  |        |\ rd     -> | CODE |      |        |
   --------  \      /    ------        --------
              \    /
               \  /
                \/
                /\
   0x070000    /  \                    0x090000
   --------   /    \                   --------
  |        | / rd   \    ------       |        |
  |  phys  |/        -> | PHYS |  wr  |  phys  |
  |  frmA  |    rd      | THRD | ---> |  frmB  |
  |        | ---------> | CODE |      |        |
  | (prev) |             ------       | (next) |
  |        |                          |        |
  |        |                          |        |
  |        |                          |        |
  |        |                          |        |
   --------                            --------




                 SYNCH-SWAP FRAMES
                 =================
For both physics and AI, previous becomes next;
next becomes previous. The swap needs to be a
synchronizing handshake.




                 ON ODD FRAMES
                 =============
   0x00a000                            0x00b000
   --------                            --------
  |        |                          |        |
  |   ai   |                          |   ai   |
  |  frmA  |       ------         rd  |  frmB  |
  |        |  wr  |  AI  | <--------- |        |
  | (next) | <--- | THRD |            | (prev) |
  |        |      | CODE | <-     rd /|        |
   --------        ------    \      /  --------
                              \    /
                               \  /
                                \/
                                /\
   0x070000                    /  \    0x090000
   --------                   /    \   --------
  |        |       ------    /      \ |        |
  |  phys  |  wr  | PHYS | <-     rd \|  phys  |
  |  frmA  | <--- | THRD |            |  frmB  |
  |        |      | CODE | <--------- |        |
  | (next) |       ------         rd  | (prev) |
  |        |                          |        |
  |        |                          |        |
  |        |                          |        |
  |        |                          |        |
   --------                            --------

Notice how there is no need to lock any datas, as the only data being shared, at any given time, are the read-only ("previous") buffers.

Naturally, all this asynchronous simulation crap is bound to complicate things, but not much. All it means is that the diagram above needs to be applied on a per-atom basis. That is, each atom needs to have its own A and B frames, and we must ensure that, during parallel execution of physics and AI, the two agree which frame is "previous" (read-only) and which frame is "next" (write only), for every atom in the universe. IOW, each atom has to have its own thread-safe boolean bit to indicate swap state. There must be a simpler solution, I'm sure. YES THERE IS. Only one bit needs to be thread safe at the synchronization point. The swap state in each atom is toggled at the end of each processing pass without regards for synchronization.

Is this clearer than before?

Of course, I had to bring-in the 2-plane, standard simulation into play, because, well, it just shows how correct it is that, not only it is the only way to prevent aliasing, but it's the only way to parallelize, as well.
Notice that although there's a doubling of the data, the amount of reading and writing is exactly the same as in the present system; we merely switched from reading and writing into the same data set to reading from one and writing to another, and then switching directions.

As for delays, shadow_slicer said it better than I could. I'll just add to it: Think of it as two neural delays: Time it takes from senses to brain, and time it takes commands to travel from the brain to your hands. Physics to AI latency simulates the first; AI to Physics latency simulates the second.
Any questions?

Post by **safemode** » Sat Mar 08, 2008 12:02 am

Like i said previously however, there's no guarantee that there is an "input" and "output" it could be the same thing. Meaning, ai's input variables could be it's output variables that physics uses and then writes back to as well.

under serial mode :
ai reads in variable a. a = 1; it looks at other variables and decided to make a = 2; It exits and then physics runs. Physics reads in a and does it's things, it then writes back to a because of a collision a = -1; then around comes ai again and we cycle.

under threaded mode with buffers like you said

ai reads in a previous copy of a, which starts off at 1.
Physics reads in a previous copy of a, which starts off at 1.
physics gets collided with a, is now -2;
ai changes it's local value of a to 2;

Now they swap.
physics's a is all of a sudden 2 and ai's is -2; cycle continues

Depending on what a represents, this could be a big deal.

My first suggestion retains the serial nature of ai and physics as it stands in VS, thus the value of a is not corrupted.

Ideally, we would make sure none of these situations occur, we would make sure that ai's input didn't mix with it's output and physics output and input didn't also mix. ideally the unit's ai functions would be executed from physics updates via queue'd requests and our ai thread would simply traverse it's list of units with their list of ai requests in an infinite loop. I dont think we have an ideal situation here though.

edit PS:
I think i'll give it a night to rest in my head before coming back to some coding. maybe it's best to focus on other things for now besides threading. Lots of prep work to do for 0.5 still.

Post by **chuck_starchaser** » Sat Mar 08, 2008 12:31 am

If the code is that messy, might as well feed it to the albatross and start again from scratch. Is "a" a pyhsical variable, or an ai variable? There is no variable or quantity in all universes that both physics AND ai should write to. None. Totally unjustified. AI sends commands to ship systems; it doesn't directly move the ships. And physical phenomena can be "seen" by ai's, or sends messages to ai's, but it doesn't directly manipulate their virtual "neurons". Physics and ai only take input from each other, but write to their own variables, respectively; or else the code is a basket case.

Post by **safemode** » Sat Mar 08, 2008 12:49 am

the physics code doesn't do callbacks etc. it's perfectly plausible. threading was never really considered.

ideally, ai should be parallel ..i just doubt it is in vs right now. Nothing else is.

Post by **chuck_starchaser** » Sat Mar 08, 2008 1:23 am

Please read thoroughly and carefully, and preferably twice; --I'm giving you a full solution for thread-separating physics and ai, here; and the same principle could be used for other things.

Well, I didn't say it should be parallel; I said it should be parallelizable by having A and B images of the data and doing the processing as shown in the ascii flowchart.

Code: Select all

                 ON EVEN FRAMES
                 ==============
   0x00a000                            0x00b000
   --------                            --------
  |        |                          |        |
  |   ai   |                          |   ai   |
  |  frmA  |    rd       ------       |  frmB  |
  |        | ---------> |  AI  |  wr  |        |
  | (prev) |            | THRD | ---> | (next) |
  |        |\ rd     -> | CODE |      |        |
   --------  \      /    ------        --------
              \    /
               \  /
                \/
                /\
   0x070000    /  \                    0x090000
   --------   /    \                   --------
  |        | / rd   \    ------       |        |
  |  phys  |/        -> | PHYS |  wr  |  phys  |
  |  frmA  |    rd      | THRD | ---> |  frmB  |
  |        | ---------> | CODE |      |        |
  | (prev) |             ------       | (next) |
  |        |                          |        |
  |        |                          |        |
  |        |                          |        |
  |        |                          |        |
   --------                            --------




                 SYNCH-SWAP FRAMES
                 =================
For both physics and AI, previous becomes next;
next becomes previous. The swap needs to be a
synchronizing handshake.




                 ON ODD FRAMES
                 =============
   0x00a000                            0x00b000
   --------                            --------
  |        |                          |        |
  |   ai   |                          |   ai   |
  |  frmA  |       ------         rd  |  frmB  |
  |        |  wr  |  AI  | <--------- |        |
  | (next) | <--- | THRD |            | (prev) |
  |        |      | CODE | <-     rd /|        |
   --------        ------    \      /  --------
                              \    /
                               \  /
                                \/
                                /\
   0x070000                    /  \    0x090000
   --------                   /    \   --------
  |        |       ------    /      \ |        |
  |  phys  |  wr  | PHYS | <-     rd \|  phys  |
  |  frmA  | <--- | THRD |            |  frmB  |
  |        |      | CODE | <--------- |        |
  | (next) |       ------         rd  | (prev) |
  |        |                          |        |
  |        |                          |        |
  |        |                          |        |
  |        |                          |        |
   --------                            --------

But I assumed, as I should have been able to assume, that physics and ai had their own separate sets of variables they write to, and that they don't write to one another. And chances are the code is not as bad as you imply. Grab a good sleep, and a huge capuccino in the morning, and then take a second look. It may be that in some places ai's output writes to physics' input, say. This is not too serious, --though it'd be rather apathetic--, but it can be fixed by deleting those writes and making physics read ai's output, instead. And viceversa. What each writes to should be its own turf; and nothing outside of it.

AI data IS or represents AI state, which is separate from physics.
Similarly for physics.

So, each outputs to its own state: to the "next" image of it, during a given frame; which by the next frame, after swap(), becomes the "previous" state. Then, the following frame, Physics and AI both can read from their own "previous" state, AND each-other's "previous" state; and there's no need for locks, since "previous" states are read-only for the duration of a frame's computations.
So, the first thing is for Physics and AI to read from each other; never write to each other.

Once such things are fixed, it's a matter of creating two images of the ai data, and two images of physics data. This could be done on a per-unit basis. That is,

Code: Select all

struct physics_data { ... };
struct ai_data { ... };
class unit
{
  //two physics data images; and two ai data images
  physics_data phdA, phdB;
  ai_data aiA, aiB;
  //swappable pointers to data members:
  physics_data const * previous_pd;
  physics_data       * next_pd;
  ai_data  const * previous_ai;
  ai_data        * next_ai;
  //toggle indicates processing direction between the two images; --i.e.: it
  //determines whether A or B are "previous" or "next"
  char toggle;
  //done can be set by physics and ai, but cleared by neither; (it is
  //cleared by swap() only
  char done;
  void clear_done(){ done = 0; }
public:
  void set_done(){ done = 1; }
  void swap()
  {
    ++toggle;
    toggle &= 1;
    if( toggle )
    {
      previous_pd = &phdA;
      next_pd = &phdB;
      previous_ai = &aiA;
      next_ai = &aiB;
    }
    else
    {
      previous_pd = &phdB;
      next_pd = &phdA;
      previous_ai = &aiB;
      next_ai = &aiA;
    }
    clear_done();
  }
  //just a thought exercise... ideally we could even do something like...
  physics_data       * pd(){ return next_pd; }
  physics_data const * pd() const { return previous_pd; }
  ai_data       * aid(){ return next_ai; }
  ai_data const * aid() const { return previous_ai; }
  //.........
};

NOTE: What level or kind of "unit" this is, I don't know; I'm not familiar with the class hierarchy; just fix as necessary.

Those last overloads would allow us to completely ignore the underlying implementation. You could use pd() to access physics data and aid() to access ai data without regard for whether you're reading from or writing to it, as if there was a single data set; but reads automatically would use the const versions of the functions, which return the (currently) "previous" image pointer; and writes automatically would use the non-const function, which returns a pointer to the (currently) "next" image pointer.

From outside unit, we wouldn't even suspect there are two sets of data; but reads come from "previous", while writes go to "next", under the hood.

But we'd better forget about those const/non-const function overloads; it's elegant but too risky...
Just have

Code: Select all

  physics_data       * pd_wr(){ return next_pd; }
  physics_data const * pd_rd() const { return previous_pd; }
  ai_data       * ai_wr(){ return next_ai; }
  ai_data const * ai_rd() const { return previous_ai; }
  //.........
};

and make things explicit. Otherwise, I doubt that pd()->x++ would know to read x from previous->x, increment it, and write to next->x. Chances are it would just increment next->x. Or, for sure, you could have a non-const to physics_data* pointer and initialize it from pd(), and the compiler would select the non-const pd(), which returns "next", but then you could just use the pointer to read through. Too dangerous... But a good thought exercise to better understand the principle, perhaps; that's why I showed the hypothetical overloads.

But yeah, too much abstraction is not good, anyways; makes code hard to understand.

When physics and ai both finish, they shake hands, and one of the two is in charge of calling swap() on all units that need to be swapped (those that were processed this frame, --their simulation atoms permitting); wash and repeat. And, before you ask, no; you can't swap() them as you go, or else we'd have to put a lock on every static tooggle. (Consider this: would physics OR ai call swap()? Swap() should be called once both have processed that unit. Morover, the physical data of that unit may be read while processing other units, and while doing so, swap() shouldn't have been called yet, if we are to avoid aliasing.

For a practical solution, I'd suggest a mutable "done" flag in unit that both physics and ai can set, but neither can clear. After both threads finish, all units that have been processed (atoms allowing) will have the done flag set, and now we probably need to iterate through all units to update distance sorting indexes and whatnot, right? So we could piggy-back swap()'s onto such traversal. Just add...

Code: Select all

if( unit_iter->done ) unit_iter->swap();

and add a line to swap(),

Code: Select all

done=0;

EDIT:
Before you say this is too different from what's there, or something along the lines, let me say this is the ONLY sane solution. To whatever extent it necessitates changing the existing code, it's for all the good reasons, and a mighty opportunity to refactor and clean-up the existing code. This is how proper simulation is supposed to work.

There's no guarantee that there is an "input" and "output".

is not a valid argument. There OUGHT to be such guarantees for any efficient multithreading implementation AND for proper simulation code organization; --even if multithreading wasn't being considered.
As I said to you once before, locks ought to be a LAST RESORT, in multithreading. The paradigm embedded in this solution is to make sure that only read-only data are ever shared between threads, between synchronization points. This is the RIGHT way. It minimizes the need for such synchronization points (to only once per frame).

EDIT2:
Further optimization
The next step would be to have all physics data and all ai data separately bunched together in memory.
Instead of unit having embedded structs, it would just have pointers to elements in large, statically allocated respective arrays.
IOW, there'd be somewhere in a cpp file at file-scope

Code: Select all

physics_data pdpoolA[1000], pdpoolB[1000];
ai_data aipoolA[1000], aipoolB[1000];

then, instead of,

Code: Select all

class unit
{
  //two physics data images; and two ai data images
  physics_data phdA, phdB;
  ai_data aiA, aiB;
  //swappable pointers to data members:
  physics_data const * previous_pd;
  physics_data       * next_pd;
  ai_data  const * previous_ai;
  ai_data        * next_ai;

we'd have reference variables initialized once in the constructor's inintialization list.

Code: Select all

class unit
{
  //two physics data images; and two ai data images
  physics_data & phdA, & phdB;
  ai_data & aiA, & aiB;
  //swappable pointers to data members:
  physics_data const * previous_pd;
  physics_data       * next_pd;
  ai_data  const * previous_ai;
  ai_data        * next_ai;
  //toggle indicates processing direction between the two images; --i.e.: it
  //determines whether A or B are "previous" or "next"
  mutable char toggle;
  //modified can be set by physics and/or ai, but can be cleared by neither one;
  // --it is only cleared by swap()
  mutable bool physics_modified;
  mutable bool ai_modified;
  void clear_modified() const { physics_modified = ai_modified = false; }
  void physics_set_modified() const { physics_modified = true; }
  void ai_set_modified() const { ai_modified = true; }
public:
  void swap() const
  {
    char modified = 0;
    if( physics_modified ) ++modified;
    if( ai_modified ){ ++modified; ++modified; }
    switch( modified )
    {
    case 0:
      return;
    case 1:
#ifndef _NDEBUG
      bool physics_modified_this_unit_but_ai_did_not = true;
      assert( physics_modified_this_unit_but_ai_did_not == false );
#endif
      return;
    case 2:
#ifndef _NDEBUG
      bool ai_modified_this_unit_but_physics_did_not = true;
      assert( ai_modified_this_unit_but_physics_did_not == false );
#endif
      return;
    default:
    }
    ++toggle;
    toggle &= 1;
    if( toggle )
    {
      previous_pd = &phdA;
      next_pd = &phdB;
      previous_ai = &aiA;
      next_ai = &aiB;
    }
    else
    {
      previous_pd = &phdB;
      next_pd = &phdA;
      previous_ai = &aiB;
      next_ai = &aiA;
    }
    clear_modified();
  }
  physics_data       * pd_wr(){ physics_set_modified(); return next_pd; }
  physics_data const * pd_rd() const { return previous_pd; }
  ai_data       * ai_wr(){ ai_set_modified(); return next_ai; }
  ai_data const * ai_rd() const { return previous_ai; }
  //.........
};

By having such pools, we could later optimize by writing through cache for the memory areas being written but not read in a given frame.
EDIT: Sorry, I'm not sure this would help. Might help with AMD but not with Intel; not sure. Never mind. One thing that would definitely help, er... is NEEDED, is "Assume No Aliasing". Otherwise the compiler wouldn't know that we're reading from read-only variables and writing to write-only variables --i.e.: wouldn't know that the previous/next pointer don't point to the same variables and this would hand-cuff the pipeline. The execution speed difference with assume no aliasing should be immense.

I made other improvements to the code. I renamed the "done" flag to "modified", which better reflects its semantic meaning. And I made the flag, as well as the toggle, mutable. Why? Because these variables, philosophically speaking, are not truly a part of the class; they are more like helper variables; and I wanted to make swap() a const function, because swapping's effect is of no business to clients of the class, and one should be able to call it through a const iterator, if necessary. But this is not important, admittedly, just a bit pedantic.

Something more important I added is a condition in swap(). If the unit has not been modified, it doesn't swap. This way, all we have to do between frames is traverse all units calling swap(), and only those that were modified in the previous frames will actually swap themselves; for the rest it will have no effect, so we avoid placing code outside unit that knows anything about unit's internals.

I also added set_modified() calls to the bodies of pd_wr() and ai_wr(), which you might call a "bug-fix"

(Well, except for the fact that, to do so is inefficient, as the function is called for every write; so we might want the end of the physics update to call set_modified() just once.)

EDIT3:
Usage
Using the physics and ai data is the same as it should be presently, except for the addition of "_rd"/"_wr" to the functions that return data pointers. Thus,

Code: Select all

void unit::physics_update( size_t present_time_milliseconds )
{
  //writing current time to "next" frame
  pd_wr()->time = present_time_milliseconds;
  //subtract previous frame's time to get delta
  float delta_time = float( present_time_milliseconds - pd_rd()->time );
  //forces ...
  //example of reading a previous output of ai's thread
  float thrust = k * ai_rd()->throttle;
  //acceleration ...
  float accel = thrust / this->mass;
  //velocity
  float frame_velocity = pd_rd()->velocity + delta_time * accel;
  pd_wr()->velocity = frame_velocity;
  float average_frame_velocity = frame_velocity;
  average_frame_velocity += pd_rd()->velocity;
  average_frame_velocity *= 0.5f;
  //position:
  pd_wr()->position = pd_rd()->position + delta_time * average_frame_velocity;
  physics_set_modified();
}

EDIT4:
Another assumption here is that all units modified by physics are modified by ai, and all units modified by ai are also modified by physics, in a given frame. So, I modified the code for physics and ai to use separate bits in modified, and put a switch statement in swap() to assert( false ) whenever one but not the other is modified.
EDIT5:
Er... Better have two separate booleans; otherwise they could asynchronously clear each other's bits if they try to set modified at the same time...
FIXED.

EDIT6:
You might want to rename "swap()" to "end_frame()" or something like that... something more meaningful from the perspective of a client of the class, than its internal details of implementation.

EDIT7:
Which of physics or ai should be in the main thread, which in the separate?
On a single core it doesn't matter.
On a dual core, it doesn't matter either, BUT, the two threads should be load-balanced to make best use of both cores. So, whichever of the two takes less time to execute, I'd make it do other duties after it finishes executing its frame and before the synchronization handshake.

m3t00 · Post by **m3t00** » Tue Mar 25, 2008 2:04 pm

With the original game running from a CD, disk i/o was horrible. I used to load the entire game to a RAM disk and run it from there. Just wanted to point out that the entire PR install directory is using less than 250 Mb so don't hesitate to "prefetch" _everything_ if disk i/o is a bottleneck or pain-in-the-ass to make allowances for. Most people wouldn't notice or care about the extra memory usage if it makes the game better.

Interesting discussion that mostly went over my head. Thanks for making a cool game better.

Post by **safemode** » Tue Mar 25, 2008 3:46 pm

most OS's use all "free" memory for disk cache. hdd io shouldn't be a problem unless you dont have enough ram to cache it, in which case, you wouldn't be able to benefit much from the game doing the same thing as what the OS would be doing.

I see no point in pre-caching everything. What could benefit is preloading (as in initializing and pushing to GL etc..) things that the game is going to use before it uses it, this is not the same as precaching however. In one instance, we have a cache, in the other we do not.