Author Topic: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc. Tools for development (Read 21993 times)

Bonknuts · « **Reply #45 on:** October 13, 2015, 04:55:24 AM »

Touko, can you explain a little more?

touko · « **Reply #46 on:** October 13, 2015, 05:15:55 AM »

Quote from: Bonknuts on October 13, 2015, 04:55:24 AM

Touko, can you explain a little more?

I'm going to try .

The idea is to have a copy in VRAM(call it SAT2) of your main SAT (call it SAT1 also in VRAM) .
First you copy your SAT1 in your SAT2 each frame with DMA (VRAM -> VRAM) .
Next in your engine, you sort all your sprites, and for exemple if your sprite1 come in front of sprite2, you make a DMA of your 4 words of sprite1(in SAT2) earlier in SAT1 than sprite2(it must be copied too) .
When all your sprites were sorted(and of course your DMA list is complete), you do a manual DMA VRAM->SATB when all your transferts in your DMA list are done .
Of course all DMA must be in a DMA list(except the VRAM to SATB one) with the SAT1 to SAT2 copy in first,and SATB auto DMA must be off .

The goal is to use DMA for copying sprites's attributes and not the CPU .
The CPU is used for sorting sprites and make the DMA list in RAM .

Quote

before sorting,first DMA transfert in your list
SAT1 SAT2
spr3 spr3
spr2 spr2
spr1 spr1

follows after sorting, each sprite transfert in your list(here we want spr1 in front of spr2)
SAT1 SAT2
spr3 spr3
spr1 spr2
spr2 spr1

Of course if you are using meta-sprites(with aligned sprites) this will work better .
i don't know if i'am clear

(but for me i'am

)

Bonknuts · « **Reply #47 on:** October 14, 2015, 07:22:07 AM »

What's the savings in cpu cycles? Something slow like (unoptimized; straight code)... LDA [vector],y -> STA port -> INY would be 15 cpu cycles per byte or 120 cpu cycles per SAT entry (8bytes).

I had a link list system and SAT in local ram with embedded opcodes (ST1/ST2). It was 5 cycles a byte + one JMP ~return~. The overhead was is one JMP [table,x]. The table was a list of JMP $address to jump to the start of the ST1/ST2 list. That's 44 cycles for a single SAT copy into vram, without the overhead of calling/sorting. Of course, the down side is a bloated SAT array in local ram. And accessing the SAT was a little bit more complex (but still doable and optimizable). Let's see, JMP[Addr,x] is 7 cycles, JMP is 4 cycles, so that bumps it up to 44+7+4 = 55cycles per SAT entry. 455/55= 8 SAT entries per line.

So 8 SAT entries per scanline. We know V-V DMA is 336bytes per scanline in 10mhz mode (going by the other thread), so the max theoretical SAT updates per scanline via DMA is 42. It's going to be lower, but even if it was something like 30 realistically.. that's way faster than 8.

I think you've got a winner here Touko! So keep a link list in local ram with a reference to a DMA table. 8 bytes takes 16 cpu vdc cycles. At 10mhz, you wouldn't even need to poll the status flag. Nice.

The nice thing too, is that this DMA approach lends itself nicely to meta-sprite objects too. A single DMA call could handle all meta-cell entries for that object.

touko · « **Reply #48 on:** October 14, 2015, 07:40:35 AM »

Quote

I had a link list system and SAT in local ram with embedded opcodes (ST1/ST2).

For me it's in my DMA list

it's this embeded list who gave me this idea .

Quote

The nice thing too, is that this DMA approach lends itself nicely to meta-sprite objects too. A single DMA call could handle all meta-cell entries for that object.

yes, it's the main idea, this is why i took browler as exemple .

Quote

At 10mhz, you wouldn't even need to poll the status flag.

i use interrupts, no need of status flag .

for now my satb DMA interrupt starts my DMA list, who continu and finish by himself .

elmer · « **Reply #49 on:** October 14, 2015, 08:40:17 AM »

I can totally understand that getting that your SAT sorted for display is quick with this system ... but haven't you just dramatically increased the complexity of the code that actually updates the actual sprite positions (and palette if you're going to flash it)?

Are you expecting position updates to be written directly to VRAM, or are you still expecting to update a RAM-based SAT and then copy that to VRAM each frame (before doing the sorting)?

Bonknuts · « **Reply #50 on:** October 14, 2015, 01:58:06 PM »

I almost always use sprite sheets. I translate/transform the entry from the sprite sheet (referenced by an object) into a SAT format (single entry or meta). For instance, I might have an object that is made up of 3 sprite entries in the SAT. While I have that object array in ram, its not defined in a SAT structure. And whether there's a change to the object or not, that object is always translated into a SAT entry every frame as long as it's in that array. The object in the array has attributes like palette number, X/Y position, bounding box, frame number, etc. But those objects still need to be translated into a proper SAT entry or entries. So if the destination is vram instead of local ram, it wouldn't be anymore complex to write to vram instead. At least for my setup.

elmer · « **Reply #51 on:** October 14, 2015, 03:20:39 PM »

OK, now I'm really confused! :oops:

It's been a long time since I've actually written a game on a sprite-based machine, so please forgive me if I'm missing something simple.

Quote

While I have that object array in ram, its not defined in a SAT structure. And whether there's a change to the object or not, that object is always translated into a SAT entry every frame as long as it's in that array.

That makes perfect sense ... AFAIK that was always the most common way of doing things.

But if you're doing that, then isn't the sort normally part of the translation phase? i.e. you normally just "render" the objects in the order that you want them to appear in the SAT, and write the SAT directly to RAM/VRAM in the correct order.

If you want a part of an object to be behind something else, then you'd just have the object place 2 different "render" calls into the list of meta-sprites to be translated.

Now on the PCE, unlike the SNES & Genesis, you don't need to write the SAT to local RAM because you don't have to wait until hsync/vsync to write to VRAM. (Non-programmers really don't understand just how incredibly wonderful the PCE's design is.)

I'm having a hard time figuring out when you'd want to sort-and-copy single-or-multiple SAT entries unless you're doing something like uploading a whole bunch of a level's sprite data-and-animated-SAT-entries semi-permanently into VRAM and then compositing each frame's SAT from the previously-uploaded SAT-fragments.

Can you help me understand what usage case that I'm not thinking of here?

touko · « **Reply #52 on:** October 15, 2015, 12:47:03 AM »

Quote

So if the destination is vram instead of local ram, it wouldn't be anymore complex to write to vram instead. At least for my setup.

Exact,writing a sprite attribute to a SAT in RAM or VRAM is the same approach,only destination differs, you write in RAM or through port,code is the same .

This is why, with the possibility to write in VRAM anytime,the use of a RAM buffer for sat is pure useless IMO .
Even better, if you change a sprite attributes you only need to set the good location of your sprite in VRAM, and just write to ports $0002/$0003 consecutively and let the auto-increment do the job .
I use that for my meta-sprite routine, it take 700/800 cycles max for a 4 sprites meta-sprite, and all is in VRAM.

Bonknuts · « **Reply #53 on:** October 15, 2015, 06:32:03 AM »

Quote from: elmer on October 14, 2015, 03:20:39 PM

I'm having a hard time figuring out when you'd want to sort-and-copy single-or-multiple SAT entries unless you're doing something like uploading a whole bunch of a level's sprite data-and-animated-SAT-entries semi-permanently into VRAM and then compositing each frame's SAT from the previously-uploaded SAT-fragments.

Can you help me understand what usage case that I'm not thinking of here?

I can't think of any good examples off hand. Because normally, if all sprites are objects and all objects have to be built as sprites per frame, then you can easily sort objects simply by way of a reference list (single byte array). Do that before the object->sat process and there's no need to sort afterwards.

Maybe an example would be where some objects are only rendered into a SAT entry once and stay in that format, and their attributes are manipulated in a simple way, so updating those changes in vram isn't so bad (maybe only X or Y position, or cell #.. something like that). For example if an object is always a 32x64 sprite and all that changes is cell#/X/Y, then it can easily be kept in SAT format. If priority issues need to be evaluated by other objects that are constantly being rebuilt, the DMA list sort would probably be the faster/better option.

I've mixed and matched stuff like this before; debris, bullets, clipping/overlay.. stuff that really only needs X/Y or basic stuff updated in SAT format. Basically because object->sat process (lots of indirect and redirect of there are quite a but of frames and phases for an object) can eat up a decent amount of cpu cycles. I like sprite sheets (frame tables) because they are so easy to design animation "cells" for objects. It's neat, clean and organized, but the down side is processing time. Cheat where you can.

Touko: you can eliminate an extra frame delay by setting up the VDC to do two frames inside a VCE frame. It's tricky but it's doable. I've set it up so that the start of the display does SATB DMA instead of at v-int. That would give you your vblank time for DMAing and the start of the display SATB DMA for syncing the update for the frame to be shown. Normally it's not a problem, but this DMA list thing does make that an issue on a stock frame setup.

touko · « **Reply #54 on:** October 15, 2015, 08:34:26 PM »

Quote

you can eliminate an extra frame delay by setting up the VDC to do two frames inside a VCE frame. It's tricky but it's doable

Why do you want have an extra frame delay ??

If you desable the auto SATB DMA, and do it manualy after your DMA list is complete, you shouldn't have this delay .

Bonknuts · « **Reply #55 on:** October 16, 2015, 06:16:12 AM »

When I tested doing manual SATB DMA, during vblank, it didn't happen until the next v-int. Guess I'll have to revisit/retest that.

touko · « **Reply #56 on:** October 16, 2015, 08:14:29 AM »

Quote from: Bonknuts on October 16, 2015, 06:16:12 AM

When I tested doing manual SATB DMA, during vblank, it didn't happen until the next v-int. Guess I'll have to revisit/retest that.

Aaaah, it's possible then ..
I always thought it was like other DMA, you have the entire VBLANK for that, if not finished, it resume next vblank ..

elmer · « **Reply #57 on:** October 16, 2015, 09:14:09 AM »

Quote from: touko on October 16, 2015, 08:14:29 AM

I always thought it was like other DMA, you have the entire VBLANK for that, if not finished, it continu next vblank ..

The VDC manual says ...

For VRAM-SATB block transfers, 256 words are transferred at the beginning of a vertical blanking period. It is triggered by access to the high byte of the VRAM-SATB block transfer source address register (DVSSR). If the register is set, a block transfer operation will start at the beginning of the following vertical blanking period.

Charles MacDonald's pcetech.txt also makes it sound like it's an only-at-the-start-of-vblank trigger.

Hahaha ... am I going to get the chance to use my Inigo Montoya quote again?

touko · « **Reply #58 on:** October 16, 2015, 07:25:13 PM »

Yes i know that, but in my mind it was for auto DMA ..

Damn ..

Quote

Hahaha ... am I going to get the chance to use my Inigo Montoya quote again?

Bonknuts · « **Reply #59 on:** October 19, 2015, 08:20:33 AM »

Ok, so I've been crunching numbers all weekend in relation to dynamic tiles. Basically, I was watching some snes game longplays and tried to come up with ways to replicate some of the same effects.

I had this problem, where I wanted to do large area patterns in a faux BG layer - but I didn't want to update the tilemap at all. I only wanted to up the tile dynamic buffer. This presented a challenge, because all I had were 8 horizontally shifted frames. In other words, I would need a complete rotation in order for this to work (not just an 8 pixel rotation). No only is this problematic, but storing the these frames in ST1/ST2 opcodes immediately doubles that size. For a large half screen pattern - that's just not doable.

The first solution I had, was to allow a wider buffer (horizontally) than the intended target. Since I use ST1/ST2 opcodes to draw a bitmap line into VDC ram, with an RTS at the end, I'm limited to the width of that stored bitmap line (see below). I figured if I could start in the middle of the line, and then restart the line again. The craziness comes in here; I would use the VDC interrupt as a timer for the cpu. Basically, put the cpu on a timer leash and when the that timer runs out - reposition the PC back to the parent routine. This keeps the cpu from writing too much on the second call of the same line (to create the completed the rotation). Not only is this overly complex, it also exotically dangerous. Kinda.

So a better solution, is to break down the large pattern of bitmap lines into segments of smaller horizontal sections (and lines as well). As in, there would be multiple breaks in a single bitmap line (as RTSs). And call each line in sequence. The vram pointer doesn't need to be re-adjusted because it's still a sequential sequence in relation to vram. For example, say I have a pattern that is 15 tiles wide (120px). If I broke that down into 5 segments, it would be 24px line segments. The buffer would only need to be enough for (n+1)*segments wide for overflow handling. So 120+24= 144px wide buffer in vram. This allows you to start in one of three positions of a segment; 0, 1, 2. So for a full horizontal rotation of a 15 tile (120px) wide pattern, you need the segment offset, and offset inside the segment, and the frame rotation number.

This gives me a little bit of overhead; one JSR/RTS set per line segment, but the approach is much cleaner and less wasteful for vram. I still have a smaller buffer for over-run, but not nearly as big.

So to give some ideas for clarification here, I'll explain a few details.

1) To draw a bitmap line into vram, you need to set the autoincrement vram pointer to 32+. Since only a WORD can be written into vram at a time (two 8pixel planes), you'll have to do a second pass to write the full 4bit color image (assuming 4bit color is the goal). 32+ mean increment by 32 WORDS, so you'll need to organize the tiles (interleave). Here is where another optimization comes in; you can draw a bitmap line into a buffer of tiles @ ( n*8 )+offset. So if you start at line 0 of the buffer and keep writing passed it (with autoincrement of 32+), you'll end up writing line into the next line in the buffer at line 8, and then line 16, etc - until you reach the end. You'll need to reposition the offset in vram to point to line 1, and that'll draw 1, 9, 17, etc. So you save cycles by not having to constantly set VRAM pointer every scanline. You only need to do this 8 times. But that's just for the first 2bit planes. You need to do this again with the proper offset into the second 2bit planes of the tile. So data needs to be organized in a very specific way in rom, with embedded opcodes, but the result is very optimal in terms of speed.

2) If the amount of data is quite large to write to vram, you'll probably be racing the display because there won't be enough time in vblank for really large patterns. This is ok, though. There are 455 cpu cycles per VCE scanline. As long as you're under that, you'll be fine... kinda. If you followed along with the above method, you'll see that it requires multiple passes, and then a second pass for the second bitplane. Even if your data being written to vram, in the form of a scanline, is faster than the beam - the order of the written data presents the problem. One solution is to split the bitmap buffer in vram into two halves; and upper bitmap and lower bitmap. And start drawing this process earlier in vblank, or put a status bar at the top of the screen to effectively increase screen drawing time. Maybe two halves isn't enough. Maybe more are needed. Or maybe just do 16-24 lines full color, and then switch methods. The idea here, is to get enough buffered room between the beam and where you are writing in that bitmap position. Remember, if your data is stored as scanlines, you have a very flexible method and control in how you write to that vram bitmap.

The idea here, is to do a large pattern of dynamic tiles at 60fps and no double buffering. Of course double buffer will work, assuming it'll fit on vblank vram-vram time frame, but with a large buffer already that means eating into your vram space.

The whole idea of doing dynamic tiles as scanlines, means you no longer have limitations of tiles. You can do sine wave effects or line scrolls because you have direct control over the X position of that line being transferred into vram. You can Y-scale and do vertical effects as well. You can also draw parts of an image in reverse order (vertical flipping/mirroring, etc). And maybe the best of all, you can do transparency effects on that pseudo layer. Remember, you have to write the image as two 2bit images. Since the hardware puts them together, you have hardware assisted compositing. Sure, the colors are low (4 colors for one plane, and 3 colors of transparency to overlay) - but you also have the aid of palettes to assign to any 8x8 area. If you're ambitious, you could do the tilemap reposition trick to get 8x4 or 8x2 palette association.

This is just one approach. I have other dynamic tile approaches with different abilities. Some of them that allow object drawing into vram, overlapping edges on tiles without using sprites, etc. But there's not enough time to do that in 60fps, so those effects would be 30fps. Hopefully I can demonstrate these effects in a demo of sorts (playable with a simple engine).

Just to note: I'm using the bitmap line approach because it offers way more control over the pattern/pseudo layer than the tile write approach. I could have easily arranged a column of tiles in vram to show as a row format in the tilemap (which is a super easy setup/approach), but because of how the tiles are composed of two 2bit tiles, doing independent vertical scrolling on this fake BG layer becomes much more complex. A bitmap line approach easily allows for independent vertical and horizontal scrolling of the fake BG layer, but also allows hsync style effects (horizontal), vertical effects, as well transparency or split layer effects.

Author Topic: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc. Tools for development (Read 21993 times)

Bonknuts

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

touko

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

Bonknuts

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

touko

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

elmer

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

Bonknuts

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

elmer

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

touko

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

Bonknuts

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

touko

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

Bonknuts

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

touko

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

elmer

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

touko

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

Bonknuts

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.