Author Topic: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc. Tools for development (Read 25515 times)

touko · « **Reply #15 on:** January 06, 2015, 07:31:22 PM »

Quote

If it's a brawler, you could probably get away with interleaving updates on specific frame intervals. I mean, it's pretty rare that frame updates need to be 60hz for a character. Someone else had mentioned this (something like 4 frame slots, so frame/pixel animation was limited to 15fps).

Of course you're right ..

Quote

Though if it's SGX, and you're using both sprite planes from both VDCs as a single virtual/pseudo SATB (cause you need to process sprite priority layering) - then I can see where you would need to update between both VDCs at a faster rate. I.e. it's fairly dynamic as to which VDC would receive the frame update (or even need a redundant update; the character moved from VDC2 SATB to VDC1 SATB and VDC1 needs the frame that VDC2 already had in memory).

Ehehe, yes it's my problem, i want to maximise sprites on screen .It's not difficult for any type of games but for brawler with Y ordering it's slightly difficult to manage this on two separate layers .

Y ordering will be made with a dynamic sprites list, like a chained list, for now this is the best i found

Quote

But being a brawler, surely you could fit most enemy frames (as ST1/ST2) in local memory to keep those as fast updates. Since most of the game data is going to sit in AC ram anyway, you have lots of room for stuff in local ram. 32k of ram should be plenty for a brawler engine, and 16k of work ram (8k of that being original sys ram). That leaves you with 216k. And most brawlers break up a stage into subsections, which you could take advantage of and replace/update enemy opcode sprites in local ram (from AC ram). Same for bosses. You can use mini 'transition' scenes to hide this 'loading'. Or do it as a background process in an area that's lite on enemies. Etc.

Now i think i'll do something like that, for now i do not have any gfx, and i don't think that sprites size will be close to FF but like double dragon with mush more different moves for players and like 10 enemies on screen (and 2 player co-op of course) .
I calculated a 50% of max cpu use for a 8,5 ko/frame transfert, this should be enought,and let me ~60000 cycles for game logic .The drawback is the need to double buffering sprites datas in vram, not really a problem with SGX, but can be tedious with Y ordering and the 2 sprites layers .
I can probably use scdram for players compiled sprites and AC for the rest,with a good sprites datas organisation to avoid transfering empty tiles ..
Sprites buffer can be cleared with fast VDC DMA and DMA list driven by interrupt (i already have) .

Quote

If you can't tell, I've thought a lot about this before (SGX+AC+brawler)

I see lol, you show exactly the same problematics i have

This is in my mind since a while, no code or something for now, only theoretical and some basis, and i dream that this combo could allow ..

touko · « **Reply #16 on:** February 06, 2015, 04:14:11 AM »

hi,i converted lz4 decompressor to 6280 .
It's a real time decompressor and has a good ratio compression/speed,and can compress all kind of files .
For exemple i compressed the first level of my game chuck no rise, original file is 23ko, and down to 13 KO compressed .
A sprite pattern (not very optimised) of about 7 ko, is 1,79 ko compressed .
My routine is functional, and works very fine,but the bank rollover is not yet finished,and it lacks for now the part to decompress directly in vram, and a use of block transfert instruction rather than a simple copy (lda=>sta) .
Yes you can summarise this decompression to a single bytes copy,this is why is so fast .

Someone has already tested this in a game ??

Exemple,code,algo, and benchs for apple 2 gs (65816) :
http://www.brutaldeluxe.fr/products/crossdevtools/lz4/

lz4 creator's blog:
http://fastcompression.blogspot.fr/2011/05/lz4-explained.html

elmer · « **Reply #17 on:** February 06, 2015, 06:02:43 AM »

Quote from: touko on February 06, 2015, 04:14:11 AM

Someone has already tested this in a game ??

Yes, it's very suitable for games. We were using various LZ77-variants (such as LZ4) all through the 1980's-1990's for compressing game data.

I've been meaning to look up LZ4 for a while now to see what the fuss is about, thanks for the links to the explanation.

The trick with all the LZ77-variants is what scheme you use to store the literal/match lengths and offsets ... LZ4 seems to offer a good balance of compression and performance.

I can't see that you'd want to decompress directly to VRAM when you need to keep the sliding window of previous data around for copying the "match" bytes ... but you can certainly play with the code to achieve that effect if you want to.

If you're short of decompression space, what we usually did was to just split larger data into separately-compressed blocks of a fixed size (say 8KB), and then decompress a block at a time. That's what I did to make a cacheable-filesystem-on-a-cartridge for a few N64 games.

touko · « **Reply #18 on:** February 06, 2015, 06:38:40 AM »

Thank's for feedback ;-)

Quote

I can't see that you'd want to decompress directly to VRAM when you need to keep the sliding window of previous data around for copying the "match" bytes ... but you can certainly play with the code to achieve that effect if you want to.

It's easy (i think) because the match and letteral stay in source file in ram not in vram, only destination will be ..

Quote

If you're short of decompression space, what we usually did was to just split larger data into separately-compressed blocks of a fixed size (say 8KB), and then decompress a block at a time. That's what I did to make a filesystem-on-a-cartridge for a few N64 games.

I want to avoid copying datas in a buffer first, and later in vram for datas need to be like sprites pattern, and directly doing it in vram .
As you can access vram any time,why do not do it ?? ;-)

elmer · « **Reply #19 on:** February 06, 2015, 07:25:44 AM »

Quote from: touko on February 06, 2015, 06:38:40 AM

I want to avoid copying datas in a buffer first, and later in vram for datas need to be like sprites pattern, and directly doing it in vram .
As you can access vram any time,why do not do it ?? ;-)

Why not?? ... because I'm afraid that you've misunderstood the LZ4 algorithm. Here's a quote from one of the pages that you linked ...

Quote

With the offset and the match length, the decoder can now proceed to copy the repetitive data from the already decoded buffer.

That's how all LZ77-variants work ... by exploiting the repetitive nature of the decompressed data.

You must keep a sliding window buffer of the decompressed data available to copy from.

The match offset and match length in the compressed data refers to offsets and lengths in the decompressed data.

Now ... you certainly can modify the algorithm to use a sliding window in the compressed data instead of the decompressed data and thus enable decompression directly into VRAM ... but your compression-ratio will almost-certainly suffer very, very badly (I tried this back in the 1980's).

You are welcome to try this yourself ... your data may be different enough that it will work ... but don't be surprised if it doesn't!

elmer · « **Reply #20 on:** February 06, 2015, 09:19:13 AM »

Quote from: elmer on February 06, 2015, 06:02:43 AM

Yes, it's very suitable for games. We were using various LZ77-variants (such as LZ4) all through the 1980's-1990's for compressing game data.

I've been meaning to look up LZ4 for a while now to see what the fuss is about, thanks for the links to the explanation.

Since one of touko's links actually gave a testsuite, I thought that I'd run my old SWD compressor on it. Just like LZ4, my compressor's LZ77-style-encoding is designed for fast game-time decompression.

I'm insufferably happy to present the following results (compressed size in bytes) ...

Test File LZ4 SWD --------------------------- ANGELFISH 6,505 5,799 ASTRONUT 23,517 21,426 BEHEMOTH 14,799 14,068 BIG 2,800 2,571 BUTTERFLY 8,862 8,137 CD 6,651 6,164 CLOWN 18,873 16,934 COGITO 7,659 9,666 COTTAGE 15,297 13,628 FIGHTERS 13,099 12,182 FLOWER 13,217 12,338 JAZZ 9,970 9,074 KNIFE 14,807 13,707 LORI 20,258 18,610 MAX 8,640 8,171 OWL 18,471 15,347 RED.DRAGON 20,592 18,903 TAJ 16,303 13,953 TUT 12,548 11,476

touko · « **Reply #21 on:** February 06, 2015, 08:07:11 PM »

Quote

You must keep a sliding window buffer of the decompressed data available to copy from.

The match offset and match length in the compressed data refers to offsets and lengths in the decompressed data.

Ah ok,i see

, it's not a problem because you have 2 independant VRAM pointer, 1 for read and 1 for write .
You can point easily on destination in VRAM (with the read pointer) and copying your match byte in A reg, and write it repeatedly in VRAM (with the write one),i treat vram like a buffer,don't forget we have an unlimited access to vram and not only in vblank .
And as the write pointer is auto incremented, you do not have to set it each time or to words inc destination as we do in ram..

Of course you canot use transfert block instructions in this case.

Quote

Now ... you certainly can modify the algorithm to use a sliding window in the compressed data instead of the decompressed data and thus enable decompression directly into VRAM ... but your compression-ratio will almost-certainly suffer very, very badly (I tried this back in the 1980's).

LZ4 algorithm is very easy, and i have already a faster version than the 65C02 one (for ram decompression only).
My version is based on this one :
http://pferrie.host22.com/misc/appleii.htm
It will inevitably increase the code size (it's already the case),but it should be faster than copying datas twice i think..

Quote

You are welcome to try this yourself ... your data may be different enough that it will work ... but don't be surprised if it doesn't!

I'll try

,but for now the difficulty is how to manage the 2 options, RAM and VRAM efficiently without convoluted code,and not if write directly in vram is possible (because it is) ..

Quote

I'm insufferably happy to present the following results (compressed size in bytes) ...

Test File LZ4 SWD
---------------------------
ANGELFISH 6,505 5,799
ASTRONUT 23,517 21,426
BEHEMOTH 14,799 14,068
BIG 2,800 2,571
BUTTERFLY 8,862 8,137
CD 6,651 6,164
CLOWN 18,873 16,934
COGITO 7,659 9,666
COTTAGE 15,297 13,628
FIGHTERS 13,099 12,182
FLOWER 13,217 12,338
JAZZ 9,970 9,074
KNIFE 14,807 13,707
LORI 20,258 18,610
MAX 8,640 8,171
OWL 18,471 15,347
RED.DRAGON 20,592 18,903
TAJ 16,303 13,953
TUT 12,548 11,476

Wahou, your compressor is better in any case than lz4,and what about the speed ?? ..

elmer · « **Reply #22 on:** February 07, 2015, 04:45:55 AM »

Quote from: touko on February 06, 2015, 08:07:11 PM

Ah ok,i see , it's not a problem because you have 2 independant VRAM pointer, 1 for read and 1 for write .

Excellent! Yes, it'll work directly to/from VRAM. ... I'm still not used to the intricacies of the PCE's VDC and was thinking about other (much more limited) machines.

Just remember that you are copying a string of bytes from the previous data, so it's a sequence of read/write pairs and not just read-once, write-many.

That's going to get ugly very quickly with even/odd byte boundaries ... so what I'd suggest is to hack up a customized version of LZ4 that processes 16-bit words instead of 8-bit bytes, it'll be a much better match for the VRAM data that way and avoid lots of ugly code.

I seem to remember losing a few % of compression when I tried that on the Gameboy, but it'll make your life much easier ... I think that it's a good trade off for your usage.

Quote

My version is based on this one :
http://pferrie.host22.com/misc/appleii.htm
It will inevitably increase the code size (it's already the case),but it should be faster than copying datas twice i think..

His code is written for clarity and not speed, so you can definitely do better.

Quote

Wahou, your compressor is better in any case than lz4,and what about the speed ?? ..

It is almost-certainly a bit slower, because I bit-pack the offset/length encodings, but in my experience most of the compressed data is single-byte literals which should be just as fast (or faster) than LZ4.

I'll have to clean up the code a bit and release it on github, and then you can run some tests!

Remember ... there is always a tradeoff between compression and speed, that's why LZ4 is so fast ... it uses a very simple encoding for the runs/offsets/lengths.

My encoding is a bit more complex, and usually get's an extra few % of compression, but not always ... you can see that SWD is actually considerably larger than LZ4 in one of the tests.

It all depends upon the data, and LZ4 is more resilient to different data sets than my encoding, which was originally hand-tuned for the character/map/sprite data in one specific game.

The test suite that the AppleII guys used is, IMHO, not a very good representation of the character/map/sprite data used on the PCE/Genesis/SNES/Gameboy ... it contains way too many runs of single-color or simple-pattern pixels.

touko · « **Reply #23 on:** February 07, 2015, 06:08:10 AM »

Quote

Just remember that you are copying a string of bytes from the previous data, so it's a sequence of read/write pairs and not just read-once, write-many.

Yes it's the case for litteral not for match, no ??

Quote

That's going to get ugly very quickly with even/odd byte boundaries ... so what I'd suggest is to hack up a customized version of LZ4 that processes 16-bit words instead of 8-bit bytes, it'll be a much better match for the VRAM data that way and avoid lots of ugly code.

Of course, i'am not sure that copying directly in VRAM will be pratical, and like i said the 2 case (RAM/VRAM) are not easy to do together and implies (maybe) dirty code and an increase in decompressor code size ..

The buffer in ram is the most simplest solution, by far, but not optimal in term of speed .

Quote

I seem to remember losing a few % of compression when I tried that on the Gameboy, but it'll make your life much easier ... I think that it's a good trade off for your usage.

I do not exclude any solution

Quote

His code is written for clarity and not speed, so you can definitely do better.

Exact, and size too, but definitely not for speed .

Quote

I'll have to clean up the code a bit and release it on github, and then you can run some tests!

Thanks so much

Quote

Remember ... there is always a tradeoff between compression and speed, that's why LZ4 is so fast ... it uses a very simple encoding for the runs/offsets/lengths.

You're right, i'am not a fan in general of compression, and i search a good compromise between size and speed,i don't like to spend too mush cycles for decompressing datas.

Quote

It all depends upon the data, and LZ4 is more resilient to different data sets than my encoding, which was originally hand-tuned for the character/map/sprite data in one specific game.

This is why i 'am gone with LZ4,not to bad for all kind, and easy to implement .
But yours is very good too ..

Quote

The test suite that the AppleII guys used is, IMHO, not a very good representation of the character/map/sprite data used on the PCE/Genesis/SNES/Gameboy ... it contains way too many runs of single-color or simple-pattern pixels.

Of course, i made some tests on my pce graphics datas, mainly tileset and sprites, and it was very good for my use, not the best of course but with a factor of 2/2,5 in most case .

Bonknuts · « **Reply #24 on:** February 07, 2015, 06:32:47 AM »

Wow, that's a really simple compression algorithm (LZ4). I love looking and taking apart different compression schemes (they all have their own advantage).

Planar graphics never compress that well, compared packed pixel. I wonder how it does with 4bit packed pixel nibbles. Gate of Thunder uses LZSS and has the sprite (and IIRC, tiles) all in pack pixel format. The compression algorithm knows ahead of time whether the graphic data is a 16x16 native sprite cell or a 8x8 tile cell, and has an internal counter that when expired - converts decompressed graphics data back into PCE format and writes it to vram. On top of that, it does this in real time as the game engine is playing along. I didn't fully investigate how the game engine does this, but making a time sliced background 'process' isn't too difficult. Definitely something you can do if the game is in such a fashion that you have 'lead time' before the graphics are due - thus decompress them in the background process over quite a few frames as the normal game logic is running.

I've used LZss, pucrunch, and packfire for PCE. All with circular buffer to decode directly to vram. With Pucrunch, I was able to get really good results with 512k and 1024k window sizes. But man.. it's slow. Especially with the packed pixel to planar counter/conversion implemented ;>_>

Some other later gen PCE CD games that use LZss, prime the half or more the 'window' with a special set of values every time, before the decompression process starts. The compression algorithm knows this ahead of time and can reference this (usually great for tilemap data and such).

Thinking about all of this in the context of CD ram, reserving a larger decompression buffer can negate better compression savings (because you're taking away 'storage' ram for 'work' ram to create a local decompression area). I think in this context, decompressing directly to vram can save more overal CDRAM space, even with a slightly worse compression ratio/scheme. Of course, it's all really relative to what you need for your project.

elmer: I'm looking forward to your 'SWD' compression tools when you release them.

elmer · « **Reply #25 on:** February 07, 2015, 07:08:53 AM »

Quote from: touko on February 07, 2015, 06:08:10 AM

Yes it's the case for litteral not for match, no ??

No, I'm afraid that you often copy multiple bytes from the match position, that's how it get's it's good compression.

Quote

Of course, i'am not sure that copying directly in VRAM will be pratical

Because you have both read and write pointers to VRAM that actually auto-increment ... it will be blindingly fast compared to the regular-RAM version. But doing the compression with 16-bit words instead of 8-bit bytes is likely to hurt the compression quite a bit.

You can still do it with bytes, but you'll probably end up with 4 different routines to cope with the various combinations of even/odd source/destination.

elmer · « **Reply #26 on:** February 07, 2015, 07:53:57 AM »

Quote from: Bonknuts on February 07, 2015, 06:32:47 AM

Wow, that's a really simple compression algorithm (LZ4). I love looking and taking apart different compression schemes (they all have their own advantage).

They're fun aren't they! I like it that LZ4 actually implements a run-length for the literal data, I'd always meant to try that, but never got around to it.

Quote

Planar graphics never compress that well, compared packed pixel. I wonder how it does with 4bit packed pixel nibbles.

Very, very true. I expect that it'll do extremely well with packed data ... but OMG, the terrible overhead!!!!

Quote

Gate of Thunder uses LZSS and has the sprite (and IIRC, tiles) all in pack pixel format. ...

That's cool, I certainly didn't know that. The background process is a very nice solution to the problem of unpacking the pixels if you don't need an as-fast-as-possible data-rate.

Quote

Some other later gen PCE CD games that use LZss, prime the half or more the 'window' with a special set of values every time, before the decompression process starts.

That's a cool trick ... especially if you have the preload data in ROM or VRAM somewhere. It's always fun to hear what ideas people came up with to wring the best performance out of a machine.

Quote

elmer: I'm looking forward to your 'SWD' compression tools when you release them.

It's really just another LZ77/LZSS variant. I always mix up LZ77 and LZSS since they're basically the same thing in my mind ... LZSS is such a trivial (but useful) improvement to the LZ77 concept.

As the wikipedia page on LZSS says ...

Quote

Many popular archivers like PKZip, ARJ, RAR, ZOO, LHarc use LZSS rather than LZ77 as the primary compression algorithm; the encoding of literal characters and of length-distance pairs varies, with the most common option being Huffman coding.

My first Amiga games used Huffman-encoded LZSS as the article suggests, but it was a bit slow and also a pain because you had to include the Huffman table along with the compressed data.

When I had to do a Gameboy game, I ran all the data through the LZSS/Huffman encoding and took a look at the bit-lengths of each length/offset encoding used. After a bit of eyeballing and tweaking I came up with a static encoding of the lengths/offsets that gave approx 80% as good results, but was trivial to decode in Z80/6502 assembler.

SWD data is encoded as LZSS length/offset pairs. Lengths are encoded ... 1 : 0 dddddddd 2 : 10 3-5 : 11 xx 6-20 : 11 00 xxxx 21-275 : 11 00 0000 xxxxxxxx Offsets are encoded ... $0001-$0020 : 00 x xxxx $0021-$00A0 : 01 xxx xxxx $00A1-$02A0 : 10 x xxxx xxxx $02A1-$06A0 : 11 xx xxxx xxxx

In order to avoid too many bit-shifts, bytes of the encoded bit-stream are interleaved with data bytes, so that literal values and the low 8-bits of long lengths and offsets can be read directly from the compressed stream without any shifting.

With this encoding, any 2-byte or longer match is a win ... whereas with LZ4 the minimum match is 4-bytes.

Bonknuts · « **Reply #27 on:** February 07, 2015, 03:10:44 PM »

I think you could keep the compression scheme byte based for LZ4. Yeah, you need work out some case logic, but something like this could handle the bulk of it:

Quote

Maybe for something like setting up the 'read' address, you could shift out the byte offset into a word base offset, and then take the 'carry' and shift it into the index register. This would automatically setup your even/off offset for the read pointer.

lsr .sm0+1
ror .sm1+2
cla
rol a
tax

st0 #$01
.sm0
st1 #$00
.sm1
st2 #$00
lda #$02
sta <vdc_reg
st0 #$02

.loop
lda $0002,x
sta $0002,y

txa
eor #$01
tax

tya
eor #$01
tay

dec <counter
bne .loop

I didn't show setting up Y, but it should be continuous - since you're writing forward with this compression scheme. Same for the VRAM write pointer. That's set at the start of the block of data to decompress. It's the read pointer that needs to be modified, hence the above code.

Starting off with reading from $0002 or $0003 handles the even/odd byte offset reading issue (by indexing on the base read address $0002). I mean, you're never skipping bytes - just starting with an even or odd byte offset.

Though a jump table with multiple renditions of the same code, but handling/priming the starting offset read/write, would definitely be faster (you wouldn't have to deal with indexing, and modifying those index regs) - at the expense of some code space.

elmer · « **Reply #28 on:** February 07, 2015, 04:13:27 PM »

Quote from: Bonknuts on February 07, 2015, 03:10:44 PM

I think you could keep the compression scheme byte based for LZ4. Yeah, you need work out some case logic, but something like this could handle the bulk of it:

NICE!!!! Those auto-incrementing VDC registers really take a lot of the niggling-cr*p out of the inner loop. The PCE is such a beautifully designed piece of hardware.

But, really ... you know that you can get those eor's out of the inner loop if you really want to!

Bonknuts · « **Reply #29 on:** February 08, 2015, 07:30:44 AM »

Yeah, you can optimize out those eor's. Just to show something as an approach.

Yeah, the PCE architecture is pretty simple and clean. Being able to read and write to vram during active display is pretty nice IMO. It might not have a fast local to vram DMA like the SNES and Genesis, but in a good amount of cases open vram access can balance that out (games like Sapphire with large area animation updates show this off).

On a related note.. (source code layout optimization?)
I used to think the lack of a bigger linear PC address range (local to the cpu) was a design hindrance, but then I realized that all my optimizations were local anyway, and macros for 'far jsr' makes the code structure help lend itself to a more linear like layout (kinda. In the source it looks that way). I typically have a layout of 8k I/O, 16k of ram, 16k of code, 16k of data, and 8k of fixed library.

I have multiple vector banks, with the top 4k with repeated code/data, and the lower 4k with different stuffs - with the lower 4k being usually tables for speeding up code/etc - relative to the subroutine called. The upper 4k always has the code (along with the macro) to do the far calls and far returns, while always having the fixed lib funcs and video/timer interrupt routines, etc. So you get a 16k code+4k fast table mapping, and still have 16k for other 'data'. Or call an 8k code+24k data, etc. Or 8k code + another 8k code, etc. It works out pretty well. I'm usually not concerned with wasting a little bit of fat on code, since code generally takes up a small percentage compared to data.

Do you guys ever map anything in the typical I/O bank area? After working on nes2pce stuff, I've found myself mapping other banks to this area (MPR0). Interrupt routines that need access to the I/O bank can mapped it bank in for that interval. I mean, if a specific subroutine isn't writing/read vram or writing to the sound hardware, why not map something else there? Matter of fact, having done nes2pce stuff - I don't find it odd to map the I/O bank to something like the 4000-5fff or 6000-7fff range either ($6002,$6003, $7403, $6404, etc). It gives you another 8k of address range to work it otherwise (ram, data, code, etc).

Author Topic: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc. Tools for development (Read 25515 times)

touko

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

touko

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

elmer

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

touko

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

elmer

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

elmer

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

touko

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

elmer

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

touko

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

Bonknuts

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

elmer

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

elmer

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

Bonknuts

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

elmer

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.

Bonknuts

Re: Graphic, Sound, & Coding Tips / Tricks / Effects / Etc.