PCEngineFans.com - The PC Engine and TurboGrafx-16 Community Forum 
		Tech and Homebrew => Turbo/PCE Game/Tool Development => Topic started by: Bonknuts on June 03, 2013, 02:05:49 PM
		
			
			- 
				These two processor architectures are so polarizingly different, not to mention the 'bitness' aspect as they are defined (32bit,16bit, and 8bit). This is a discussion for a more detailed look between both CPU's, in relation to the MD and PCE, beyond general assumptions and popular/common beliefs. This is not a system vs system compare. If the cpu's offer an advantage to the supported/specific hardware, than that's valid. Talk and examples of similar CPUs that share either of the two's architecture is also valid. And while game examples are valid, I'd rather it not be the primary source of comparison. Maybe more to reiterate a point or such. The primary discussion should focus more on code related examples, if possible.
 
 So, 68000 vs huc6280.
 
 
 Please, no trolling. And if the trolls do come a knocking, please don't feed them. Best to ignore them.
- 
				A strange thing, 68K coders often do not seem to take into account the registers init in her comparison..
 I know they are used to working with registers, but values do not appear by magic..
- 
				Maybe bit OT, but are there also other CPUs that were often used during the 16-bit era? 
 Like in arcade hardware etc.
- 
				I don't know if AMR2 can be considered as belonging to 16 bit era !!
			
- 
				A strange thing, 68K coders often do not seem to take into account the registers init in her comaprison..
 I know they are used to working with registers, but values do not appear by magic..
 
 
 Isn't that address registers rather than data registers? Caveat; the only 68k I coded assembler for was the 68020.  Many years back.
- 
				Isn't that address registers rather than data registers? Caveat; the only 68k I coded assembler for was the 68020.  Many years back.
 
 Same, because addresses must be loaded in register before use .
 I think for this era, the big advantage of 6280 over 68k is really his 8 bit architecture, and fast ram read/write .
 with this you can easily improve 16bit treatment, by changing only LSB or MSB .
 
 For exemple: adding 256 bytes to a 16bit variable .
 On 6280 you can do :
 inc var + 1  ; // 5 cycle max, 4 cycles min (if var is in zero page), it's more efficient than 68k .
 In a game this case is very common .
 
- 
				A strange thing, why so many 6280 instructions have their cycles higher than 65xxx ??
 I don't understand why at least zero page access are not the same  :-k..
- 
				Possibly because 6280 Stack and ZP go through an extra layer of indirection by the memory mapper...?
			
- 
				aaah, may be ..
 Pass through MMU may cause cycles penalties,seems logical to me  ..
 
 Chris i know to be a father is a difficult task, but would you be interested in doing a bad apple demo on PCE ??  :mrgreen:
- 
				Why not make something like HuVideo, fullscreen instead?
 
 By the way, how many tiles can you upload from ROM/Syscard RAM to the video processor's internal memory per VBlank? Is it theoretically possible to create a "system card" with special hardware to generate a fullscreen video frame on it's RAM to be uploaded?
 
 Say that there's a 3D processor or something that would calculate matrices, vectors, etc. and generate a fullscreen render, to be uploaded to VRAM every frame... is it possible to do it fullscreen or only a part of the screen would be possible due to bulk transfer speeds (tix instructions)?
- 
				Why not make something like HuVideo, fullscreen instead?
 
 By the way, how many tiles can you upload from ROM/Syscard RAM to the video processor's internal memory per VBlank? Is it theoretically possible to create a "system card" with special hardware to generate a fullscreen video frame on it's RAM to be uploaded?
 
 Say that there's a 3D processor or something that would calculate matrices, vectors, etc. and generate a fullscreen render, to be uploaded to VRAM every frame... is it possible to do it fullscreen or only a part of the screen would be possible due to bulk transfer speeds (tix instructions)?
 
 
 
 If you embedded the graphics as ST1/ST2 instructions, it's the fast method to transfer to vram. Normally, TIA is 6 cycles per byte but since it's in hardware bank first 2k address space - it gets the +1 cycle penalty per memory access (it's a mystery, because the VDC never asserts RDY for that access. It does assert /RDY, but that's fir memory slot alignments during active display and it's only a fraction of the cpu cycle (/rdy works in master clock cycle steps, not cpu cycles)). So TIA to the VDC is 7 cycles a byte. ST1/ST2 is 5 cycles a byte (4+1penalty=5). There is a ~119436cpu cycles in a 1/60 frame. So the "theoretical" max transfer is 23.887k per frame. Not enough to do 60fps FULL screen, but definitely enough to do 30fps full screen. Of course, you can compress or save 5 cycles out of 10, in a word transfer if the LSB is the same as the previously LSB of the previous WORD (VDC latch trick, just use a ST2 and only write the MSB/latch, old LSB value is still kept). So technically, it's higher than 24k per frame. All depends on redundant LSBs.
 
 You could do a hardware support that would basically insert such every other byte as those opcodes (switch between them). Though I've done it without hardware. I made a transparency demo that used them: http://www.pcedev.net/demos/transparency/test0.zip .
 
 The transparency is realtime (Amiga planar style). That is to say, it uses the VDCs 4bit planar mode to do hardware assist transparency effects. The demo was never polished and finished, but you can still see the results. I had this crazy idea; instead of using fixed point entries as code lists with RTS instruction for the embedded video in st1/st2 opcodes, one would use the hsync interrupt or the TIMER interrupt to set a time limit to the transfer. You push the return address on a special stack or place in ram, jump to the address and let the cpu run hog wild. When the interrupt counter completes, you manually change the real stack with the special/saved address and return to that instead. Of course, you'll need a buffer in vram for over spill since neither the hsync interrupt or timer interrupt is fine enough cycle wise to do accurate stopping on an exact instruction. Crazy? Yup, but it's totally doable.
 
 Anyway, the reason why huvideo is not full screen is because of the CDROM hardware. I mean, sure, the transfer rate from the CDROM itself is limiting - but the interface to the CDROM is very limiting. The cpu much consistently poll ports and such. Bytes have to be manually copied (and not at full speed. You can't read out a byte from a selected sector faster than ~24 cycles). There's not a lot of free time and what little there is, is spent either sending audio packets to ADPCM memory port (which is pretty damn slow, but at least you can write to ADPCM memory while it's playing) or writing frame updates to the VDC. I guess if you want low res full screen, you could use a hsync interrupt to double up the scanline to repeat every other scanline (an early version of huvideo used in john madden does this for the opening stadium intro).
- 
				aaah, may be ..
 Pass through MMU may cause cycles penalties,seems logical to me  ..
 
 Chris i know to be a father is a difficult task, but would you be interested in doing a bad apple demo on PCE ??  :mrgreen:
 
 
 The 6502 and 65c02 have penalty cycles for crossing the 256byte boundary. If an instruction operand lands on the start of a new boundary, there's a penalty cycle. Indexing might have one as well. On the 6280, there are no penalty cycles. But instead, all cycles are included into the opcode decoding. Kinda sucks because you can make some real nice optimizations on the 65x, but it also simplifies cycle counting on the 6280. This used to bother me a little bit, but then I realized that the clock speed on the 6280 is greater than any 65x. It's not like you're going to be comparing to a 7.16mhz 65x anyway. But yeah...
 
 Possibly because 6280 Stack and ZP go through an extra layer of indirection by the memory mapper...? LDA abs,x and such instructions also have 1 extra cycle than the 65x and that ABS location can be anywhere in the cpu 16 logical address range. The '816 has cycle penalties if the use the DP register to point to a non default bank, but the 6280 doesn't seem to care. I've mapped all of system card ram to $2000 and tested and found no difference. Though in my NES2PCE stuffs, I did find something interesting. I mirror ram to $0000 range as well. If you use the base 64k of CDRAM for this, it will corrupt itself. If you use the $f7 bank or any of the 192k SCD ram, then it's completely fine. Even my Duo and SuperCDROM^2 with built in system card, had this issue. Weird stuffs.
- 
				Tom have i already said, that i like so much yours explanations ?? ;-) ..
 
 And this transparent effect is awesome ..
 Some opcode like smb/rmb have 2 more cycles on 6280 .
 
 For use ST1/ST2 to transfering data, you need self modifying code, no ???
 And you lose the benefit of ST0/ST1 transfer(may be i have not understood the trick) !!
 
 I have red a topic where you and charles discussed about packed sprite, if i'am right !!
 And one of you talked about this technique .
- 
				
 Some opcode like smb/rmb have 2 more cycles on 6280 .
 
 
 Probably because of the address decoding and included page penalty cycle.
 For use ST1/ST2 to transferring data, you need self modifying code, no ???
 And you lose the benefit of ST0/ST1 transfer(may be i have not understood the trick) !!
 I didn't use self modifying code, but you could (if you have enough ram; 8k isn't enough). My demo is hardcoded/embedded. I wrote a util that took data files and created embedded ASM files for use with PCEAS. But I've done a lot more than just transparency with that method.
 
 
 
 I have red a topic where you and charles discussed about packed sprite, if i'am right !!
 And one of you talked about this technique .
 
 
 Ehh? I don't remember. Charles and I talked about many awesome things, but I tend to forget things if I don't save them :/
- 
				Pardon if this has been asked before, but how hard would it be to port a 68k game to HUC6280 and vice versa?
			
- 
				 I didn't use self modifying code, but you could (if you have enough ram; 8k isn't enough). My demo is hardcoded/embedded. I wrote a util that took data files and created embedded ASM files for use with PCEAS. But I've done a lot more than just transparency with that method. 
 
 Ok, i'll try to understand how it works, thanks ..
 
  Ehh? I don't remember. Charles and I talked about many awesome things, but I tend to forget things if I don't save them :/ 
 
 It was in this tread :
 http://forums.magicengine.com/en/viewtopic.php?t=1615
 
 ;-) ..
- 
				Link didn't work. It just brought me to the main forum page.
			
- 
				Oups, updated ;-)
			
- 
				Pardon if this has been asked before, but how hard would it be to port a 68k game to HUC6280 and vice versa?
 
 
 Just the code itself? Not really hard at all, as long as you know both processors fairly well. Or at least knowing the target processor of the port pretty well (I'm thinking 68k->65x/6280). 68k is fairly straight forward. 65x/6280 can get convoluted when the code gets really optimized for speed. With out good notes/comments, I can see convoluted 65x/6280 code to 68k being a pain in the ass. Regular 65x/6280 code though, shouldn't be a problem. I have early source code for Art of Fighting on the PCE ACD. It shows some 68k code in the comments that they were directly porting. Me personally, I like to write comment code in C - for a quick explanation of what the code is doing. You have a project in mind?
 
 
 
 
- 
				touko: Continuing the discussion...
 
 I was looking into doing fast multiplication on the PCE. Stef mentioned the 68k can get 70 cycles for 16bit * 16bit -> 32bit.
 
 I looked up some routines and came across the old c-64 code for fast mul. I've seen this before, a number of years back. But I never had a need for it. Almost all my multiplication in code were of usually one element being a constant and optimized as such. But for something else that I started, I needed variable values for both A and B.
 
 The fast mul routine is based on f(a+b)-f(a-b). Where f(x)=x^2/4. If you break the multiplication down into 8bit steps, a+b=9bit result. So f(a+b) is a 9bit (512) WORD wide LUT. This breaks it down into simple additions and subtractions (albeit 16bit add/sub operations). I have a few ideas how to speed this up further, but I have to write the code out and compare cycles.
 
- 
				Ah yes i read that on sega-16 ;-) ..
 But comparing 68k cycles with 6280 directly is not fair ..
 70 cycles for mul opcode on 68k, stef don't count the registers init, he counted only the mul instruction ...
 i think all the process (load each value + mul ) is close to 80/90 cycles ..