I found a few hucards games doing it, but specifically Street Fighter 2 (does it for two sample channels). It's faster than normal bitpacked sample that other PCE games tend to use. I've never seen the scheme touko used.
 End of sample is convenient if you have the free bit for it, but it's not necessary. But that is awesome that you fitted it in there. I still like the padded end of sample block method (whatever that block is - 116 samples is the block in this case), where you check in vsync int.
 I honestly thought the SF2 method would be faster. It looks simpler IMO (less steps/shifts).
Does SF2 do the unpacking during the TIRQ, of does it buffer up an entire 116-byte frame?  

One of the things that I liked about the 3-in-2-bytes scheme was that it avoided the need to check for bank-overflow in the middle of a packet. With 8-samples-in-5-bytes, I don't think that you can do that.
That's not a problem if you're mapping in 2 banks to build the 116-byte buffer during vblank, but it's expensive to do in a TIRQ.
Plus ... I kinda like having that bit in there to mark the end-of-sample, or to signal a loop-point.  

Your double-buffered scheme is something that I'll need to look at seriously and compare the timings.
Mapping stuff into TAM #4 isn't needed, because Touko is already relying on his sample data being aligned on an even boundary, so I removed it, and got rid of the X and Y registers which were wasting cycles.
I didn't catch that he was doing that. I shaved the 6 sample buffer method down to 8.9%, but buffer difference isn't big enough for the overhead (have a pointer of playback buffer) to over come the non pointer method you did with a 3 sample cache system. 
Ahhh ... I missed that you'd already tried the 6-sample-at-once method.
Yep, I was afraid that the overhead from keeping track of which sample you're playing would outweigh the savings in the bank switching.
A block comment in assembly! I never seen anyone do that - haha. Your code is clean, clear, and and easy to understand. And fast. I'm impressed 
Thanks, that means something, coming from you!  

And Touko was right ... the overhead of the banking for every sample in the simple playback code that I posted earlier really *was* significant.
I'm very surprised to see it, but the packed playback code is actually a tiny bit faster than simple playback code.
Now I'm not sure if that will extend to multi-channel playback, but the idea of getting a 30% space-savings on sample size for no CPU playback cost is pretty amazing!
In other words, you expressed interest in the soft mix player I made - the only with 4 channels on two PCE channels. That wouldn't use any compression, because the sample depth is higher than 5bit. Would that mean you're not interested in it?
Well, *I'm* certainly interested in seeing the code.  

Having a high-quality sample alternative for use on Title Screens and places like that would be nice.
It all depends upon the CPU and memory costs.
The only thing that I don't overly like it losing stereo panning.
Having a drum-roll pan across the stereo field, or having a effect come from the left or right, are both common occurrences in games.