(Mis) Using Perl 6 Grammars : Decompressing Zelda 3 GFX
(Mis) Using Perl 6 Grammars : Decompressing Zelda 3 GFX
Grammars combined with actions allows to parse strings and produce something from it. It's not far fetched to say that any compressed data follow a structure that is likely 'parsed' by the corresponding decompression algorithm.
So why not using Perl 6 grammar for this kind of work? Especially a compression I am familiar with.
A look into Nintendo compression
Nintendo used the same base compression format for their SNES games with some variant depending on the game. It's pretty straight forward and it's easy on the ~2Mhz of the SNES CPU.
It goes like that:
<a byte that contains an header>
the header is split into 2 parts :
the 3 left most bits form a command number
The 5 other bits code the lenght associated with the command
<An arbitrary number of bytes associated with the command>
<repeat until the header is \xFF>
The commands depend on the game, but the commands for Zelda 3 are pretty simple
- 0 Copy : Basicly a no compression command: copy the following bytes
- 1 Byte repeat : Repeat the following byte lenght time
- 2 Word repeat : Repeat the following word (2 bytes for the SNES) lenght / 2 time
- 3 Byte Increment: Repeat the same Byte length time but increment it too
- 4 Repeat existing : Copy the already uncompressed data from the given offset (2 byte)
Other games added variant like the last one but do a OR/XOR/AND on the data.
Grammar
When I presented you the compression it easily translates to this grammar
TOP : <chunk>+? <end>
end : \xFF
chunk : <header> <data>
header : <command> <lenght>
command : 0 | 1 | 2 | 3 | 4
lenght : \d
data : ???
There are some issues with this:
- The size of the data depend on the command and the length
- How to work at bits level for the header
- For more fun, if the command is 7 the header is 2 bytes (for a bigger length)
- Perl 6 Grammar works on strings not binary data
Working with binary data
Since grammars does not really support pure binary data you have to pass your data that you store originally in a buf as latin1
encoded string. Like MyGrammar.parse($buf.decode("latin1"))
then you can access the value of a 'byte' by doing .ord
on the character of the string.
Validate the header format
Grammars are created with regex, so we need to look if you can have conditionnal match on values aside just pattern. It's pretty easy, you add <?{ some condition }>
after your match and if the condition checks the match occurs.
token normal-header {
<mheader> . <?{ $/.ord < 0b11100000}>
}
The extended header is a bit more lengthy
token extended-header {
$<mheader> = . ** 2 <?{ ($/.comb[0].ord +& 0b11100000) == 0b11100000 }>
}
You can note that I don't validate the possible commands in these 2 tokens, it will be done later.
Adding value to the ast
It's possible to add code in your token with {}
, like you can add { say "I am in token foo" }
to have some sort of debug (but use a proper module for that). You can also add value to the ast with this. This partially solve our issue with parsing data since we needed to capture the command and length to be able to properly validate the data
token.
token normal-header {
$<mheader> = . <?{ $/.ord < 0b11100000 }>
{ make {
cmd => Command($<mheader>.ord +> 5),
lenght => $<mheader>.ord +& 0b00011111 + 1;
} }
}
The make
add 2 entries in a hash to the normal-header
token. I defined an enum for the commands to make them more readable. You can note that I add 1 to the lenght since 0 is 1.
You then need to propagate these added entries to the header
token too, since we will use then in the chunk
token.
token header {
[ <normal-header> | <extended-header> ]
{ make ($<normal-header> // $<extended-header> ).ast
}
}
Grammar allows us to pass arguments to a token with this syntax <token(arguments)>
Our chunk token look like that.
token chunk {
<header>
{}
<data(|$<header>.ast<cmd lenght>)>
}
The empty {}
is needed, otherwise, the ast does not get refreshed apparently.
Validate commands
Since we can overload token like methods in a class with the keyword multi
, we just have to make a data
token per command that works in a similar way.
multi token data(CMD_Copy, $lenght) {
. ** {$lenght}
}
multi token data($cmd where $cmd == CMD_ByteRepeat | CMD_ByteInc, $lenght) {
.
}
multi token data($cmd where $cmd == CMD_WordRepeat | CMD_CopyExisting, $lenght) {
. ** 2
}
Testing
I already have some tests written for my C implementation of this (you can find them at https://github.com/Skarsnik/sneshacking/tree/master/src). I tried some of them, but here an example of a run with the current grammar.
my $buf = buf8.new(BUILD_SHEADER(CMD_ByteRepeat, 3), 42, BUILD_SHEADER(CMD_Copy, 4), 1, 2, 3, 4, BUILD_SHEADER(CMD_WordRepeat, 2), 11, 22, 0xFF);
say AlttpDecompression.parse($buf.decode("latin1"));
This output :
「"*^D^B^C^D^EA^L^Wÿ」
chunk => 「"*」
header => 「"」
normal-header => 「"」
mheader => 「"」
data => 「*」
chunk => 「^D^B^C^D^E」
header => 「^D」
normal-header => 「^D」
mheader => 「^D」
data => 「^B^C^D^E」
chunk => 「A^L^W」
header => 「A」
normal-header => 「A」
mheader => 「A」
data => 「^L^W」
stop => 「ÿ」
This is not really useful aside to see if the token is correctly parsed, so I added {say $<header>.ast<cmd lenght>, $<data>.Str.chars}
at the end of the chunk token
(CMD_ByteRepeat 3)1
(CMD_Copy 4)4
(CMD_WordRepeat 2)2
Decompressing data
Now that we have a grammar that can validate Zelda 3 compressed datas, let move on to decompress them since it's not really useful to just validate datas.
We could probably already do this inside the grammar with the {}
and make block, I choose to use Action
. Action are easy to implement, you just create a class with methods named after the token in your grammar. We don't need to have a method for each token and in our case only the 'chunk' token is interesting since it contains all the informations to build the decompressed data.
Since I am not sure how you are supposed to return something else than amodified ast with actions, I just added a result
attribute and passed an instancied Action to the grammar instead of just the class.
class DecompressAction {
has buf8 $.result .= new;
method chunk($/) {
self.process-cmd($/<header>.ast<cmd>, $/<header>.ast<lenght>, $/<data>.Str.encode("latin1"));
}
Then each process-cmd
multi method builds data into result
.
The copy is pretty straight forward.
multi method process-cmd(CMD_Copy, $lenght, $data) {
$!result.append($data)
}
multi method process-cmd(CMD_CopyExisting, $lenght, $data) {
$!result.append($!result.subbuf($data.read-uint16(0, LittleEndian), $lenght))
}
This is one is more interesting because if we wanted to decompress the map data we would just need to change the endian used to decode the offset without changing the rest of the decompression. (And, yes there is 2 copy of the same decompression routine on the ROM with just 2 instructions changed for this...)
Let's try with the previous compressed string I showed previously.
my $buf = buf8.new(BUILD_SHEADER(CMD_ByteRepeat, 3), 42, BUILD_SHEADER(CMD_Copy, 4), 1, 2, 3, 4, BUILD_SHEADER(CMD_WordRepeat, 2), 11, 22, 0xFF);
my $data = AlttpDecompression.parse($buf.decode("latin1"), :actions(DecompressAction.new)).actions.result;
my $testdata = buf8.new(42, 42, 42, 1, 2, 3, 4, 11, 22);
say $data eq $testdata;
I added some debug so we can see how that flows
skarsnik@DESKTOP-UIA12T1:/mnt/f/Project/Perl6/grammar$ perl6 alttpcompression.p6
To decompress : Buf[uint8]:0x<22 2a 03 01 02 03 04 41 0b 16 ff>
Decompressed : Buf[uint8]:0x<2a 2a 2a 01 02 03 04 0b 16>
(CMD_ByteRepeat 3)1
Chunk : CMD_ByteRepeat 3 : Blob[uint8]:0x<2a>
$.result : Buf[uint8]:0x<2a 2a 2a>
(CMD_Copy 4)4
Chunk : CMD_Copy 4 : Blob[uint8]:0x<01 02 03 04>
$.result : Buf[uint8]:0x<2a 2a 2a 01 02 03 04>
(CMD_WordRepeat 2)2
Chunk : CMD_WordRepeat 2 : Blob[uint8]:0x<0b 16>
$.result : Buf[uint8]:0x<2a 2a 2a 01 02 03 04 0b 16>
True
Real Data
Testing with carefully crafted data is nice, but let see how this decompression perform with real game data. I already have a tool to extract and inject GFX from ROM file that include this compression. Normally you should look at a table that tell you where each each part of the GFX (tileset) is located in the rom (using SNES address, not file address), but for the sake of simplicify I will take data from a know location. This is tileset that is used for some of the Link actions and his shields.
[at 0xc0d64, 3bpp]
This is real scale and with a fake gray palette. The compressed data are located at 0xC0D64 + 0x200 (some ROM file have a special header) and is 1500 bytes. The size should not matter since we have a way to know the end of the compressed string but the grammar will not validate.
Here the result inside my tool when opening the output produced by this script. It's exactly the same as before, so I guess it's a success :)
Conclusion
Yes you can parse binary format with Perl 6 grammar, but as you can see it's lot of workaround to access the raw binary data. It was a fun mini project, it made me fix some bugs on my tool that work with SNES GFX (it was bad at opening standalone gfx file)
But if I had to write more snes related software in Perl 6, I will probably bind with NativeCall
what I already wrote in C since it's already well tested.
You can find the whole code at https://gist.github.com/Skarsnik/b2850dbebabeb4c539444598419d0248
Tool used for looking at the GFX can be found at SNESTilesKitten
Leave a comment