Writing a SNES assembler compiler/disassembler - Day 3

Assembling the Assembler

Grammar fix

When starting implementing the compiler part of this. I noticed that the grammar does not actually really work, especially if you introduce new lines. If I parse a file with 3 instructions, we catch the \n sometime and the asm-comment token is too greedy.

Let's change the ws rule to only capture horizontal blank (space and tab) and introduce an eol token, this makes the grammar more clear on what we are working on also.

token TOP { <.eol>* <thing>+ $ } token thing { || <asm-comment> <.eol> || <one-instruction> <.eol> || <instruction-line> <.ws> <.asm-comment>? <.eol> } token instruction-line { <instruction> (<.ws> ':' <.ws> <instruction>)+ } token one-instruction { <instruction> <.ws> <.asm-comment>? } token asm-comment { ';' <-[\n]>* } token ws { <!ww> \h* } token eol { \n[\h*\n]* } }

Capturing thing

With grammar in Raku, you associate an Action class to your grammar to transform the AST or just do what you want when you encounter a token.

It's really a simple mechanism, you define a class with methods that have the same name as your token and that's all. You don't need to have a method for every token and for our purpose we mainly only need to catch the <instruction> tokens. The catch for our code, since we use a proto token for <instruction> and multiple generated token like token instruction:sym<TRB-DIRECT-PAGE> defining an instruction method does not work. The method must match the name of the generated token, but we don't need to generate a method for every token since Raku offers a pseudo method called FALLBACK that catch-all method calls on our class that are not defined.

The method look like this :

method FALLBACK ($name, $match) { if $name ~~ /^instruction':sym<'(.+?)['-'(.*?)]**0..1'>'/ { my $int = CompilableInstruction.new; $int.name = $/[0].Str; $int.text = $match.Str; my $parsed = $match.target.substr(0, $match .pos).trim-trailing; with $/[1] { $int.addressing = ASM65816::AddressingMode::<<$/[1]>>; $int.lenght = %map-instructions{$int.name}{$int.addressing}.lenght; $int.operand = $int.lenght == 1 ?? -1 !! $!last-value; } else { $int.addressing = IMPLIED; $int.operand = -1; $int.lenght = 1; } $int.op-code = %map-instructions{$int.name}{$int.addressing}.op-code; @!see-instructions.push($int); } }

The instructions are stored in a CompilableInstruction class to make them more interesting to use.

I removed my generated array of instructions and replaced it with a %map-instructions that map my instruction information according to their 'name' and their addressing mode. This will be very useful to report what addressing mode an instruction support in case of error.

BIT => { DIRECT-PAGE => Instruction.new(:inst("BIT"), :addressing(DIRECT-PAGE), :description("Test Bits"), :op-code(36), :alias(""), :lenght(2)), ABSOLUTE => Instruction.new(:inst("BIT"), :addressing(ABSOLUTE), :description("Test Bits"), :op-code(44), :alias(""), :lenght(3)), DP-INDEXED-X => Instruction.new(:inst("BIT"), :addressing(DP-INDEXED-X), :description("Test Bits"), :op-code(52), :alias(""), :lenght(2)), ABSOLUTE-INDEXED-X => Instruction.new(:inst("BIT"), :addressing(ABSOLUTE-INDEXED-X), :description("Test Bits"), :op-code(60), :alias(""), :lenght(3)), IMMEDIATE => Instruction.new(:inst("BIT"), :addressing(IMMEDIATE), :description("Test Bits"), :op-code(137), :alias(""), :lenght(2)), },

Today I learn - Be mindful of $/

You can see that I named the Match argument of FALLBACK to $match. In most grammar examples it will be named $/ like the implicit variable affect when you use a regex. But it's a small trap. You can see that I use a regex to match and capture part of the method name so $/ get reaffected and you get a runtime error that tell you that you affected a read-only variable (method/sub arguments are read only by default)

Getting the value

We also need to capture the operand of the instruction, we could get it at grammar level and add the value to the ast when parsing a byte, word or long, but you need to propagate the change the ast for each parent token up to the instruction. There is a more simple solution, you can note that I assign the operand with an attribute called last-value it's because I simply catch each byte, word, long token in the action class and affect them to this attribute.

method long($/) { self.word($/); $!value-type = ValueType::long; } method byte ($/) { self.word($/); $!value-type = ValueType::byte; } method word ($cap) { $!last-value = $cap.Str.starts-with('$') ?? $cap.Str.substr(1).parse-base(16) !! $cap.Str.parse-base(10); $!value-type = ValueType::word; }

I actually don't use the ValueType for now, but maybe that will be useful later.

Showing the result

Next step is to write a Str/gist method on the CompilableInstruction class to have a nice output so we can display the @.instructions from the action class. This array is built from the @!see-instructions when we encounter a <one-instruction> or <instruction-line> token. This is probably not needed now, but my previous grammar was causing some instruction to be seen twice on a bad match.

Let assemble one of my small test files.

lda 42 ; boring rtl ; I love comment rtl ; meh

I also wrote a sub to display the value according to the addressing mode so the Str method can 'rebuild' the text.

The program output look like this, so far so good

A5:2A LDA $2A 6B: RTL 6B: RTL

To generate the bytecode I just add an assemble method on my instruction class that returns a buf8 containing the bytes from the instruction information. It's actually pretty straightforward since we just need to take the op-code and append the encoded operand in LittleEndian.

# There is no write-uint24 method :) method !encode-long(int32 $value) returns buf8 { my $toret = buf8.new(); $toret.append($value +& 0xFF); $toret.append(($value +> 8) +& 0xFF); $toret.append(($value +> 16) +& 0xFF); $toret } method assemble returns buf8 { my buf8 $toret = buf8.new(); $toret.write-uint8(0, $.op-code); $toret.write-uint8(1, $.operand, LittleEndian) if $.lenght == 2; $toret.write-uint16(1, $.operand, LittleEndian) if $.lenght == 3; $toret.append(self!encode-long($.operand)) if $.lenght == 4; $toret; }

The next step will be to test if the produced output is valid. So we will need to write tests and compare with other assemblers.

Leave a comment

About Sylvain Colinet

user-pic I blog about Perl 6.