Currently one of my tasks is creation a scripting engine for .NET which would satisfy the following demands:
- Extendable syntaxis (ability to add new, possibly very exotic, syntaxical constructions).
- Dynamical typization (in order to simplify the scripting writing process, because language is oriented on non-programmers as well).
- Full execution control (for external control over the script execution).
Although the best solution would be to find an existing implementation and use it, but googling lead to either Lua/Perl/Python interopearbility (which doesn't satisfy the 2nd criterion) or to use of the .NET classes for dynamical compilation of the code.
After some time I did almost all the work by myself and here are some advices, which are hopefully useful (unless you had a compilers theory course and a good practice and already know much more):
- For quick introduction in scripting read correspondent chapters in books on games programming. They will give the baseline terminology and the overall scheme of what-to-do given a script text. Books on compilers theory are good, but if you need a hands-on approach then they probably contain too much theoretical information.
- Do not try to solve all problems at once like "Ok, I've got a script and now I'm simply going to process and execute the raw source. Nothing serious really, because I can easily see assignments, function calls, conditions etc., so writing a program, which would do the same, is a piece of cake". This is a suicide, because a problem is much harder than it seems. Just think about code like: a = sqrt(abs(c - d[i + 1])).
- Simplified (a bit) source processing procedure includes: (1) removal of comments; (1a) concatenation of the whole script into a single string; (2) tokenization (breaking the script into atomic entries like operators, brackets, identifiers etc.); (3) creation of a parse tree (http://en.wikipedia.org/wiki/Parse_tree), which define a kind of 'execution atoms', and the order at which they should be processed; (4) optimization of the parse tree.
- It's convenient to create a special class for tokens with fields like Contents and Type. The latter defines what kind of token it is (keyword, operator, bracket and so on). This will make life much easier on the parse tree building step.
- BNFs (http://en.wikipedia.org/wiki/Backus-Naur_Form) are really-really-really useful things, especially when describing a formalized grammar. You can represent a BNF as collection of arrays consisting of tokens. Big advantage is that BNFs has inherent recursion property and allow reuse of one BNFs as part of other BNFs (so the good KISS and DRY principles hold).
- BNFs should be applied for building of a parse tree given a collection of tokens. You just start from the very beginning with BNF for the entire script and use recursion. The complex part is that it's not known beforehand, which BNF suits current collection of tokens the best, so you'll have to try any suitable BNF you have, which start from the same token. But there is a way to make the problem easier, see advices below.
- Handling recursively embedded BNF is not a simple thing and your brains will boil, so be very careful with this. Use loop and class invariants (http://en.wikipedia.org/wiki/Loop_invariant) for functions to control the process. But you'll be surprised how little code it takes to make things work.
- Use dynamic programming (http://en.wikipedia.org/wiki/Dynamic_programming) approach for finding BNF, which suits current list of tokens best. This means that if you found that some subset of tokens is described by some BNF then you can mark the first token in this subset so that not to consider it again and again when trying other BNFs. Otherwise you'll end up redoing some work over and over and it's very likely that you'll get a stack overflow error.
- Probably the most amasing thing about BNFs is that when you need to extend a syntaxis, you'll just need to extend the set of BNFs used during a parse tree building procedure. And taking that BNFs can be read from some file, it's even possible to extend the grammar *without* recompilation of the program (in case you've written everything correctly, use loop and class invariants!).
- The drawback of using BNFs is that arithmetic and logical exressions are written in a 'direct' representation, not in the Polish notation (http://en.wikipedia.org/wiki/Polish_notation), hence optimization of the tree is required, because otherwise you'll have to make the transform to the Polish notation at each scripting execution, which is really unwise.
- Parse tree is enough for execution of the script.
- It's convenient to keep additional information in parse tree nodes about their BNF (if any). This information can be used to handle execution of conditoinals etc. Moreover you can specify special external handlers, which are called by BNF names. Hence it's possible to obtain a control over script execution, which may be convenient for setting external break points or specific execution rules.
- Script execution is guided in terms of the program state (http://en.wikipedia.org/wiki/Program_state), which include (among others) global and local variables data. You can add any extra information you need for the program state (call stack, current parse tree node, environment variables etc.).