This guide explains how to create external parsers for SemanticMerge (and Plastic SCM semantic features) so that you can add support for custom languages or programming languages not yet supported out of the box.
Traditional diff and merge tools handle blocks of text. Semantic does something different; it works at a structural level. (By Semantic, we mean both SemanticMerge and the Plastic SCM built-in semantic features.)
To diff and merge code, Semantic needs to understand its structure. But, this is much simpler than it might sound because it just needs to locate the different declarations (typically methods and classes) and find how they relate to each other.
The following figure illustrates what the Semantic technology needs to locate inside the source code files to work:
As you can see, all that Semantic needs to know is:
Socket
class and where it is located (beginning and end).Socket
class contains two methods.Connect()
inside Socket
and the lines where it is located.Disconnect()
.Given this very simple structure, Semantic will be able to properly calculate the diffs and merges.
It does not need to delve into the bodies of the methods. It does not need to know if there are if-else clauses, for-loops and so on. All Semantic needs to know is where the namespaces, classes and methods are (or functions depending on what the specific language is).
Semantic comes with built-in parsers for several programming languages: C#, Java, Vb.net, C and C++ are supported natively.
Built-in parsers convert the source code into a structure that Semantic can use to calculate diffs and merges.
It is possible to extend Semantic to support additional languages. To do that an external parser must be created.
An external parser is just a command line application that receives the files to parse as input from Semantic and returns the parsed structures as output. You can create the external parser using your preferred programming language and framework, since the API is just command line-based there are no limitations.
This guide explains how to create external parsers and how to configure Semantic to use them.
The API that Semantic uses to invoke external parsers is very simple and is based on the command line.
As the following figure shows, Semantic will start your program and send the input using the stdin, then wait for your program to output OK or KO on the stdout.
The input will be always in the form of pairs of paths plus an encoding between them:
The command line based API is extremely simple to allow you to write parses in any language. For instance, you can use Go code to create a Go parser, all you need to do is a command line Go application to be invoked by Semantic.
The following diagram explains how a typical parsing session works, so that you can learn what is the exact input your program must handle and the output it is expected to provide.
Please note that your parser will be always invoked as follows:
yourparserapplication.exe shell path_to_the_flag_fileWhere shell is always going to be the first argument which basically means you are expected to continue running until you receive "end" on the stdin (note that at this point shell is the only supported mode, so it is redundant, but it has been included for future compatibility).
Then there is the path_to_flag_file: once your program is ready to receive files to parse, ensure you write to this file. Semantic will be monitoring it waiting for the external parser to be ready. This synchronization mechanism makes sense when the parser program takes some time to start up. This can happen for applications written in Java where the runtime must be initialized, for instance.
The goal of the external parser is to create a declarations tree file. This declarations tree allows Semantic to handle the source code file because it provides the information about how it is structured and where each declaration is located.
Independently of the language being parsed, your program only needs to create a tree based on the following four types of objects:
include
, import
and using
clauses, functions in C, would be directly contained by the file).The following figure shows additions to the previous diagram with the members of each of the four types of objects:
In summary, the parsing required by Semantic is just about locating the different declarations.
The only tricky part is correctly calculating the spans (in character count instead of line and columns) and this is going to be explained in the next section using an example.
This tutorial explains how to create a declarations tree that the Semantic tools expect to handle your programming language. It will not cover how to create an actual parser since this is out of scope of this guide.
We will use a very simple file in a C#-like language and explain the resulting declarations tree that Semantic needs to understand the code.
The code used in this example is as follows:
Both SemanticMerge and Plastic SCM ask your external parser to create a declarations tree from a given source code file.
The tree structure for the previous source code is a YAML file which looks as follows:
--- type: file name: FooSocket.csharp locationSpan : {start: [1, 0], end: [12, 1]} footerSpan : [0,-1] parsingErrorsDetected : false children: - type : class name : name of the first container locationSpan : {start: [1, 0], end: [12, 1]} headerSpan : [0, 16] footerSpan : [186, 186] children : - type : method name : Connect locationSpan : {start: [3, 0], end: [7,2]} span : [17, 109] - type : method name : Disconnect locationSpan : {start: [8,0], end: [11,6]} span : [110, 185]
What you have seen above is the expected result. The next sections explain how to calculate each of the members of the nodes in the declarations tree.
The first node is the file node; there will be only one per file. These are the members to fill in:
--- type: file name: the path of the file locationSpan : {start: [line, column], end: [line, column]} footerSpan : [0,-1] parsingErrorsDetected : false children :
In our case, this is very simple if we check the lines and columns of the file:
The location span will be set as follows:
locationSpan : {start: [1, 0], end: [12, 1]}
Which spans the entire file.
So far, it wasn't that hard :-)
To parse the Connect()
method we need to fill in the following members in the YAML structure:
- type : method name : Connect locationSpan : {start: [line, column], end: [line, column]} span : [start_char, end_char]
And again, we need to perform a char count calculation.
The following figure explains how the spans of the method are calculated (together with the previously calculated ones for the class):
And then, the resulting structure is:
- type : method name : Connect locationSpan : {start: [3, 0], end: [7,2]} span : [17, 109]
Parsing the Disconnect()
method follows the same logic as the previous one: identify the lines and columns for locationSpan and count the chars for "span".
And the resulting YAML for the method is:
- type : method name : Disconnect locationSpan : {start: [8, 0], end: [11, 6]} span : [110, 185]
At this point, the entire file is parsed and the output will be ready to be consumed by Semantic.
The following screenshot shows how SemanticDiff can handle a couple of files parsed by an external parser:
The file on the left matches the one parsed during the tutorial, while the one on the right is a modification that moves a method and modifies it.
As you can see, Semantic can track the moved method in the custom language, and find differences inside its body.
Visual Diff also works correctly as expected:
Finally, this is the sample parser running inside Plastic SCM in the Pending changes view:
For SemanticDiff and SemanticMerge, it is very simple to invoke Semantic with and external parser:
semanticmergetool.exe -s sample\source.code -d sample\destination.code -ep=codeparser.exeWhere codeparser is the sample parser we created for this tutorial. You can find the sample parser here.
For Plastic SCM, you have to create a externalparsers.conf
file as follows:
The sample file we used for the example didn't have any comments. Let's now add a few comments to see how we should handle them:
As you see, we are using a pseudo-C# language, not a real one.
We added a couple of include
sentences that were not present before, and some comments.
Let's now find out how to build the parser by using charposition:
There are some interesting points to highlight:
Socket
class and the comment before the Connect
method. I consider both as part of their corresponding declarations.Disconnect
. It starts just in the character after the end of the line where Connect
ends. In this case, there are no comments there, but if they were, they would be considered as part of Disconnect
.Socket
container applying the same concept I used for Disconnect
. The footer starts just after the line where the last method ends, and anything there not being another node will be just part of the footer.I'm including the comments just above the declarations as part of the declaration because I consider they will be moved together with the method, and because comments normally go before the declaration. By the way, custom parsers could use different techniques. The key thing is that every single character in the file is mapped so that Semantic knows how to handle it.
The resulting declarations tree in YAML is as follows:
--- type: file name: sample\source.code locationSpan : {start: [1,0], end: [22,1]} footerSpan : [0,-1] parsingErrorsDetected : false children: - type : include name : sockets locationSpan : {start: [1, 0], end: [1, 18]} span : [0, 17] - type : include name : system locationSpan : {start: [2, 0], end: [2,17]} span : [18, 34] - type : class name : Socket locationSpan : {start: [3,0], end: [22,1]} headerSpan : [35, 94] footerSpan : [354, 430] children : - type : method name : Connect locationSpan : {start: [7, 0], end: [13,6]} span : [95, 275] - type : method name : Disconnect locationSpan : {start: [14,0], end: [18,6]} span : [276, 353]
Here is how SemanticDiff looks displaying the previous sample, plus another file based on the previous:
You can find the sample parser code for this scenario here.
You can learn more about different techniques to handle comments in this Forum thread.
What if you find errors during parsing? You can fill in the parsingError collection to notify the user that something went wrong and maybe the declarations tree can't be correctly calculated.
Let's look at the following example in the pseudo-programming language we used in this tutorial:
As you can see there is a ; missing at the end of line 5 (highlighted in the picture).
The declarations tree to handle this case can be as follows:
--- type: file name: sample\source.code locationSpan : {start: [1,0], end: [12,1]} footerSpan : [0,-1] parsingErrorsDetected : true children: - type : class name : Socket locationSpan : {start: [1,0], end: [12,1]} headerSpan : [0, 16] footerSpan : [186, 186] children : - type : method name : Connect locationSpan : {start: [3, 0], end: [7,2]} span : [17, 109] - type : method name : Disconnect locationSpan : {start: [8,0], end: [11,6]} span : [110, 185] parsingError: - location: [5,45] message: "Missing ; at the end of the line"
And then, if you launch SemanticDiff, you will get the following warning:
SemanticDiff and SemanticMerge warns you that the parser found issues, so it gives the user the choice to launch the text-based tool instead.
You can find the sample code for this example here.
While you develop your external parser, you will want to launch Semantic (or Plastic) to check if everything works as expected.
First, you need to launch Semantic or Plastic SCM using your parser. To do that, check the section Configure external parsers.
Sometimes, issues will happen if you forget to fill in the declarations tree correctly. If that happens, besides checking the error messages, looking at the logs might be very useful.
When you run SemanticDiff or SemanticMerge, check semantic.log.txt
and then filter the log lines containing ExternalParser. You will see something as follows:
2016-11-06 21:39:30,730 DEBUG [10460] ExternalParser - Obtaining the descriptor file from the source file "sample\source.code" 2016-11-06 21:39:30,733 DEBUG [10460] ExternalParser - DescriptorFile: --- type: file name: sample\source.code locationSpan : {start: [1,0], end: [17,30]} footerSpan : [189, 316] parsingErrorsDetected : false children: - type : class name : Socket locationSpan : {start: [1,0], end: [12,1]} headerSpan : [0, 16] footerSpan : [186, 186] children : - type : method name : Connect locationSpan : {start: [3, 0], end: [7,2]} span : [17, 109] - type : method name : Disconnect locationSpan : {start: [8,0], end: [11,6]} span : [110, 185] 2016-11-06 21:39:30,820 DEBUG [10460] ExternalParser - Obtaining the syntax tree from the descriptor file 2016-11-06 21:39:30,821 DEBUG [10460] ExternalParser - Processing the file node "sample\source.code" 2016-11-06 21:39:30,823 DEBUG [10460] ExternalParser - FooterText of the declaration: "\r\nthis is the bottom of the file\r\nand some languages don't fail\r\nto parse if there are extra\r\nlines after the the code ends" 2016-11-06 21:39:30,826 DEBUG [10460] ExternalParser - Processing the container node "Socket" ("class" type) 2016-11-06 21:39:30,826 DEBUG [10460] ExternalParser - HeaderText of the declaration: "class Socket\r\n{\r\n" 2016-11-06 21:39:30,826 DEBUG [10460] ExternalParser - FooterText of the declaration: "}" 2016-11-06 21:39:30,827 DEBUG [10460] ExternalParser - Processing the terminal node "Connect" ("method" type) 2016-11-06 21:39:30,827 DEBUG [10460] ExternalParser - Text of the declaration: " void Connect(string server)\r\n {\r\n SocketLibrary.Connect(mSocket, server);\r\n }\r\n\r\n" 2016-11-06 21:39:30,827 DEBUG [10460] ExternalParser - Processing the terminal node "Disconnect" ("method" type) 2016-11-06 21:39:30,827 DEBUG [10460] ExternalParser - Text of the declaration: " void Disconnect()\r\n {\r\n SocketLibrary.Disconnect(mSocket);\r\n }\r\n"
The interesting thing is that you will see how your tree is loaded and how each declaration is managed. If something goes wrong, you will find out if the tree can't be loaded at all, if there is a YAML error, or simply to which declaration the system managed to process.
In case you are using Plastic SCM, check the plastic.log.txt
log file.
In both cases, check the log configuration file (semantic.log.conf
or plastic.log.conf
) and make sure the following logger (we use standard log4net) is enabled:
<logger name="ExternalParser"> <level value="DEBUG" /> </logger>
You can use the -ep argument when launching SemanticDiff and SemanticMerge to specify an external parser as follows:
semanticmergetool.exe -s sample\source.code -d sample\destination.code -ep=codeparser.exeThere is no way in SemanticDiff and SemanticMerge to specify parsers associated to file extensions yet (unlike what happens in Plastic SCM).
It is possible to predefine a single external parser to be used in all sessions so the -ep param doesn't have to be specified. To do that, just go to Configuration/General and then:
This setting actually modifies the configuration file semanticmergetool.conf
which stores all the arguments to be passed to the tool by default.
virtualMachine: {-vm | --virtualmachine}=<path to the Java Virtual Machine executable>
To configure Plastic SCM to use your external parser, you have to create a externalparsers.conf
file in your Plastic SCM configuration directory.
The path will be: C:\Users\<your-user-name>\AppData\Local\plastic4\externalparsers.conf
Example:
C:\Users\pablo\AppData\Local\plastic4\externalparsers.conf
Then, for each external parser you want to use, specify the extension that it will handle and the full path of your parser.
Example:
.code=C:\Users\pablo\wkspaces\semantic-external-parsers\codeparser\bin\debug\codeparser.exe
Each parser is associated to a file extension and you can define as many parsers as you need.
Why did we decide to use YAML instead of plain text, XML or JSON for external parsers to specify the declarations tree?
YAML has the following advantages: