External parsers


Goal

This guide explains how to create external parsers for SemanticMerge (and Plastic SCM semantic features) so that you can add support for custom languages or programming languages not yet supported out of the box.


How SemanticMerge handles source code

Traditional diff and merge tools handle blocks of text. Semantic does something different; it works at a structural level. (By Semantic, we mean both SemanticMerge and the Plastic SCM built-in semantic features.)

To diff and merge code, Semantic needs to understand its structure. But, this is much simpler than it might sound because it just needs to locate the different declarations (typically methods and classes) and find how they relate to each other.

The following figure illustrates what the Semantic technology needs to locate inside the source code files to work:

What Semantic needs to work

As you can see, all that Semantic needs to know is:

  • That there is a Socket class and where it is located (beginning and end).
  • That the Socket class contains two methods.
  • That there is a method Connect() inside Socket and the lines where it is located.
  • Same for Disconnect().

Given this very simple structure, Semantic will be able to properly calculate the diffs and merges.

It does not need to delve into the bodies of the methods. It does not need to know if there are if-else clauses, for-loops and so on. All Semantic needs to know is where the namespaces, classes and methods are (or functions depending on what the specific language is).


Built-in parsers

Semantic comes with built-in parsers for several programming languages: C#, Java, Vb.net, C and C++ are supported natively.

Built-in parsers convert the source code into a structure that Semantic can use to calculate diffs and merges.


External parsers

It is possible to extend Semantic to support additional languages. To do that an external parser must be created.

An external parser is just a command line application that receives the files to parse as input from Semantic and returns the parsed structures as output. You can create the external parser using your preferred programming language and framework, since the API is just command line-based there are no limitations.

This guide explains how to create external parsers and how to configure Semantic to use them.


The external parser API


Overview

The API that Semantic uses to invoke external parsers is very simple and is based on the command line.

As the following figure shows, Semantic will start your program and send the input using the stdin, then wait for your program to output OK or KO on the stdout.

External parser - API

The input will be always in the form of pairs of paths plus an encoding between them:

Take a look at the complete supported encoding list.
  • The first path will always be the path of the file to parse.
  • The second one, the path where Semantic wants your program to write the resulting YAML file with parsed file.
  • Your program is expected to continue running, processing pairs of paths, until it receives the end string on the stdin. This way when Semantic needs to parse multiple files (it will need to parse 2 files for diffing, and 3 for merging), your application doesn't have to be launched multiple times, saving precious time when some frameworks are used (like Java, NodeJS and so on).

The command line based API is extremely simple to allow you to write parses in any language. For instance, you can use Go code to create a Go parser, all you need to do is a command line Go application to be invoked by Semantic.


A sample parsing session

The following diagram explains how a typical parsing session works, so that you can learn what is the exact input your program must handle and the output it is expected to provide.

Sample parsing session

Please note that your parser will be always invoked as follows:

yourparserapplication.exe shell path_to_the_flag_file

Where shell is always going to be the first argument which basically means you are expected to continue running until you receive "end" on the stdin (note that at this point shell is the only supported mode, so it is redundant, but it has been included for future compatibility).

Then there is the path_to_flag_file: once your program is ready to receive files to parse, ensure you write to this file. Semantic will be monitoring it waiting for the external parser to be ready. This synchronization mechanism makes sense when the parser program takes some time to start up. This can happen for applications written in Java where the runtime must be initialized, for instance.


The structure of the declarations tree

The goal of the external parser is to create a declarations tree file. This declarations tree allows Semantic to handle the source code file because it provides the information about how it is structured and where each declaration is located.

Independently of the language being parsed, your program only needs to create a tree based on the following four types of objects:

Parsed tree basic structure

  • You need to create one file object.
  • The file object can contain terminal nodes (for example, include, import and using clauses, functions in C, would be directly contained by the file).
  • Or containers (which can be namespaces, classes, and so on) that can in turn contain terminal nodes (typically methods, properties).
  • Optionally the file can contain parsing errors.

The following figure shows additions to the previous diagram with the members of each of the four types of objects:

Declarations tree objects with properties

In summary, the parsing required by Semantic is just about locating the different declarations.

The only tricky part is correctly calculating the spans (in character count instead of line and columns) and this is going to be explained in the next section using an example.

Remarks:
  • The parsingError list is optional. Do not include it if there are no errors.
  • The children collection is also optional. Do not enter the children field if there are not children in the collection.
  • When there is no footerSpan it must be defined as: footerSpan : [0,-1].

Tutorial: how to create a declarations tree, step by step

This tutorial explains how to create a declarations tree that the Semantic tools expect to handle your programming language. It will not cover how to create an actual parser since this is out of scope of this guide.

We will use a very simple file in a C#-like language and explain the resulting declarations tree that Semantic needs to understand the code.

The code used in this example is as follows:

Source lines and chars


Expected declarations tree

Both SemanticMerge and Plastic SCM ask your external parser to create a declarations tree from a given source code file.

The tree structure for the previous source code is a YAML file which looks as follows:

---
type: file
name: FooSocket.csharp
locationSpan : {start: [1, 0], end: [12, 1]}
footerSpan : [0,-1]
parsingErrorsDetected : false
children:

  - type : class
    name : name of the first container
    locationSpan : {start: [1, 0], end: [12, 1]}
    headerSpan : [0, 16]
    footerSpan : [186, 186]
    children :

    - type : method
      name : Connect
      locationSpan : {start: [3, 0], end: [7,2]}
      span : [17, 109]

    - type : method
      name : Disconnect
      locationSpan : {start: [8,0], end: [11,6]}
      span : [110, 185]

What you have seen above is the expected result. The next sections explain how to calculate each of the members of the nodes in the declarations tree.


File node - calculate locationSpan

The first node is the file node; there will be only one per file. These are the members to fill in:

---
type: file
name: the path of the file
locationSpan : {start: [line, column], end: [line, column]}
footerSpan : [0,-1]
parsingErrorsDetected : false
children :

In our case, this is very simple if we check the lines and columns of the file:

Source lines and columns only

The location span will be set as follows:

locationSpan : {start: [1, 0], end: [12, 1]}

Which spans the entire file.

So far, it wasn't that hard :-)


First container - Socket class - calculate headerSpan and footerSpan

Our file only has one child, the Socket class.

The external parser will add only one child to the file, and here are the members to calculate:

---
type: file
name: source.code
locationSpan : {start: [1,0], end: [12,1]}
footerSpan : [0,-1]
parsingErrorsDetected : false
children:

  - type : class
    name : Socket
    locationSpan : {start: [line, column], end: [line, column]}
    headerSpan : [start_char, end_char]
    footerSpan : [start_char, end_char]
    children :

In this case, since the file only has one class, the locationSpan of the class matches the one of the file, so it will be as follows:

    locationSpan : {start: [1, 0], end: [12, 1]}

Now it comes the most difficult part: calculating the headerSpan and footerSpan because both go in "character position" instead of line number and column.

Look at the following figure to see how the headerSpan and footerSpan are calculated:

Source socket class header and footer span

Once you "number" each character, calculating both spans is straightforward.

Counting the characters of your source file is not something you want to do by hand. That's why we created the charposition Windows application to help you create your first parser manually.

Now, it is important to note that you don't need to count characters manually once you create a real parser, since this is something you will do in code :-)

This is how the parsed tree in YAML format looks so far:

---
type: file
name: source.code
locationSpan : {start: [1,0], end: [12,1]}
footerSpan : [0,-1]
parsingErrorsDetected : false
children:

  - type : class
    name : Socket
    locationSpan : {start: [1, 0], end: [12, 1]}
    headerSpan : [0, 16]
    footerSpan : [186, 186]
    children :

Parse the Connect() method

To parse the Connect() method we need to fill in the following members in the YAML structure:

    - type : method
      name : Connect
      locationSpan : {start: [line, column], end: [line, column]}
      span : [start_char, end_char]

And again, we need to perform a char count calculation.

The following figure explains how the spans of the method are calculated (together with the previously calculated ones for the class):

Source Connect method span

And then, the resulting structure is:

    - type : method
      name : Connect
      locationSpan : {start: [3, 0], end: [7,2]}
      span : [17, 109]
Note: Check how the method starts where the class headerSpan ends. There can't be holes between chars in the file or the parser won't work.

Parse the Disconnect() method

Parsing the Disconnect() method follows the same logic as the previous one: identify the lines and columns for locationSpan and count the chars for "span".

Source Disconnect method span

And the resulting YAML for the method is:

    - type : method
      name : Disconnect
      locationSpan : {start: [8, 0], end: [11, 6]}
      span : [110, 185]

At this point, the entire file is parsed and the output will be ready to be consumed by Semantic.


How the parsed file looks like inside SemanticDiff

The following screenshot shows how SemanticDiff can handle a couple of files parsed by an external parser:

Initial sample inside Semantic

The file on the left matches the one parsed during the tutorial, while the one on the right is a modification that moves a method and modifies it.

As you can see, Semantic can track the moved method in the custom language, and find differences inside its body.

Visual Diff also works correctly as expected:

Initial sample inside Semantic - Visual Diff

Finally, this is the sample parser running inside Plastic SCM in the Pending changes view:

Initial sample inside Plastic


How to invoke Semantic with an external parser

For SemanticDiff and SemanticMerge, it is very simple to invoke Semantic with and external parser:

semanticmergetool.exe -s sample\source.code -d sample\destination.code -ep=codeparser.exe

Where codeparser is the sample parser we created for this tutorial. You can find the sample parser here.

For Plastic SCM, you have to create a externalparsers.conf file as follows:

C:\Users\pablo\AppData\Local\plastic4>type externalparsers.conf .code=C:\Users\pablo\wkspaces\semantic-external-parsers\codeparser\bin\debug\codeparser.exe

How to handle comments

The sample file we used for the example didn't have any comments. Let's now add a few comments to see how we should handle them:

Handling comments source file

As you see, we are using a pseudo-C# language, not a real one.

We added a couple of include sentences that were not present before, and some comments.

Let's now find out how to build the parser by using charposition:

Handling comments spans explained

There are some interesting points to highlight:

  • I'm associating the comments above a declaration to the declaration itself. This happens both for the comment before the Socket class and the comment before the Connect method. I consider both as part of their corresponding declarations.
  • I'm considering a method that starts just in the next line after the previous method ends. You can see this in Disconnect. It starts just in the character after the end of the line where Connect ends. In this case, there are no comments there, but if they were, they would be considered as part of Disconnect.
  • Check the footerSpan of the class. I consider the last comment as part of the footer of the Socket container applying the same concept I used for Disconnect. The footer starts just after the line where the last method ends, and anything there not being another node will be just part of the footer.

I'm including the comments just above the declarations as part of the declaration because I consider they will be moved together with the method, and because comments normally go before the declaration. By the way, custom parsers could use different techniques. The key thing is that every single character in the file is mapped so that Semantic knows how to handle it.

The resulting declarations tree in YAML is as follows:

---
type: file
name: sample\source.code
locationSpan : {start: [1,0], end: [22,1]}
footerSpan : [0,-1]
parsingErrorsDetected : false
children:
  - type : include
    name : sockets
    locationSpan : {start: [1, 0], end: [1, 18]}
    span : [0, 17]
  - type : include
    name : system
    locationSpan : {start: [2, 0], end: [2,17]}
    span : [18, 34]
  - type : class
    name : Socket
    locationSpan : {start: [3,0], end: [22,1]}
    headerSpan : [35, 94]
    footerSpan : [354, 430]
    children :
    - type : method
      name : Connect
      locationSpan : {start: [7, 0], end: [13,6]}
      span : [95, 275]
    - type : method
      name : Disconnect
      locationSpan : {start: [14,0], end: [18,6]}
      span : [276, 353]

Here is how SemanticDiff looks displaying the previous sample, plus another file based on the previous:

Handling comments SemanticDiff

You can find the sample parser code for this scenario here.

You can learn more about different techniques to handle comments in this Forum thread.


How to handle parsing errors

What if you find errors during parsing? You can fill in the parsingError collection to notify the user that something went wrong and maybe the declarations tree can't be correctly calculated.

Let's look at the following example in the pseudo-programming language we used in this tutorial:

Parsing error - Missing semicolon

As you can see there is a ; missing at the end of line 5 (highlighted in the picture).

The declarations tree to handle this case can be as follows:

---
type: file
name: sample\source.code
locationSpan : {start: [1,0], end: [12,1]}
footerSpan : [0,-1]
parsingErrorsDetected : true
children:
  - type : class
    name : Socket
    locationSpan : {start: [1,0], end: [12,1]}
    headerSpan : [0, 16]
    footerSpan : [186, 186]
    children :
    - type : method
      name : Connect
      locationSpan : {start: [3, 0], end: [7,2]}
      span : [17, 109]
    - type : method
      name : Disconnect
      locationSpan : {start: [8,0], end: [11,6]}
      span : [110, 185]
parsingError:
  - location: [5,45]
    message: "Missing ; at the end of the line"

And then, if you launch SemanticDiff, you will get the following warning:

Parsing error - Warning message

SemanticDiff and SemanticMerge warns you that the parser found issues, so it gives the user the choice to launch the text-based tool instead.

You can find the sample code for this example here.


How to handle the file footerSpan

So far all the examples had a single class, and the class perfectly matched the file, so the end of the file and the end of the main class were the same.

That's why all the declarations trees specify the file mostly as follows:

---
type: file
name: sample\source.code
locationSpan : {start: [1,0], end: [12,1]}
footerSpan : [0,-1]
parsingErrorsDetected : true

Where the footerSpan is always "undefined" as [0, -1] meaning there is no footer.

But, let's now consider the following case:

File footer sample explanation

There are a few lines after the Socket class ends. You could certainly consider these lines as part of the Socket class (the footerSpan of the Socket class indeed), but you could also consider it just as part of the file footerSpan. In the old days of Pascal (still relevant for Delphi programmers), every unit ended with end; and it didn't matter what you wrote after that, the parser would simply ignore it. You can achieve the same thanks to footerSpan in the file.

The entire declarations tree will be as follows:

---
type: file
name: sample\source.code
locationSpan : {start: [1,0], end: [17,30]}
footerSpan : [189, 316]
parsingErrorsDetected : false
children:
  - type : class
    name : Socket
    locationSpan : {start: [1,0], end: [12,1]}
    headerSpan : [0, 16]
    footerSpan : [186, 186]
    children :
    - type : method
      name : Connect
      locationSpan : {start: [3, 0], end: [7,2]}
      span : [17, 109]
    - type : method
      name : Disconnect
      locationSpan : {start: [8,0], end: [11,6]}
      span : [110, 185]

You can find the code of this example here.


How to debug your parser

While you develop your external parser, you will want to launch Semantic (or Plastic) to check if everything works as expected.

First, you need to launch Semantic or Plastic SCM using your parser. To do that, check the section Configure external parsers.

Sometimes, issues will happen if you forget to fill in the declarations tree correctly. If that happens, besides checking the error messages, looking at the logs might be very useful.

When you run SemanticDiff or SemanticMerge, check semantic.log.txt and then filter the log lines containing ExternalParser. You will see something as follows:

2016-11-06 21:39:30,730 DEBUG [10460] ExternalParser - Obtaining the descriptor file from the source file "sample\source.code"
2016-11-06 21:39:30,733 DEBUG [10460] ExternalParser - DescriptorFile:
---
type: file
name: sample\source.code
locationSpan : {start: [1,0], end: [17,30]}
footerSpan : [189, 316]
parsingErrorsDetected : false
children:
  - type : class
    name : Socket
    locationSpan : {start: [1,0], end: [12,1]}
    headerSpan : [0, 16]
    footerSpan : [186, 186]
    children :
    - type : method
      name : Connect
      locationSpan : {start: [3, 0], end: [7,2]}
      span : [17, 109]
    - type : method
      name : Disconnect
      locationSpan : {start: [8,0], end: [11,6]}
      span : [110, 185]

2016-11-06 21:39:30,820 DEBUG [10460] ExternalParser - Obtaining the syntax tree from the descriptor file
2016-11-06 21:39:30,821 DEBUG [10460] ExternalParser - Processing the file node "sample\source.code"
2016-11-06 21:39:30,823 DEBUG [10460] ExternalParser - FooterText of the declaration: "\r\nthis is the bottom of the file\r\nand some languages don't  fail\r\nto parse  if  there  are extra\r\nlines after the the code  ends"
2016-11-06 21:39:30,826 DEBUG [10460] ExternalParser - Processing the container node "Socket" ("class" type)
2016-11-06 21:39:30,826 DEBUG [10460] ExternalParser - HeaderText of the declaration: "class Socket\r\n{\r\n"
2016-11-06 21:39:30,826 DEBUG [10460] ExternalParser - FooterText of the declaration: "}"
2016-11-06 21:39:30,827 DEBUG [10460] ExternalParser - Processing the terminal node "Connect" ("method" type)
2016-11-06 21:39:30,827 DEBUG [10460] ExternalParser - Text of the declaration: "   void Connect(string server)\r\n   {\r\n      SocketLibrary.Connect(mSocket, server);\r\n   }\r\n\r\n"
2016-11-06 21:39:30,827 DEBUG [10460] ExternalParser - Processing the terminal node "Disconnect" ("method" type)
2016-11-06 21:39:30,827 DEBUG [10460] ExternalParser - Text of the declaration: "   void Disconnect()\r\n   {\r\n      SocketLibrary.Disconnect(mSocket);\r\n   }\r\n"

The interesting thing is that you will see how your tree is loaded and how each declaration is managed. If something goes wrong, you will find out if the tree can't be loaded at all, if there is a YAML error, or simply to which declaration the system managed to process.

In case you are using Plastic SCM, check the plastic.log.txt log file.

In both cases, check the log configuration file (semantic.log.conf or plastic.log.conf) and make sure the following logger (we use standard log4net) is enabled:

<logger name="ExternalParser">
    <level value="DEBUG" />
</logger>

Configure external parsers


SemanticMerge and SemanticDiff

You can use the -ep argument when launching SemanticDiff and SemanticMerge to specify an external parser as follows:

semanticmergetool.exe -s sample\source.code -d sample\destination.code -ep=codeparser.exe

There is no way in SemanticDiff and SemanticMerge to specify parsers associated to file extensions yet (unlike what happens in Plastic SCM).

It is possible to predefine a single external parser to be used in all sessions so the -ep param doesn't have to be specified. To do that, just go to Configuration/General and then:

SemanticMerge - Configure external parser

This setting actually modifies the configuration file semanticmergetool.conf which stores all the arguments to be passed to the tool by default.

Remark: In case you are building a parser in Java, we have a -vm argument worth watching:

virtualMachine:      {-vm | --virtualmachine}=<path to the Java Virtual Machine executable>

Plastic SCM

To configure Plastic SCM to use your external parser, you have to create a externalparsers.conf file in your Plastic SCM configuration directory.

The path will be: C:\Users\<your-user-name>\AppData\Local\plastic4\externalparsers.conf

Example:

C:\Users\pablo\AppData\Local\plastic4\externalparsers.conf

Then, for each external parser you want to use, specify the extension that it will handle and the full path of your parser.

Example:

.code=C:\Users\pablo\wkspaces\semantic-external-parsers\codeparser\bin\debug\codeparser.exe

Each parser is associated to a file extension and you can define as many parsers as you need.


Why YAML?

Why did we decide to use YAML instead of plain text, XML or JSON for external parsers to specify the declarations tree?

YAML has the following advantages:

  • There are YAML parsers for all programming languages.
  • YAML is a superset of JSON. So, if someone wants to write the descriptor in JSON, we will be able to parse it with the YAML parser.
  • It is human readable.

Useful links


Last updated

October 19, 2017
  • Now, the built-in parsers support C++ too.
  • June 1, 2017
  • We added encoding support for the External parsers.
  • December 13, 2016
  • Launching the External parsers guide.