File Association Databases – what and why

By JNaude on August 14, 2012 in Blog

At the core of the Scineric Workspace architecture lies something called “File Association Databases”. A database is a well structured XML file which stores a set of rules and definitions that Scineric uses to manage things in a design. These files adheres to the IP-XACT standard as far as the standard allowed.

This blog post will dig into the reasoning behind these databases and explain how and why it is a very powerful concept when applied to file management. Lets start at the very beginning.

Classification of different file types

Identifying and classifying files is nothing new. It has been around for many years and there are a bunch of proven approaches to identifying a file. Once a file’s content has been identified it can be classified in various ways; one common approach is using something called a MIME type. MIME types allow you to classify a file using a standardised system of identifiers. For example a html document is classified as text/html or a GIF image is classified as image/gif. Many applications use this type of information to determine what to do with a file. To illustrate; when you open a C++ file in a text editor that supports syntax highlighting, it will try to determine the file type and then use the appropriate syntax highlighting for it. When you open a VHDL file in the same, the syntax highlighting will be different.

There is a whole database of known MIME types and it is typically stored on your system somewhere, allowing applications to query it. If a file you are interested in is not in the standard list of MIME types, you can apply to get it registered. Below is a screenshot showing the MIME type management functionality in the Qt Creator IDE. Notice that it supports multiple approaches when identifying a file (file patterns, magic headers etc.).

All of this works well, and has been working well, for a long time. However one of the shortcomings of the standard MIME database is that it is limited. It does not contain MIME type definitions for files specific to unique fields. For example, firmware related files produced by the Xilinx toolset are not found in the database. I don’t believe the solution is for a company like Xilinx to register MIME types for all the files that its tools spit out. Its simply not practical. Instead, in some cases they decide to document the most important files that their tools produce. The problem with such a list is that it becomes outdated as new versions of a tool produce new types of files, and stop to produce some of the files in the list of the previous version.

It is clear that there is a gap here. We need a way to identify and classify files produced by the tools we use on a daily basis, be it vendor tools or our own tools. Before we look at the approach Scineric Workspace takes to solve these problems, its probably a good idea to ask ourselves if it really matters to solve them.

Why does it matter?

Lets look at one example where it does matter: Revision Control

If you are not familiar with revision control systems, a quick read through this will get you on the same page. If you are, you probably already know where I’m going with this. In either case, lets have a closer look.

Its good practice to only keep the necessary information in your version controlled repository. Things like source files, important metadata about those files, etc. The reason being that files checked into a version control system remains there forever (or the lifetime of the repository of course), unless you use advanced commands of “your favourite version control system here” which allows you to remove a file across all revisions. Therefore, if you delete it and step the revision number of the repository, the file is gone in the latest revision. However it is still in the previous revision in order to allow you to restore your working copy to that earlier step in time.

Naturally there are different opinions on things that should stay in and out of a repository, and these opinions will most likely be different from team to team. However, most people will agree that generated items should stay unversioned unless there is a good reason for it.

If we now take a step back to our discussion on the classification of the files produced by the tools we use on a daily basis, its easy to see that some of those files need to end up in the repository while others don’t. You also need to be able to classify these files in the field that you are working in, in order to make sure generated files in that field does not end up in the repository. All the version control systems that I’m aware of allows you to specify a wildcard matched ignore list of files that must be ignored by version control operations. For example, a software engineer will use something like this: *.a *.dll *.so etc, meaning that object code and generated libraries must be ignored. A firmware engineer on the other hand will use a list like this: *.ngc *.edf *.bit etc. The screenshot below shows the ignore list set using the svn:ignore property in the Subversion version control system.

The second thing that most people will agree on is that everyone using repository must at least use the same ignore list. Otherwise you will not check in your .dll files, but your team mate checks in theirs and when you update your working copy you end up checking out their generated files.

Now that we have looked at one example of why it is important, lets start to look at possible solutions.

Possible solutions

I know that there are many people out there who have their own solutions to these problems. I followed a recent discussion titled “Can anyone recomend an FPGA source code configuration management tool?” related to this problem in the “FPGA – Field Programmable Gate Array” LinkedIn group with great interest. Many talked about their version control system, followed by some comments on the advantages and disadvantages. However, later in the discussion some people rightly pointed out that the real problem is not related to the version control system being used, but to know which files to ignore and which files not to ignore.

Someone said that they just sort all files by size and delete the largest and most obviously unnecessary files. If possible, you can also force the process that generates your files to put them in a different directory than your source files and you just ignore the complete generated files directory. Yet another solution is to create your own ignore list of files that should be ignored in your field, but maintaining and sharing this list again becomes a problem.

Sure these solutions work, but they are not ideal. Now lets take a step forward and look at the solution proposed, and built into Scineric Workspace.

Scineric’s solution: File Association Databases

Scineric’s solution is to pack a list of file associations together in a format that can be easily shared. We call such a list a File Association Database.

Amongst other things, this database serves as a place to define and store information about files used within a field of interest. For each file association we can store a description and some extended properties. At present files are identified using their file name suffices and this approach has worked well enough so far. The screenshot below shows the associations in a database dedicated to files produced by the Xilinx design tools.

Apart from the file associations, the database also specifies a structure to which files must be hooked when viewing the files in a design. For example, .vhd and .v files should be grouped as design files, while generated files must be grouped accordingly. Below is a screenshot of the design structure defined by the HDL Design database, the default for firmware designs.

Over time the capabilities and amount of information stored in these databases have grown steadily and today they don’t only store a list of file associations and the structure of a design. Here is a short list of other useful things stored in a database:

Folder associations store rules for folders (directories), which are managed by a design.
The structure of a design also stores a complete mapping to a directory structure that the design uses to manage its working folder. For example, we can specify that all files hooked to NODE A are stored in a folder called “/node_a_files” and all files hooked to NODE B are stored in a folder called “/node_b_files”.
You can store item grouping information. Thus, you can create a grouping called “Xilinx Reports” and indicate that all Xilinx report files (map, par, mrp) etc are included in that grouping.

Last but not least, you can extend databases with other databases. This allows us to create an extendible database and extend it using multiple databases which adds (or overwrites) information to the database being extended. This is a very powerful feature, for example the HDL Design database can be extended by numerous vendor databases. Scineric ships with extensions for Xilinx, Altera and Mentor Graphics tools by default as shown below.

Nothing stops any other vendor to create an extension database for their product.

Creating, sharing and distributing databases

As we’ve seen when we looked at possible solutions, many people have come up with their own processes to manage things. Below is an example to illustrate this:

We have two firmware development teams. Team A uses a version control ignore list that works for them, while team B uses a different ignore list that works for them. Now lets say team A uses Xilinx EDK but not Xilinx System Generator. Team B on the other hand uses System Generator and but not EDK. Naturally team A will have outputs of EDK in their list, and team B will have outputs of System Generator in their ignore list.

In this example, nobody’s list is “wrong”, its just limited to the scope of the work that the team is doing. Scineric’s databases provides a platform where both teams can share their knowledge. Each team can extend the HDL Design database and re-use the lessons learned by the other team without even thinking about it. In other words, Scineric provides a platform to uses the knowledge and power of the crowd in your designs, and to easily share your knowledge with the crowd.

The features below facilitate this type of interaction:

Databases are simple XML files and as a result you can email someone your database and they can immediately use it in their designs.
When you save your design, the database (and its extensions) are saved inside your design’s IP-XACT component file. Hence, you can send your design to someone else. When they open your design, Scineric will check if they already have the required databases and install them if necessary.
Databases have versions. When you edit it the version will step. When a team member updates the database, all team members will benefit from the changes made. By default, this will happen behind the scenes without you knowing it, or alternatively, you can enable notifications to give you complete control over these database complexities.
You can lock a database to indicate that it should not be modified.

When you create a new database, you can tag it with publisher information as shown below. This also allows users of your database to know who to contact with questions or suggestions about the database.

Over time, databases created by experts in a field can build up to a comprehensive resource of crowdsourced rules about files and processes in the applicable field. The possibilities are endless.

Making it work without you thinking about it

I’ve seen in my team that most people use Scineric without even knowing that they are using a database under the hood. This is a good thing; it means that the internal complexities related to managing databases are conveniently hidden from the normal user. However, the power user will be able to fine tune the way his/her databases manage files. The best of all is that the normal user will be able to reuse these fine tuned databases without doing any work.

To end this post, lets look at the way you add new designs to Scineric.

The left column lists the type of designs currently available after inspecting all loaded plugins. When you make a selection in the left column, the middle column will show the databases available for that design type. The above screenshot shows that there are currently only two databases to choose from, Default and Software. When we select “Firmware Design” we will get a different list of databases to choose from.

When you go ahead and select an option, you will be prompted with a wizard allowing you to customize the new design. All wizards start with a variation of the page shown below.

The page shows you the Database you selected, and the target working folder (and therefore also the repository location) where the new design will be created. Notice that the location of this target folder is defined by the database you have selected. In the above example I specified that the Software database must use a repository called HelloWorld for demonstration purposes.

What’s next

We have been fine-tuning this solution for a long time and we are continuously finding new things that we want to add to it. What we have currently is an extendible platform to enable crowdsourced file management, and we would love to see how you use it.

Please let us know about your experiences, feedback and suggestions.