Extractor: a source code extractor

1. Overview

Extractor is a tool for extracting information from C++ or Java source files.

If you write technical documents like program documentation, how-to’s (like this) or other mostly software oriented documents, it is a very commen requirement to show some source-code. To limit the chances of documention errors, these source-code examples should be real world examples. This means, that they should be extracted from real code examples or production code and they should not be cut-and-pasted into the document or originally be written in the document. This leads to untested and unchecked examples with a high risk of being wrong (or outdated).

AsciiDoctor is capable of including source code files into the document.

But real world examples are mostly too long to be fully included. It would be much more readable if you can focus on the important parts of the source code and leave unrelevant parts out. This is what extractor is for:

  • extracting parts (snippets) from real world source code.

  • omitting lines of the source code

  • generating auto-callouts from comments of the source-code

  • highlighting parts of the source-code

2. Prerequisites

2.1. Compiler

You need a C++14-capable C++ compiler (both GCC and CLang will do).

2.2. Libraries

The following system libraries are required:

2.3. Tools

Not strictly needed, but helpfull: GNU Source Highlight.

And of course: AsciiDoctor

3. Installation

3.1. Download the source

Download the souce from sourceforge.

3.2. Build

On Linux you simply type:

Quelltext 1. Kommando1.
$ make all

3.3. Install

At the moment there is no special install step: you have to run extractor from the source directory, where it has been built. Or you can manually copy it to other places, because its internal library has been statically linked during build.

4. Usage

The following example will show the basic idea behind extractor.

4.1. Simple Example

Suppose you’ll document the following very simple C++ example:

Quelltext 2. Zeilen aus der Datei test.cc
#include <iostream>
#include <iomanip>
int main() {
    return EXIT_SUCCESS;
}

It could be that you want to describe the portion of the souce file where the include directives are in a special way. Therefore it would be nice to extract that part of the file like:

Quelltext 3. Zeilen aus der Datei test.cc [Snippet: include]
#include <iostream>
#include <iomanip>

In a later section of your documentation you want to emphasize the int main() function:

Quelltext 4. Zeilen aus der Datei test.cc [Snippet: main]
int main() {
    return EXIT_SUCCESS;
}

For this to be possible without copying manually anything from your sources you have to mark these parts directly in your sources. These parts are called snippets.

Quelltext 5. A simple source snippet definition
//[<name> (1)
...
//]  (2)
1 Begin of snippet name
2 End of snippet name
Source snippets resemble the AsciiDoctor feature of include tags, but they have to be strictly nested. They must not overlap!

With this you can annotate your real source code with the neccessary snippet definitions:

Quelltext 6. The file test.cc with the source snippets include and main
//[include
#include <iostream>
#include <iomanip>
//]
//[main
int main()
{
    return EXIT_SUCCESS;
}
//]

Then you run the extractor for your file test.cc:

Quelltext 7. Kommando2.
$ extractor test.cc

The outcome from this is the file test.extractor with the contents:

Quelltext 8. The Content of the snippet database file
Snippet [ all [ ( 0 , 6 ) ] exclude [  ] ]
Snippet [ include [ ( 0 , 2 ) ] exclude [  ] ]
Snippet [ main [ ( 2 , 6 ) ] exclude [  ] ]

At the moment this file isn’t very useful (but if you want to automate building the whole documenation, this file will become very handy), but if you look carefully into the directory of the file test.cc you’ll find a newly create directory named .extractor:

total 12
-rw-r--r-- 1 lmeier lmeier 197 Apr 22 12:58 test.cc.all
-rw-r--r-- 1 lmeier lmeier 175 Apr 22 12:58 test.cc.include
-rw-r--r-- 1 lmeier lmeier 173 Apr 22 12:58 test.cc.main

These files contain the snippets, e.g. the file test.cc.main obviously contains the snippet main of file test.cc. The file test.cc.all contains the full file test.cc but without the snippet definitions.

In your AsciiDoctor documentation files you can include these snippets files. Especially usefull are defintions of some attributes like srcbase, srcdir and extractdir:

include::{srcbase}/{srcdir}/{extractordir}/test.cc.main[]

The contents of the snippet-file is asciidoc:

.Zeilen aus der Datei link:{srcbase}/{srcdir}/test.cc.html[`test.cc`,window="_new"] [Snippet: main]
[source,cpp,indent=0]
----
int main() {
    return EXIT_SUCCESS;
}
----

Please note that this file contains asciidoc-syntax, e.g. this simple case generates a caption with a link to the file containing the displayed snippet. If you click on the link, you should get a new browser window with the original code highlighted via source-highlight as a html-file (s.a. Using make to generate to snippets).

If the snippet contains auto-callouts these will also be collected into the snippet file (s.a. Auto-Callouts).

If you use the above include-macro in your documentation you’ll get the following result:

Quelltext 9. Zeilen aus der Datei test.cc [Snippet: main]
int main() {
    return EXIT_SUCCESS;
}

Ok, that’s the simple story.

4.2. Source Annotations

There are several source annotations which extractor understands: snippets (simple or compound), omitted lines, auto callouts and highlighting (marking).

4.2.1. Source snippets

Source snippets are divided into simple and compound snippets.

4.2.1.1. Simple snippets

As stated above simple snippets are defined by the special comments in the source files. Snippets must not be overlapping, but they can (and usually should) be nested:

Quelltext 10. A simple source snippet definition
//[<snippet1> (1)
...
//[<snippet2> (2)
...
//] (3)
...
//]  (4)
1 Begin of snippet1
2 Begin of snippet2
3 End of snippet2
4 End of snippet1

The following source code gives an example of two nested snippets with names pragma and Abc:

Quelltext 11. The file nested.cc
#include <cstdlib>

//[pragma
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused"
//[Abc
class Abc {
public:
    Abc() {}
private:
    int mX = 0;
};
//]
#pragma GCC diagnostic pop
//]
int main() {
    return EXIT_SUCCESS;
}

To include them use the following line in your adoc-file for the outer snippet with name pragma:

include::{srcbase}/{srcdir}/{extractordir}/nested.cc.pragma[]

use the next line for the inner snippet with name `Abc':

include::{srcbase}/{srcdir}/{extractordir}/nested.cc.Abc[]

and you get for the outer snippet pragma:

Quelltext 12. Zeilen aus der Datei nested.cc [Snippet: pragma]
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused"
class Abc {
public:
    Abc() {}
private:
    int mX = 0;
};
#pragma GCC diagnostic pop
Please note that the definition lines of the inner snippet Abc are excluded.

or for the inner snippet Abc:

Quelltext 13. Zeilen aus der Datei nested.cc [Snippet: Abc]
class Abc {
public:
    Abc() {}
private:
    int mX = 0;
};
4.2.1.2. Continuation of simple snippets

Simple snippets can be continued as in the following example:

Quelltext 14. The file simple.cc with simple snippet
#include <iostream>

int main()
{
	int x = 0;

	//[out
	std::cout << __PRETTY_FUNCTION__ << std::endl;
	//]

	x = 42;

	//[out
	std::cout << __cplusplus << std::endl;
	//]

	return x;
}

This will produce a continued snippet:

Quelltext 15. Zeilen aus der Datei simple.cc [Snippet: out]
std::cout << __PRETTY_FUNCTION__ << std::endl;

// ... lines omitted ...

std::cout << __cplusplus << std::endl;

Please note that the text between the parts ot the continued snippet (here: // …​ lines omitted …​) can be omitted (see Command Line Options). If you do that using the option --sm you get the following output:

Quelltext 16. Zeilen aus der Datei simple.cc [Snippet: out]
std::cout << __PRETTY_FUNCTION__ << std::endl;
std::cout << __cplusplus << std::endl;
Please see also Missing Features.
4.2.1.3. Compound snippets

Sometimes it is useful to exclude one or more nested snippets from an outer snippet. This can be done by subtracting one or more of the nested snippets from the outer one. You can use the follwing compund snippet definition syntax:

Quelltext 17. Defining a compound snippet
//[outer -inner (1)

//[inner (2)

//] (3)

//]  (4)
1 Defintion of the snippet outer without the nested snippet inner
2 Defintion of the nested snippet inner
3 End of snippet inner
4 Ende of snippet outer

As said above you can freely nest the snippets as in the following example:

Quelltext 18. The file test.cc with simple and compound source snippets
#include <iostream>

//[mainx -ret -out
//[mainout -ret
//[mainret -out
//[main
int main()
{
	//[out
	std::cout << __PRETTY_FUNCTION__ << std::endl;
	std::cout << __cplusplus << std::endl;
	//]
	//[ret
	return 0;
	//]
}
//]
//]
//]
//]
Snippet called all
There is one implicitly defined snippet called all. It includes the whole source file but (normally) with the snippet definitions themselves removed.

If you include the snippet all you get the whole file test.cc with the snippet definition lines removed:

Quelltext 19. Zeilen aus der Datei test.cc
#include <iostream>
int main() {
    std::cout << __PRETTY_FUNCTION__ << std::endl;
    std::cout << __cplusplus << std::endl;
    return 0;
}

The snippet main displays as follows:

Quelltext 20. Zeilen aus der Datei test.cc [Snippet: main]
int main() {
    std::cout << __PRETTY_FUNCTION__ << std::endl;
    std::cout << __cplusplus << std::endl;
    return 0;
}

The snippet mainret excludes the inner snippet out from the outer snippet main. So you get the following result:

Quelltext 21. Zeilen aus der Datei test.cc [Snippet: mainret]
int main() {
// ...
    return 0;
}
The excluded snippet itself is displayed as // …​ by default. But you can customize that (see Using exclude Texts).

The same is true for snippet mainout subtracting ret:

Quelltext 22. Zeilen aus der Datei test.cc [Snippet: mainout]
int main() {
    std::cout << __PRETTY_FUNCTION__ << std::endl;
    std::cout << __cplusplus << std::endl;
// ...
}

For maximum flexibility you can remove more than one inner snippet from an outer snippet. This is shown with snippet mainx:

Quelltext 23. Zeilen aus der Datei test.cc [Snippet: mainx]
int main() {
// ...
}
4.2.1.4. Using exclude Texts

As default a text like a C++-comment is show in case a snippet is excluded from another, outer snippet. If want to show some special text in this case use a exclude-text defintion.

Quelltext 24. Defining exclude text
//[outer -inner (1)

//[inner : The alternative exclude text (2)

//]
//]
1 Defintion of the snippet outer without the nested snippet inner
2 The nested snippet inner with an alternative exclude-text The alternative exclude text

Below is an example defining exclude-texts:

Quelltext 25. The file test2.cc with exclude texts
#include <iostream>

//[mainout -ret
//[mainret -out
//[main
int main()
{
    //[out : The output statements are not shown
    std::cout << __PRETTY_FUNCTION__ << std::endl;
    std::cout << __cplusplus << std::endl;
    //]
    //[ret : The return-statement omitted
    return 0;
    //]
}
//]
//]
//]

With these _exclude_texts the snippets are display as follows:

Quelltext 26. Zeilen aus der Datei test2.cc [Snippet: mainout]
int main() {
    std::cout << __PRETTY_FUNCTION__ << std::endl;
    std::cout << __cplusplus << std::endl;
// The return-statement omitted
}
Quelltext 27. Zeilen aus der Datei test2.cc [Snippet: mainret]
int main() {
// The output statements are not shown
    return 0;
}

4.2.2. Omitted lines

Sometimes individual lines are distracting the readers attention and these lines should be excluded from the source shown. For this purpose extractor can marked lines.

Individual lines can be omitted with a special marker //-:

Quelltext 28. The file omit.cc with omitted lines
#include <iostream>

//[main
int main()
{
	int x = 0;

	std::cout << __PRETTY_FUNCTION__ << std::endl; //-
	x = 42;
	std::cout << __cplusplus << std::endl; //-

	return x;
}
//]

Below you see the snippet main with some lines omitted:

Quelltext 29. Zeilen aus der Datei omit.cc [Snippet: main] (einige Zeilen nicht dargestellt)
int main() {
    int x = 0;
    x = 42;
    return x;
}

4.2.3. Auto-Callouts

A really nice feature are the auto-callouts.

Callouts in general are a really good way to annotate portions of some code. But there are problems using the normal way to define callouts:

  • If you use the _ include_-macro then you have to split the callout into two places: the source file and the documenation file.

  • The callout-text will only be visible in the documentation not in the source file.

  • Using snippets together with callouts one get problems with the numbering of the callouts.

These problems are avoided using the so called auto-callouts:

  • These are fully specified in the source file, so the reader of the original source file still gets the callout text.

  • There is only one place to define the callout: the source file.

  • All auto-callouts are automatically numbered appropriate to the snippets included.

In the following example we have two auto-callouts. One just omits the number of the callout writing:

int x = 0; // <> Initialization of _variable_ `x` with value `0`
Quelltext 30. The file callout.cc with _auto-callouts.
#include <iostream>

//[main
int main()
{
    //[a
    int x = 0; // <> Initialization of _variable_ `x` with value `0`
    //]
	std::cout << __PRETTY_FUNCTION__ << std::endl;
    //[b
	x = 42; // <> Copy-assignment
    //]
	std::cout << __cplusplus << std::endl;
    //[c
    return x; // <> returning the value to the caller
    //]
}
//]

With auto-callouts the following snippet file will be generated. As you can see, the callouts are numbered and the text of the source comments are transfered to the callout-definition.

Quelltext 31. Snippet file ready for inclusion into the document
.Zeilen aus der Datei link:{srcbase}/{srcdir}/callout.cc.html[`callout.cc`,window="_new"] [Snippet: main]
[source,cpp,indent=0]
----
int main() {
    int x = 0; (1)
    std::cout << __PRETTY_FUNCTION__ << std::endl;
    x = 42; (2)
    std::cout << __cplusplus << std::endl;
    return x; (3)
}
----
<1> Initialization of _variable_ `x` with value `0`
<2> Copy-assignment
<3> returning the value to the caller
Using this feature you can leave the text for the callouts bundled with the source itself. Therefore the code author sees the very same callout-text in the comment as the documentation reader.
Quelltext 32. Zeilen aus der Datei callout.cc [Snippet: main]
int main() {
    int x = 0; (1)
    std::cout << __PRETTY_FUNCTION__ << std::endl;
    x = 42; (2)
    std::cout << __cplusplus << std::endl;
    return x; (3)
}
1 Initialization of variable x with value 0
2 Copy-assignment
3 returning the value to the caller

If you include different snippets the callout numbering will be adapted:

Quelltext 33. Zeilen aus der Datei callout.cc [Snippet: a]
int x = 0; (1)
1 Initialization of variable x with value 0
Quelltext 34. Zeilen aus der Datei callout.cc [Snippet: b]
x = 42; (1)
1 Copy-assignment
Quelltext 35. Zeilen aus der Datei callout.cc [Snippet: c]
return x; (1)
1 returning the value to the caller
Please see also Missing Features.

4.2.4. Highlighting (Marking)

There is a bug here: this feature doesn’t work if AStyle is formatting the output.
Quelltext 36. Zeilen aus der Datei hlite.cc
#include <algorithm>
#include <memory>
int main() {
    int x = 0;
    int y = 0;
    return x;
}
int foo() {}

4.3. Formatting

4.4. Language support

4.5. Command Line Options

extractor has the following commandline option:

The options of extractor
201406
CommandLineOption[ h , help ]
CommandLineOption[ v , verbose ]
CommandLineOption[ se , skipEmptyLines , skipemptylines ]
CommandLineOption[ sb , skipBlockComments , skipblockcomments ]
CommandLineOption[ ss , skipSnippetDefs , skipsnippetdefs ]
CommandLineOption[ sc , skipCallouts , skipcallouts ]
CommandLineOption[ sm , skipMultiSnippetDeliminter , skipdelimiter ]
CommandLineOption[ sx , skipExcludeMarker , skipexclude ]
CommandLineOption[ sh , skipHighlighting , skiphighlight ]
CommandLineOption[ ee , enableEmptyLines , enableemptylines ]
CommandLineOption[ eb , enableBlockComments , enableblockcomments ]
CommandLineOption[ es , enableSnippetDefs , enablesnippetdefs ]
CommandLineOption[ ec , enableCallouts , enablecallouts ]
CommandLineOption[ eh , enableHighlighting , enablehighlight ]
CommandLineOption[ io , includeOmitted , includeomitted ]
CommandLineOption[ in , indent , indentlevel ]
CommandLineOption[ l , lang , language ]
CommandLineOption[ a , astyle , astyleoptions ]
CommandLineOption[ o , output ]
CommandLineOption[ d , subdir ]
CommandLineOption[ x , nosnippets , filteronly ]
CommandLineOption[ n , linenums , linenumbers ]
Option

Description

-h

help message

-v

be more verbose

--se

(default): don’t print empty lines in snippet output to be more condensed.

--ee

opposite of above

--sb

(default): don’t include block-comments (like /* …​*/ for C++ and Java) in snippet output. This is especially usefull to exclude copyright or license headers in sources files.

--eb

opposite of above

--ss

(default): don’t include the snippets definition lines themselves into the snippet files.

--es

opposite of above

--sc

(default): don’t include callouts (see Auto-Callouts) into the snippet files

--ec

opposite of above

--io

include omitted lines (see Omitted lines)

-l

set the language (see AsciiDoctor)

-a

set additional options for AStyle

-o

set the output filename for the snippet database file (default: <basefilename>.extract)

-d

set the directory for generated snippet files (default: .extractor)

-x

don’t generate snippet files, just pass the source through extractor to e.g. exclude the snippet definiiton lines

5. Other useful tools

5.1. Automating the documentation process

The snippets are best generated when building the source itself: if the source changes the snippets must chance too. Automating this whole process would be best.

5.1.1. Using make to generate to snippets

make is a content-agnostik build tool, therefore it could be used for automating the documentation generating process.

If the project is already build with GNU make, then with some additional rules all needed information can be generated very simple.

Quelltext 37. Rules for make (Makefile) to generate the source snippets and the snippet-database file
EXTRACTOR = extractor (1)
EXTRACTDIR = .extractor (2)

%.cc.extract: %.cc (3)
	$(EXTRACTOR) -lcpp -aA2 -o$@ -d$(EXTRACTDIR) $< (4)

%.h.extract: %.h (5)
	$(EXTRACTOR) -lcpp -aA2 -o$@ -d$(EXTRACTDIR) $<

%.java.extract: %.java (6)
	$(EXTRACTOR) -ljava -aA2 -o$@ -d$(EXTRACTDIR) $<
1 The EXTRACTOR-variable contains the path to the extractor-executable
2 Use this directory to put the snippets in
3 Rule to generate the snippets-database file (e.g. test.cc.extract) from a source file (e.g. test.cc)
4 Use cpp as language setting an A2-syling for astyle
5 Same rule for header files
6 Same rule for java-files

With the following rules you can generate the needed html-files for the links in the snippet-files:

Quelltext 38. Rules for make (Makefile) to generate highlighted versions of the source files using GNU source-highlight
SRCHI = source-highlight (1)

%.cc.html: %.cc
	$(EXTRACTOR) -x --eb --io $< | $(SRCHI) -scpp > $@ (2)

%.h.html: %.h
	$(EXTRACTOR) -x --eb --io $< | $(SRCHI) -scpp > $@

%.java.html: %.java
	$(EXTRACTOR) -x --eb --io $< | $(SRCHI) -sjava > $@
1 source-highlight is used to produce the html versions of the source files
2 extractor with special options is used to eliminate the snippet definitions but include otherwise omitted lines

5.2. Kate

5.2.1. AsciiDoc syntax definition

5.2.2. Text snippets

5.3. Editor integration

6. Missing Features

As you work with extractor you’ll surely notice, that there are (many) missing features. The main reason for this, that these feature aren’t relevant for me at the moment. If you need one of the following, it should be simple to add them. Please inform me if you are working on creating a patch.

  • setting the text between continued snippets (// .. lines omitted …​).

  • customization of the texts shown in captions if snippets are used.

  • customization of the texts shown in captions if omitted lines are used.