This server will be upgraded at 3pm Sydney time on December 3rd (December 2nd, 8pm PST) and will be down for up to 30 minutes.

UWC Developer Documentation

UWC Developer Documentation

Get the Source

The UWC is in Subversion.
You can check it out here:
http://svn.atlassian.com/svn/public/contrib/confluence/universal-wiki-converter
if you might want to modify it you'll want to obtain a subverion login id from jnolen@atlassian.com and get it it here:
https://svn.atlassian.com/svn/public/contrib/confluence/universal-wiki-converter



Download the Latest version of the distribution*

How to check out and build

svn co https://svn.atlassian.com/svn/public/contrib/confluence/universal-wiki-converter

http*s* must be used if you have subversion write access to check things back in. Otherwise you can do an
svn co http://svn.atlassian.com/svn/public/contrib/confluence/universal-wiki-converter

building the project

ant -p       will give you all the targets
ant            will create a build that can run from here:
	cd universal-wiki-converter/devel/target/uwc		
	. ./run_uwc_devel.sh
	this is handy for a quicker devel cycle than using the package target
ant package    will create the full uwc.jar  and package up the entire distribution under universal-wiki-converter/dist

What are the developer features I can take advantage of?

  • A wizard like UI already exists which assists users in walking through the steps necessary to convert their wiki
  • The UI and framework take care of locating pages to convert and then sending the converted pages and attachements to Confluence
  • There are already a few examples of exactly what need to be written to successfully convert another wiki to Confluence.
  • The UI allows users to easily swap out certain converters that could be causing unexpected results. This has proven useful.
  • Everything you should need to create a set of wiki converters is already in the stack in terms of libraries.
  • The ANT build system is up and running
  • The system uses regular expressions. In many cases someone has already written the start of a set of regular expressions to convert their own wiki. The UWC was written with the intention of making it easy to leverage these pre-existing efforts.
  • Several UWC specific developer frameworks

How to convert another wiki.

Have a look at this video: UWC Developer Video - 11 min.

  1. I'd recommend locating a wiki syntax page for your origin wiki and bookmarking since you'll be visiting it frequently.
  2. I'd recommend creating a test page which shows all the most popular syntaxes for the origin wiki that you'll be converting. This is a file you can the run directly through the UWC to test.
  3. If you can get your hands on some (or all) of the source files you'll be converting that's very handy to have around.
  4. Look for other existing converters from which you can borrow regular expressions. There might be some here under Confluence content converters or there might not.
  5. Under uwc/devel/conf you'll see a few files such as converter.pmwiki.properties, copy one of these to something like converter.mediawiki.properties
  6. Then you can either start adding PERL regular expressions directly to the 'converter.mediawiki.properties' or you can extend BaseConverter class to run things through your own engine (I really need to add something which will allow adding of Java's built in regex package expressions directly to the property files since that engine is proving more robust and powerful to use than the PERL stuff).
  7. There is an ANT build file which will build the project for you under the devel/target/uwc directory.

It's a bit sketchy but that's the gist Feel free to email me any questions - brendan@atlassian.com

Version control and checking in your changes

If you make improvements please contribute them back. You'll need to check out the project via the http*s* link above as opposed to the http link. You'll also need to acquire a Subversion login/password from Jonathan Nolen - jnolen@atlassian.com.

The source code is Apache version 2

Architecture

How it works:

The Universal Wiki Converter is a client side application with a rich GUI. It converts files containing wiki markup from the first wiki and then sends those files directly to Confluence via XMLRPC

What the framework provides:
  1. A GUI interface
    1. GUI allows user to select pages to convert
    2. GUI allows user to dynamically specify Confluence settings
    3. GUI provides a regular expression test tool for rapid development of regular expressions which these specific regex engines (all regular expression engine implementations seem to be slightly different)
    4. GUI provides a %completetion bar while files are both being converted and sent to Confluence
    5. A feedback window with which to send feedback to the user
  2. Currently one regular expression engine. The regular expression engines are pluggable. You need only extend 'BaseConverter' or have a look at the PERLConverter to see how (~5 lines of code).
  3. The BaseConverter can be extended to implement 'Java converter' classes to handle the trickier cases where a regular expression isn't quit up to the task.
  4. All the XMLRPC code necessary to send pages to Confluence, and upload attachments to their pages
  5. The ability to dynamically massage page names.
  6. Several of the systems have unit tests both to verify functionality and provide sample code
  7. lots of sample regular expressions.

Wiki Exporters

The UWC reads in individual files on the hard drive. The names of those files become the wiki page names. The contents of the files are expected to be the content of the pages and markup that gets converterted by the UWC.

In many cases wikis either store their contents in this format already or have built in features to export their content to this format of files.

However some wikis do not have facilities built in. Their data must be retrieved directly from a database or an XML file or converted into this format. In such cases the developer can define an 'exporter' for a wiki or multiple exporters.

There are two main components to an 'exporter'
1) A Java class which implements the com.atlassian.uwc.exporters.Exporter interface. This drives the behavior of the exporter.
2) An exporter properties file which is named exporter.some-existing-converter-name.properties located in the conf/ dirrectory

The UWC detects all such property files and:

  • v45 and earlier: lists them on the UWC's 'exporter' tab.
  • v46 and later: will enable the export button when the associated wiki is chosen in the drop-down menu.

The exporter properties file's contents will be passed into the class implementing the Exporter interface as a Map. The developer will thereby have access to all those settings in the properties file.

The only required property in the exporter.name.properties file is:

  1. Exporter fully qualified class name
    exporter.class=com.atlassian.uwc.exporters.SomeWikiExporter

One example of the Exporter class is the MediaWikiExporter. This is used to query the database via jdbc and retrieve the MediaWiki's contents. Those contents are then written out to individual files corresponding to wiki pages which is the expected format for the UWC.

Another example might be to convert an XML export file into individual files.

Tips

  • To test new regular expression changes to the converter.wiki.properties file you do NOT need to restart the UWC. Simply reselect the target wiki with the "choose wiki" button and any changes to your regex file will be picked up. If this does not seem to be working it is probably because you are changing the converter.tikiwiki.properties (or whichever) file in a different location than where the UWC is picking it up.
  • It seems that the UWC running Java 6 might be a little faster than jdk1.5.
  • If you're developing with IDEA or Eclipse and running through the debugger in most cases code changes can be recompiled and reloaded by the IDE without restarting the UWC.
  • *Regression Testing*
    • It is a good idea to create a file which demonstrates all of the syntax you are trying to convert. New changes can unexpectedly affect things that use to be working. This is generally not too time consuming, but regression testing is key.
    • Having a file such as SampleTikiwiki-Input2.txt and then an output file which correctly converted text such as SampleTikiwiki-Expected2.txt is helpful both to you and other developers. Please check such files in under sampleData/<wikiName>
    • withing the SampleTikiwiki-Input2.txt it can be helpful to wrap certain text describing the syntax changes or showing the target Confluence syntax in whatever that wiki's equivalent of
      {code}
      tags are.   That way the text you don't really want the converter messing with comes through 'unmolested'. Otherwise sample text will usually get changed so the expected output file is not quite as clear as you'd like.  This of course works best if you've written the converter for the tag which translates to
      {code}
      in Confluence.
  • I've recently started writing all the regular expressions using Java's built in regex. This regex engine has proven extremely powerful, flexible and reliable. Additionally there is a nice demo/testing util checked in which makes developing the regular expressions much easier. I can't remember exactly where I found it online, but I've checked it in under the uwc/devel/tools dir and you can run it as demonstrated below.
    • cd C:\projects\universal-wiki-converter-public\devel\tools\javaregex\classes (or whatever location you've checked things out to)
    • java regexdemo.AppRegexDemo
  • As you work through issues it's a good idea to track them all in a single file. Then you can use that file for regression testing.
  • Regression testing - keep handy a 'test' file with all the syntax you're testing as well as a successful conversion of that file. Much like unit testing you'll feel much more confident refactoring as you can always run your test file through again and then do a diff against the 'successful' output file. I recommend some sort of visual diff tool. I use JEdit's jdiff plugin.
  • There are at least three 'kinds' of conversion regular expressions I find myself writing.
    1) conversions of things which don't want to be touched by other regular expressions. These include links, code blocks which and attachments among others
    2) escapes - when the original wiki uses characters that aren't meaningful in that wiki but ARE meaningful in Confluence you have to escape those characters or Confluence will take them as formatting and generally look strange
    3) other conversions which just kind of stack up against each other...bold, italics, tables
    What's working well is to order the above conversion types as shown - 1) 2) 3). For the first type you generally want to tokenize the matches by using the TokenMap class or the built in tokenizing replacement. This way you convert something but then it gets tokenized so as it won't be touched by any other conversion until it is 'de-tokenized' at the end.
  • Regular Expression Reference Links:
  • When you're testing your converter, you do not need to import to Confluence. After you click Convert Pages to Confluence Syntax, a popup will ask you if you want to send the pages to Confluence. You can click No, and instead examine your converted pages in the output/output directory.
  • It is very helpful to develop a 'test file' which distills all of the syntax you are trying to convert along with its correct conversion. The problem is after you convert this file it's not always easy to know what you're looking for because if you show the correctly converted text in the file it will probably get changed into something else.

So what is very helpful is to essentially say, "hey converter don't touch this". What makes sense is to look for how the origin wiki puts the equivalent of Confluence
{code}
tags around a block of text (so it won't be messed with), create that converter and then do the same.

So for PmWiki you have this syntax:
[@
some code you don't want parsed here
@]

The equivalent converter for this is:
PmWiki.0040_code-block.java-regex-tokenize=[@(.*?)@]
{replace-multiline-with}

\\
\\ $1
Newline Tip

To match and replace ABC with a newline character:

SomeWiki.newline-replace-example.java-regex=ABC{replace-with}NEWLINE

The text NEWLINE (in upper case) now resolves to a system dependent newline character.

Anatomy of a Conversion properties file

Here I'm going to describe naming conventions, particularly ones with meaning, and some special classes that can be used to help you with your conversions.

Property names

Your property name will look like:

Example property

Wikitype.xxxx-syntax_description.suffix

Let's go through that in order.

  • The Wikitype section of the property is arbitrary, but for consistency, name it after the wiki you are converting from
  • The xxxx section is a number. This is useful for helping to keep the converters in an understandable order. Essentially, the converters are run in ASCII Ascending alphabetical order. So, provided your Wikitype is the same for all converters, these numbers are going to determine the order the converters get run in
  • The syntax_description is just for ease of identifying what the converter does
  • The suffix will tell the ConverterEngine what type of property this is. Choices are:
    • class - Use this one if the converter will use a Java class that implemented BaseConverter. See Classes.
    • java-regex - Use this one if the converter will do a simple search and replace java regex expression here. See regular expressions for more info.
    • perl - Use this one if the converter will use perlish search and replace syntax. See regular expressions for more info.
    • java-regex-tokenizer - Use this one if the converter will do a search and replace, and then tokenize the results so that they are no longer available for conversions. See Tokenizing classes for more info.
    • A non-converter property
      Converters are run in ASCII alphabetical order by property name

      MyWiki.0100-stuff will get run before
      MyWiki.0200-stuff which will get run before
      MyWiki.0200-xyz

Property values

Property values for syntax converters are either classes or regular expressions.
Property values for non-converter properties are tailored to the property in question (booleans, settings, classnames, etc.)

Classes

If it's a class, the property value should point to a Java class that implements BaseConverter.

Example

MyWiki.0100-converting_stuff.class=com.atlassian.uwc.converters.ConvertingStuff

This class implements com.atlassian.uwc.converters.BaseConverter. The entry method is convert.
Basically, you should:

  • get the original text
  • Do something to it, maybe with a regular expression maybe not
  • set the page's converted text
    public void convert(Page page) {
       String input = page.getOriginalText();
       String converted = doSomething(input);
       page.setConvertedText(converted)
    }
    



regular expressions

If it's a regular expression, provide a search and replace string. If you are using the java-regex converter type, use the delimiter {replace-with} between your search and replace strings.
The following java-regex example takes characters surrounded by <nowiki> tags and replaces those tags with Confluence noformat macros.

java-regex example

Mediawiki.0200-re_noformat.java-regex=<nowiki>((?s).*?)</nowiki>{replace-with}{noformat}$1{noformat}


Here's a perl example. It looks like a perl regex. This one converts italics.

perl example
DokuWiki.1underlined.perl=s/__([^_]+)__/+$1+/g



Tokenizing classes

Let's say you want to convert something, but then not allow any further conversions. For example, converting the contents of <code> tags to {code} tags. You would then use the java-regex-tokenizer type. This would perform the search and replace and then tokenize those converted sections so that they were protected from further conversion.

java-regex-tokenizer example

Mediawiki.0095-re_code.java-regex-tokenizer=\<code\>((?s).*?)\<\/code\>{replace-with}{code}$1{code}

Tokenizer properties have a convenience option for when the developer wants "dotall" and "multiline" modes in effect. Use {replace-multiline-with} instead of {replace-with}:

replace-multiline-with compiles with DOTALL and MULTILINE

Mediawiki.0095-re_code.java-regex-tokenizer=\<code\>(.*?)\<\/code\>{replace-multiline-with}{code}$1{code}

To detokenize, you would add the following class to the end of your converter.properties:

If you use tokenizers, you must end your properties with this:

Mediawiki.2000-detokenize.class=com.atlassian.uwc.converters.DetokenizerConverter




Nonconverter properties

Non-converter properties are used to handle settings that would affect the conversion, but are not technically converters. They are often used to turn on or customize optional features.

Non-converter properties belong on top

We recommend that non-converter properties be set at the beginning of the properties file.

hierarchy

UWC Hierarchy Builder Framework
Description - The hierarchy framework provides functionality to allow the UWC to set parent-child relationships between pages.
Example

MyWiki.0001.switch.hierarchy-builder=UseBuilder
MyWiki.0002.classname.hierarchy-builder=com.atlassian.uwc.hierarchies.FilepathHierarchy

page histories

UWC Page History Framework
Description - The page histories framework provides the ability to maintain version histories for pages.
Example

MyWiki.0001.switch.page-history-preservation=true
MyWiki.0002.suffix.page-history-preservation=-#.txt

disabling illegal pagenames framework

UWC Illegal Pagenames Framework - Disabling
Description - The disabling illegal pagenames framework feature provides a way to turn off the default illegal pagenames handling.

Careful!

Allowing illegal pagenames to be uploaded to your Confluence could produce unknown behavior.

Example

Mywiki.0001.illegal-handling=false

auto detect spacekeys

UWC Auto Detect Spacekeys Framework
Description - The Auto Detect Spacekeys framework will detect and create spaces on the fly for your new Confluence pages.
Example

Mywiki.0001.autodetect-spacekeys=true

Filename extension stripping class

If you do not want the pages that Confluence imports to have the filename extension in the page title, add this class to the end of your converter.properties:

This will strip out filename extensions from page titles

Mediawiki.1000-remove-extension.class=com.atlassian.uwc.converters.ChopPageExtensionsConverter




Conversion examples

UWC Conversion Line Examples

Important Classes

TokenMap -
This is a helper class to create, store and retrieve tokens.

  • <p/>
  • Certain elements such as links and code can be quite tricky
  • to convert. One issue is that you need to escape text in some places
  • but not others (like inside links).
  • <p/>
  • Use this class for anything where you want to avoid syntaxt from
  • being escaped. VERY HELPFUL.

Devel cycle notes:

  • to devel just run 'ant' or 'ant all' which is the same. this does not create a dist, but does build all the classes are create everything under 'target/uwc' .
  • during the devel cycle you can run 'target/uwc/run_uwc_devel.sh'. this lets you run the UWC without packaging up the whole distribution

Other Notes

After learning the value of tokenizing I'm wondering how JavaCC might be leveraged to make development of new wiki conversions faster (actual runtime speed is of little relative importance as long as we're not doing something silly).

Don't use Base64 encoding or decoding. It is too slow to be practical.

Getting Help

Want to ask a developer a question? Try out the UWC Forum.

Labels

uwc uwc Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Feb 27, 2007

    Eric Sorenson says:

    Here's a link I found and posted on the old Doc page; I found myself needing it ...

    Here's a link I found and posted on the old Doc page; I found myself needing it again so maybe it'll help somebody else. 

     If you want to add your own regex converters but aren't sure what exactly is supported by the 'perl' engine, the underlying code is from the Jakarta ORO project and the regex flavor is documented here: http://jakarta.apache.org/oro/api/org/apache/oro/text/perl/Perl5Util.html

     -Eric