Universal Wiki Converter - Original Docs and Discussion

These are the original docs and discussion. If you are looking for the most recent docs please visit the current site

Universal Wiki Converter

Goal: To provide a tool which can handle the majority of wiki conversion work with selectable regular expressions via a GUI. Optionally bits of additional wiki specific conversion functionality can be added by implementing a few well definied interfaces (such as attachment handling)

User documentation including walk through videos

Source

The UWC is in Subversion.
You can easily check it out here:
http://svn.atlassian.com/svn/public/contrib/confluence/universal-wiki-converter
or if you might want to modify it you'll want to get it here:
https://svn.atlassian.com/svn/public/contrib/confluence/universal-wiki-converter

Latest version of the distribution

Updates:

April 20th, 2006

The UWC is now in what I'd call alpha. Rolf Staflin has provided a working Doku converter, progress bars, a huge number of insightful refactorings, some attachment handling code and javdoc. A PmWiki converter is being tested.

The framework itself which provides things like:

  • A GUI
  • selectable pages
  • Attachment handling
  • A unit tested XMLRPC wrapper to send pages and attachments to Confluence
  • documentation - honest to goodness Javadoc...architecture doc stuff is in the works
  • User documentation including walk through videos - in progress but moving.
April 20th, 2006 - The base framework is nearing a first cut completion. Also check out the new UWC Documentation
Jan 25th, 2006 - Two min. video update.

This is still a work in progress and not yet in a usable state.
Many parts of this are now working. In fact you can select and convert several pages exported from another wiki system via regular expressions and insert them into a specified Confluence space. The UWC will support any number of arbitrary regular expression engines.
Things working: I started to list these, but watching the video is much more helpful.
Current issues:

  • The SOAP interface seems a little flakey. It works for a few pages but then fails without any clear error messages. I think alternative page insertion options should be explored. One possibility would be a Confluence plugin that can communicate with the UWC possibly via RMI, or an export could be created as the JSP Wiki converter does.......another would be to make the UWC itself a plugin.
  • I'm not 100% sure how much conversion the one regular expression engine inserted can handle. This isn't too much of an issue as the UWC can plugin other regular expression and replacement engines in addition to custom Java conversion classes.
  • As of yet there is no good general strategy for handling attachments in a general way other than implementing interfaces specific to each popular wiki to handle them.
    • One possibility might be to create an interface listing all the pages and then making it easy (pre-upload) to match attachments with pages. Then a batch process would kick off all the uploads. You'd end up with attachments in the right places. The usefulness of this would be limited, but in some cases it might be enough.

Process description:

  1. A wiki exporter sends page contents to a standard format - 1 file per page. In several cases wikis can already do this, in other cases a bit a Java class implementing an interface might be required.
  2. Wiki pages to be imported are selected. This allows the user to select only pages they want imported. This way the user easily leave some pages out or import some pages later. You could even combine pages from different wikis!
  3. User enters Confluence access settings. This tells the UWC where to send the pages.
  4. Selecting a converter group the user can specify pre-existing sets of regular expressions that will be run in serial to convert the selected pages to Confluence. Converters can be regular expressions and/or custom classes. This will allow a user to write an extra regular expression or two in case their wiki has customizations. Or leave out a converter that is mis-behaving.
  5. Mix and match the converters to do the job.
  6. Push the big red button. Finally send those pages to Confluence. Some safety will be built in to only overwrite existing pages if the user is really really sure its OK.

Mini-FAQ:

What is a Converter?
It's a key=value pair which matches up a regular expression engine with a regular expression. All the origin wiki pages you select are then run through this regular expression when you click 'Send to Confluence'.

Each of these lines is a converter.

Nomenclature is very important here and I'm just ironing this out. I realize I need to do a little refactoring of the Java code to match these definitions.

What exactly does each part of a converter line mean in the user interface?

TWiki.h2.perl=s/^---++(\[^+\])/h2. $1/
^^^^^ ^^  ^^^ ^^^^^^^^^^^^^^^^^^^^^^^
group
      ^^
      key
         ^^^^
         Converter engine to use - something implementing the Converter interface
              ^^^^^^^^^^^^^^^^^^^^^^^^
              regular expression used by Converter engine

How do I tell the Universal Wiki Converter which converter engine to use with a regular expression?
That's the last part of the 'converter key' -> Its the .perl part in the example above, or it could be .java-regex or .my-custom-thing or whatever. Just so long as it references something which implements the Converter interface.

What is a converter group?
This is meant to be a group of converters which logically represent how to convert a whole wiki....say TWiki.

Why bother at all with individual converters? Why not just hide those?
Because the beauty of this is that you can grab a whole pre-definied set of converters, but if one little thing is missing you can add that, or get rid of a piece that doesn't work well for you.

How was the UI built?
Using IDEA's Form Builder thing.

I know this is kind of lame if it prohibits anyone without IDEA (i.e. an Eclipse master) who wants to update the UI, but if that's the case I'll generate the Java code and check it in. However if you do have IDEA it should be ultra super easy for now. Unless someone without IDEA wants to do a major UI overhaul I'd like to leave that piece in IDEA Form Builder.

How will you handle attachments in a generalized way?
Haven't really thought that one through yet. Suggestions?

Ultimately in some cases it might be possible to create a whole new conversion process for a type of wiki without writing any code other than regular expressions.

The UWC readme.txt

Labels

 
  1. Feb 09, 2006

    Guy Fraser says:

    Would it not be better to have a "pluginable" system whereby definitions and con...

    Would it not be better to have a "pluginable" system whereby definitions and converters for various wikis can be developed and easily added?

    Each converter could convert between Wiki format X and a "verbose" general data format. That way you could convert between any wiki quickly and easily simply by creating a converter plugin.

    You could also have "connector" plugins that would automate common tasks such as getting or adding pages, etc. Each connector would provide a standard API as far as the app is concerned, allowing standardised controls within the app such as "list pages" and "add page", etc. The connector would simply change the request in to the relevant calls to the target wiki.

    This could also be used as a simple mechanism to clone wiki's as well - eg. you could use the same input and output converter and connector to duplicate information in a Confluence space to several different Confluence installations.

    Converters and connectors could later be developed to allow MS office documents to be treated as if there were a wiki, etc.

    1. Feb 09, 2006

      Guy Fraser says:

      Something like this: !wikiconv.gif align=center!

      Something like this:

      1. Feb 09, 2006

        Brendan Patterson says:

        Neat diagram! What tool did you use to make that?

        Neat diagram! What tool did you use to make that?

        1. Feb 09, 2006

          Guy Fraser says:

          Microsoft Viso not the most graphically appealing tool but great for thowing thi...

          Microsoft Viso - not the most graphically appealing tool but great for thowing things like that together

          1. Feb 09, 2006

            Brendan Patterson says:

            Ahh. That's what I used for mine, but yours is so much more professional looking...

            Ahh. That's what I used for mine, but yours is so much more professional looking. Maybe I just need to learn to use it.

  2. Feb 09, 2006

    Brendan Patterson says:

    Hi Guy, Those are a lot of great ideas. Wow you added your comments before I wa...

    Hi Guy,

    Those are a lot of great ideas. Wow you added your comments before I was even done editing this page Thanks for the insights.

    I need to add the word 'pluggable' or pluginable. This system does allow pluggable regular expression engines, pluggable Java converters or the converters can simply be regular expression strings. That is definitely part of the goal. The Converters are picked up at runtime, not compile time....so you can add them on the fly if you want to.

    Another primary goal is that a non-programmer types should be able to come and use this to convert their own wiki once it already has the converters needed for that wiki.

    The 'verbose general data format' might as well be Confluence's wiki markup. If you want to write converters to move from Confluence to another wiki....well that works right now. It generates all the pages as text files, it just doesn't insert them automatically into the 'other' wiki.

    I think its important to keep in mind too that this tool is something a group will probably only ever use one time. It will be very important that one time, but probably never used again. Its very much a one shot tool.

    Your idea about treating MS Office documents as if they were wiki pages you're importing is really fascinating. That I had not thought of. The fact is that if you implemented the Converter interface to convert an MS Office doc to Confluence wiki markup that would work right now.

    Now time to actually finish editing this page

    1. Feb 09, 2006

      Guy Fraser says:

      I like the idea of the Wizard you mentioned in your video that would certainly m...

      I like the idea of the Wizard you mentioned in your video - that would certainly make such a tool far more accessible.

      I'd ideally like something where you could:

      • Connect to one or more wiki's by clicking an "Add" button, choose the wiki (or file format, etc) and then specify any additional details that might be required (eg. login details)
      • Build a task list from a pre-defined list of available tasks and then define the source(s) and target(s) of the tasks (or the whole task list might be easier).
      • Save it as a profile.

      That way people could start contributing connectors/converters (they'd probably be bundled in the same plugin) and even saved profiles such as "this will convert the contents of TikiWiki to Confluence" - the result is that end-users would be able to download the latest package (including converters) and then grab the relevant profile they want to use.

      It would be pretty neat if the UWC could be driven from the command line or an API - you could then automate tasks, for example: When a new word document appears in a certain folder, a batch process could kick in and automatically send it in to Confluence. When the Confluence version of the document is changed, an event listener could send the changes back to the word document on disk

      If the app could be a server-side action, you could even provide a pluggable exporter to confluence pages - eg. Export as Word, PDF, x wiki, y wiki, z file format, etc.

  3. Feb 09, 2006

    Eric Sorenson says:

    Hi Brendan, nice work\! I guess this is the "next generation" wiki importer than...

    Hi Brendan, nice work! I guess this is the "next generation" wiki importer than Jonthan Nolan mentioned on the other Wiki Importer page. As I mention over there, I'm right in the midst of converting our TWiki installation to Confluence so if this is where the importer development energy is headed, I'm ready to pitch in. I've just gotten it running under Linux, here's a screenshot:

     |linux-uwc.jpg|

     I have fixed the build_uwc.sh and run_uwc.sh scripts , I guess I should get a svn account to upload them? Next I'll be working to get TWiki imports going. TWiki already stores its pages as individual files, so it should be straightforward to convert them... mostly  

    1. Feb 09, 2006

      Brendan Patterson says:

      Wohoo! Most excellent! Yes please grab an SVN account so you can commit your cha...

      Wohoo! Most excellent! Yes please grab an SVN account so you can commit your changes back into the repository. Email Jonathan?

      The regular expressions in there already were actually written by Laura Kolker for PERL to convert individual TikiWiki pages. But I think they need to be tweaked for this engine.

      I also need to document the Converter string format..........and generally add more documents. I guess I wasn't thinking anyone would see this right away since nothing has been posted yet....little did I know eyes are everywhere

      1. Feb 09, 2006

        Eric Sorenson says:

        OK I've sent in my request for an account. Am I right in divining that the Perl ...

        OK I've sent in my request for an account.

        Am I right in divining that the Perl regex flavor used is from Jakarta-ORO? I found this document which suggests that although the _syntax_ is familiar to Perl weens (and I include myself in this group), the underlying _engine_ may differ in implementation, so some of the more esoteric perl-flavor regex tricks may not work.  Have to watch out for that.

        It's very nice that it uses a runtime list from the properties file. It doesn't look like it ought to, but does anything inside the code need to change to enable a new namespace in converter.properties? It starts out:

        TWiki.h2.perl=s/^---++(\[^+\])/h2. $1/
        TWiki.h1.perl=s/^---+(\[^+\])/h1. $1/
        ...
        1. Feb 09, 2006

          Brendan Patterson says:

          Thanks for asking these questions....this falls under the category...."stuff in ...

          Thanks for asking these questions....this falls under the category...."stuff in my head that I really should have written down and probably haven't yet".

          Perl is the regex flavor used from Jakarta-ORO. However ANY regex engine can be added by just implementing the Converter.java interface (umm and currently adding a couple lines of code to the ConverterEngine....but that will get fixed).

          You are right that you can't just copy and paste Perl regular expressions into this ORO engine. That's what I tried to do.

          So if you find a better regex engine that lets you do replacements with a single s/search/replace/global type of string (or any kind of one line expression) please let me know and I'll try to add it or tell you more specifically how to do so.

          And you are also correct that nothing need be added to start a new namespace. The namespace will soon get used by the 'Converter Group' button so that you'll just be able to add all the expressions for say TWiki with a single click.

          More about the converter key below.

          Mini-FAQ:

          What is a converter in the UI?
          It's a key=value pair which matches up a regular expression engine with a regular expression. All the origin wiki pages you select are then run through this regular expression when you click 'Send to Confluence'.

          Each of these lines is a converter.

          Nomenclature is very important here and I'm just ironing this out. I realize I need to do a little refactoring of the Java code to match these definitions.

          • What exactly does each part of a converter line mean in the user interface?
            TWiki.h2.perl=s/^---++(\[^+\])/h2. $1/
            ^^^^^ ^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
            group 
                  ^^
                  key - a good key would be the shortest possible description of what this converter does or 'fixes' 
                     ^^^^ 
                     Converter engine to use - something implementing the Converter interface
                          ^^^^^^^^^^^^^^^^^^^^^^^^
                          regular expression used by Converter engine

            How do I tell the Universal Wiki Converter which converter engine to use with a regular expression?
            That's the last part of the 'converter key' -> Its the .perl part in the example above, or it could be .java-regex or .my-custom-thing or whatever. Just so long as it references something which implements the Converter interface.

          What is a converter group?
          This is meant to be a group of converters which logically represent how to convert a whole wiki....say TWiki. So basically its a whole bunch of regular expressions (or other custom things) that together convert a page of text to the Confluence format (or whatever format is your goal).

          Why bother at all with individual converters? Why not just hide those?
          Because the beauty of this is that you can grab a whole pre-definied set of converters, but if one little thing is missing you can add that, or get rid of a piece that doesn't work well for you. If people have customized their TWiki or TikiWikis or MediaWikis or whatever then they can add a couple more converters to get those bits converted.

          How was the UI built?
          Using IDEA's Form Builder thing.

          I also should mention that if you want to mess with the UI it uses IDEA's Form Builder currently. I know this is kind of lame if it prohibits anyone without IDEA (an Eclipse master) who wants to update the UI, but if that's the case I'll generate the Java code and check it in. However if you do have IDEA it should be ultra super easy for now. Unless someone without IDEA wants to do a major UI overhaul I'd like to leave that piece in IDEA Form Builder.

  4. Mar 17, 2006

    Rolf Staflin says:

    Hi\! I'm in the process of converting a site from DocuWiki to Confluence....

    Hi!

    I'm in the process of converting a site from DocuWiki to Confluence. In order to do that I will need to add some functionality to the UWC and thought I'd discuss my ideas before I start:

    File selection

    The site I'm converting has tons of files in maybe a hundred folders. Selecting them a folder at a time would be a pain to say the least. I need to add some way of selecting the top folder, providing a file name filter and then having UWC traverse the file tree and add all the files matching the filter.

    Attachments

    I need to be able to attach images and other files to the pages, and this functionality doesn't seem to be in the UWC yet. For images, I'm thinking of writing a custom converter class that can run after all image tags have been converted to Confluence markup (since that would make it reusable by others).

    I will need a new page class that would hold page title, contents and attachments, and I will rewrite RemotePageWriter so that it uses the new page class. The image converter would pick out the image names from the converted page, somehow retrieve the image file and add it to the page object. When RemotePageWriter uploads a page it can then upload the attachments as well.

    What do you think? Are there better ways to do this? Am I on the right track?

    1. Apr 04, 2006

      Rolf Staflin says:

      Well, since no one has replied in the two weeks since I posted the above I'm goi...

      Well, since no one has replied in the two weeks since I posted the above I'm going ahead.

      I've just checked in some changes that allow you to select directories along with files in the "Include Wiki Pages" file selection dialog. If you add a directory, all files in it will be converted recursively. You can specify a file name filter in the settings dialog to control which files are converted. DokuWiki uses regular text files, so I use the pattern ".txt". The page file list shows the directory, not the files, so that it's easy to remove the directory if it's selected by mistake.

      Incidentally, I have not had any SOAP problems yet. I run UWC on the same machine as Confluence (Win XP SP2) and I've converted 464 files in one go without any errors.

      Oh, and I got rid of the "Cancel" option in the send confirmation dialog .

      1. Apr 04, 2006

        Brendan Patterson says:

        Rolf, That is amazing stuff! Nice work! Thanks for all of it. Sorry for the dela...

        Rolf, That is amazing stuff! Nice work! Thanks for all of it. Sorry for the delayed response...I'd been on vacation of a couple of weeks and was playing catch up.

        I'm glad to hear that you've had no SOAP issues. Maybe I was doing something wrong or maybe the recent version of Confluence is more stable. I was planning on trying XML-RPC, but will rerun my SOAP tests.

        Your idea to write a converter to insert attachments is very insightful and shows that you understand this wacky design......(doh! why didn't I think of that? I'm going to pursue that strategy as well.

        I think for attachments we may also want to add something to the UI that points to where they are stored. The tricky part is that each wiki I've seen handles attachments differently. But that only means each 'attachment converter' will just need to be tailored to that wiki. No big deal.

  5. Mar 19, 2006

    Matt Ryall says:

    There's an outstanding feature request to develop an abstract syntax tree to rep...

    There's an outstanding feature request to develop an abstract syntax tree to represent Confluence documents. That is, Confluence wiki markup would be parsed into a series of objects representing the document, and from there converted into HTML/PDF/etc. We would need an HTML parser as well, to support the Confluence RTE.

    http://jira.atlassian.com/browse/CONF-5430

    If we did this, I imagine the universal convertor could use these objects as an intermediate format, avoiding a complex and buggy bunch of regular expressions which wiki text formatters typically use. I thought contributors here might be interested in voting for this enhancement.

    1. Apr 10, 2006

      Brendan Patterson says:

      Thanks Matt. That would be a nice feature and I voted for it too. The XMLRPC fac...

      Thanks Matt. That would be a nice feature and I voted for it too. The XMLRPC facilities to import pages, attachments and creates spaces are turning out to be pretty good, but at some point perhaps such a syntax tree would make importing other non-wiki stuctures much easier also.

  6. Aug 04, 2006

    Todd Chapin says:

     This tool is a really great idea. I'm psyched to have found Confluence, as...

     This tool is a really great idea. I'm psyched to have found Confluence, as I want our company to use and support one wiki platform. Currently, we're on a mix of TWiki and MediaWiki.

     So, I tried it out last night, and one thing that might be helpful is that for people who are not wholly familar with Confluence,  I had no idea what "space key to insert text" meant. Then, when I figured that out, after a few different attempts at uploading, I realized that I had to precreate the space on Confluence. So, all is well on that front. My data is now on Confluence.

    This test is a migration from TWiki to Confluence (for some background). Now, I find that all of my page titles have been renamed w/ part of the Space / Web name in the page title, but none of the links in the pages were modified in the same way. Why did the TWiki converter do that? Is there a problem with pages in different spaces having the same names? Which filter do I tweak to either stop the converter from modifying the page titles (or I can just do that manually once all of the pages are processed, but not uploaded) (and chaning the page titles is my preferred solution), or have the links within the pages all be renamed?

    Thanks.

    Todd

    1. Aug 04, 2006

      Brendan Patterson says:

      Todd thanks for the info. I just improved the descriptions in the tool to help...

      Todd thanks for the info.

      I just improved the descriptions in the tool to help clarify the space key hopefully. That'll be in version 23 (no fancy numbering system here

      As for the links to different TWiki webs.....I haven't really tried to handle that yet. The customer I wrote this for just had one web, though handling different webs will probably be desired by many people.

      Can you give me some more specifics? What is happening to the titles? And what you want?

      I'm note sure what is happnening to your page titles unfortunately because I haven't played much with multiple webs. The page titles are created from the file names....so it doesn't occur to me why they'd be getting changed.

      But the good news is that messing with the page titles will be easier than messing with the link converters as those are fairly numerous and mildly complicated.