DITA to WordPress Import Tool

This ‘plugin’ is a DITA to WordPress importer. Specifically it is a WordPress import module which will take the two-pane ‘Web Help’ output from the DITA Open Toolkit and import the hierarchy of XHTML pages into WordPress. It will import images too, though not as WordPress attachments.

This tool was written as part of an online help project in my last job. As an add-on to WordPress to be distributed to customers it was licensed under the GNU GPL Version 2 with explicit understanding ofย  my employers.

I have retained their copyright notice as it was written for them though the concept, ideas and implementation are all mine.

spacer

There is also a zip file for you to download containing the sample DITA web help files that comes with the DITA-OT.

Feedback is welcome. Please use the comment box at the below.


Here is the contents of the readme almost verbatim:

 

It was written to import the XHTML output of the DITA[1] Open Toolkit[2]. A tool which takes XML topics in DITA format and converts them to a number of formats, including PDF, Win Help, and XHTML. It uses the body tag to grab what it needs.

It is very rough and specific to the in-house requirements of Northgate (my last company). It also works on WPMU.

It uses PHP5′s XML manipulation, and at least one part requires MYSQL 5 (for sub-selects) and has some quirky stuff in it. For instance importing 1200 files in one go on windows used to always time out (PHP timeout calc on windows uses wallclock time not cpu time), so it can be restarted and it will process from where it left off.

I mentioned it and DITA over in this post [3] on WP-Docs as part of this conversation [4].

It expects the XHTML output from producing ‘web help’ with DITA-OT 1.4. This is a hierarchical tree of XHTML files with a top level two pane frame index file with a table of contents in one pane and the help topics in the other.

It imports those help topics, grabbing the contents of the body tag and doing some manipulation to get everything to work in WordPress as well as satisfy the original requirements.

It uses a staging table (automatically created) and can be re-run to update the same topics (if you regenerate them). It can also be re-run to continue processing if there is a failure half way through.

Basic processing is as follows:

You supply the path to the top of the DITA output tree (where index.html is generated) If under WPMU, you supply the blog into which you want to import.

It then loads all the files it can find (explicitly ignoring index.html) into the staging table.

The load process does the following:
* converts the paths of any links to other files
* strips out empty anchor tags that DITA-OT generates (It adds id’s *and* empty anchors as fragment targets!)
* It takes the meta tag ‘description’ and uses that as the excerpt
* It takes meta tag keywords and pops them into a page meta tag
* it looks for some specific internal meta tags and saves those (deleted, replaced-by, and prodname)
* it then finds image references , copies the images to the blog directory and adjusts the paths in the HTML (for WPMU it puts them in the correct blog files directory, for standard WP it puts them in the blog root!)
* it then removes some DITA-OT specific stuff we didn’t want (the short description for related links – though it leaves the links)
* it finds a specific span that ought ot be a heading and turns itr into one (h3)
* it finds the parent of the page (if there is one) and stores it so the hierarchy will work
* it extracts the cleaned body contents ready to be the page contents
* it grabs the html page title and uses that as the WP page title
* it uses the DITA id as the slug for the page
* we also had a requirement that the DITA id of the page match the html filename — I’ve made that optional i which case it uses the filename as the slug

The next step of the import looks to see which, if any, of the imported page are updates to existing ones (the id/filename will match an existing slug). It will do an update for those not an insert, and it will record the updated ones if they had comments to be squirted into a post about updates (internal requirement).

Then it will process those updates. By the way, the WP revision stuff works — it will create a new revision for each time you update the page.

Next it inserts new pages

Then it has to flush thew rewrite rules. We had great problems with internal links and rewrite rules – so there is probably a bit of belt and braces stuff going on.

Next it revisits all the pages resolving the parents correctly — so that the hierarchy is created properly. And then it has to flush the rewrite rules again (the paths have changed).

Finally it call update_guids — more belt and braces.

There is an option to empty your posts table before importing. You would not normally want to do that! And another option to delete the pages which are still referenced in the staging table. It’s a clean up after a failed import step less drastic than cleaning up everything. It is hopefully not needed now.

In the source file itself there are a couple more settings you can adjust. There are two different debug levels: set $debug and or $debug_extra to true.

The loop size (how many records to process at once) is adjustable (default 75).

And there is even an option to import posts instead of pages — this is experimental and probably wouldn’t work. For instance it needs to detect category meta tags in the XHTML and add them to the post. The add to post code is half there.

[1] dita.xml.org/book/getting-started
[2] dita.xml.org/wiki/the-dita-open-toolkit
[3] comox.textdrive.com/pipermail/wp-docs/2009-January/001890.html
[4] comox.textdrive.com/pipermail/wp-docs/2009-January/001862.html

== Installation ==

Copy this file to the wp-admin/import folder. It is not a plugin.

16 thoughts on “DITA to WordPress Import Tool

  1. spacer twincascos on said:

    I’ve just tested this with the test site and as you probably know it worked quite well. Do you have plans to add any functionality to the plugin? I’m not sure what I would have to do to prepare a regular html site for this to work. The dita.list file with the demo site for example, how do I create one? is it needed? The demo site also is a frame set with toc, that’s easy enough to create for an html site in prep for import, but again, is it needed to be a frame set?
    Also the dita.list file refs: user.input.dir=/home/mike/Desktop/DITA/DITA-OT1.4/samples
    which I guess isn’t needed.
    Lastly, is there any way that you could have the images moved to the wp upload dir, maybe all into a new folder called dita_import?

    Reply
    • spacer mike on said:

      Hi.
      I’m glad you were able to get the importer working.
      The dita.lst file is simply an artifact of the DITA process, it is not used. Similarly the index.html and toc.html are generated by the DITA process to create a complete (if plain) web help system, there are not used by the importer either.
      The importer very specifically expects the XHTML output of the DITA process. The XHTML files must be valid as a XML files, which is why I specify XHTTL, and I expect the urls between pages must be relative ones. Otherwise I don’t think there is anything special. If things like the short description are not present, it will probably just carry on.

      As for the images, it would be best of they were added as WordPress attachments, but so far I haven’t looked into that. I may have a look later today at making it copy them to wp-content/uploads as a starting point.

      Reply
  2. Pingback: Merging Worlds: DITA and WordPress | I'd Rather Be Writing - Tom Johnson

  3. spacer chrys on said:

    Thanks. Haven’t taken it out of the package you provided – but, I’m anxious to watch your magic.

    Reply
  4. spacer Bob Kauten on said:

    Hi Mike,
    You’ve done a great job in creating a DITA-import utility. I watched it work in Tom Johnson’s video.
    I’m learning to use DITA, and your work is a very valuable tool.
    I’m wondering if there is sensitivity to the path of the web sample which you included. I’ve found that ditahelp.php can find the web sample folder if I place it in wp-content. In that case, I type ../wp-content/web in the DITA help directory text field. ditahelp.php does find the directory, but it hangs:
    * Clearing staging table.
    * Processing directory: ../wp-content/web

    Should I place the web sample directory elsewhere? I’m using Yahoo webhosting. ditahelp.php doesn’t seem to recognize the DITA help directory path if I place it farther away. I don’t have access to the actual path on the yahoo server, which is why I expressed the path in relative terms.

    Thank you for any advice you can offer. Again, great work.
    Bob Kauten

    Reply
    • spacer mike on said:

      Hi Bob,
      I saw your other comment about PHP versions being the problem. But you do raise an important point.
      The path in which you upload your DITA output must be accessible to the user or process the web server runs as. For shared hosting that pretty much means a publicly visible web directory, e.g. wp-content.
      But if you have more control or options, eg. dedicated server, local, or internal server, then you can be more flexible.

      Reply
  5. spacer Bob Kauten on said:

    Hi Mike,
    I found the script.log for ditahelp.php, and it says:

    “PHP Fatal error: Cannot instantiate non-existent class: recursivedirectoryiterator in /blog/wp-admin/import/ditahelp.php on line 360″

    The recursivedirectoryiterator class was included in PHP 5.0, and Yahoo webhosting is running PHP Version 4.3.11. I believe that explains the difficulty.

    Thanks,
    Bob Kauten

    Reply
    • spacer mike on said:

      Hi Bob,
      It does mention in the readme and the page above that it uses PHP 5. I’m afraid it will only work with that version. Though it is possible to do the directory walking a different way in PHP 4, the XML parsing stuff won’t work, so there’s no point.

      Reply
  6. spacer Lisa Dyer on said:

    Hi Mike,

    Congratulations on creating a great tool for the DITA community! If you’re interested in distributing your DITA2Wordpress tool (or at least having it listed as a resource) under the open-source DITA2Wiki Project on SourceForge, please contact me at lisa dot dyer at lombardi dot com.

    sourceforge.net/projects/dita2wiki/

    The DITA2Wiki Project promotes best practices and tools for marrying DITA with wikis and other Web 2.0 apps. Distributions currently include the DITA2Confluence tool (which bypasses the DITA OT to generate wiki output directly from DITA).

    Cheers,

    - lisa

    Reply
  7. spacer bal on said:

    Hi,

    I’m using this dita importer tool. Each and every step is matching with demo video. But finally content is not appearing in my blog. How do I rectify this?

    Thanks in advance
    Bal

    Reply
    • spacer mike on said:

      Hi Bal,
      The import tool creates pages. Do you have a theme which lists pages? Try switching to the default theme to see whether they appear. They should be in the sidebar on the home page.

      Reply
  8. Pingback: dita2wordpress – Import Tool installieren und anwenden » Ditalog 0.1

  9. Pingback: Wordpress feat. DITA: Publication d'articles ร  partir de sources XML | Docster

  10. Pingback: What are you doing with DITA? | DITA Chicks Blog

  11. spacer Wim Hooghwinkel on said:

    Hi Mike,

    I downloaded end tested your tool and it works as expected. I was wondering if there are any new developments or new insights in using this tool?

    Reply
    • spacer Mike Little on said:

      Hi Wim,
      No, I haven’t worked on it for years. I don’t work with DITA any more, so the need is not there.

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

*

*

You may use these HTML tags and attributes: <a class="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>