Office Importer

Last modified by Vincent Massol on 2024/02/26 17:58

 XWiki
 Feature
 Completed
 

Description

This document seems to be out-dated, will be updated as time permits.

This page will hold all the information about my Google Summer of Code project - Office Converter. The project is to create a xwiki plugin to convert office douments such as MS Doc, MS Excel, Openoffice Odt to xwiki syntax and insert the result to a xwiki page. The middle setp of the conversion is to convert office document to clean tidy html code. Then use xhtml parser convert the html to xwiki syntax. This project is not finished and unstable. This page is for development. The target release of this project is dependence on the XE 1.6 which is not released yet. So if you want to try the newest version of this project, you need build the latest version of xwiki from the svn and download the officeconverter for XE 1.6, and follow the README in the zip file to use it.

Proposal

Introduction

  • use openoffice runtime as server to convert document to html code
  • clean html code
  • parse html to xwiki syntax
  • integrate those feature into xwiki. see below Office Importer

Integration mock-up

The features below is usable only Office Converter Plugin is installed. After discussion with Vincent, we decide the integration for office converter will be plugin + application. This is, 

  • a xwiki plugin for converting office document to many document format, like pdf, html, xwiki syntax.
  • a office import application for user to import office document to xwiki page

The office import application should look like below: \ OfficeImporterPage.png These two features belownwill not support in this release, as it's related to other modules in xwiki and I not have enough time till the endline of gsoc. I will discuss it in dev list and implement them in future.

Current State

Features

  • Convert a office document to html code and save the html code to a xwiki
  • handle xwiki syntax in html content and escape special characters in the html content
  • support document type: doc, xls, ppt, odt, odp, ods
  • support convert ppt odp to a zip file and display the zip in a iframe in a xwiki page
  • handle the images in office document. Upload pictures into xwiki page as attachments
  • integrate to xwiki as a xwiki plugin
  • provide a xwiki application to import office document which can can select to convert2html or convert2xwiksyntax
  • a unfinished convert2xwikisyntax feature. To be finished in next version.

Quick Start

Install

xwiki.plugins=[...], com.xpn.xwiki.plugin.officeconverter.OfficeConverterPlugin

Start Server

  • start xwiki as you always do.
  • start the openoffice as a server in the computer.
    • If you are using windows, it's a little complicated. please refer http://www.artofsolving.com/node/11 to find out.
    • Or you just find the executable soffice file(often it is in c:/program files/openoffice-2.3) and go to the path in command line run 
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard

    • If you are in linux, the simplest one is to start it from the command line with the following options: 
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard

Use the plugin in xwiki

  • Import  the office import application to xwiki
  • go to Import.WebHome to convert office document
  • select the source file, input the target xwiki page's space and page name.
  • select covnert2xhtml or convert2xwiki
  • click "convert" button
  • if success, you can click "result" link to see the new page.

Warning

  • The source file should have a normal filename with correct extension.\
  • The target xwiki page should not existed. Otherwise, will show you not allowed to view the page.\
  • If you don't have the edit right of the target page, will show you "not allowed to view the page.")

ToDo List and plan

Use htmlcleaner to clean html but not jdom filters.

Time: 10 hours \ Predict Begin: 2008.08.16\ Predict End: 2008.08.17\ Task:

  • clean html code to well format
  • remove head tag head tag can be handled by xhtmlparer.
  • replace tag to {image}
  • remove empty link
  • replace deprecated tags of xhtml(if possible)
    • pb: HTMLCleaner can't just simple replace a tag, so a a little hard.
Write test cases for the conversion

Time: 10 hours\ Predict Begin: 2008.08.17\ Predict End: 2008.08.18\ Task:

  • refactor the test framework of office converter test cases
  • make small test input file(MS word, excel, powerpoint and openoffice) and verify the output
  • test the HtmlCleaner( have to implement the filter and fix some bugs in htmlcleaner, so it's out of track)
  • test the typeformat, util, and other classes
Insert task convert2html

see here

  • implement a convert2html feature
  • clean the code
  • write javadoc
  • write readme
    • feature list
    • quick start for how to use it

      Time: 5 hours\ Predict Begin: 2008.08.19\ Predict End: 2008.08.19\
Convert xhtml to xwiki syntax 2.0

Main Task:

  • Write test cases for WikimodelXHTMLParser. Consider all the base tags in xhtml.
  • submit patches to wikimodel and xwiki-core-rendering to make WikimodelXHTMLParser + XWikiSyntaxRendering works well for all the test cases.\

    Time: about 8 days\ Predict Begin: 2008.08.18\ Predict End: 2008.08.26\ Detail Plan for this:
NamePredict timePredict beginPredict endTest casesProblems
Base text formatAbout 1 day2008.08.182008.08.19
<b>
<strong>
<i>
<u>
<s>
<strike>
<em>
<del>
<ins>
<sup>
<sub>
<p> (existed)
title or section level(existed)
<hr>
<br>
if the tag is deprecated in xhtml, like , how to deal with it. That would be the role of the HTML cleaner. So I need to do it in the "html clean" step. Add TagHandler in wikimodel's XhtmlHandler and add blocks, parser method in xwiki-core-rendering
ListAbout 2 days2008.08.192008.08.21
<html>
<ol>
<li>Item 1
 <ol>
  <li>Item 2
   <ul class="star">
    <li>Item 3</li>
   </ul>
  </li>
  <li>Item 4</li>
 </ol>
</li>
<li>Item 5</li>
</ol>
<ul class="star">
<li>Item 1
 <ul class="star">
  <li>Item 2
   <ul class="star">
    <li>Item 3</li>
   </ul>
  </li>
  <li>Item 4</li>
 </ul>
</li>
<li>Item 5</li>
<li>Item 6</li>
</ul>
</html>
This is hard to fix. Need to see what happen in wikimodel's xhtmlparser.
LinksAbout 2 days2008.08.212008.08.23
<a href="http://www.xwiki.org">xwiki</a>
This is hard too. If can't solve in parser, I will use filter to replace link to xwiki syntax when clean html.
TableAbout 2 days2008.08.232008.08.25
<html>
<body>
<table>
<tr>
<th>1.1</th>
<th>1.2</th>
</tr>
<tr>
<th>2.1</th>
<th>2.2</th>
</tr>
</table>
</body>
</html>
even harder because it's handled by macro in new rendering. Can I just add a simple temporary tableblock solution .
Image5 hours2008.08.252008.08.25
<img src="imgurl"/>
just ignore as I replace to {image}
attribute10 hours2008.08.252008.08.26
<p align="center" color="red">middle</p>
use the style, but how? Need to find out.
class
<span class="underline">test</span>
maybe ignore, just as the same without class.
font
<font size="1" style="font-size: 8pt">test</font>
ignore? or something else.
Make ppt and odp works
as ppt and odp have generate multi html pages, so how to assemble them to a xwiki page.

\ Time: about 1 day\ Predict Begin: 2008.08.27\ Predict End: 2008.08.28

Test the project on windows
As I use linux for develop, I need test to see if the project work well on windows OS.

\ Time: 5 hours\ Predict Begin: 2008.08.28\ Predict End: 2008.08.29\ Maybe if you are using windows OS, you can help me test it. Thanks.

Documents and package
Javadoc and Readme.

\ Time: 5 hours\ Predict Begin: 2008.08.29\ Predict End: 2008.08.30\

Old Plan

There are the function point of office import. I'll give the detail plan and time line soon for every sections. The plan is still a draft. Any suggestion and discussion is very appreciated.

core of plugin July 8 - July 12

Actually, this work will last to the end of the project, as the core code need to change to meet the high level api.

  • Todo
    • Clean up code. provide low level api and high level api. Hense the plugin can be used in xwiki page and other part of xwiki both.(to be detail)
    • handle the conflict of the xwiki syntax(maybe it's the job of xhtmlparser)

Integration with xwiki

Develop a application July 13 - July 15

  • upload a file
  • select the target page
  • convert the document to the page

    problem
  • how to upload a file using fileuploadplugin and get the byte[] of the file.
  • a new page or insert to the existed page

Source Code

This project is just started and only product the initial code. Any suggestion is appreciated. And please add comment to this page to discuss.

Build

This project use maven2 as project management tool. You can get the source code, type "maven install" to get the plugin package. \ But as it dependence on some libs which are not release yet, you need to build the dependencies if you want to try the latest version.

POM File

Please see pom.xml

Reference Libraries

Libraries dependented by Office Importer.

Sample Imports (Known issues)

This section presents some of the results we obtained with Office Importer along with the original documents for comparison. The purpose of this comparison is for us to identify the shortcomings and difficulties with importing office documents into wiki pages. Each of following pages analyses the result of a single document import operation. Please note that all the documents mentioned below were taken from scribed and that all of them were imported with filtering mode set to None.

Support

Any question and problem, please send email to [email protected](need to subscribe) or to me daning106(at)gmail.com


 


Tags: gsoc
    

Get Connected