Office Importer
Description
This page will hold all the information about my Google Summer of Code project - Office Converter. The project is to create a xwiki plugin to convert office douments such as MS Doc, MS Excel, Openoffice Odt to xwiki syntax and insert the result to a xwiki page. The middle setp of the conversion is to convert office document to clean tidy html code. Then use xhtml parser convert the html to xwiki syntax. , and follow the README in the zip file to use it.
Proposal
Introduction
- use openoffice runtime as server to convert document to html code
- clean html code
- parse html to xwiki syntax
- integrate those feature into xwiki. see below Office Importer
Integration mock-up
The features below is usable only Office Converter Plugin is installed. After discussion with Vincent, we decide the integration for office converter will be plugin + application. This is,
- a xwiki plugin for converting office document to many document format, like pdf, html, xwiki syntax.
- a office import application for user to import office document to xwiki page
The office import application should look like below: \
- Import from WYSIWYG
- mock up demo:
- Preview Office document
- mock up demo:
Current State
Features
- Convert a office document to html code and save the html code to a xwiki
- handle xwiki syntax in html content and escape special characters in the html content
- support document type: doc, xls, ppt, odt, odp, ods
- support convert ppt odp to a zip file and display the zip in a iframe in a xwiki page
- handle the images in office document. Upload pictures into xwiki page as attachments
- integrate to xwiki as a xwiki plugin
- provide a xwiki application to import office document which can can select to convert2html or convert2xwiksyntax
- a unfinished convert2xwikisyntax feature. To be finished in next version.
Quick Start
Install
- latest XE 1.6 in svn trunk is required.
- install openoffice(>=2.3) in the computer in which xwiki will run. Refer http://www.openoffice.org
- copy all the libs mentioned below to XWIKI_WEB_HOME/WEB-INF/lib/
- All the dependanted libraries can be downloaded
- slf4j-api-1.4.3.jar
- slf4j-jdk14-1.4.3.jar
- jodconverter-2.2.1.jar http://sourceforge.net/project/showfiles.php?group_id=91849
- jurt-2.3.0.jar
- juh-2.3.0.jar
- ridl-2.3.0.jar
- unoil-2.3.0.jar
- htmlcleaner-2.0.jar http://htmlcleaner.sourceforge.net
. install requirement libraries.include:
- All the dependanted libraries can be downloaded
- copy to XWIKI_WEB_HOME/WEB-INF/lib/
- add the office converter plugin in xwiki.cfg
- Edit your WEB-INF/xwiki.cfg file as follows:
Start Server
- start xwiki as you always do.
- start the openoffice as a server in the computer.
- If you are using windows, it's a little complicated. please refer http://www.artofsolving.com/node/11 to find out.
- Or you just find the executable soffice file(often it is in c:/program files/openoffice-2.3) and go to the path in command line run
- If you are in linux, the simplest one is to start it from the command line with the following options:
Use the plugin in xwiki
- Import to xwiki
- go to Import.WebHome to convert office document
- select the source file, input the target xwiki page's space and page name.
- select covnert2xhtml or convert2xwiki
- click "convert" button
- if success, you can click "result" link to see the new page.
Warning
- The source file should have a normal filename with correct extension.\
- The target xwiki page should not existed. Otherwise, will show you not allowed to view the page.\
- If you don't have the edit right of the target page, will show you "not allowed to view the page.")
ToDo List and plan
Use htmlcleaner to clean html but not jdom filters.
Time: 10 hours \ Predict Begin: 2008.08.16\ Predict End: 2008.08.17\ Task:
clean html code to well formatremove head taghead tag can be handled by xhtmlparer.- replace tag to {image}
remove empty link- replace deprecated tags of xhtml(if possible)
- pb: HTMLCleaner can't just simple replace a tag, so a a little hard.
Write test cases for the conversion
Time: 10 hours\ Predict Begin: 2008.08.17\ Predict End: 2008.08.18\ Task:
refactor the test framework of office converter test casesmake small test input file(MS word, excel, powerpoint and openoffice) and verify the output- test the HtmlCleaner( have to implement the filter and fix some bugs in htmlcleaner, so it's out of track)
test the typeformat, util, and other classes
Insert task convert2html
see here
implement a convert2html featureclean the codewrite javadocwrite readmefeature listquick start for how to use itTime: 5 hours\ Predict Begin: 2008.08.19\ Predict End: 2008.08.19\
Convert xhtml to xwiki syntax 2.0
Main Task:
- Write test cases for WikimodelXHTMLParser. Consider all the base tags in xhtml.
- submit patches to wikimodel and xwiki-core-rendering to make WikimodelXHTMLParser + XWikiSyntaxRendering works well for all the test cases.\ Time: about 8 days\ Predict Begin: 2008.08.18\ Predict End: 2008.08.26\ Detail Plan for this:
Name | Predict time | Predict begin | Predict end | Test cases | Problems |
---|---|---|---|---|---|
About 1 day | 2008.08.18 | 2008.08.19 |
<b> <strong> <i> <u> <s> <strike> <em> <del> <ins> <sup> <sub> <p> (existed) title or section level(existed) <hr> <br> |
if the tag is deprecated in xhtml, like , how to deal with it. That would be the role of the HTML cleaner. So I need to do it in the "html clean" step. Add TagHandler in wikimodel's XhtmlHandler and add blocks, parser method in xwiki-core-rendering | ||||
List | About 2 days | 2008.08.19 | 2008.08.21 |
<html> <ol> <li>Item 1 <ol> <li>Item 2 <ul class="star"> <li>Item 3</li> </ul> </li> <li>Item 4</li> </ol> </li> <li>Item 5</li> </ol> <ul class="star"> <li>Item 1 <ul class="star"> <li>Item 2 <ul class="star"> <li>Item 3</li> </ul> </li> <li>Item 4</li> </ul> </li> <li>Item 5</li> <li>Item 6</li> </ul> </html> |
This is hard to fix. Need to see what happen in wikimodel's xhtmlparser. | ||||
About 2 days | 2008.08.21 | 2008.08.23 |
<a href="http://www.xwiki.org">xwiki</a> |
This is hard too. If can't solve in parser, I will use filter to replace link to xwiki syntax when clean html. | ||||
Table | About 2 days | 2008.08.23 | 2008.08.25 |
<html> <body> <table> <tr> <th>1.1</th> <th>1.2</th> </tr> <tr> <th>2.1</th> <th>2.2</th> </tr> </table> </body> </html> |
even harder because it's handled by macro in new rendering. Can I just add a simple temporary tableblock solution . | ||||
Image | 5 hours | 2008.08.25 | 2008.08.25 |
<img src="imgurl"/> |
just ignore as I replace to {image} | ||||
attribute | 10 hours | 2008.08.25 | 2008.08.26 |
<p align="center" color="red">middle</p> |
use the style, but how? Need to find out. | ||||
class |
<span class="underline">test</span> |
maybe ignore, just as the same without class. | ||||
font |
<font size="1" style="font-size: 8pt">test</font> |
ignore? or something else. |
Make ppt and odp works
\ Time: about 1 day\ Predict Begin: 2008.08.27\ Predict End: 2008.08.28
Test the project on windows
\ Time: 5 hours\ Predict Begin: 2008.08.28\ Predict End: 2008.08.29\ Maybe if you are using windows OS, you can help me test it. Thanks.
Documents and package
\ Time: 5 hours\ Predict Begin: 2008.08.29\ Predict End: 2008.08.30\
Old Plan
core of plugin July 8 - July 12
Actually, this work will last to the end of the project, as the core code need to change to meet the high level api.
- Todo
- Clean up code. provide low level api and high level api. Hense the plugin can be used in xwiki page and other part of xwiki both.(to be detail)
- handle the conflict of the xwiki syntax(maybe it's the job of xhtmlparser)
Integration with xwiki
Develop a application July 13 - July 15
- upload a file
- select the target page
- convert the document to the page problem
how to upload a file using fileuploadplugin and get the byte[] of the file.a new page or insert to the existed page
Source Code
This project is just started and only product the initial code. Any suggestion is appreciated. And please add comment to this page to discuss.
- svn for office converter plugin: https://svn.xwiki.org/svnroot/xwiki/sandbox/xwiki-plugin-officeimporter
- svn for office import application: http://svn.xwiki.org/svnroot/xwiki/sandbox/xwiki-application-officeimporter
Build
This project use maven2 as project management tool. You can get the source code, type "maven install" to get the plugin package. \ But as it dependence on some libs which are not release yet, you need to build the dependencies if you want to try the latest version.
- get the latest code from the svn for these libs below
- xwiki-core
- xwiki-core-rendering
- xwiki-core-xml
- org.wikimodel.wem
- Patch them as these issues:
- install these libs above to your maven repository
- if you want to test the project with "mvn test" or "mvn install", you should start the openoffice as a server
- if you want to build it without test, you should run "mvn install -Dmaven.test.skip=true"
POM File
Please see pom.xml
Reference Libraries
Libraries dependented by Office Importer.
Sample Imports (Known issues)
This section presents some of the results we obtained with Office Importer along with the original documents for comparison. The purpose of this comparison is for us to identify the shortcomings and difficulties with importing office documents into wiki pages. Each of following pages analyses the result of a single document import operation. Please note that all the documents mentioned below were taken from scribed and that all of them were imported with filtering mode set to None.
Support
Any question and problem, please send email to [email protected](need to subscribe) or to me daning106(at)gmail.com