Office Importer

Last modified by Vincent Massol on 2024/11/19 16:12

Manage
- Copy
Actions
Viewers
- Source
- Siblings
- Attachments (8)
- History
- Information
- Likes

Product: XWiki

Type: Feature

Status: Completed

Participants

Asiri Rathnayake

Issues: http://jira.xwiki.org/jira/browse/XWIKI/component/10740

Description

This document seems to be out-dated, will be updated as time permits.

Proposal
- Introduction
- Integration mock-up
Current State
- Features
Quick Start
ToDo List and plan
- Use htmlcleaner to clean html but not jdom filters.

This page will hold all the information about my Google Summer of Code project - Office Converter. The project is to create a xwiki plugin to convert office douments such as MS Doc, MS Excel, Openoffice Odt to xwiki syntax and insert the result to a xwiki page. The middle setp of the conversion is to convert office document to clean tidy html code. Then use xhtml parser convert the html to xwiki syntax. WarningThis project is not finished and unstable. This page is for development. The target release of this project is dependence on the XE 1.6 which is not released yet. So if you want to try the newest version of this project, you need build the latest version of xwiki from the svn and download the officeconverter for XE 1.6, and follow the README in the zip file to use it.

Proposal

Introduction

use openoffice runtime as server to convert document to html code
clean html code
parse html to xwiki syntax
integrate those feature into xwiki. see below Office Importer

Integration mock-up

The features below is usable only Office Converter Plugin is installed. After discussion with Vincent, we decide the integration for office converter will be plugin + application. This is,

a xwiki plugin for converting office document to many document format, like pdf, html, xwiki syntax.
a office import application for user to import office document to xwiki page

The office import application should look like below: \ WarningThese two features belownwill not support in this release, as it's related to other modules in xwiki and I not have enough time till the endline of gsoc. I will discuss it in dev list and implement them in future.

Import from WYSIWYG
- mock up demo: OfficeImporterWYSIWYG.png
Preview Office document
- mock up demo: OfficeImporterPreview.png

Current State

Features

Convert a office document to html code and save the html code to a xwiki
handle xwiki syntax in html content and escape special characters in the html content
support document type: doc, xls, ppt, odt, odp, ods
support convert ppt odp to a zip file and display the zip in a iframe in a xwiki page
handle the images in office document. Upload pictures into xwiki page as attachments
integrate to xwiki as a xwiki plugin
provide a xwiki application to import office document which can can select to convert2html or convert2xwiksyntax
a unfinished convert2xwikisyntax feature. To be finished in next version.

Quick Start

Install

latest XE 1.6 in svn trunk is required.
install openoffice(>=2.3) in the computer in which xwiki will run. Refer http://www.openoffice.org
copy all the libs mentioned below to XWIKI_WEB_HOME/WEB-INF/lib/
- All the dependanted libraries can be downloaded here. install requirement libraries.include:
  - slf4j-api-1.4.3.jar
  - slf4j-jdk14-1.4.3.jar
  - jodconverter-2.2.1.jar http://sourceforge.net/project/showfiles.php?group_id=91849
  - jurt-2.3.0.jar
  - juh-2.3.0.jar
  - ridl-2.3.0.jar
  - unoil-2.3.0.jar
  - htmlcleaner-2.0.jar http://htmlcleaner.sourceforge.net
copy office importer plugin lib to XWIKI_WEB_HOME/WEB-INF/lib/
add the office converter plugin in xwiki.cfg
- Edit your WEB-INF/xwiki.cfg file as follows:

xwiki.plugins=[...], com.xpn.xwiki.plugin.officeconverter.OfficeConverterPlugin

Start Server

start xwiki as you always do.
start the openoffice as a server in the computer.
- If you are using windows, it's a little complicated. please refer http://www.artofsolving.com/node/11 to find out.
- Or you just find the executable soffice file(often it is in c:/program files/openoffice-2.3) and go to the path in command line run

soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard

- If you are in linux, the simplest one is to start it from the command line with the following options:

soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard

Use the plugin in xwiki

Import the office import application to xwiki
go to Import.WebHome to convert office document
select the source file, input the target xwiki page's space and page name.
select covnert2xhtml or convert2xwiki
click "convert" button
if success, you can click "result" link to see the new page.

Warning

The source file should have a normal filename with correct extension.\
The target xwiki page should not existed. Otherwise, will show you not allowed to view the page.\
If you don't have the edit right of the target page, will show you "not allowed to view the page.")

ToDo List and plan

Use htmlcleaner to clean html but not jdom filters.

Time: 10 hours \ Predict Begin: 2008.08.16\ Predict End: 2008.08.17\ Task:

~~clean html code to well format~~
~~remove head tag~~ head tag can be handled by xhtmlparer.
replace tag to {image}
~~remove empty link~~
replace deprecated tags of xhtml(if possible)
- pb: HTMLCleaner can't just simple replace a tag, so a a little hard.

Write test cases for the conversion

Time: 10 hours\ Predict Begin: 2008.08.17\ Predict End: 2008.08.18\ Task:

~~refactor the test framework of office converter test cases~~
~~make small test input file(MS word, excel, powerpoint and openoffice) and verify the output~~
test the HtmlCleaner( have to implement the filter and fix some bugs in htmlcleaner, so it's out of track)
~~test the typeformat, util, and other classes~~

Insert task convert2html

see here

~~implement a convert2html feature~~
~~clean the code~~
~~write javadoc~~
~~write readme~~
- ~~feature list~~
- ~~quick start for how to use it~~
  Time: 5 hours\ Predict Begin: 2008.08.19\ Predict End: 2008.08.19\

Convert xhtml to xwiki syntax 2.0

Main Task:

Write test cases for WikimodelXHTMLParser. Consider all the base tags in xhtml.
submit patches to wikimodel and xwiki-core-rendering to make WikimodelXHTMLParser + XWikiSyntaxRendering works well for all the test cases.\
Time: about 8 days\ Predict Begin: 2008.08.18\ Predict End: 2008.08.26\ Detail Plan for this:

Name	Predict time	Predict begin	Predict end	Test cases	Problems
~~Base text format~~	About 1 day	2008.08.18	2008.08.19

<b>
<strong>
<i>
<u>
<s>
<strike>
<em>
<del>
<ins>
<sup>
<sub>
<p> (existed)
title or section level(existed)
<hr>
<br>

	if the tag is deprecated in xhtml, like , how to deal with it. That would be the role of the HTML cleaner. So I need to do it in the "html clean" step. Add TagHandler in wikimodel's XhtmlHandler and add blocks, parser method in xwiki-core-rendering
List	About 2 days	2008.08.19	2008.08.21

<html>
<ol>
 <li>Item 1
  <ol>
   <li>Item 2
    <ul class="star">
     <li>Item 3</li>
    </ul>
   </li>
   <li>Item 4</li>
  </ol>
 </li>
 <li>Item 5</li>
</ol>
<ul class="star">
 <li>Item 1
  <ul class="star">
   <li>Item 2
    <ul class="star">
     <li>Item 3</li>
    </ul>
   </li>
   <li>Item 4</li>
  </ul>
 </li>
 <li>Item 5</li>
 <li>Item 6</li>
</ul>
</html>

This is hard to fix. Need to see what happen in wikimodel's xhtmlparser.
~~Links~~	About 2 days	2008.08.21	2008.08.23

<a href="http://www.xwiki.org">xwiki</a>

This is hard too. If can't solve in parser, I will use filter to replace link to xwiki syntax when clean html.
Table	About 2 days	2008.08.23	2008.08.25

<html>
<body>
<table>
<tr>
 <th>1.1</th>
 <th>1.2</th>
</tr>
<tr>
 <th>2.1</th>
 <th>2.2</th>
</tr>
</table>
</body>
</html>

even harder because it's handled by macro in new rendering. Can I just add a simple temporary tableblock solution .
Image	5 hours	2008.08.25	2008.08.25

<img src="imgurl"/>

just ignore as I replace to {image}
attribute	10 hours	2008.08.25	2008.08.26

middle

use the style, but how? Need to find out.
class

test

maybe ignore, just as the same without class.
font

test

ignore? or something else.

Make ppt and odp works

~~as ppt and odp have generate multi html pages, so how to assemble them to a xwiki page.~~

\ Time: about 1 day\ Predict Begin: 2008.08.27\ Predict End: 2008.08.28

Test the project on windows

~~As I use linux for develop, I need test to see if the project work well on windows OS.~~

\ Time: 5 hours\ Predict Begin: 2008.08.28\ Predict End: 2008.08.29\ Maybe if you are using windows OS, you can help me test it. Thanks.

Documents and package

~~Javadoc and Readme.~~

\ Time: 5 hours\ Predict Begin: 2008.08.29\ Predict End: 2008.08.30\

Old Plan

There are the function point of office import. I'll give the detail plan and time line soon for every sections. The plan is still a draft. Any suggestion and discussion is very appreciated.

core of plugin July 8 - July 12

Actually, this work will last to the end of the project, as the core code need to change to meet the high level api.

Todo
- Clean up code. provide low level api and high level api. Hense the plugin can be used in xwiki page and other part of xwiki both.(to be detail)
- handle the conflict of the xwiki syntax(maybe it's the job of xhtmlparser)

Integration with xwiki

Develop a application July 13 - July 15

upload a file
select the target page
convert the document to the page
problem

~~how to upload a file using fileuploadplugin and get the byte[] of the file.~~
~~a new page or insert to the existed page~~

Source Code

This project is just started and only product the initial code. Any suggestion is appreciated. And please add comment to this page to discuss.

svn for office converter plugin: https://svn.xwiki.org/svnroot/xwiki/sandbox/xwiki-plugin-officeimporter
svn for office import application: http://svn.xwiki.org/svnroot/xwiki/sandbox/xwiki-application-officeimporter

Build

This project use maven2 as project management tool. You can get the source code, type "maven install" to get the plugin package. \ But as it dependence on some libs which are not release yet, you need to build the dependencies if you want to try the latest version.

get the latest code from the svn for these libs below
- xwiki-core
- xwiki-core-rendering
- xwiki-core-xml
- org.wikimodel.wem
Patch them as these issues:
- http://code.google.com/p/wikimodel/issues/detail?id=34
- http://jira.xwiki.org/jira/browse/XWIKI-2568
install these libs above to your maven repository
if you want to test the project with "mvn test" or "mvn install", you should start the openoffice as a server
if you want to build it without test, you should run "mvn install -Dmaven.test.skip=true"

POM File

Please see pom.xml

Reference Libraries

Libraries dependented by Office Importer.

Sample Imports (Known issues)

This section presents some of the results we obtained with Office Importer along with the original documents for comparison. The purpose of this comparison is for us to identify the shortcomings and difficulties with importing office documents into wiki pages. Each of following pages analyses the result of a single document import operation. Please note that all the documents mentioned below were taken from scribed and that all of them were imported with filtering mode set to None.

Support

Any question and problem, please send email to [email protected](need to subscribe) or to me daning106(at)gmail.com

Office Importer

Description

Proposal

Introduction

Integration mock-up

Current State

Features

Quick Start

Install

Start Server

Use the plugin in xwiki

ToDo List and plan

Use htmlcleaner to clean html but not jdom filters.

Write test cases for the conversion

Insert task convert2html

Convert xhtml to xwiki syntax 2.0

Make ppt and odp works

Test the project on windows

Documents and package

Old Plan

core of plugin July 8 - July 12

Integration with xwiki

Develop a application July 13 - July 15

Source Code

Build

POM File

Reference Libraries

Sample Imports (Known issues)

Support

About

About

Support

Platform

User Guide

Admin Guide

Developer Guide

Projects

XWiki

Extensions

Other

Contribute

Status

Practices

Under the Hood

Get Involved

Get Connected