Wiki source code of Office Importer

Last modified by Vincent Massol on 2024/02/26 17:58

Show last authors
1 {{warning}}
2 This document seems to be out-dated, will be updated as time permits.
3 {{/warning}}
4
5 {{toc start="2"/}}
6
7 This page will hold all the information about my [[Google Summer of Code project>>http://code.google.com/soc/2008/]] - [[Office Converter>>http://dev.xwiki.org/xwiki/bin/view/GoogleSummerOfCode/OfficeImport2008]]. The project is to create a xwiki plugin to convert office douments such as MS Doc, MS Excel, Openoffice Odt to xwiki syntax and insert the result to a xwiki page. The middle setp of the conversion is to convert office document to clean tidy html code. Then use xhtml parser convert the html to xwiki syntax. {{warning}}This project is not finished and unstable. This page is for development. The target release of this project is dependence on the XE 1.6 which is not released yet. So if you want to try the newest version of this project, you need build the latest version of xwiki from the svn and download the [[officeconverter for XE 1.6>>attach:officeconverter0.0.3SNAPSHOTforXE.1.5.zip@officeconverter0.0.3forXE1.6.zip} and follow the README. If you want to try the office importer in XE 1.5, you can download the {attach:version for XE 1.5||]], and follow the README in the zip file to use it.{{/warning}}
8
9 == Proposal ==
10
11 === Introduction ===
12
13 * use openoffice runtime as server to convert document to html code
14 * clean html code
15 * parse html to xwiki syntax
16 * integrate those feature into xwiki. see below [[||anchor="mock-up"]]
17
18 === Integration mock-up ===
19
20 The features below is usable only Office Converter Plugin is installed. After discussion with Vincent, we decide the integration for office converter will be plugin + application. This is,
21
22 * a xwiki plugin for converting office document to many document format, like pdf, html, xwiki syntax.
23 * a office import application for user to import office document to xwiki page
24
25 The office import application should look like below: \ [[image:OfficeImporterPage.png]] {{warning}}These two features belownwill not support in this release, as it's related to other modules in xwiki and I not have enough time till the endline of gsoc. I will discuss it in dev list and implement them in future.{{/warning}}
26
27 * Import from WYSIWYG
28 ** mock up demo: attach:OfficeImporterWYSIWYG.png
29 * Preview Office document
30 ** mock up demo: attach:OfficeImporterPreview.png
31
32 == Current State ==
33
34 === Features ===
35
36 * Convert a office document to html code and save the html code to a xwiki
37 * handle xwiki syntax in html content and escape special characters in the html content
38 * support document type: doc, xls, ppt, odt, odp, ods
39 * support convert ppt odp to a zip file and display the zip in a iframe in a xwiki page
40 * handle the images in office document. Upload pictures into xwiki page as attachments
41 * integrate to xwiki as a xwiki plugin
42 * provide a xwiki application to import office document which can can select to convert2html or convert2xwiksyntax
43 * a unfinished convert2xwikisyntax feature. To be finished in next version.
44
45 == Quick Start ==
46
47 === Install ===
48
49 * latest XE 1.6 in svn trunk is required.
50 * install openoffice(>=2.3) in the computer in which xwiki will run. Refer [[http://www.openoffice.org]]
51 * copy all the libs mentioned below to XWIKI_WEB_HOME/WEB-INF/lib/
52 ** All the dependanted libraries can be downloaded [[here>>attach:libs.zip]]. install requirement libraries.include:
53 *** slf4j-api-1.4.3.jar
54 *** slf4j-jdk14-1.4.3.jar
55 *** jodconverter-2.2.1.jar [[http://sourceforge.net/project/showfiles.php?group_id=91849]]
56 *** jurt-2.3.0.jar
57 *** juh-2.3.0.jar
58 *** ridl-2.3.0.jar
59 *** unoil-2.3.0.jar
60 *** htmlcleaner-2.0.jar [[http://htmlcleaner.sourceforge.net]]
61 * copy [[office importer plugin lib>>attach:xwiki-plugin-officeconverter-0.0.3.jar]] to XWIKI_WEB_HOME/WEB-INF/lib/
62 * add the office converter plugin in xwiki.cfg
63 ** Edit your WEB-INF/xwiki.cfg file as follows:
64
65 {{code}}
66 xwiki.plugins=[...], com.xpn.xwiki.plugin.officeconverter.OfficeConverterPlugin
67 {{/code}}
68
69 === Start Server ===
70
71 * start xwiki as you always do.
72 * start the openoffice as a server in the computer.
73 ** If you are using windows, it's a little complicated. please refer http://www.artofsolving.com/node/11 to find out.
74 ** Or you just find the executable soffice file(often it is in c:/program files/openoffice-2.3) and go to the path in command line run
75
76 {{code}}
77 soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard
78 {{/code}}
79
80 * \\
81 ** If you are in linux, the simplest one is to start it from the command line with the following options:
82
83 {{code}}
84 soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard
85 {{/code}}
86
87 === Use the plugin in xwiki ===
88
89 * Import [[ the office import application>>attach:officeimporter-application0.0.3.xar]] to xwiki
90 * go to Import.WebHome to convert office document
91 * select the source file, input the target xwiki page's space and page name.
92 * select covnert2xhtml or convert2xwiki
93 * click "convert" button
94 * if success, you can click "result" link to see the new page.
95
96 Warning
97
98 * The source file should have a normal filename with correct extension.\
99 * The target xwiki page should not existed. Otherwise, will show you not allowed to view the page.\
100 * If you don't have the edit right of the target page, will show you "not allowed to view the page.")
101
102 == ToDo List and plan ==
103
104 === Use htmlcleaner to clean html but not jdom filters. ===
105
106 Time: 10 hours \ Predict Begin: 2008.08.16\ Predict End: 2008.08.17\ Task:
107
108 {{html clean="false" wiki="true"}}
109 * <del>clean html code to well format</del>
110 * <del>remove head tag</del> head tag can be handled by xhtmlparer.
111 * replace <img> tag to {image}
112 * <del>remove empty link <a/></del>
113 * replace deprecated tags of xhtml(if possible)
114 ** pb: HTMLCleaner can't just simple replace a tag, so a a little hard.
115
116 ===== Write test cases for the conversion =====
117
118 Time: 10 hours\
119 Predict Begin: 2008.08.17\
120 Predict End: 2008.08.18\
121 Task:
122
123 * <del>refactor the test framework of office converter test cases</del>
124 * <del>make small test input file(MS word, excel, powerpoint and openoffice) and verify the output</del>
125 * test the HtmlCleaner( have to implement the filter and fix some bugs in htmlcleaner, so it's out of track)
126 * <del>test the typeformat, util, and other classes</del>
127
128 ===== Insert task <ins>convert2html</ins> =====
129
130 see [[here>>http://xwiki.markmail.org/search/?q=#query:+page:1+mid:2u2to6ywsqqcx42b+state:results]]
131
132 * <del>implement a convert2html feature</del>
133 * <del>clean the code</del>
134 * <del>write javadoc</del>
135 * <del>write readme</del>
136 ** <del>feature list</del>
137 ** <del>quick start for how to use it</del>
138 <p/>
139 Time: 5 hours\
140 Predict Begin: 2008.08.19\
141 Predict End: 2008.08.19\
142
143 ===== Convert xhtml to xwiki syntax 2.0 =====
144
145 Main Task:
146
147 * Write test cases for WikimodelXHTMLParser. Consider all the base tags in xhtml.
148 * submit patches to [[wikimodel>>http://code.google.com/p/wikimodel]] and xwiki-core-rendering to make WikimodelXHTMLParser + XWikiSyntaxRendering works well for all the test cases.\
149 <p/>
150 Time: about 8 days\
151 Predict Begin: 2008.08.18\
152 Predict End: 2008.08.26\
153 Detail Plan for this:
154
155 |=Name|=Predict time|=Predict begin|=Predict end|=Test cases|=Problems
156 |<del>Base text format</del>|About 1 day|2008.08.18|2008.08.19|
157
158 |{{code}}
159 <b>
160 <strong>
161 <i>
162 <u>
163 <s>
164 <strike>
165 <em>
166 <del>
167 <ins>
168 <sup>
169 <sub>
170 <p> (existed)
171 title or section level(existed)
172 <hr>
173 <br>
174 {{/code}}
175
176 | |if the tag is deprecated in xhtml, like <u>, how to deal with it. That would be the role of the HTML cleaner. So I need to do it in the "html clean" step. Add TagHandler in wikimodel's XhtmlHandler and add blocks, parser method in xwiki-core-rendering
177 |List|About 2 days|2008.08.19|2008.08.21|
178
179 |{{code}}
180 <html>
181 <ol>
182 <li>Item 1
183 <ol>
184 <li>Item 2
185 <ul class="star">
186 <li>Item 3</li>
187 </ul>
188 </li>
189 <li>Item 4</li>
190 </ol>
191 </li>
192 <li>Item 5</li>
193 </ol>
194 <ul class="star">
195 <li>Item 1
196 <ul class="star">
197 <li>Item 2
198 <ul class="star">
199 <li>Item 3</li>
200 </ul>
201 </li>
202 <li>Item 4</li>
203 </ul>
204 </li>
205 <li>Item 5</li>
206 <li>Item 6</li>
207 </ul>
208 </html>
209 {{/code}}
210
211 |This is hard to fix. Need to see what happen in wikimodel's xhtmlparser.
212 |<del>Links</del>|About 2 days|2008.08.21|2008.08.23|
213
214 |{{code}}<a href="http://www.xwiki.org">xwiki</a>{{/code}}
215
216 |This is hard too. If can't solve in parser, I will use filter to replace link to xwiki syntax when clean html.
217 |Table|About 2 days|2008.08.23|2008.08.25|
218
219 |{{code}}
220 <html>
221 <body>
222 <table>
223 <tr>
224 <th>1.1</th>
225 <th>1.2</th>
226 </tr>
227 <tr>
228 <th>2.1</th>
229 <th>2.2</th>
230 </tr>
231 </table>
232 </body>
233 </html>
234 {{/code}}
235
236 |even harder because it's handled by macro in new rendering. Can I just add a simple temporary tableblock solution .
237 |Image|5 hours|2008.08.25|2008.08.25|
238
239 |{{code}}<img src="imgurl"/>{{/code}}
240
241 |just ignore as I replace <img> to {image}
242 |attribute|10 hours|2008.08.25|2008.08.26|
243
244 |{{code}}<p align="center" color="red">middle</p>{{/code}}
245
246 |use the style, but how? Need to find out.
247 |class| | | |
248
249 |{{code}}<span class="underline">test</span>{{/code}}
250
251 |maybe ignore, just as the same without class.
252 |font| | | |
253
254 |{{code}}<font size="1" style="font-size: 8pt">test</font>{{/code}}
255
256 |ignore? or something else.
257
258 ===== Make ppt and odp works =====
259
260 <del>as ppt and odp have generate multi html pages, so how to assemble them to a xwiki page.</del>\
261 Time: about 1 day\
262 Predict Begin: 2008.08.27\
263 Predict End: 2008.08.28
264
265 ===== Test the project on windows =====
266
267 <del>As I use linux for develop, I need test to see if the project work well on windows OS.</del>\
268 Time: 5 hours\
269 Predict Begin: 2008.08.28\
270 Predict End: 2008.08.29\
271 Maybe if you are using windows OS, you can help me test it. Thanks.
272
273 ===== Documents and package =====
274
275 <del>Javadoc and Readme.</del>\
276 Time: 5 hours\
277 Predict Begin: 2008.08.29\
278 Predict End: 2008.08.30\
279
280 == <del>Old Plan</del> ==
281
282 {{info}}There are the function point of office import. I'll give the detail plan and time line soon for every sections. The plan is still a draft. Any suggestion and discussion is very appreciated.{{/info}}
283
284 === core of plugin July 8 - July 12 ===
285
286 Actually, this work will last to the end of the project, as the core code need to change to meet the high level api.
287
288 * Todo
289 ** Clean up code. provide low level api and high level api. Hense the plugin can be used in xwiki page and other part of xwiki both.(to be detail)
290 ** handle the conflict of the xwiki syntax(maybe it's the job of xhtmlparser)
291
292 === Integration with xwiki ===
293
294 ==== Develop a application July 13 - July 15 ====
295
296 * upload a file
297 * select the target page
298 * convert the document to the page
299 <p/>
300 problem
301
302 * <del>how to upload a file using fileuploadplugin and get the byte[] of the file.</del>
303 * <del>a new page or insert to the existed page</del>
304
305 == Source Code ==
306
307 This project is just started and only product the initial code. Any suggestion is appreciated. And please add comment to this page to discuss.
308
309 * svn for office converter plugin: [[https://svn.xwiki.org/svnroot/xwiki/sandbox/xwiki-plugin-officeimporter]]
310 * svn for office import application: [[http://svn.xwiki.org/svnroot/xwiki/sandbox/xwiki-application-officeimporter]]
311
312 == Build ==
313
314 This project use maven2 as project management tool. You can get the source code, type "maven install" to get the plugin package. \
315 But as it dependence on some libs which are not release yet, you need to build the dependencies if you want to try the latest version.
316
317 * get the latest code from the svn for these libs below
318 ** xwiki-core
319 ** xwiki-core-rendering
320 ** xwiki-core-xml
321 ** org.wikimodel.wem
322 * Patch them as these issues:
323 ** http://code.google.com/p/wikimodel/issues/detail?id=34
324 ** http://jira.xwiki.org/jira/browse/XWIKI-2568
325 * install these libs above to your maven repository
326 * if you want to test the project with "mvn test" or "mvn install", you should start the openoffice as a server
327 * if you want to build it without test, you should run "mvn install -Dmaven.test.skip=true"
328
329 == POM File ==
330
331 Please see [[pom.xml>>http://svn.xwiki.org/svnroot/xwiki/sandbox/xwiki-plugin-officeimporter/pom.xml]]
332
333 == Reference Libraries ==
334
335 Libraries dependented by Office Importer.
336
337 * [[JODConverter>>http://www.artofsolving.com/opensource/jodconverter]]
338 * [[HtmlCleaner>>http://htmlcleaner.sourceforge.net/]]
339 * [[Openoffice>>http://www.openoffice.com]]
340
341 == Sample Imports (Known issues) ==
342
343 This section presents some of the results we obtained with Office Importer along with the original documents for comparison. The purpose of this comparison is for us to identify the shortcomings and difficulties with importing office documents into wiki pages. Each of following pages analyses the result of a single document import operation. Please note that all the documents mentioned below were taken from [[scribed>>http://www.scribd.com/]] and that all of them were imported with filtering mode set to **None**.
344
345 * [[73234.xls>>73234XLS]]
346 * [[1622132.xls>>1622132XLS]]
347 * [[5927315.doc>>5927315DOC]]
348
349 == Support ==
350
351 Any question and problem, please send email to [email protected](need to subscribe) or to me daning106(at)gmail.com
352
353 {{/html}}

Get Connected