Sample Import: 73234XLS

Last modified by Ecaterina Moraru (Valica) on 2013/11/11 13:47




Following are the noticeable importer defects resulting from 73284.xls:

  • The sizing of content: in general, the resulting wiki page seems to be quite larger than that of the original document. There are several reasons for this:
    • Stripping of style tags: The original html content has the following style tag defined in the html header:
  BODY,DIV,TABLE,THEAD,TBODY,TFOOT,TR,TH,TD,P { font-family:"Arial"; font-size:x-small }


 Since we do not address these style tags within Office Importer, the imported content has quite large font sizes.

    • elements used by OO: The OO server used to convert office documents into html generates tables like the following:
 <col width="87" />
 <col width="152" />
 <col width="50" />
Even though this is valid xhtml, our xhtml parser neglects these elements at the moment. This can be further illustrated with the following rendering test case which passes:
<table><colgroup><col width="87" /><col width="152" /><col width="50" /></colgroup><tbody><tr><td/><td/><td/></tr></tbody></table>
So as a result, the width information of the table is lost and the resulting table indeed looks fat.
    • Placement of images: If we examine the original document we can see that none-textual components like charts are not placed inside table cells (of the spreadsheet) but floats on the document. But this is not true about the generated html where the images are embedded inside some table cells depending on their positioning. This also causes some sizing and alignment problems.
  • Spurious Table Cells: In the wiki result, the second table appears to have several table cells out of order. This is true for spreadsheets containing a lot of images (like charts). The html generated from OO server outputs tables that are out of shape and contains a lot of spurious table cells because it tries to embed the images into the tables.
  • Horizontal Lines: The html generated by OO server for spreadsheets contains
    elements used to separate multiple spreadsheets. Open Question: Should these horizontal lines be filtered from spreadsheets while cleaning?
  • Italic Headings: The headings generated by OO server looks somewhat like:
<H1>Sheet 1: <EM>Survey sheet</EM></H1>
Thus the resulting headings are in italics.Open Question: Should these headings be corrected in spreadsheets while cleaning?
  • Missing vertical orientations: The second table of the original document has some table cells where the cell content (text) is vertically oriented, but this is completely missing in the resulting wiki page. This is because the OO server cannot produce html that renders vertically oriented text.


Created by Ecaterina Moraru (Valica) on 2013/11/11 12:52

Get Connected