Summary: | FILEOPEN: Calc confused by unclosed HTML tags | ||
---|---|---|---|
Product: | LibreOffice | Reporter: | Tristan Miller <psychonaut> |
Component: | Spreadsheet | Assignee: | Not Assigned <libreoffice-bugs> |
Status: | NEW --- | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | psychonaut |
Version: | 3.4.2 release | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Sample HTML File as descripebed in first Comment
Proposed unit test for this bug |
Description
Tristan Miller
2011-08-19 03:11:22 UTC
[This is an automated message.] This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it started right out as NEW without ever being explicitly confirmed. The bug is changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases. Details on how to test the 3.5.0 beta1 can be found at: http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1 more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html I checked this in LO 3.5 beta, Calc still does not import the whole table, only till a2. The import Writer/Web is fine. Setting status to NEW Created attachment 55960 [details]
Sample HTML File as descripebed in first Comment
Confirming problem still exists with LibreOffice 3.6.0.4. The HTML importer is only confused by this unclosed anchor tag. I've tried other tags like <div>, <span> or <font>, but the import works fine. Also <a name="foo"> works. The only problem exists with <a href="eu">. A solution would be to manually end the started anchor if the next </td> is found, but that's some kind of spaghetti: --- a/editeng/source/editeng/eehtml.cxx +++ b/editeng/source/editeng/eehtml.cxx @@ -319,6 +319,7 @@ void EditHTMLParser::NextToken( int nToken ) case HTML_TABLEHEADER_OFF: case HTML_TABLEDATA_OFF: { + AnchorEnd(); if ( nInCell ) nInCell--; } A far better solution for all non-well-formatted HTML documents would be to clean them up in a first step. This could be done like http://www.mostthingsweb.com/2013/02/parsing-html-with-c/ Do we want to include tidy in our project? In my opinion this could be a huge benefit. Created attachment 87404 [details] [review] Proposed unit test for this bug |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.