pdf - Difference between iTextSharp 4.1.6 and 5.x versions -
pdf - Difference between iTextSharp 4.1.6 and 5.x versions -
we developing pdf parser used along our system. requirement such that, store info on pdf documents , should able reproduce document such (with minimal changes original document).
we did googling , found itextsharp best mate our purpose. developing our project using .net.
you might have guessed mentioned in title requiring comparisons specific versions of itextsharp (4.1.6 vs 5.x). know 4.1.6 lastly version of itextsharp lgpl/mpl license . 5.x versions agpl.
we have comparing between versions before choosing lgpl version or purchase license agpl (we dont publish our code).
i did browsing through revision changes in itextsharp know if content exist, making comparing between versions.
thanks in advance!
i'm ceo of itext software, michaƫl answered in comment section, i'm @ same time authoritative source biased source.
there's simple comparing chart on itext web site: http://itextpdf.com/functionalitycomparison
this chart doesn't cover text extraction, allow me list relevant improvements since itext 5.
you've found page: http://itextpdf.com/salesfaq
in case wonder bug fixes , performance improvements regarding text parsing, more exhaustive list:
5.0.0: text extraction: major overhaul perform calculations in user space. allows parser correctly determine line breaks, if text or page rotated. 5.0.1: refactored callback method signature won't need alter render callback api evolves. 5.0.1: refactoring create easier outside users interact content stream processor. refactored render listener text , image event listening occurs in same interface (reduces lot of non-value-add complexity) 5.0.1: new filtering functionality text renderers. 5.0.1: additional utility method previewing pdf content. 5.0.1: added much more advanced text renderer listener can reconstruct page content based on physical location of text on page 5.0.1: added back upwards xobject form processing (text added via pdftemplate can parsed) 5.0.1: added rudimentary back upwards xobject image callbacks 5.0.1: bug prepare - text extraction wasn't right page orientations 5.0.1: bug prepare - matrices beingness concatenated in wrong order. 5.0.1: pdftextextractor: changed default render listener (new location aware strategy) 5.0.1: getters graphicsstate 5.0.2: major refactoring of interface text extraction functionality: instance introduction of class pdfreadercontentparser 5.0.2: cmapawaredocumentfont: tweaks create processing quasi-invalid pdf files more robust 5.0.2: pdfcontentreadertool: null pointer handling, plus few placed flush calls 5.0.2: pdfcontentreadertool: show details on resource entries 5.0.2: pdfcontentstreamprocessor: adjustment embedded images don't cause parsing problems , improvements ei detection 5.0.2: locationtextextractionstrategy: fixed anti-parallel algorithm, plus accounting negative inter-character offsets. alter text extraction strategy builds out text model first, computes concatenation requirements. 5.0.2: adjustments linesegment implementation; optimalization of changes made bruno text extraction; example: introduction of class markedcontentinfo. 5.0.2: major refactoring of interface text extraction functionality: instance introduction of class pdfreadercontentparser 5.0.3: added method area of image in user units 5.0.3: improve parsing of inline images 5.0.3: adding check begin/end sequences when parsing tounicode stream. 5.0.4: content streams in arrays should parsed if separated whitespace 5.0.4: expose ctm 5.0.4: refactor pull inline image processing it's own class. added parsing of image info if there no filter applied (there pdfs there no white space between end of image info , ei operator). ultimately, best parse image data, require pretty big refactoring of itext decoders (to work streams instead of byte[] of known lengths). 5.0.4: handle multi-stage filters; right bug pulled whitespace first byte of inline image stream. 5.0.4: applying stream filters inline images. 5.0.4: pdfreader: expose filter decoder arbitrary byte arrays (instead of streams) 5.0.6: cmapparser: prepare read broken tounicode cmaps. 5.0.6: handle malformed embedded images 5.0.6: cmapawaredocumentfont: pdfs have diff map bigger 256 characters. 5.0.6: performance: cache fonts used in text extraction 5.1.2: prtokeniser: made algorithm find startxref more memory efficient. 5.1.2: randomaccessfileorarray: improved handling huge files can't mapped 5.1.2: cmapawaredocumentfont: prepare npe if mapping doesn't initialized (i'd rather wind junk characters throw unexpected exception downwards road) 5.1.3: refactoring of how filters applied streams, adjust parser can handle multi-stage filters 5.1.3: images: allow right decoding of 1bpc bitmask images 5.1.3: images: add together jbig2 streams pass through 5.1.3: images: handle null , indirect references in decode parameters, throw exception if unable decode image 5.2.0: improve error messages , improve handling 0 sized files , attempts read past end of file. 5.2.0: removed restriction using memory mapping requires file smaller ~2gb. 5.2.0: avoid nullpointerexception in randomaccessfileorarray 5.2.0: made utility method in pdfcontentstreamprocessor private , clarified stateful nature of class 5.2.0: locationtextextractionstrategy: bounds checking on string lengths , refactoring create code easier read. 5.2.0: improve handling of color space dictionaries in images. 5.2.0: improve handling of quasi improper inline image content. 5.2.0: don't decode inline image streams until absolutely need them. 5.2.0: avoid nullpointerexception of resource dictionary isn't provided. 5.3.0: locationtextextractionstrategy: old comparing approach caused runtime exceptions in java 7 5.3.3: incorporate text-rise parameter 5.3.3: expose glyph-by-glyph information 5.3.3: bugfix: text user space transformation beingness applied multiple times sub-textrenderinfo objects 5.3.3: bugfix: right baseline calculation doesn't include final character spacing 5.3.4: added low-level filtering hook locationtextextractionstrategy. 5.3.5: fixed bug in prtokeniser: handle case number @ end of stream. 5.3.5: replaced stringbuffer stringbuilder in prtokeniser performance reasons. 5.4.2: added ischunkatwordboundary() method locationtextextractionstrategy check if space character should inserted between previous chunk , current one. 5.4.2: added getcharspacewidth() method locationtextextractionstrategy width of space character. 5.4.2: added gettext() method locationtextextractionstrategy text of current chunk. 5.4.2: added appendtextchunk(() method simpletextextractionstrategy expose append process subclasses can add together text outside text parse operation. 5.4.5: added multifilteredrenderlistener class pdf parser. 5.4.5: added glyphrenderlistener , glyphtextrenderlistener classes processing each glyph rather processing chunks of text. 5.4.5: added method getmcid() in textrenderinfo. 5.4.5: fixed resource leak when many inline images in content stream 5.5.0: cmapawaredocumentfont: if font space width isn't defined, utilize default width font. 5.5.0: pdfcontentreader: avoid exception when displaying empty dictionary.there things won't able if don't upgrade. instance, won't able things described in these slides: http://www.slideshare.net/itextpdf/itext-summit-2014-talk-unstructured-pdf
if @ roadmap itext, you'll see we'll invest more time on text extraction in future: http://www.slideshare.net/itextpdf/itext-summit-2014-keynote-talk
in honesty: using 5 year old version wouldn't reinventing wheel, falling in every pitfall we've fallen in in lastly 5 years. can assure buying license less expensive.
pdf licensing itextsharp itext pdf-parsing
Comments
Post a Comment