Frequently Asked Question:

Text extraction from landscape pdf

Question

I'm using GetPageText that the text is not extracted in the correct sequence and spacing. By using DAExtractPageText I show that it reads the line in a non sorted way. Is there a way we can sort the extracted text as it is presented in the pdf ?

Answer

Every PDF is different and text extraction is inherently difficult.

Text objects can be placed into a PDF in a totally random order just like a jigsaw puzzle is put together. Most PDF's follow a Left to Right, Top to Bottom order. So if the PDF is pretty straight forward then you can add the text and bounding box information returned from DAExtractPageText with option 3 or 4 and then sort the rectangles based on x,y positions.