Frequently Asked Question:
How do I retrieve all URLs and related text from a PDF?
How do I retrieve all URLs and related text from a PDF?
Unfortunately the way PDFs work, the text of the link and the hotspot itself are not related in any way.
You can have text and graphics anywhere on that page and the hotspot links are annotations floating on a layer above the page.
So to get all the links would be a two step process and some logic would have to be written by the customer to correlate the two sets of information to match the URL to the text of the link.
Here's the procedure to use:
Step 1. Get the URLs and locations of the hotspot links
QP.LoadFromFile(...)
QP.SelectPage(...)
For X = 1 to QP.AnnotationCount
URL = QP.GetAnnotStrProperty(X, 111)
Left = QP.GetAnnotDblProperty(X, 105)
Top = QP.GetAnnotDblProperty(X, 106)
Width = QP.GetAnnotDblProperty(X, 107)
Height = QP.GetAnnotDblProperty(X, 108)
// Store this information in an array
Next X
Step 2. Get the location of blocks of text on the page
QP.SelectPage(...)
PageText = QP.GetPageText(3)
// Split PageText into rows
// Process each row:
// Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text
// Store this information in a second array
Step 3. Compare the information in the two arrays to match URLs with blocks of text
A good approach might be to expand the rectangle of the hotspot link by a certain percentage and then check if the (X1, Y1) .. (X4, Y4) are inside the hotspot rectangle.
There is no guarantee that individual words of the link text will be returned as a single block - so multiple rows of GetPageText output may be within the hotspot rectangle area. Also multiple blocks of text may not be in "visual" order.