Anybody have a recommedation for PDF (OCR) coversion to text?

Thread

21st Jun 2023 22:36 #1
davexnet

View Profile

View Forum Posts

Private Message
Member

Join Date
Mar 2008

Location
United States
Hello -
I'm looking for something that will run on the PC, not upload document to website.
Preferably free -
thanks

Quote
22nd Jun 2023 00:53 #2
johns0

View Profile

View Forum Posts

Private Message
I'm a Super Moderator

Join Date
Jun 2002

Location
canada
Try foxit reader.

I think,therefore i am a hamster.

Quote
22nd Jun 2023 10:10 #3
TreeTops

View Profile

View Forum Posts

Private Message
Member

Join Date
May 2010

Location
Oregon
It doesn't look like Foxit Reader can convert PDF to Text.

https://cdn01.foxitsoftware.com/pub/foxit/datasheet/reader/en_us/Foxit-PDF-Reader.pdf

I had to upgrade to Foxit PhantomPDF in order to convert a PDF to Text.

Extraordinary claims require extraordinary evidence -Carl Sagan

Quote
22nd Jun 2023 10:44 #4
johns0

View Profile

View Forum Posts

Private Message
I'm a Super Moderator

Join Date
Jun 2002

Location
canada
I have the regular foxit reader and it converts pdf to text.

I think,therefore i am a hamster.

Quote
22nd Jun 2023 11:02 #5
davexnet

View Profile

View Forum Posts

Private Message
Member

Join Date
Mar 2008

Location
United States
Thanks everybody for your input. What I'm trying to do is convert some old ebook pdf's that appear to be
literally picture scans of the original book pages.

I'd like to make the text machine readable so I stand a chance of converting it to epub.
Here's an example

https://s3.us-west-1.wasabisys.com/luminist/EB/B/Barlett%20-%20Vanguard%20of%20Venus.pdf

Quote
22nd Jun 2023 12:51 #6
TreeTops

View Profile

View Forum Posts

Private Message
Member

Join Date
May 2010

Location
Oregon
Good luck with that. Please let us know how it goes. I predict you will have to do some work on it after trying to convert.

Extraordinary claims require extraordinary evidence -Carl Sagan

Quote
22nd Jun 2023 16:09 #7
pandy

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2008
Probably best for you is to export pdf into image and after this, with help of some OCR you need to convert image to text.
For OCR selection you can follow for example https://www.techradar.com/best/best-ocr-software and similar recommendations.

This can be interesting too https://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software?useskin=vector

Open source Tesseract seem to be quite good, i have personally best experience with Abbyy FineReader but it was many years ago (versions 5 and 6 - something like 20 years ago).

Quote
24th Jun 2023 12:50 #8
Subtitles

View Profile

View Forum Posts

Private Message
Member

Join Date
Mar 2021

Location
Israel
Originally Posted by davexnet

Thanks everybody for your input. What I'm trying to do is convert some old ebook pdf's that appear to be
literally picture scans of the original book pages.

I'd like to make the text machine readable so I stand a chance of converting it to epub.
Here's an example

https://s3.us-west-1.wasabisys.com/luminist/EB/B/Barlett%20-%20Vanguard%20of%20Venus.pdf

For an eBook made of images, the best OCR software is Abbyy FineReader. It is a bit pricy but maybe they have a free trial that you can use.
Please see attached 2 pages that I OCRed for you.
You might need to check for errors but I think it is pretty good as is.

Attached Thumbnails

Last edited by Subtitles; 24th Jun 2023 at 12:59.

Quote
24th Jun 2023 14:01 #9
davexnet

View Profile

View Forum Posts

Private Message
Member

Join Date
Mar 2008

Location
United States
Originally Posted by Subtitles

Originally Posted by davexnet

Thanks everybody for your input. What I'm trying to do is convert some old ebook pdf's that appear to be
literally picture scans of the original book pages.

I'd like to make the text machine readable so I stand a chance of converting it to epub.
Here's an example

https://s3.us-west-1.wasabisys.com/luminist/EB/B/Barlett%20-%20Vanguard%20of%20Venus.pdf

For an eBook made of images, the best OCR software is Abbyy FineReader. It is a bit pricy but maybe they have a free trial that you can use.
Please see attached 2 pages that I OCRed for you.
You might need to check for errors but I think it is pretty good as is.

Thanks for the recommendation; any included images are a bonus to me and low priority.
I'd be happy with a good text-only conversion. Once I get a txt version, I'd probably start with Calibre
to see what it can do with it. Big learning process for me

Quote
24th Jun 2023 15:44 #10
_Al_

View Profile

View Forum Posts

Private Message
Member

Join Date
Feb 2011
gave it a shot using python, lots of dependencies, so not that easy to set up , but posting it here as an alternative

I got this from that posted pdf:

Code:

The VANGUARD of VENUS by LANDELL BARLETT Presented With the Compliments of ee Copyrighted 1928 EXPERIMENTER PUBLISHING CO., 230 Fifth Avenue, New York

that "ee" perhaps means it did not ocr'd that "Amazing stories" icon which has characters of not the same height but mainly characters overlap, perhaps that was too difficult to deal with,
python code:

Code:

from pdf2image import convert_from_path import numpy as np ##import cv2 import pytesseract pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' pages = convert_from_path( 'Barlett - Vanguard of Venus.pdf', poppler_path = r'F:\downloads\poppler-0.68.0\bin' ) #dealing only with first page if more for now img = np.array(pages[0]) ##cv2.imshow('img', img) ##cv2.waitKey(0) text = pytesseract.image_to_string(img) print(text)

requirements are:
numpy: pip install numpy
if wanting to see an image or adjusting images for better orc-ing also opencv: pip install opencv-python
pdf2image: pip install pdf2image
poppler-windows: https://blog.alivate.com.au/poppler-windows/
installing pytesseract for ocr from: https://github.com/UB-Mannheim/tesseract/wiki
then installing python pytesseract module: pip install pytesseract
Last edited by _Al_; 24th Jun 2023 at 16:16.
Quote
24th Jun 2023 16:26 #11
davexnet

View Profile

View Forum Posts

Private Message
Member

Join Date
Mar 2008

Location
United States
Thanks for this python scenario; I have it installed as I wanted to try the Whisper AI that needed it.
I got it working but I barely know anything about python. This could be an interesting project -
I've got to learn some basics before I can tackle the problem!

Quote

24th Jun 2023 18:33 #12

_Al_

Member

Oh there was actually 28 pages or something.

I thought there was only one and other empty page in that pdf, so using that script below yielded included output text:

Code:

from pdf2image import convert_from_path
import numpy as np
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
print('processing pdf to images ...')
pages = convert_from_path(
    r'D:\downloads2\Barlett - Vanguard of Venus.pdf',
    poppler_path = r'F:\downloads\poppler-0.68.0\bin'
    )
print('total pages:', len(pages))
text_list = []
with open('output.txt', 'w') as f:
    print('ocr for page:')
    for page_number, page in enumerate(pages, 1):
        print(page_number)
        text = pytesseract.image_to_string(np.array(page))
        page_text = f'{text}                               -{page_number}-\n\n\n'
        f.write(page_text)
        text_list.append(text)

Attached Files

output.txt (72.0 KB, 50 views)

Quote

24th Jun 2023 18:50 #13
johns0

View Profile

View Forum Posts

Private Message
I'm a Super Moderator

Join Date
Jun 2002

Location
canada
When you see the page numbers but there's nothing means it skipped the graphics.

I think,therefore i am a hamster.

Quote
25th Jun 2023 02:32 #14
davexnet

View Profile

View Forum Posts

Private Message
Member

Join Date
Mar 2008

Location
United States
Thanks everybody, I've got some stuff to digest

Quote
25th Sep 2023 03:52 #15
Subtitles

View Profile

View Forum Posts

Private Message
Member

Join Date
Mar 2021

Location
Israel
Originally Posted by Mg88

Sorry to bump this, but I don't want to duplicate the topic. How did it all end for you?

It all depends on what do you want to OCR.
I have tested several OCR software and I find ABBYY FineReader to be the best to OCR books and magazines as you can select "Exact Copy" in addition to other options.
I think there is a trial option so try before you decide to buy.
For OCRing images, the best solution is to convert the images to pdf files using Windows built in print pictures "Microsoft print to pdf", upload to Google Drive and open the pdf files using Google Docs. There is a limitation of about 2MB for pdf file. If it does nothing then make the pdf file smaller and try again.
With images you will get only the text.
Other solutions mentioned above are good but you might find yourself doing a lot of corrections and that takes time.

Quote

Anybody have a recommedation for PDF (OCR) coversion to text?

Thread Tools

Similar Threads

Google OCR

Convert Timed Text subtitles to ASS with text positions?

overlay some text with a black background/white text over one small section

I have OCR without timing

OCR software