Hello -
I'm looking for something that will run on the PC, not upload document to website.
Preferably free -
thanks
+ Reply to Thread
Results 1 to 15 of 15
-
-
It doesn't look like Foxit Reader can convert PDF to Text.
https://cdn01.foxitsoftware.com/pub/foxit/datasheet/reader/en_us/Foxit-PDF-Reader.pdf
I had to upgrade to Foxit PhantomPDF in order to convert a PDF to Text.Extraordinary claims require extraordinary evidence -Carl Sagan -
I have the regular foxit reader and it converts pdf to text.
I think,therefore i am a hamster. -
Thanks everybody for your input. What I'm trying to do is convert some old ebook pdf's that appear to be
literally picture scans of the original book pages.
I'd like to make the text machine readable so I stand a chance of converting it to epub.
Here's an example
https://s3.us-west-1.wasabisys.com/luminist/EB/B/Barlett%20-%20Vanguard%20of%20Venus.pdf -
Good luck with that. Please let us know how it goes. I predict you will have to do some work on it after trying to convert.
Extraordinary claims require extraordinary evidence -Carl Sagan -
Probably best for you is to export pdf into image and after this, with help of some OCR you need to convert image to text.
For OCR selection you can follow for example https://www.techradar.com/best/best-ocr-software and similar recommendations.
This can be interesting too https://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software?useskin=vector
Open source Tesseract seem to be quite good, i have personally best experience with Abbyy FineReader but it was many years ago (versions 5 and 6 - something like 20 years ago). -
Last edited by Subtitles; 24th Jun 2023 at 12:59.
-
-
gave it a shot using python, lots of dependencies, so not that easy to set up , but posting it here as an alternative
I got this from that posted pdf:
Code:The VANGUARD of VENUS by LANDELL BARLETT Presented With the Compliments of ee Copyrighted 1928 EXPERIMENTER PUBLISHING CO., 230 Fifth Avenue, New York
python code:
Code:from pdf2image import convert_from_path import numpy as np ##import cv2 import pytesseract pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' pages = convert_from_path( 'Barlett - Vanguard of Venus.pdf', poppler_path = r'F:\downloads\poppler-0.68.0\bin' ) #dealing only with first page if more for now img = np.array(pages[0]) ##cv2.imshow('img', img) ##cv2.waitKey(0) text = pytesseract.image_to_string(img) print(text)
numpy: pip install numpy
if wanting to see an image or adjusting images for better orc-ing also opencv: pip install opencv-python
pdf2image: pip install pdf2image
poppler-windows: https://blog.alivate.com.au/poppler-windows/
installing pytesseract for ocr from: https://github.com/UB-Mannheim/tesseract/wiki
then installing python pytesseract module: pip install pytesseractLast edited by _Al_; 24th Jun 2023 at 16:16.
-
Thanks for this python scenario; I have it installed as I wanted to try the Whisper AI that needed it.
I got it working but I barely know anything about python. This could be an interesting project -
I've got to learn some basics before I can tackle the problem! -
Oh there was actually 28 pages or something.
I thought there was only one and other empty page in that pdf, so using that script below yielded included output text:
Code:from pdf2image import convert_from_path import numpy as np import pytesseract pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' print('processing pdf to images ...') pages = convert_from_path( r'D:\downloads2\Barlett - Vanguard of Venus.pdf', poppler_path = r'F:\downloads\poppler-0.68.0\bin' ) print('total pages:', len(pages)) text_list = [] with open('output.txt', 'w') as f: print('ocr for page:') for page_number, page in enumerate(pages, 1): print(page_number) text = pytesseract.image_to_string(np.array(page)) page_text = f'{text} -{page_number}-\n\n\n' f.write(page_text) text_list.append(text)
-
When you see the page numbers but there's nothing means it skipped the graphics.
I think,therefore i am a hamster. -
It all depends on what do you want to OCR.
I have tested several OCR software and I find ABBYY FineReader to be the best to OCR books and magazines as you can select "Exact Copy" in addition to other options.
I think there is a trial option so try before you decide to buy.
For OCRing images, the best solution is to convert the images to pdf files using Windows built in print pictures "Microsoft print to pdf", upload to Google Drive and open the pdf files using Google Docs. There is a limitation of about 2MB for pdf file. If it does nothing then make the pdf file smaller and try again.
With images you will get only the text.
Other solutions mentioned above are good but you might find yourself doing a lot of corrections and that takes time.
Similar Threads
-
Google OCR
By elyasi in forum Newbie / General discussionsReplies: 9Last Post: 7th Feb 2022, 10:15 -
Convert Timed Text subtitles to ASS with text positions?
By Hakunamatata67 in forum SubtitleReplies: 0Last Post: 3rd Feb 2022, 02:08 -
overlay some text with a black background/white text over one small section
By devdev in forum EditingReplies: 0Last Post: 16th Dec 2020, 12:27 -
I have OCR without timing
By elyasi in forum Latest Video NewsReplies: 1Last Post: 20th Aug 2020, 13:11 -
OCR software
By Aludin in forum ComputerReplies: 4Last Post: 8th Nov 2018, 04:29