VideoHelp Forum
+ Reply to Thread
Results 1 to 15 of 15
Thread
  1. Member
    Join Date
    Mar 2008
    Location
    United States
    Search Comp PM
    Hello -
    I'm looking for something that will run on the PC, not upload document to website.
    Preferably free -
    thanks
    Quote Quote  
  2. I'm a Super Moderator johns0's Avatar
    Join Date
    Jun 2002
    Location
    canada
    Search Comp PM
    Try foxit reader.
    I think,therefore i am a hamster.
    Quote Quote  
  3. It doesn't look like Foxit Reader can convert PDF to Text.

    https://cdn01.foxitsoftware.com/pub/foxit/datasheet/reader/en_us/Foxit-PDF-Reader.pdf

    I had to upgrade to Foxit PhantomPDF in order to convert a PDF to Text.
    Extraordinary claims require extraordinary evidence -Carl Sagan
    Quote Quote  
  4. I'm a Super Moderator johns0's Avatar
    Join Date
    Jun 2002
    Location
    canada
    Search Comp PM
    I have the regular foxit reader and it converts pdf to text.
    I think,therefore i am a hamster.
    Quote Quote  
  5. Member
    Join Date
    Mar 2008
    Location
    United States
    Search Comp PM
    Thanks everybody for your input. What I'm trying to do is convert some old ebook pdf's that appear to be
    literally picture scans of the original book pages.

    I'd like to make the text machine readable so I stand a chance of converting it to epub.
    Here's an example

    https://s3.us-west-1.wasabisys.com/luminist/EB/B/Barlett%20-%20Vanguard%20of%20Venus.pdf
    Quote Quote  
  6. Good luck with that. Please let us know how it goes. I predict you will have to do some work on it after trying to convert.
    Extraordinary claims require extraordinary evidence -Carl Sagan
    Quote Quote  
  7. Probably best for you is to export pdf into image and after this, with help of some OCR you need to convert image to text.
    For OCR selection you can follow for example https://www.techradar.com/best/best-ocr-software and similar recommendations.

    This can be interesting too https://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software?useskin=vector

    Open source Tesseract seem to be quite good, i have personally best experience with Abbyy FineReader but it was many years ago (versions 5 and 6 - something like 20 years ago).
    Quote Quote  
  8. Member
    Join Date
    Mar 2021
    Location
    Israel
    Search Comp PM
    Originally Posted by davexnet View Post
    Thanks everybody for your input. What I'm trying to do is convert some old ebook pdf's that appear to be
    literally picture scans of the original book pages.

    I'd like to make the text machine readable so I stand a chance of converting it to epub.
    Here's an example

    https://s3.us-west-1.wasabisys.com/luminist/EB/B/Barlett%20-%20Vanguard%20of%20Venus.pdf
    For an eBook made of images, the best OCR software is Abbyy FineReader. It is a bit pricy but maybe they have a free trial that you can use.
    Please see attached 2 pages that I OCRed for you.
    You might need to check for errors but I think it is pretty good as is.
    Image Attached Thumbnails Barlett - Vanguard of Venus (2 Pages).pdf  

    Last edited by Subtitles; 24th Jun 2023 at 12:59.
    Quote Quote  
  9. Member
    Join Date
    Mar 2008
    Location
    United States
    Search Comp PM
    Originally Posted by Subtitles View Post
    Originally Posted by davexnet View Post
    Thanks everybody for your input. What I'm trying to do is convert some old ebook pdf's that appear to be
    literally picture scans of the original book pages.

    I'd like to make the text machine readable so I stand a chance of converting it to epub.
    Here's an example

    https://s3.us-west-1.wasabisys.com/luminist/EB/B/Barlett%20-%20Vanguard%20of%20Venus.pdf
    For an eBook made of images, the best OCR software is Abbyy FineReader. It is a bit pricy but maybe they have a free trial that you can use.
    Please see attached 2 pages that I OCRed for you.
    You might need to check for errors but I think it is pretty good as is.
    Thanks for the recommendation; any included images are a bonus to me and low priority.
    I'd be happy with a good text-only conversion. Once I get a txt version, I'd probably start with Calibre
    to see what it can do with it. Big learning process for me
    Quote Quote  
  10. gave it a shot using python, lots of dependencies, so not that easy to set up , but posting it here as an alternative

    I got this from that posted pdf:
    Code:
    The VANGUARD
    of VENUS
    
    by LANDELL BARLETT
    
    Presented With the
    Compliments of
    
    ee
    
    Copyrighted 1928
    
    EXPERIMENTER PUBLISHING CO., 230 Fifth Avenue, New York
    that "ee" perhaps means it did not ocr'd that "Amazing stories" icon which has characters of not the same height but mainly characters overlap, perhaps that was too difficult to deal with,
    python code:
    Code:
    from pdf2image import convert_from_path
    import numpy as np
    ##import cv2
    import pytesseract
    
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
    
    pages = convert_from_path(
        'Barlett - Vanguard of Venus.pdf',
        poppler_path = r'F:\downloads\poppler-0.68.0\bin'
        )
    #dealing only with first page if more for now
    img = np.array(pages[0])
    ##cv2.imshow('img', img)
    ##cv2.waitKey(0)
    text = pytesseract.image_to_string(img)
    print(text)
    requirements are:
    numpy: pip install numpy
    if wanting to see an image or adjusting images for better orc-ing also opencv: pip install opencv-python
    pdf2image: pip install pdf2image
    poppler-windows: https://blog.alivate.com.au/poppler-windows/
    installing pytesseract for ocr from: https://github.com/UB-Mannheim/tesseract/wiki
    then installing python pytesseract module: pip install pytesseract
    Last edited by _Al_; 24th Jun 2023 at 16:16.
    Quote Quote  
  11. Member
    Join Date
    Mar 2008
    Location
    United States
    Search Comp PM
    Thanks for this python scenario; I have it installed as I wanted to try the Whisper AI that needed it.
    I got it working but I barely know anything about python. This could be an interesting project -
    I've got to learn some basics before I can tackle the problem!
    Quote Quote  
  12. Oh there was actually 28 pages or something. I thought there was only one and other empty page in that pdf, so using that script below yielded included output text:
    Code:
    from pdf2image import convert_from_path
    import numpy as np
    import pytesseract
    
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
    print('processing pdf to images ...')
    pages = convert_from_path(
        r'D:\downloads2\Barlett - Vanguard of Venus.pdf',
        poppler_path = r'F:\downloads\poppler-0.68.0\bin'
        )
    print('total pages:', len(pages))
    text_list = []
    with open('output.txt', 'w') as f:
        print('ocr for page:')
        for page_number, page in enumerate(pages, 1):
            print(page_number)
            text = pytesseract.image_to_string(np.array(page))
            page_text = f'{text}                               -{page_number}-\n\n\n'
            f.write(page_text)
            text_list.append(text)
    Image Attached Files
    Quote Quote  
  13. I'm a Super Moderator johns0's Avatar
    Join Date
    Jun 2002
    Location
    canada
    Search Comp PM
    When you see the page numbers but there's nothing means it skipped the graphics.
    I think,therefore i am a hamster.
    Quote Quote  
  14. Member
    Join Date
    Mar 2008
    Location
    United States
    Search Comp PM
    Thanks everybody, I've got some stuff to digest
    Quote Quote  
  15. Member
    Join Date
    Mar 2021
    Location
    Israel
    Search Comp PM
    Originally Posted by Mg88 View Post
    Sorry to bump this, but I don't want to duplicate the topic. How did it all end for you?
    It all depends on what do you want to OCR.
    I have tested several OCR software and I find ABBYY FineReader to be the best to OCR books and magazines as you can select "Exact Copy" in addition to other options.
    I think there is a trial option so try before you decide to buy.
    For OCRing images, the best solution is to convert the images to pdf files using Windows built in print pictures "Microsoft print to pdf", upload to Google Drive and open the pdf files using Google Docs. There is a limitation of about 2MB for pdf file. If it does nothing then make the pdf file smaller and try again.
    With images you will get only the text.
    Other solutions mentioned above are good but you might find yourself doing a lot of corrections and that takes time.
    Quote Quote  



Similar Threads

Visit our sponsor! Try DVDFab and backup Blu-rays!