VideoHelp Forum




+ Reply to Thread
Results 1 to 7 of 7
  1. Member
    Join Date
    Jun 2002
    Location
    United States
    Search Comp PM
    I am doing some research useing a web site that has many different old newspapers scanned to .pdf files with OCR. They have usually have many years worth of each daily paper online, with each page a seperate .pdf file. This is the web site:

    http://fultonhistory.com/Fulton.html

    The site has its own online search engine you can use to search for words or specific phrases in the entire archive (hundreds of thousands of seperate .pdf files). However there is no way to limit the search to a particular newspaper. There is no way to limit the seach to a specific year or range of years.

    However they have the scanned pages online directly in a very organized way. Each particular title has its own web page. For example here is the web page for the Syracuse NY Evening Telegram newspaper:

    http://fultonhistory.com/my%20photo%20albums/All%20Newspapers/Syracuse%20NY%20Evening%...ram/index.html

    Is there a way to use software or some other online search engine such as Google, but to limit the results to just the .pdf files which are below the link I posted above for that particular newspaper?

    Actually it woud be fine to go to one of the folder links for a particular year on that page above, such as the one for 1922:

    http://fultonhistory.com/my%20photo%20albums/All%20Newspapers/Syracuse%20NY%20Evening%...pdf/index.html

    and search the actual .pdf files linked there. That would limit the search to that paper and just 1922.

    Well what it might really do is limit the seach that paper for just the beginning of 1922. A big daily newspaper has so many pages ans so many seperate pdf files that they are split up onto about 10 seperate web pages. I mean you have to navigate to the page with the next group using the left and right arrow icons at the top of the page. Hit the right arrow and you go to the second page for 1922:

    http://fultonhistory.com/my%20photo%20albums/All%20Newspapers/Syracuse%20NY%20Evening%...df/index2.html

    and so forth.


    Any ideas?

    Thanks!
    Last edited by Toastie; 11th Nov 2012 at 10:47. Reason: adding more specific info
    Quote Quote  
  2. Try specifying site:fultonhistory.com filetype=pdf in your google search. Or use the advanced search option at google.

    http://www.google.com/advanced_search
    Last edited by jagabo; 11th Nov 2012 at 11:24.
    Quote Quote  
  3. Member vhelp's Avatar
    Join Date
    Mar 2001
    Location
    New York
    Search Comp PM
    yes, there is a way, but requires some html and php. maybe some scripting.

    what you do is snip their main search page (in opera, it is Ctrl-F3), and have a look at the code to see how they "post" the data. note whether they use html or php though probably script, where there data is posting. if it is some .js script, copy that, (because it may be a combination of html/php and script) open it and review how they submit/post, then copy your version and customize it to your search criteria. then add that to your custom html page (that you can put on your hdd) and run that in your browser and search all your want!
    Quote Quote  
  4. Member
    Join Date
    Jun 2002
    Location
    United States
    Search Comp PM
    Thanks to both of you!

    I copied the shortcut to a sample .pdf file to notepad - it had tons of those "%20" or "%2520" things in it. I cut and pasted it into a utility on this web page:

    http://www.blooberry.com/indexdot/html/topics/urlencoding.htm

    to convert it to a normal path. Once I had that I could clearly see the directory/subdirestory structure being used to store the acutal .pdf files. They had some screwey things like double spaces in the sub-folder names. Then I could used that in Google like so:

    "search term" site:"path to the folder containing the .pdf files"

    It works!!!
    Last edited by Toastie; 11th Nov 2012 at 12:08. Reason: fix typos
    Quote Quote  
  5. Member vhelp's Avatar
    Join Date
    Mar 2001
    Location
    New York
    Search Comp PM
    i don't think i trust that website. i had it up in one of my tabs. after a while i started noticing my internet status (trask tray) was always on...sending/receiving. i thought i was still d/l'ing, but i wasn't. that went on for the last 10 minutes. as soon as i closed the tab, it stopped. so, i'm just saying, you never know.
    Quote Quote  
  6. Member vhelp's Avatar
    Join Date
    Mar 2001
    Location
    New York
    Search Comp PM
    use quotation ("") when you see spaces in their like that.

    i believe the term is, urlencoded, when you see things showing as %n in there. those snipets have to be urldecoded (or cleaned) if you are trying to put together how things work or else things will get confusing. i wrote a tool to snip most of this but it is too crude to post here. maybe there is something more nicer to download. do a google search for url cleaner or url decode or something like that.

    EDIT: oh, i see, you already found it.
    see this resource for an detail explanation: http://en.wikipedia.org/wiki/Percent-encoding


    here's a list i put together when i was learning how to scrap google search results, not sure if complete though:

    %22 = "
    %23 = "#"
    %25 = "%"
    %26 = "&"
    %28 = "(" - casual
    %29 = ")" - casual
    %2C = ","
    %2F = "/"
    %2F = "/"
    %3A = ":"
    %3C = "<"
    %3D = "="
    %3E = ">"
    %3F = "?"
    %40 = "@"
    Quote Quote  
  7. Member
    Join Date
    Jun 2002
    Location
    United States
    Search Comp PM
    Thanks for spending the time to help out.
    Quote Quote  



Similar Threads

Visit our sponsor! Try DVDFab and backup Blu-rays!