I am doing some research useing a web site that has many different old newspapers scanned to .pdf files with OCR. They have usually have many years worth of each daily paper online, with each page a seperate .pdf file. This is the web site:
The site has its own online search engine you can use to search for words or specific phrases in the entire archive (hundreds of thousands of seperate .pdf files). However there is no way to limit the search to a particular newspaper. There is no way to limit the seach to a specific year or range of years.
However they have the scanned pages online directly in a very organized way. Each particular title has its own web page. For example here is the web page for the Syracuse NY Evening Telegram newspaper:
Is there a way to use software or some other online search engine such as Google, but to limit the results to just the .pdf files which are below the link I posted above for that particular newspaper?
Actually it woud be fine to go to one of the folder links for a particular year on that page above, such as the one for 1922:
and search the actual .pdf files linked there. That would limit the search to that paper and just 1922.
Well what it might really do is limit the seach that paper for just the beginning of 1922. A big daily newspaper has so many pages ans so many seperate pdf files that they are split up onto about 10 seperate web pages. I mean you have to navigate to the page with the next group using the left and right arrow icons at the top of the page. Hit the right arrow and you go to the second page for 1922:
and so forth.
+ Reply to Thread
Results 1 to 7 of 7
Last edited by Toastie; 11th Nov 2012 at 10:47. Reason: adding more specific info
yes, there is a way, but requires some html and php. maybe some scripting.
what you do is snip their main search page (in opera, it is Ctrl-F3), and have a look at the code to see how they "post" the data. note whether they use html or php though probably script, where there data is posting. if it is some .js script, copy that, (because it may be a combination of html/php and script) open it and review how they submit/post, then copy your version and customize it to your search criteria. then add that to your custom html page (that you can put on your hdd) and run that in your browser and search all your want!
Thanks to both of you!
I copied the shortcut to a sample .pdf file to notepad - it had tons of those "%20" or "%2520" things in it. I cut and pasted it into a utility on this web page:
to convert it to a normal path. Once I had that I could clearly see the directory/subdirestory structure being used to store the acutal .pdf files. They had some screwey things like double spaces in the sub-folder names. Then I could used that in Google like so:
"search term" site:"path to the folder containing the .pdf files"
Last edited by Toastie; 11th Nov 2012 at 12:08. Reason: fix typos
i don't think i trust that website. i had it up in one of my tabs. after a while i started noticing my internet status (trask tray) was always on...sending/receiving. i thought i was still d/l'ing, but i wasn't. that went on for the last 10 minutes. as soon as i closed the tab, it stopped. so, i'm just saying, you never know.
use quotation ("") when you see spaces in their like that.
i believe the term is, urlencoded, when you see things showing as %n in there. those snipets have to be urldecoded (or cleaned) if you are trying to put together how things work or else things will get confusing. i wrote a tool to snip most of this but it is too crude to post here. maybe there is something more nicer to download. do a google search for url cleaner or url decode or something like that.
EDIT: oh, i see, you already found it.
see this resource for an detail explanation: http://en.wikipedia.org/wiki/Percent-encoding
here's a list i put together when i was learning how to scrap google search results, not sure if complete though:
%22 = "
%23 = "#"
%25 = "%"
%26 = "&"
%28 = "(" - casual
%29 = ")" - casual
%2C = ","
%2F = "/"
%2F = "/"
%3A = ":"
%3C = "<"
%3D = "="
%3E = ">"
%3F = "?"
%40 = "@"
Thanks for spending the time to help out.