I am doing some research useing a web site that has many different old newspapers scanned to .pdf files with OCR. They have usually have many years worth of each daily paper online, with each page a seperate .pdf file. This is the web site:
http://fultonhistory.com/Fulton.html
The site has its own online search engine you can use to search for words or specific phrases in the entire archive (hundreds of thousands of seperate .pdf files). However there is no way to limit the search to a particular newspaper. There is no way to limit the seach to a specific year or range of years.
However they have the scanned pages online directly in a very organized way. Each particular title has its own web page. For example here is the web page for the Syracuse NY Evening Telegram newspaper:
http://fultonhistory.com/my%20photo%20albums/All%20Newspapers/Syracuse%20NY%20Evening%...ram/index.html
Is there a way to use software or some other online search engine such as Google, but to limit the results to just the .pdf files which are below the link I posted above for that particular newspaper?
Actually it woud be fine to go to one of the folder links for a particular year on that page above, such as the one for 1922:
http://fultonhistory.com/my%20photo%20albums/All%20Newspapers/Syracuse%20NY%20Evening%...pdf/index.html
and search the actual .pdf files linked there. That would limit the search to that paper and just 1922.
Well what it might really do is limit the seach that paper for just the beginning of 1922. A big daily newspaper has so many pages ans so many seperate pdf files that they are split up onto about 10 seperate web pages. I mean you have to navigate to the page with the next group using the left and right arrow icons at the top of the page. Hit the right arrow and you go to the second page for 1922:
http://fultonhistory.com/my%20photo%20albums/All%20Newspapers/Syracuse%20NY%20Evening%...df/index2.html
and so forth.
Any ideas?
Thanks!
+ Reply to Thread
Results 1 to 7 of 7
-
Last edited by Toastie; 11th Nov 2012 at 10:47. Reason: adding more specific info
-
Try specifying site:fultonhistory.com filetype=pdf in your google search. Or use the advanced search option at google.
http://www.google.com/advanced_searchLast edited by jagabo; 11th Nov 2012 at 11:24.
-
yes, there is a way, but requires some html and php. maybe some scripting.
what you do is snip their main search page (in opera, it is Ctrl-F3), and have a look at the code to see how they "post" the data. note whether they use html or php though probably script, where there data is posting. if it is some .js script, copy that, (because it may be a combination of html/php and script) open it and review how they submit/post, then copy your version and customize it to your search criteria. then add that to your custom html page (that you can put on your hdd) and run that in your browser and search all your want! -
Thanks to both of you!
I copied the shortcut to a sample .pdf file to notepad - it had tons of those "%20" or "%2520" things in it. I cut and pasted it into a utility on this web page:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
to convert it to a normal path. Once I had that I could clearly see the directory/subdirestory structure being used to store the acutal .pdf files. They had some screwey things like double spaces in the sub-folder names. Then I could used that in Google like so:
"search term" site:"path to the folder containing the .pdf files"
It works!!!Last edited by Toastie; 11th Nov 2012 at 12:08. Reason: fix typos
-
i don't think i trust that website. i had it up in one of my tabs. after a while i started noticing my internet status (trask tray) was always on...sending/receiving. i thought i was still d/l'ing, but i wasn't. that went on for the last 10 minutes. as soon as i closed the tab, it stopped. so, i'm just saying, you never know.
-
use quotation ("") when you see spaces in their like that.
i believe the term is, urlencoded, when you see things showing as %n in there. those snipets have to be urldecoded (or cleaned) if you are trying to put together how things work or else things will get confusing. i wrote a tool to snip most of this but it is too crude to post here. maybe there is something more nicer to download. do a google search for url cleaner or url decode or something like that.
EDIT: oh, i see, you already found it.
see this resource for an detail explanation: http://en.wikipedia.org/wiki/Percent-encoding
here's a list i put together when i was learning how to scrap google search results, not sure if complete though:
%22 = "
%23 = "#"
%25 = "%"
%26 = "&"
%28 = "(" - casual
%29 = ")" - casual
%2C = ","
%2F = "/"
%2F = "/"
%3A = ":"
%3C = "<"
%3D = "="
%3E = ">"
%3F = "?"
%40 = "@"
Similar Threads
-
How to print a web page
By neomaine in forum ComputerReplies: 15Last Post: 30th Nov 2011, 23:05 -
web page translator and email full page translator
By juststarting in forum ComputerReplies: 3Last Post: 1st Feb 2010, 09:23 -
How to copy locked web page?
By coody in forum ComputerReplies: 20Last Post: 24th Mar 2009, 01:37 -
embed video in a web page
By shashgo in forum Newbie / General discussionsReplies: 1Last Post: 13th Nov 2008, 20:26 -
Web page to run from CD help.
By Poppa_Meth in forum ComputerReplies: 4Last Post: 26th Mar 2008, 10:51