how to search a specific web page??

11th Nov 2012 10:38 #1
Toastie

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2002

Location
United States
I am doing some research useing a web site that has many different old newspapers scanned to .pdf files with OCR. They have usually have many years worth of each daily paper online, with each page a seperate .pdf file. This is the web site:

http://fultonhistory.com/Fulton.html

The site has its own online search engine you can use to search for words or specific phrases in the entire archive (hundreds of thousands of seperate .pdf files). However there is no way to limit the search to a particular newspaper. There is no way to limit the seach to a specific year or range of years.

However they have the scanned pages online directly in a very organized way. Each particular title has its own web page. For example here is the web page for the Syracuse NY Evening Telegram newspaper:

http://fultonhistory.com/my%20photo%20albums/All%20Newspapers/Syracuse%20NY%20Evening%...ram/index.html

Is there a way to use software or some other online search engine such as Google, but to limit the results to just the .pdf files which are below the link I posted above for that particular newspaper?

Actually it woud be fine to go to one of the folder links for a particular year on that page above, such as the one for 1922:

http://fultonhistory.com/my%20photo%20albums/All%20Newspapers/Syracuse%20NY%20Evening%...pdf/index.html

and search the actual .pdf files linked there. That would limit the search to that paper and just 1922.

Well what it might really do is limit the seach that paper for just the beginning of 1922. A big daily newspaper has so many pages ans so many seperate pdf files that they are split up onto about 10 seperate web pages. I mean you have to navigate to the page with the next group using the left and right arrow icons at the top of the page. Hit the right arrow and you go to the second page for 1922:

http://fultonhistory.com/my%20photo%20albums/All%20Newspapers/Syracuse%20NY%20Evening%...df/index2.html

and so forth.

Any ideas?

Thanks!

Last edited by Toastie; 11th Nov 2012 at 10:47. Reason: adding more specific info

Quote
11th Nov 2012 11:17 #2
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Try specifying site:fultonhistory.com filetype=pdf in your google search. Or use the advanced search option at google.

http://www.google.com/advanced_search

Last edited by jagabo; 11th Nov 2012 at 11:24.

Quote
11th Nov 2012 11:30 #3
vhelp

View Profile

View Forum Posts

Private Message
Member

Join Date
Mar 2001

Location
New York
yes, there is a way, but requires some html and php. maybe some scripting.

what you do is snip their main search page (in opera, it is Ctrl-F3), and have a look at the code to see how they "post" the data. note whether they use html or php though probably script, where there data is posting. if it is some .js script, copy that, (because it may be a combination of html/php and script) open it and review how they submit/post, then copy your version and customize it to your search criteria. then add that to your custom html page (that you can put on your hdd) and run that in your browser and search all your want!

Quote
11th Nov 2012 12:06 #4
Toastie

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2002

Location
United States
Thanks to both of you!

I copied the shortcut to a sample .pdf file to notepad - it had tons of those "%20" or "%2520" things in it. I cut and pasted it into a utility on this web page:

http://www.blooberry.com/indexdot/html/topics/urlencoding.htm

to convert it to a normal path. Once I had that I could clearly see the directory/subdirestory structure being used to store the acutal .pdf files. They had some screwey things like double spaces in the sub-folder names. Then I could used that in Google like so:

"search term" site:"path to the folder containing the .pdf files"

It works!!!

Last edited by Toastie; 11th Nov 2012 at 12:08. Reason: fix typos

Quote
11th Nov 2012 12:12 #5
vhelp

View Profile

View Forum Posts

Private Message
Member

Join Date
Mar 2001

Location
New York
i don't think i trust that website. i had it up in one of my tabs. after a while i started noticing my internet status (trask tray) was always on...sending/receiving. i thought i was still d/l'ing, but i wasn't. that went on for the last 10 minutes. as soon as i closed the tab, it stopped. so, i'm just saying, you never know.

Quote
11th Nov 2012 12:22 #6
vhelp

View Profile

View Forum Posts

Private Message
Member

Join Date
Mar 2001

Location
New York
use quotation ("") when you see spaces in their like that.

i believe the term is, urlencoded, when you see things showing as %n in there. those snipets have to be urldecoded (or cleaned) if you are trying to put together how things work or else things will get confusing. i wrote a tool to snip most of this but it is too crude to post here. maybe there is something more nicer to download. do a google search for url cleaner or url decode or something like that.

EDIT: oh, i see, you already found it.
see this resource for an detail explanation: http://en.wikipedia.org/wiki/Percent-encoding

here's a list i put together when i was learning how to scrap google search results, not sure if complete though:

%22 = "
%23 = "#"
%25 = "%"
%26 = "&"
%28 = "(" - casual
%29 = ")" - casual
%2C = ","
%2F = "/"
%2F = "/"
%3A = ":"
%3C = "<"
%3D = "="
%3E = ">"
%3F = "?"
%40 = "@"

Quote
11th Nov 2012 12:26 #7
Toastie

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2002

Location
United States
Thanks for spending the time to help out.

Quote

how to search a specific web page??

Thread Tools

Search Thread

Similar Threads

How to print a web page

web page translator and email full page translator

How to copy locked web page?

embed video in a web page

Web page to run from CD help.