VideoHelp Forum




+ Reply to Thread
Results 1 to 13 of 13
  1. VH Wanderer Ai Haibara's Avatar
    Join Date
    Jan 2006
    Location
    Somewhere on VideoHelp...
    Search Comp PM
    Are there any (hopefully simple) utilities to brute-force strip Javascript from an HTML file - or better yet, a (local) directory with more than one HTML file? I vaguely recall seeing a few such programs a long, long time ago... back in the early 90s, I think.
    But all I can find now are mostly scripts, one shareware webadmin utility that I'm not sure would even do exactly what I want, here, and a single freeware program that - unfortunately - also strips out all HTML code as well as Javascript. I'm looking to keep the HTML as it is, because most HTML-to-text utilities or anything else that strips HTML code seem to lose the formatting as well, most of the time.
    If cameras add ten pounds, why would people want to eat them?
    Quote Quote  
  2. Member zoobie's Avatar
    Join Date
    Feb 2005
    Location
    Florida
    Search Comp PM
    Just delete it yourself?
    Of course, it's there for functionality...
    It may lose formatting because the stylesheet is mistakenly deleted
    Quote Quote  
  3. VH Wanderer Ai Haibara's Avatar
    Join Date
    Jan 2006
    Location
    Somewhere on VideoHelp...
    Search Comp PM
    Yeah, I can do that. But it can get rather tedious for a large directory full of HTML pages.

    It's mostly for stripping Javascript ad redirect code (the 'Click Here to Continue' redirect ads) or IntelliText ads from archived webpages, as while my HTML viewer's decent enough, it doesn't allow me to switch off Javascript. I'm not worried about stylesheets, really.
    If cameras add ten pounds, why would people want to eat them?
    Quote Quote  
  4. Member AlanHK's Avatar
    Join Date
    Apr 2006
    Location
    Hong Kong
    Search Comp PM
    Originally Posted by Ai Haibara
    Yeah, I can do that. But it can get rather tedious for a large directory full of HTML pages.

    It's mostly for stripping Javascript ad redirect code (the 'Click Here to Continue' redirect ads) or IntelliText ads from archived webpages, as while my HTML viewer's decent enough, it doesn't allow me to switch off Javascript. I'm not worried about stylesheets, really.
    Are you sure you can't turn off Javascript? It's one click in Opera or Firefox (with Noscript).

    Anyway, a simple way to deactivate is with a text editor.
    For instance, I use Ultraedit.
    I can do "change in files" and select a folder, and it will do a search-and-replace in every file in that folder.

    So you could change
    <script
    to
    <!--
    and
    /script>
    to
    -->

    which would turn all the scripts into invisible comments.

    There are many text-edit utilities that can do similar operations, going back to "sed" the unix stream editor and DOS clones of that.
    Quote Quote  
  5. VH Wanderer Ai Haibara's Avatar
    Join Date
    Jan 2006
    Location
    Somewhere on VideoHelp...
    Search Comp PM
    I'm sure. I'm using a stand-alone smaller HTML viewer, in this case, and it doesn't have the option. (I archive a lot of text, HTML and RTF files (okay, I'll say it - it's fanfiction ), and I didn't want to load a large browser just to view HTML pages in an archive.)

    Darn, I forgot all about search and replace. I've even got a few batch files already set up with an old global-search-and-replace console utility to replace some common SmartQuotes in a file with their ASCII equivalents. I'd still prefer to try removing the Javascript, though. Maybe it'll save some space.
    If cameras add ten pounds, why would people want to eat them?
    Quote Quote  
  6. Member AlanHK's Avatar
    Join Date
    Apr 2006
    Location
    Hong Kong
    Search Comp PM
    Originally Posted by Ai Haibara
    Darn, I forgot all about search and replace. I've even got a few batch files already set up with an old global-search-and-replace console utility to replace some common SmartQuotes in a file with their ASCII equivalents. I'd still prefer to try removing the Javascript, though. Maybe it'll save some space.
    If they're all from the same site, every page will have the same scripts. So you can s&r for the whole script and just delete.

    I'm sure there is a perl utility that could parse the scripts out in a more general way. Look at some perl scripting newsgroups or sites and ask there if you want to get into that; I'm not a perl guru.


    it's fanfiction
    Perhaps then a simpler method: if it's just paragraphs of text with no formatting, you can copy-and-paste to a text file from a browser. There are utilities that do that as a command line (lynx, I think, has an option).
    See http://www.w3.org/Tools/html2things.html

    And here's a utility that does exactly what you want: http://www.jafsoft.com/detagger/remove-markup.html, but it costs $30.
    Quote Quote  
  7. VH Wanderer Ai Haibara's Avatar
    Join Date
    Jan 2006
    Location
    Somewhere on VideoHelp...
    Search Comp PM
    Originally Posted by AlanHK
    If they're all from the same site, every page will have the same scripts. So you can s&r for the whole script and just delete.
    They're not all from the same site. However, if I were just to do a global search and replace on all files using the tags you mentioned above, that would probably work, of course.

    Originally Posted by AlanHK
    I'm sure there is a perl utility that could parse the scripts out in a more general way. Look at some perl scripting newsgroups or sites and ask there if you want to get into that; I'm not a perl guru.
    Neither am I. In searching, I found a number of Perl and PHP scripts that supposedly do it, but that doesn't help me, much. Now, if it was Python, I could experiment with it a little...

    Originally Posted by AlanHK
    Perhaps then a simpler method: if it's just paragraphs of text with no formatting, you can copy-and-paste to a text file from a browser. There are utilities that do that as a command line (lynx, I think, has an option).
    See http://www.w3.org/Tools/html2things.html

    And here's a utility that does exactly what you want: http://www.jafsoft.com/detagger/remove-markup.html, but it costs $30.
    Most of it does have both formatting and styles, which I do want to keep. I've been experimenting with HTML-to-text converters for a while, and many of the ones I've tried end up removing both the formatting and styles (I'm guessing that's most likely because the original pages probably did all their formatting with tags). Some even crashed on the Javascript.
    There was one or two that did what I wanted, which was to keep the italics/bold/etc. by converting it to a 7-bit equivalent... but they were among the ones that also lost the formatting. Maybe I should experiment with those, again. If only I had more knowledge about scripts...
    If cameras add ten pounds, why would people want to eat them?
    Quote Quote  
  8. Member Alex_ander's Avatar
    Join Date
    Oct 2006
    Location
    Russian Federation
    Search Comp PM
    I remember using Advanced Replacer (shareware) by PearlFox 2 or 3 years ago for removing any text between HTML tags in multiple pages at once (that's what you want). Looks like it is not supported now (garbage on former home page). But it still can be googled and downloaded (can't remember whether trial works). Description here:

    http://www.freedownloadscenter.com/Utilities/Text_Search_and_Replace_Tools/Advanced_Replacer.html

    With the script %anything% you can easily remove banners from your pages.
    Quote Quote  
  9. Member AlanHK's Avatar
    Join Date
    Apr 2006
    Location
    Hong Kong
    Search Comp PM
    Originally Posted by Ai Haibara
    There was one or two that did what I wanted, which was to keep the italics/bold/etc. by converting it to a 7-bit equivalent... but they were among the ones that also lost the formatting. Maybe I should experiment with those, again. If only I had more knowledge about scripts...
    Well, you could do a S&R for [B] and [I] and

    tags and convert them to something like {B}, {I}, {P}.
    Then run a HTML strip; then convert the {} back to <>. But if they used funky <font..> tags or styles, as likely if any MS app was used to generate them, you'd lose that.
    Quote Quote  
  10. Member zoobie's Avatar
    Join Date
    Feb 2005
    Location
    Florida
    Search Comp PM
    curious to know what HTML "viewer" you're using
    doesn't sound very popular...
    Quote Quote  
  11. VH Wanderer Ai Haibara's Avatar
    Join Date
    Jan 2006
    Location
    Somewhere on VideoHelp...
    Search Comp PM
    Well, it's slightly obscure... I think. It's the "Universal Viewer"/ATViewer (http://www.uvviewsoft.com/ ), which I believe was primarily written for use with the Total Commander shell, IIRC. It's a decent webpage/RTF/other viewer (well, certainly MUCH better than the ones I had been using). I haven't tried using it for text, though, since I'd already been using WnBrowse for several years.
    (I only use UV/ATV as a simple single-screen viewer, and not as a file-system browser... and was using it before they brought out a separate 'pay' version, so I've only needed the 'free' version.)

    I suppose it could have some deeply buried method to turning off Javascript (it uses the MSIE engine for HTML, though IIRC, you can get a plugin that uses the Gecko engine, instead. Maybe I ought to see if I could turn off Javascript in that...)

    Originally Posted by AlanHK
    Well, you could do a S&R for [B] and [I] and

    tags and convert them to something like {B}, {I}, {P}.
    Then run a HTML strip; then convert the {} back to <>. But if they used funky <font..> tags or styles, as likely if any MS app was used to generate them, you'd lose that.
    Not to mention the additional tag soup Word adds. :/ Doesn't Word also convert a lot of things to entities, too? (the ;&nbsp-type character representations, for anyone else reading this)
    If cameras add ten pounds, why would people want to eat them?
    Quote Quote  
  12. Member AlanHK's Avatar
    Join Date
    Apr 2006
    Location
    Hong Kong
    Search Comp PM
    Originally Posted by Ai Haibara
    Not to mention the additional tag soup Word adds. :/ Doesn't Word also convert a lot of things to entities, too? (the ;&nbsp-type character representations, for anyone else reading this)
    I would expect an HTML-to-ASCII converter would handle these correctly.

    (I like the function in Dreamweaver: "Fix Word HTML".)

    You might also look at HTML Tidy http://www.w3.org/People/Raggett/tidy/
    Running this first should simplify and clean up the code considerably.
    Some options:
    -clean, -c replace FONT, NOBR and CENTER tags by CSS (clean: yes)
    -raw output values above 127 without conversion to entities
    drop-font-tags discard <FONT> and <CENTER> tags
    hide-comments * (perhaps this with my previous suggestion of converting script to comment tags would get rid of them completely)
    Quote Quote  
  13. VH Wanderer Ai Haibara's Avatar
    Join Date
    Jan 2006
    Location
    Somewhere on VideoHelp...
    Search Comp PM
    Oh, I already have a number of utilities to convert those, with no problem... even a version of Tidy, somewhere. I'm not worried about that; it was more of a side comment as to a few of the things Microsoft apps do to HTML files.

    (I don't remember if FrontPage also does that, though... never really tried using it beyond a couple of brief experiments with the stripped-down version they once 'included' with IE.)
    If cameras add ten pounds, why would people want to eat them?
    Quote Quote  



Similar Threads

Visit our sponsor! Try DVDFab and backup Blu-rays!