VideoHelp Forum
+ Reply to Thread
Results 1 to 3 of 3
Thread
  1. Hi, my goal is to be able to run a script to remove duplicate strings of text and merge what remains from incremental style subtitles to create a plain text transcript of just the text contents of the subtitle file (sans subtitle timings or any format commands or control characters).

    An example to help you help me.

    I want this SRT file with incremental subtitles (there are two rows of subtitles showing at all times, the currently spoken subtitle and the previous subtitle for slow readers, this is similar to what you see on T.V. with real-time captions.

    Code:
    18
    00:00:23,999 --> 00:00:24,009
    we've already discussed research and
    
    19
    00:00:24,009 --> 00:00:26,460
    we've already discussed research and
    prevalence barriers and assessment shoes
    
    20
    00:00:26,460 --> 00:00:26,470
    prevalence barriers and assessment shoes
    
    21
    00:00:26,470 --> 00:00:28,769
    prevalence barriers and assessment shoes
    and now we'll discuss where you can send
    
    22
    00:00:28,769 --> 00:00:28,779
    and now we'll discuss where you can send
    
    23
    00:00:28,779 --> 00:00:31,649
    and now we'll discuss where you can send
    a client what kind of treatment you can
    
    24
    00:00:31,649 --> 00:00:31,659
    a client what kind of treatment you can
    
    25
    00:00:31,659 --> 00:00:35,000
    a client what kind of treatment you can
    use and what approaches are available
    
    26
    00:00:35,000 --> 00:00:35,010
    use and what approaches are available
    
    27
    00:00:35,010 --> 00:00:37,560
    use and what approaches are available
    within the general population
    
    28
    00:00:37,560 --> 00:00:37,570
    within the general population
    
    29
    00:00:37,570 --> 00:00:38,970
    within the general population
    there are many different approaches to
    
    30
    00:00:38,970 --> 00:00:38,980
    there are many different approaches to


    To become this:
    Code:
    we've already discussed research and prevalence barriers and assessment shoes and now we'll discuss where you can send a client what kind of treatment you can use and what approaches are available within the general population there are many different approaches to
    All the text is merged with no overlapping duplicates and all on one line. I'd prefer to use command line tools available on the MacOS platform if possible, I use MacPorts and have GNU Core Utilities installed. I can use GUI software, or Windows 10 in a virtual machine for a one-off quick fix, but I'd like an automated bash script or similar that I can trigger on the MacOS platform.

    My scripting abilities are quite rudimentary even if I dabble from time to time. My first thought is to regex out the subtitle line numbers and time codes, that should be easy enough for even me. But then how to set up the array and how to compare/match complete lines to partial lines up to maybe 5 or 6 subtitles forward or back in either direction in the array, and then to concatenate/merge whatever is left is well over my head. Would much appreciate guidance on figuring this out.

    I've attached the sample subtitle as seen in the screenshot for the convenience of anyone who wants to play around with this and help me out.

    Thanks
    Image Attached Files
    Quote Quote  
  2. Member
    Join Date
    Aug 2010
    Location
    San Francisco, California
    Search PM
    Here is a great idea for removing duplicate lines with regex.
    https://www.regular-expressions.info/duplicatelines.html
    Quote Quote  
  3. Originally Posted by JVRaines View Post
    Here is a great idea for removing duplicate lines with regex.
    https://www.regular-expressions.info/duplicatelines.html
    Thanks. Did some work on this today. This is what I've got so far for regex processing the subtitle file:

    Code:
    Regex to delete empty lines:  /((\r\n|\n|\r)$)|(^(\r\n|\n|\r))|^\s*$/gm
    Regex to select the subtitle number:  ^[0-9]+$
    Regex to select timings:  ^[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\,[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9]\,[0-9][0-9][0-9]$
    Regex to delete duplicate lines but leave backreference:  ^(.*)(\r?\n\1)+$
    Join all lines: To be completed
    Thus far I've been testing this in Sublime Text V3. The regex to delete duplicate lines but leave backreference has the behavior of just selecting all of the lines in sublime text, but I'm guessing that is fine since I'm not working in a shell.

    Now to put these regexes together using command line tools. Ed maybe. Hmm. Will carry on the project tomorrow.
    Last edited by adamlogan; 21st Jun 2018 at 03:37.
    Quote Quote  



Similar Threads

Visit our sponsor! Try DVDFab and backup Blu-rays!