the slacker's guide to mass downloading (most of) the internet, automation

22nd Jul 2024 10:32 #1

2nHxWW6GkN1l916N3ayz8HQoi

Feels Good Man

Send a message via ICQ to 2nHxWW6GkN1l916N3ayz8HQoi

Send a message via AIM to 2nHxWW6GkN1l916N3ayz8HQoi

Send a message via MSN to 2nHxWW6GkN1l916N3ayz8HQoi

Send a message via Yahoo to 2nHxWW6GkN1l916N3ayz8HQoi

Hello. If you know Python (version 3+) enough to edit existing scripts but you don't consider yourself an expert (otherwise why would you waste time with guides since you can code everything by yourself) and you fit in at least one of these categories then this guide is meant for you:

- You have a favorite site that you'd like to download a lot of videos at once and regularly, either for future new shows or whatever else. There's no public service available for it and yt-dlp doesn't have an extractor built in for your site (either because it's not popular enough or it uses DRM).

- You want a specific fresh resource from a site. For example: a m3u8 URL with fresh tokens.

- You want other non-video fixed resources mass downloaded. For example, thumbnails, subtitles, etc.

- You want pretty much anything else downloaded at the press of a single key.

First of all, make sure that what you want to be automated can at least be handled manually by you. It's not like you can automate fairplay/playready DRM downloading when they aren't even cracked publicly yet. So have some realistic expectations. The minimum necessary requirement for you to be able to automate a task is for you to be capable of doing it manually. That doesn't mean that all manual tasks can be automated (that easily): captcha problems are one example. However, it's a good starting point to attempt automation. In addition, we're gonna focus only on tasks from websites, so no Android apps or anything else since the workflow is entirely different (I think? No idea since I didn't bother with those).

The proposed guide teaches how to design a service from scratch by making it as minimalist as possible and without being tied to anything. The purpose is to establish a methodology, a set of steps, that can be followed if you want to analyze and create a mass downloader for a specific site/scenario (basically to teach you how to think when it comes to designing one).

I don't know if these exact steps are followed by users who write scripts of this kind, but since until now no one bothered to write the list of steps they take, I thought a guide would be appreciated. Because writing a service is the hardest of all the possible scenarios, if you know how to write one, then you know how to attempt the other mentioned problems as well (fresh m3u8 URLs, etc.). Don't forget, if you ever find the downloader to be lacking certain fancy things (output name formatting, season querying, etc.), you can always add them yourself.

I should point out that if all you're interested in is just the final product and you have no patience reading and doing things by yourself gradually, then this is not the guide for you. If you like learning things then you may continue reading.

That being said, let's start writing a downloader for a random site. For example:
https://www.tv5mondeplus.com

A) Browsing the site

Ok, so you found a site. Great. Now you'll have to browse it and just look at how it's structured.

[Attachment 80781 - Click to enlarge]

Some things can be noticed:
- you don't need an account to watch the content, which is good since it makes your task easier
- the site is structured in movies, TV shows and podcasts
- after hitting play on a video, the small DRM icon in the page URL indicates that the videos use DRM and the podcasts as well (at first glance all of them)
- the current language is tied to the page URL (and the language decides how the content is translated)

So there are individual videos and multiple series of videos. To solve the task we're gonna split it into 2 easier and solvable problems: individual video automation and batch video automation.

B) Individual video automation

Before starting, we need to establish what the script will receive and what its output will be. Since the script deals with videos, we will consider the input to be simple video URLs and the output to be the downloaded content. Similar to how you manually use a download command to obtain a video using any popular tool (yt-dlp/N_m3u8DL-RE/etc), the script is gonna have to do the same thing.

I'll consider N_m3u8DL-RE to be the chosen tool for downloading because we have DRM content. The script will have to find the relevant information that needs to be passed to it. The reason why the input is made up of URLs is to make it easier to automate everything. Because of that we now need to see what kind of URLs send you directly to the video itself. By browsing the site again and hitting play on all types of content we obtain these kinds of URLs (notice the format):

https://www.tv5mondeplus.com/fr/series-et-films-tv/comedie/la-maison-bleue-s-1-e4-le-p...triot-act/play
(URL pointing to the episode of a series)
https://www.tv5mondeplus.com/fr/cinema/policier-et-suspense/goodbye-morocco
(URL pointing to a movie)
https://www.tv5mondeplus.com/fr/podcast/subcategory/dingue-14590722_74079A/play
(URL pointing to a podcast)

Warning, if the previous URLs don't work for you or if at any point you end up with other resource URLs (manifest mpd for example) then that's because tv5mondeplus has content tied to your region. To advance in this tutorial, just pick something that looks like what I posted/described and follow the detailed steps.

Some things can be noticed:
- most video URLs end in "/play"
- only video URLs where the player is full screen by default end in "/play"
- the podcasts have a black screen so most likely only audio is loaded
- the URLs that point to movies aren't full-screen and don't end in "/play", even after you hit the play button

Now that we have some URLs to test, we're gonna start writing the script for only one of them. Create an empty file called "video.py" and also the file "video_urls.txt". Considering the site uses DRM, it is expected that you know already how to obtain decryption keys manually and already possess a CDM in WVD format. If that is not the case, you may start reading the sticky threads (in particular @angela's trilogy) or you can continue reading the guide if you're only interested in understanding the methodology behind automation.

B) 1. Establishing a template script for future use (regardless of DRM site)

Moving on, the URL is
https://www.tv5mondeplus.com/fr/series-et-films-tv/comedie/la-maison-bleue-s-1-e4-le-p...triot-act/play

We're gonna start with a script that gives us keys manually by using the pssh and license request + https://curlconverter.com . You can copy it from GitHub
https://github.com/devine-dl/pywidevine?tab=readme-ov-file#usage
and adapt it for your URL. From now on, any information that is tied to my session will be replaced with "redacted", regardless of whether it contains sensitive information or not. It would be good if you knew how to use an IDE (like PyCharm Community free edition for example), but if you don't have any idea then Notepad++ is also good enough since you'll be mostly copying generated code with few lines written by you in between.
Code:
from pywidevine.cdm import Cdm
from pywidevine.device import Device
from pywidevine.pssh import PSSH
import requests

pssh = PSSH("AAAAXHBzc2gAAAAA7e+LqXnWSs6jyCfc1R0h7QAAADwIARIQNGGGJyYlRZ6Ta6prMepl5hoIdXNwLWNlbmMiGE5HR0dKeVlsUlo2VGE2cHJNZXBsNWc9PSoAMgA=")
device = Device.load("device_wvd_file.wvd")
cdm = Cdm.from_device(device)
session_id = cdm.open()
challenge = cdm.get_license_challenge(session_id, pssh)
params = {
    'contentId': '106935860_74079A',
    'keyId': '34618627-2625-459e-936b-aa6b31ea65e6',
    'ls_session': 'ey...REDACTED',
}
data = challenge
licence = requests.post(
    'https://rbm-tv5monde.live.ott.irdeto.com/licenseServer/widevine/v1/rbm-tv5monde/license',
    params=params,
    data=data,
)
licence.raise_for_status()
cdm.parse_license(session_id, licence.content)

for key in cdm.get_keys(session_id):
    print(f"[{key.type}] {key.kid.hex}:{key.key.hex()}")
cdm.close(session_id)
When we run it using:
Code:
python video.py
We get the output:
Code:
[SIGNING] 00000000000000000000000000000000:b0f94a4d42f03747752aaf69f0a87854a29e3b6a4bf07ecb86540e22d95095ede62a91bca73fe1bb46616c7109a074d82e667b2c83609ac48385e156dcf6a651
[CONTENT] 346186272625459e936baa6b31ea65e6:6fa1de620815a211d21cf3b31fc20030
Which is good since that's the key. Now we're gonna make the following changes to this basic script:
- create separate variables for some of the unknowns: manifest, pssh, and video title (for the moment you can put any random name you want)
- instead of printing the keys, we're gonna generate the N_m3u8DL-RE command (with a list of predefined parameters that start the download directly) and launch it as if it was from the terminal (it's worth noting that if you want to append 2 texts in Python, you simply add them using the + symbol, and if something is not a str object just wrap it in str)
- the code responsible for extracting the keys is gonna be moved to a separate function called "get_download_command" that will receive a single parameter "source_url"
- that function is then gonna be applied to a list of source URLs extracted from the txt file "video_urls.txt" (for the moment write the single input URL to that txt file)
- add some informative prints that display the progress and any errors encountered (try/except is useful when displaying errors)

You can use https://text-compare.com to compare previous script stages to see what was added and where (it's best to read the code from end to beginning because of the strategy we use). That being said, with these modifications, the script becomes:
Code:
from pywidevine.cdm import Cdm
from pywidevine.device import Device
from pywidevine.pssh import PSSH
import requests
import subprocess

def get_download_command(source_url):
    manifest = "https://vod.tv5mondeplus.com/tv5monde/tv5mondeplus/assets/106935860_74079A/materials/YsJdQr6JU4_74079A/vod-idx-6.ism/.mpd"
    pssh = "AAAAXHBzc2gAAAAA7e+LqXnWSs6jyCfc1R0h7QAAADwIARIQNGGGJyYlRZ6Ta6prMepl5hoIdXNwLWNlbmMiGE5HR0dKeVlsUlo2VGE2cHJNZXBsNWc9PSoAMgA="
    video_title = "video"

    pssh = PSSH(pssh)
    device = Device.load("device_wvd_file.wvd")
    cdm = Cdm.from_device(device)
    session_id = cdm.open()
    challenge = cdm.get_license_challenge(session_id, pssh)
    params = {
        'contentId': '106935860_74079A',
        'keyId': '34618627-2625-459e-936b-aa6b31ea65e6',
        'ls_session': 'ey...REDACTED',
    }
    data = challenge
    licence = requests.post(
        'https://rbm-tv5monde.live.ott.irdeto.com/licenseServer/widevine/v1/rbm-tv5monde/license',
        params=params,
        data=data,
    )
    licence.raise_for_status()
    cdm.parse_license(session_id, licence.content)

    keys = ""
    for key in cdm.get_keys(session_id):
        if key.type == "CONTENT":
            keys += " --key " + key.kid.hex + ":" + key.key.hex()
    cdm.close(session_id)
    download_command = 'N_m3u8DL-RE "' + manifest + '"'+ keys + ' -ss all -sv best -sa best --no-log -mt --save-name "' + video_title + '" -M format=mkv'
    return download_command

with open('video_urls.txt', 'r') as file:
    source_urls = file.read().splitlines()

for source_url in source_urls:
    source_url = source_url.strip()
    try:
        command = get_download_command(source_url)
        print("Video done: ", source_url, " Download Command: ", command)
        print("----")

        subprocess.run(command, shell=True)
    except Exception as e:
        print("Failed to get: " + source_url + ". Reason: " + str(e))
And the content of "video_urls.txt" is:
Code:
https://www.tv5mondeplus.com/fr/series-et-films-tv/comedie/la-maison-bleue-s-1-e4-le-patriot-act/play
This can be considered a template. The only thing that changes is the input URL and the license call request which needs to be updated since it's hardcoded. But starting from this template you can fully automate any site that uses Widevine DRM. Now when you run the script, the command is launched directly in the terminal and downloads the video. Which means the key we got is also valid.

[Attachment 80784 - Click to enlarge]

B) 2. Extending the script

For now on, you can comment (add the # prefix) the line containing "subprocess.run". It's be gonna be left like that until we solve the task since we don't want to download the video every time we're testing something. One of the previous script changes was to separate the unknown variables. I consider a variable to be an unknown if it's a value pulled out of thin air and one that's not tied (yet) to one of the following:
- the source URL
- a fixed request that's the same regardless of scenario: this comes into play when you're trying to generate tokens that may or may not be tied to an account

Since these resources are obtained in the browser requests (otherwise the video wouldn't load), then that means there's a chain of requests that tie the source URL to the final resource of interest. The task is to find that chain just by knowing what the input and output are. So far the unknown variables are:
- the manifest
- the pssh
- the video title: this may seem an optional variable since you can just put a randomly generated name for mass downloading so you don't have conflicting names, but I wanted to at least add a name that can help you distinguish what you downloaded
- the license URL and its parameters: the license URL can be made a fixed variable as a separated URL however you still need to find its fresh parameters since the session parameter can expire and you'll get "401 Unauthorized"

Why would you bother with such a thing when you can leave the variables as they are? Because:
- the script may work only for that content since everything is fixed
- the script may not work after a certain time because some variables can expire
You're trying to code something usable (by you and others as well, regardless of region). Even if it may work if you leave some variables like they currently are, it's best to justify any magic value because it's safer in the long run.

One of the first changes you can make is getting rid of redundant variables. As we all know, the pssh and manifest are connected. After downloading the manifest and doing a simple Ctrl+F search "cenc:pssh" you will find the pssh value. That means you can extract the pssh just by knowing the manifest URL. With the right question, you can even ask ChatGPT to help you write code (not only for this but for pretty much any small task where you need help).

[Attachment 80793 - Click to enlarge]

I'll include in the next script a modified version of that response that should work on any manifest that contains the pssh directly in its content. So the manifest and pssh problem got reduced to simply knowing the manifest URL. Since the video title is not that important, it will be ignored until it's found by accident (you'll see what I mean later). Even if it's not found you can always adapt it from the source URL directly (by getting rid of the http prefix, keeping the last part of the URL path, etc.).

Now is the time to figure out how to get the license URL fresh parameters and that's where things get interesting and where an established methodology might be helpful. If you know how to do it for a case like this then you'll also learn how to get fresh m3u8 tokens, fresh livestreams, etc. For the sake of consistency, it is recommended for you to use Firefox (at least for this guide) since that's what I'll use.

Open Firefox on an incognito window. The reason is that we want to "force" the site to do all the necessary requests from scratch. If you access a site multiple times, it might save in cache/cookies/whatever some variables. And if they're saved then some requests may be skipped when the page is loading. It is in our interest to catch ALL the possible relevant requests since we have to analyze them.

After you open a new page incognito, open a tab to a static image (the Google logo is good enough). We want to inspect the network requests before loading the page and we don't need any unrelated previous requests (and also some sites keep spamming background requests, especially for tracking you). Make sure "Persist Logs" is not checked. Some might say that opening a tab on a static image is overkill since you're not persisting the logs, but I think it's good starting practice since in some cases you'll want the logs persisted.

[Attachment 80794 - Click to enlarge]

Hit enter on that URL and you'll see a lot of requests appearing. Wait for the video to load and at least 2-3 seconds of content to be played successfully (the actual video, not the site logo). Then pause it and export ALL of the requests as a HAR collection which you'll name "video.har" (make sure you have nothing filtered in the requests). That file can be opened later in Notepad++ and contains mostly plaintext.

[Attachment 80795 - Click to enlarge]
(the video is black screen because in Firefox you can't screenshot DRM content directly)

As I said previously, some sites keep spamming requests in the background, and this can slow down your session when you inspect them. Since you obtained the HAR file then you're pretty much done with that video. That file contains all the information you need to know about downloading that video. Some of it might be expired, but for now, you don't need to download the video, you only need to learn how to by analyzing the chain of requests. If you ever need a fresh request to continue the chain, you can get a new one by capturing the network requests again and focusing only on what you need.

You can close the tab and open a fresh one on a static image tab (incognito or not, it doesn't matter now) where no new requests are spammed. Inspect network requests and import the previously downloaded HAR file either by dragging and dropping it or clicking the wheel button and then importing.

[Attachment 80796 - Click to enlarge]

You can keep the imported HAR window and the HAR file opened in Notepad++ side by side to switch between them more easily. Find the license request in the HAR monitor by filtering the requests. Since the URL can be fixed, we'll need only the fresh parameters. The license requests are done by POST with 3 parameters: contentId, keyId, and ls_session. Pick the one that may have a more distinct value (it will help you in finding it faster).

Since ls_session can expire, we're gonna copy that value. You don't need to copy it fully, 30-40 characters is good to make it distinct enough. The value I'm gonna copy is eyJ0eXAiOiJKV1QiLCJraWQiOiI1OT (for you it might be different). Now leave the HAR browser and go to the HAR Notepad. Click on the first line of the text file (that's important and should be done every time you're searching for something new since it's gonna show you the first result of your search). Now look for your copied value.

[Attachment 80797 - Click to enlarge]

It finds the value on line 11650 (again, for you it will be different). Then you have to scroll to the left side. You can see the response is in the "content" subsection which is good to know. You can find the resources you're looking for also passed as headers, both are valid ways and can be accessed in Python. Then you scroll up until you reach the section that contains this resource. In this case, it's "response" which is what you need to find.

If you found instead "request", or worse, you haven't even found anything at all in the first place, then that's very bad unless you find yourself in one of these acceptable scenarios:
- What you're looking for is a URL or something else that can be separated into smaller, searchable, components
For example for a URL like .../path1/path2/... you can split it by using "/" as a delimiter and look for the relevant components by searching each one separately.
- What you're looking for is a GraphQL hash value. You can completely ignore those values.

However, if you aren't in any of the previous cases...
[[Skip this until you know what to do in a good scenario. You can come back if you encounter this problem]]
It means that what you were looking for is either:
a) Encoded as base64. Take the URL of the first request that you found in HAR Notepad and search it in the HAR browser. Then start checking one by one (starting from the bottom and going upwards) all of the previous requests that took place before what you found. Take their JSON response, format it, and check all the base64 responses that you find there by decoding them using
https://www.base64decode.org

Some responses may contain base64 within base64 so multiple decodings are needed. I tried writing a script for this that receives a HAR file as input and finds + decodes + replaces all of the base64 strings in a recursive way until nothing is found anymore but I had no luck. If someone knows a fast way to achieve this instead of going request by request manually then feel free to leave a message (if you're willing to share the knowledge of course).

Warning: if the base64 string you're trying to decode contains the character "." in it, then divide the string into parts by using "." as a delimiter and decode each part separately. You can't decode the entire base64 value correctly if it contains "." . It may work sometimes but that's just luck.

b) Already obtained/generated somehow and sent as a request. That means there's a small piece of code that creates that variable somewhere. You can start debugging the javascript code and see what kind of stuff is happening in the page code. However, this is very tricky depending on the context.

-----
Regardless of the reason, you can continue searching for your value until you find it inside a "response" but if you didn't find it like that in your first search, the chances of success are diminishing. As I said previously, each resource can be obtained through a chain of requests, so the origin of any resource needs to belong to a "response".

-----
Another bad scenario (unrelated to the previous cases) might happen when you're finding too many places for what you searched and none of them are relevant. For example, if you try to search for something like "x" then obviously the text is too short and not distinct enough. You can try looking for another variable or make the previous one more distinct by changing it.

[[You can stop skipping]]

In our case, we had no problems, so if we scroll even further up we find the section "request" which contains the URL that obtained what we searched: "https://api.tv5mondeplus.com/v2/customer/TV5MONDE/businessunit...". Now you right-click and copy the link. Then you go back to the HAR browser and search for the link you copied. If you find nothing then:
- decode the link by going to https://www.urldecoder.org and search the new value
- or copy only a part of that URL that's not encoded weirdly and is distinct enough and search again

[Attachment 80799 - Click to enlarge]

If you find yourself with multiple requests as a result, then go to each one and check the response content/headers and see if it contains what you looked for (don't forget to enable the raw content). Then when you're sure you found the right request, copy it as a posix curl. Then you go to https://curlconverter.com and copy the generated Python code (don't bother with the imports).

[Attachment 80801 - Click to enlarge]

In your current script, you can add the exit(0) command right before your current unknown variables are declared. This is useful because you simply test what is new, instead of running the entire script, including what you already know works. Your task is now dropping that generated code and trying to connect it with as many unknown variables as possible.

So the partially modified script should look like this:
Code:
... imports ...

def get_download_command(source_url):
    ... dropped curl converter request ...
    print(response.json())
    exit(0)
    
    ... manifest and other unknowns declared ...
    ... rest of the method ...
... rest of the script ...
Now you need to reduce the headers and also the params. Keep the minimum amount that gets you the response. Most headers and some parameters are irrelevant. The fewer headers/parameters you have, the fewer new possible unknown variables are being added to your list of tasks. There are some rare sites and scenarios where a request may work temporarily with reduced headers, but the completed chain fails. This happens usually when tokens come into play.

Some tokens may not offer you the same privileges compared to others when you edit some headers. But that's rare and it can be ignored for now. If you ever stumble on this issue just go back to each request in your chain, put all of the original headers back, and see what you can reduce when you have the full chain.

In this case, only the authorization matters for the headers. The parameters are:
Code:
{
   "ifa":"REDACTED",
   "ifaType":"sessionid",
   "deviceType":"desktop",
   "width":"REDACTED",
   "height":"REDACTED",
   "pageUrl":"https://www.tv5mondeplus.com/fr/series-et-films-tv/comedie/la-maison-bleue-s-1-e4-le-patriot-act/play",
   "domain":"www.tv5mondeplus.com",
   "mute":"false",
   "autoplay":"true",
   "supportedFormats":"dash,hls,mss,mp3",
   "supportedDrms":"widevine"
}
Judging by their names, only the last 2 should be relevant (and maybe deviceType?). You can remove 1 by 1 and run the script until something crashes. In this case, even the last 2 are optional, but removing Widevine will get you a lot of garbage, like Playready and other DRM providers which are useless. So the last 2 can remain.

Another trick to reduce the variables is replacing some values with random garbage. For example if "ifa" was needed which is some kind of session ID, then you could try running the script with its value set to "ifa_random_value_12345". Most sites (and I really mean "most") check for the existence of specific variables/headers/etc, but they don't validate them (which is hilarious). If this trick works then it means the variable isn't tied to an account or content or anything else. Which is good since you can ignore it and leave it fixed.

There are multiple ways to reduce these variables and you can do it however you like. I'm just choosing a way to continue the guide since I'm only interested in getting something working. That being said, after running the script with your reduced headers, you'll get a response. You have to copy it and format it to be able to read it easier. You could take the response directly from the browser request and format it, but I prefer taking it from the script. Because you edited headers/parameters, you might get a slightly different response.

For formatting you can use:
- https://jsonformatter.org
Only for JSON.
- https://jsonformatter.curiousconcept.com
For both JSON and Python dictionaries.

If one fails, try the other. I normally use the output from "curiousconcept" and copy it to "formatter.org" due to preference. You can choose what you think looks better for you. The visual difference between a JSON and a dictionary is the syntax:
Code:
#JSON
{"value1": null, "value2": "value"}
#Python dictionary - notice the quotes as well
{"value1": None, 'value2': 'value'}
[Attachment 80802 - Click to enlarge]

Now you can see what the API response is after you used only the authorization and the 2 parameters. A simple search gives you the ls_session which you used originally as a reference point. And as a bonus you also find it appended to the license URL. So you don't even need to hardcode the license (even if you could). And that's a good thing since there are sites that are changing from time to time their presumably constant license URLs.

So how do you access that license URL? First you add the line "response = response.json()" which will get you a dictionary. As you saw in that formatted JSON, there are only 2 data structures:
- the [ ] which means a list
To access the content of a list L = [11, 22, 33, 44, 55] you write L[index], the index starting from zero. For example, if I wanted the first element, 11, I'd write L[0]. The second element, 22, L[1]. And so on. To go over a list and print all the values you write:
Code:
for element in L:
    print(element)
- the { } which means a dictionary, key => value
To access the content of a dictionary D = {"k1": "v1", "k2": "v2", "k3": "v3"} you write D["key"]. For example, if I want the value of "k2" I'd write D["k2"]. To print all the elements of a dictionary, you write:
Code:
for k, v in D.items():
    print(k, v)
In addition to these 2 data structures if you understand what continue/break does, then that's almost all you need when it comes to connecting 2 curlconverter dropped requests. To understand the difference run this code:
Code:
#code1 free
for e in [1, 2, 3, 4, 5]:
    print(e)

print("----")

#code2 break
for e in [1, 2, 3, 4, 5]:
    if e == 3:
        break
    print(e)

print("----")

#code3 continue
for e in [1, 2, 3, 4, 5]:
    if e == 3:
        continue
    print(e)
Knowing this, the license URL can be obtained using:
Code:
response["formats"][0]["drm"]["com.widevine.alpha"]["licenseServerUrl"]
So the license URL and its parameters problem was solved. But we're not gonna stop here. Let's reach the maximum potential for each newly obtained request in the current chain. After all, the fewer requests we have, the faster the final script is. After a simple search, you can find the other unknown variable, manifest.
Code:
{
  ...
  "formats": [
    {
      "drm": { ... },
      "format": "DASH",
      "mediaLocator": "https://vod.tv5mondeplus.com/tv5monde/tv5mondeplus/assets/106935860_74079A/materials/YsJdQr6JU4_74079A/vod-idx-6.ism/.mpd"
    }
  ]
  ...
}
The only unsolved unknown remains "video_title". You can't find anything else relevant in the current JSON. That's not a problem. Now let's see what new unknowns got introduced. The authorization header is obviously one of them. The 2 parameters can be left hardcoded. Let's take a look at the request URL (without parameters):
https://api.tv5mondeplus.com/v2/customer/TV5MONDE/businessunit/TV5MONDEplus/entitlemen...60_74079A/play

It looks like a normal URL except for this part "106935860_74079A" which looks like randomized garbage. That means a new unknown was introduced here and has to be taken care of. You can name this variable "play_id". That doesn't mean that you should be doubtful of any numbers found in a URL. For example, v2 is clearly the API version so it depends on the context. You can make the API version to be a separate variable and obtain that as well, but I prefer leaving it in the URL. If the version of the API changes, then its response will most likely change as well and the script will crash regardless.

From 4 variables, we solved 3 and got 2 new ones. That means we are at 3 now. It doesn't matter if you end up with more or fewer variables. What matters is extending the chain of requests to bring you closer to the source. The current script is:
Code:
from pywidevine.cdm import Cdm
from pywidevine.device import Device
from pywidevine.pssh import PSSH
import requests
import subprocess
import re

def get_download_command(source_url):
    bearer = "Zrz...REDACTED"
    play_id = "106935860_74079A"
    video_title = "video"

    headers = {
        'Authorization': 'Bearer ' + bearer,
    }

    params = {
        'supportedFormats': 'dash,hls,mss,mp3',
        'supportedDrms': 'widevine',
    }

    response = requests.get(
        'https://api.tv5mondeplus.com/v2/customer/TV5MONDE/businessunit/TV5MONDEplus/entitlement/' + play_id + '/play',
        params=params,
        headers=headers,
    )
    response = response.json()
    
    license_url = response["formats"][0]["drm"]["com.widevine.alpha"]["licenseServerUrl"]
    manifest = response["formats"][0]["mediaLocator"]
    pattern = re.compile(r'<[^<>]*cenc:pssh[^<>]*>(.*?)</[^<>]*cenc:pssh[^<>]*>')
    pssh = pattern.findall(requests.get(manifest).text)
    pssh = sorted(pssh, key=lambda p: len(p))[0]
    
    pssh = PSSH(pssh)
    device = Device.load("device_wvd_file.wvd")
    cdm = Cdm.from_device(device)
    session_id = cdm.open()
    challenge = cdm.get_license_challenge(session_id, pssh)
    data = challenge
    licence = requests.post(
        license_url,
        data=data,
    )
    licence.raise_for_status()
    cdm.parse_license(session_id, licence.content)

    keys = ""
    for key in cdm.get_keys(session_id):
        if key.type == "CONTENT":
            keys += " --key " + key.kid.hex + ":" + key.key.hex()
    cdm.close(session_id)
    download_command = 'N_m3u8DL-RE "' + manifest + '"'+ keys + ' -ss all -sv best -sa best --no-log -mt --save-name "' + video_title + '" -M format=mkv'
    return download_command

with open('video_urls.txt', 'r') as file:
    source_urls = file.read().splitlines()

for source_url in source_urls:
    source_url = source_url.strip()
    try:
        command = get_download_command(source_url)
        print("Video done: ", source_url, " Download Command: ", command)
        print("----")

        #subprocess.run(command, shell=True)
    except Exception as e:
        print("Failed to get: " + source_url + ". Reason: " + str(e))
B) 3. Continuing extending the script for new sets of variables

We're gonna continue with play_id and repeat the steps we've done previously. A simple search in the HAR Notepad gives us all the relevant information quickly (around line ~8326).
Code:
{
        "request": {
          "bodySize": 239,
          "method": "POST",
          "url": "https://www.tv5mondeplus.com/api/graphql/v1/",
          ...
        },
        "response": {
          "status": 200,
          ...
          "content": {
            "mimeType": "application/json",
            "size": 5003,
            "text": "{\"data\": {\"lookupContent\": {\"id\": \"106935860_74079A:fr\", ...
A quick HAR browser search gives us:

[Attachment 80803 - Click to enlarge]

After generating and copying the code from curlconverter, it turns out that you can completely remove the cookies and headers, leaving only the json data. Copying the output response to json formatter, the only relevant information is:
Code:
{
  "data": {
    "lookupContent": {
      "id": "106935860_74079A:fr",
      "title": "Le Patriot Act",
      ...
      "episodeNumber": 4,
      ...
    }
  }
}
That means we have both the play_id and video_title solved. We can at least use the episode index for the series episodes even if the season index is missing. It's helpful if you're downloading large collections of videos because they're at least sorted. To get the play_id you can split the string from the "id" value using the .split() method. To convert any string to a valid filename you can use slugify
https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename#answer-29942164

You can use Stackoverflow (not only ChatGPT) for any Python questions. The variable that remained unsolved is the bearer token. Let's see if any new unknowns were introduced. The json of the curlconverter request is:
Code:
json_data = {
    'operationName': 'VODContentDetails',
    'variables': {
        'contentId': 'redbee:la-maison-bleue-s-1-e4-le-patriot-act:fr',
    },
    'extensions': {
        'persistedQuery': {
            'version': 1,
            'sha256Hash': 'e396131572f170605ea7a9f13139568323ab9e398a350741952e690be38efb30',
        },
    },
}
Version, like many other things related to it, can be left as it is. However, the hash value at first glance looks like something that needs to be made a variable. To prevent you from wasting time, that's not the case in this scenario. A quick search in the HAR notepad finds the sha value directly made as a request so that means it was generated. That's not a problem in this case because all hash values regarding GraphQL are fixed so you can ignore them.

"contentId" on the other hand is another story. It's made of 3 substrings:
- redbee: this one can be left as it is since it's first found as part of a request URL in the HAR Notepad
- la-maison-bleue-s-1-e4-le-patriot-act: obviously the content itself
- fr: the language of the content.

Let's see how "fr" affects the response by switching it to "en". We now get 2 different responses:
Code:
#fr
... 'title': 'Le Patriot Act', ...
#en
... 'title': 'The Patriot Act', ...
So the language is relevant to the chain of requests (at least for the video title). We can simply leave it hardcoded to French, but if you're giving a URL to the script that contains /en/ in its content, then you would expect an English title, not French. We're gonna make it a separate variable that is tied to the source URL. That means we have 3 variables now: the bearer token and the 2 "contentId" substrings.

A simple search in the HAR Notepad++ for "la-maison-bleue-s-1-e4-le-patriot-act" returns the URL
https://www.tv5mondeplus.com/fr/series-et-films-tv/comedie/la-maison-bleue-s-1-e4-le-p...triot-act/play

which is the input. And that URL also contains the language /fr/. The URL can be split using the delimiter "/". So the new 2 variables got solved instantly and all that remains is the bearer. Just to be sure that video_title has a valid value, I'm gonna use part of the input URL if something goes wrong. A search in the HAR Notepad for a part of the bearer token gives us this information:
Code:
{
        "request": {
          "bodySize": 188,
          "method": "POST",
          "url": "https://api.tv5mondeplus.com/v2/customer/TV5MONDE/businessunit/TV5MONDEplus/auth/anonymous",
          ...
        },
        "response": {
          "status": 200,
          ...
          "content": {
            "mimeType": "application/json",
            "size": 285,
            "text": "{\n  \"sessionToken\" : \"Zrz...REDACTED...
Take the anonymous URL and search it in the HAR browser. In my case, it resulted in 2 requests and only the second one had the bearer token I was looking for. Copy the curl and use curlconverter. Copy the generated code and paste it into the script. The headers are useless so they can be removed completely. The variable json_data contains:
Code:
json_data = {
    'device': {
        'deviceId': 'REDACTED',
        'width': REDACTED,
        'height': REDACTED,
        'type': 'WEB',
        'name': 'REDACTED',
    },
    'deviceId': 'REDACTED',
}
If you remove "device" it's gonna throw the error "device must not be null" so we're gonna leave it an empty JSON. Removing "deviceId" throws the same error but since we don't want to leave magic unexplained values in the code, we're gonna change its value to "deviceId" which works and proves they don't even validate what they receive. With all these changes, the script becomes:
Code:
from pywidevine.cdm import Cdm
from pywidevine.device import Device
from pywidevine.pssh import PSSH
import requests
import subprocess
import re
from slugify import slugify

def get_download_command(source_url):
    video_slug = source_url.split("/")[6]
    language = source_url.split("/")[3]

    json_data = {
        'device': {
        },
        'deviceId': 'deviceId',
    }

    response = requests.post(
        'https://api.tv5mondeplus.com/v2/customer/TV5MONDE/businessunit/TV5MONDEplus/auth/anonymous',
        json=json_data,
    )
    response = response.json()
    bearer = response["sessionToken"]
    
    json_data = {
        'operationName': 'VODContentDetails',
        'variables': {
            'contentId': 'redbee:' + video_slug + ':' + language,
        },
        'extensions': {
            'persistedQuery': {
                'version': 1,
                'sha256Hash': 'e396131572f170605ea7a9f13139568323ab9e398a350741952e690be38efb30',
            },
        },
    }

    response = requests.post(
        'https://www.tv5mondeplus.com/api/graphql/v1/',
        json=json_data
    )
    response = response.json()

    play_id = response["data"]["lookupContent"]["id"]
    play_id = play_id.split(":")[0]
    video_title = response["data"]["lookupContent"]["title"]
    video_title = slugify(video_title)
    if video_title == "":
        video_title = slugify(video_slug)
    episode_index = str(response["data"]["lookupContent"]["episodeNumber"])
    video_title = episode_index + "-" + video_title
    
    headers = {
        'Authorization': 'Bearer ' + bearer,
    }

    params = {
        'supportedFormats': 'dash,hls,mss,mp3',
        'supportedDrms': 'widevine',
    }

    response = requests.get(
        'https://api.tv5mondeplus.com/v2/customer/TV5MONDE/businessunit/TV5MONDEplus/entitlement/' + play_id + '/play',
        params=params,
        headers=headers,
    )
    response = response.json()
    
    license_url = response["formats"][0]["drm"]["com.widevine.alpha"]["licenseServerUrl"]
    manifest = response["formats"][0]["mediaLocator"]
    pattern = re.compile(r'<[^<>]*cenc:pssh[^<>]*>(.*?)</[^<>]*cenc:pssh[^<>]*>')
    pssh = pattern.findall(requests.get(manifest).text)
    pssh = sorted(pssh, key=lambda p: len(p))[0]
    
    pssh = PSSH(pssh)
    device = Device.load("device_wvd_file.wvd")
    cdm = Cdm.from_device(device)
    session_id = cdm.open()
    challenge = cdm.get_license_challenge(session_id, pssh)
    data = challenge
    licence = requests.post(
        license_url,
        data=data,
    )
    licence.raise_for_status()
    cdm.parse_license(session_id, licence.content)

    keys = ""
    for key in cdm.get_keys(session_id):
        if key.type == "CONTENT":
            keys += " --key " + key.kid.hex + ":" + key.key.hex()
    cdm.close(session_id)
    download_command = 'N_m3u8DL-RE "' + manifest + '"'+ keys + ' -ss all -sv best -sa best --no-log -mt --save-name "' + video_title + '" -M format=mkv'
    return download_command

with open('video_urls.txt', 'r') as file:
    source_urls = file.read().splitlines()

for source_url in source_urls:
    source_url = source_url.strip()
    try:
        command = get_download_command(source_url)
        print("Video done: ", source_url, " Download Command: ", command)
        print("----")

        #subprocess.run(command, shell=True)
    except Exception as e:
        print("Failed to get: " + source_url + ". Reason: " + str(e))
After testing the generated command, the video plays well so everything is good.

You can read the second part of the guide here.

Last edited by 2nHxWW6GkN1l916N3ayz8HQoi; 22nd Jul 2024 at 10:56.

--[----->+<]>.++++++++++++.---.--------.
[*drm mass downloader: widefrog*]~~~~~~~~~~~[*how to make your own mass downloader: guide*]

Quote

22nd Jul 2024 10:33 #2

2nHxWW6GkN1l916N3ayz8HQoi

Feels Good Man

B) 4. Testing, fixing, and improving the script

You just finished the script but it's not the end. After all, you only tested it for a single URL. Even if the chain of requests is complete, it's still possible you missed something. To test and find errors faster, you should temporarily add the instruction "raise e" in the "except" block to stop the script and print the problem code line.

At the start of section B) there were 3 URLs mentioned and we only tested the first one. Replace the content "video_urls.txt" with the second URL only.
Code:
https://www.tv5mondeplus.com/fr/cinema/policier-et-suspense/goodbye-morocco
You'll get the error:

episode_index = str(response["data"]["lookupContent"]["episodeNumber"])
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
KeyError: 'episodeNumber'

The problem is that "episodeNumber" is not available as a key in that dictionary. To check more, you can add: print(response) and exit(0) before that line problem and just look at the content of the response. After formatting it, you can't find "episodeNumber" anywhere on that JSON. Which makes sense since we're trying to download a movie, not an episode. The fix is to make use of the .get() method which can specify a default value in case the key isn't found. I'll consider 1 to be the default value. Now the script works and only one URL remains to be tested.
Code:
https://www.tv5mondeplus.com/fr/podcast/subcategory/dingue-14590722_74079A/play
You'll get the error:

license_url = response["formats"][0]["drm"]["com.widevine.alpha"]["licenseServerUrl"]
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: 'drm'

After formatting the response you can see no license URL is present in it. That means the podcast is most likely not using DRM. A quick look at the manifest shows that it's an mp3 file that's not even encrypted. Therefore we can't even use N_m3u8DL-RE to directly download. You can force it by making a temporary m3u8 file that points to that mp3 URL but it's overkill. Instead, we're just going to generate a yt-dlp download command. To check if the license is present just put that existing code in a small try/except block and if it fails then set it to "None". And if there's no license, then there's no point in trying to get pssh/decryption keys.

Now the script works for this podcast as well and also with all 3 URLs in the text file all at once. I tested it for other URLs and nothing crashed so I'll stop here with the testing (you can get rid now of the temporary "raise e" line). I advise you after finishing a script to test it for at least 5-10 random URLs. You never know what scenario you missed. One last problem that remains to be changed is about optimization. When you found the request that generated the bearer token, that request wasn't tied to the source URL at all. Instead, it uses a fixed endpoint.

So does it make sense to be running that code for all URLs from that text file? That bearer token is only used to access their API and isn't tied to the content at all. A better solution would be to run it only once and use it for all URLs. Which is what we're gonna do. That being said, the new and improved script becomes:
Code:
from pywidevine.cdm import Cdm
from pywidevine.device import Device
from pywidevine.pssh import PSSH
import requests
import subprocess
import re
from slugify import slugify

json_data = {
    'device': {
    },
    'deviceId': 'deviceId',
}

response = requests.post(
    'https://api.tv5mondeplus.com/v2/customer/TV5MONDE/businessunit/TV5MONDEplus/auth/anonymous',
    json=json_data,
)
response = response.json()
bearer = response["sessionToken"]

def get_download_command(source_url):
    video_slug = source_url.split("/")[6]
    language = source_url.split("/")[3]
    
    json_data = {
        'operationName': 'VODContentDetails',
        'variables': {
            'contentId': 'redbee:' + video_slug + ':' + language,
        },
        'extensions': {
            'persistedQuery': {
                'version': 1,
                'sha256Hash': 'e396131572f170605ea7a9f13139568323ab9e398a350741952e690be38efb30',
            },
        },
    }

    response = requests.post(
        'https://www.tv5mondeplus.com/api/graphql/v1/',
        json=json_data
    )
    response = response.json()

    play_id = response["data"]["lookupContent"]["id"]
    play_id = play_id.split(":")[0]
    video_title = response["data"]["lookupContent"]["title"]
    video_title = slugify(video_title)
    if video_title == "":
        video_title = slugify(video_slug)
    episode_index = str(response["data"]["lookupContent"].get("episodeNumber", 1))
    video_title = episode_index + "-" + video_title
    
    headers = {
        'Authorization': 'Bearer ' + bearer,
    }

    params = {
        'supportedFormats': 'dash,hls,mss,mp3',
        'supportedDrms': 'widevine',
    }

    response = requests.get(
        'https://api.tv5mondeplus.com/v2/customer/TV5MONDE/businessunit/TV5MONDEplus/entitlement/' + play_id + '/play',
        params=params,
        headers=headers,
    )
    response = response.json()
    
    try:
        license_url = response["formats"][0]["drm"]["com.widevine.alpha"]["licenseServerUrl"]
    except:
        license_url = None
    manifest = response["formats"][0]["mediaLocator"]
    
    if license_url is None:
        return 'yt-dlp "' + manifest + '" -o "' + video_title + '.%(ext)s"'
    
    pattern = re.compile(r'<[^<>]*cenc:pssh[^<>]*>(.*?)</[^<>]*cenc:pssh[^<>]*>')
    pssh = pattern.findall(requests.get(manifest).text)
    pssh = sorted(pssh, key=lambda p: len(p))[0]
    
    pssh = PSSH(pssh)
    device = Device.load("device_wvd_file.wvd")
    cdm = Cdm.from_device(device)
    session_id = cdm.open()
    challenge = cdm.get_license_challenge(session_id, pssh)
    data = challenge
    licence = requests.post(
        license_url,
        data=data,
    )
    licence.raise_for_status()
    cdm.parse_license(session_id, licence.content)

    keys = ""
    for key in cdm.get_keys(session_id):
        if key.type == "CONTENT":
            keys += " --key " + key.kid.hex + ":" + key.key.hex()
    cdm.close(session_id)
    download_command = 'N_m3u8DL-RE "' + manifest + '"'+ keys + ' -ss all -sv best -sa best --no-log -mt --save-name "' + video_title + '" -M format=mkv'
    return download_command

with open('video_urls.txt', 'r') as file:
    source_urls = file.read().splitlines()

for source_url in source_urls:
    source_url = source_url.strip()
    try:
        command = get_download_command(source_url)
        print("Video done: ", source_url, " Download Command: ", command)
        print("----")

        #subprocess.run(command, shell=True)
    except Exception as e:
        print("Failed to get: " + source_url + ". Reason: " + str(e))
And that's pretty much it about individual video automation.

C) Batch video automation

By following the same steps, we have to decide what the input/output is. Consider this scenario, you have a script that only downloads individual videos and you want to download an entire series. What do you do? You go to the series URL, take all the episode URLs, and use the previous script on all of those episodes. That means the batch script is gonna have to receive a series URL and give back a list of episode URLs. The downloading isn't an important part here since it's already solved.

This entire part can be skipped if you could use a browser addon like:
https://addons.mozilla.org/en-US/firefox/addon/link-gopher
However, I specifically chose this site since the addon doesn't work for seasons/episodes (no idea why). And even if it did, you should still know how to do it without any additional tools.

The input of the script is going to be made up of URLs that contain many other video URLs. By using the 3 previous URLs, their corresponding URLs are:
https://www.tv5mondeplus.com/fr/series-et-films-tv/comedie/la-maison-bleue
(URL pointing to an entire series divided into seasons and episodes)
https://www.tv5mondeplus.com/fr/cinema/policier-et-suspense/goodbye-morocco
(URL pointing to a movie)
https://www.tv5mondeplus.com/fr/podcast/subcategory/dingue
(URL pointing to an entire podcast series divided into episodes)

By browsing the site a little more and comparing the series page content, some things can be noticed:
- the seasons of a series don't necessarily start from 1
- the seasons aren't always sorted
- a podcast series has only episodes
- a movie URL always contains only 1 video

To improve the flexibility of the script, I decided to use the same URL for a movie as a collection, and also as an individual video. That's because when you press play on that content you stay on that page and also because when you grab a list of random shows from the homepage of the site, it wouldn't be okay if you had to manually check the type of that show (movie/series). So the batch script is just gonna return a list of length 1 that only contains that movie URL and the downloading part is gonna be handled separately.

The script can now be started and only the first URL is gonna be used for now. Create the file "series.py" and the file "series_urls.txt".

C) 1. Establishing a template script for future use (regardless of site)

To speed things up, we could take the template script that was created in the section B)1. A few changes will be made:
- we're gonna use "video_urls.txt" as the output file where all the extracted episode URLs will be written
- the output file is gonna contain all of the URLs from all of the series URLs from the input file
- the output file will be cleared whenever the script starts to make sure only the current episode URLs are written there
- as information, the number of extracted URLs is enough
- "get_download_command" will be renamed to "get_series" and will keep the same parameter
- if the template had the license request hardcoded, we're gonna remove it now since we don't care about downloading and instead return a fixed list of length 1 where the only element is a random episode from the series URL
- since no commands are launched you can get rid of the subprocess import and also of the pywidevine imports since no decryption happens

To add an element to a list you can use the append method. To combine two lists you can add them using the symbol "+". The content of "series_urls.txt" is:
Code:
https://www.tv5mondeplus.com/fr/series-et-films-tv/comedie/la-maison-bleue
We're gonna pick a random episode that's not displayed by default. If season 1 is always displayed when you click on that link, then go to season 2 and choose an episode. I'm gonna pick
https://www.tv5mondeplus.com/fr/series-et-films-tv/comedie/la-maison-bleue-s-2-e5-hero...-national/play
And that will be the list returned. The template script becomes:
Code:
import requests

def get_series(source_url):
    episodes = []
    
    episodes.append("https://www.tv5mondeplus.com/fr/series-et-films-tv/comedie/la-maison-bleue-s-2-e5-heros-national/play")
    return episodes

with open("video_urls.txt", "w") as file:
    pass
with open('series_urls.txt', 'r') as file:
    source_urls = file.read().splitlines()

for source_url in source_urls:
    source_url = source_url.strip()
    try:
        series = get_series(source_url)
        with open("video_urls.txt", "a") as file:
            for url in series:
                file.write(url + "\n")

        print("Series done: ", source_url, " Obtained: ", len(series), " URLs")
        print("-----")
    except Exception as e:
        print("Failed to get series: " + source_url + ". Reason: " + str(e))
The only thing that will change is the input URL and the random episode you picked. By using the template, you can modify it for any site when it comes to extracting lists of URLs. Now after running you'll get the episode URL saved in "video_urls.txt" and you can see how many URLs were extracted.

[Attachment 80838 - Click to enlarge]

C) 2. Extending the script

It may seem weird that you put only one episode URL to be returned when the series has a lot more but that's irrelevant. If you only focus on getting that URL, you'll also know how to obtain the other remaining URLs since they're all found in the same place. Similar to how you previously created a chain of requests that connected the original license request from curlconverter with the source URL, you're gonna have to do the same thing now. Only now, you're connecting a list of URLs with a URL.

Open Firefox on an incognito window, open a static image, open network requests, make sure "Persist Logs" is disabled, and load the series URL. Since the episode we picked was from season 2, we're gonna focus on it. Important: wait for the page to load completely, change to season 2, scroll down slowly until you reach the last episode, and make sure everything loads fully. You can now stop capturing the requests and you can export the HAR file with the name "series.har".

[Attachment 80846 - Click to enlarge]

The reason why we focus on a season that wasn't by default shown on the page is because some parts of the content can be found on the HTML page and sometimes in API responses. By picking a season that wasn't default, we forced the site to use API responses. I generally prefer to build a chain of requests that don't depend on HTML. Of course that isn't always the case but such cases are rare.

Side by side, import that HAR file in your browser (on a static image tab with empty network requests) and also on Notepad++. We're gonna start searching in the HAR Notepad for the episode URL we picked. Nothing is found. Let's try removing redundant information from that URL. The URL becomes:
Code:
series-et-films-tv/comedie/la-maison-bleue-s-2-e5-heros-national
I removed the site name, language, and the play since those can be added later. A search in Notepad still brings nothing. That means we should divide the string into 3 parts using "/" as a delimiter. However, before doing this let's change our strategy by comparing the series URL with the episode URL.
Code:
https://www.tv5mondeplus.com/fr/series-et-films-tv/comedie/la-maison-bleue
https://www.tv5mondeplus.com/fr/series-et-films-tv/comedie/la-maison-bleue-s-2-e5-heros-national/play
They're almost identical except for the last part.
Code:
la-maison-bleue
la-maison-bleue-s-2-e5-heros-national
That means if we found in Notepad "la-maison-bleue-s-2-e5-heros-national" then we could build the episode URL by using the series URL. The "/play" part is irrelevant since it's appended to all videos (as we saw when we browsed the site). After searching for this, we find:
Code:
{
        "request": {
          ...
          "url": "https://api.tv5mondeplus.com/v1/customer/TV5MONDE/businessunit/TV5MONDEplus/content/asset/la-maison-bleue...
         ...
        }
        "response": {
          "status": 200,
          ...
          "content": {
            ...
            "text": ... \"slugs\" : [ \"la-maison-bleue-s-2-e5-heros-national\"
This is a good thing since we can now start building the chain of requests. After you search in the HAR browser with that request URL, you should get only 1 request which you then use on curlconverter. That site is a great tool for generating code however it does have its limitations as you can see in this image.

[Attachment 80857 - Click to enlarge]

In a happy scenario, you would get dictionaries for both headers and parameters. The reason why you want dictionaries is that you can easily edit them and only keep what is truly necessary. In this case, you would have to edit the URL and keep it in the same format which is kinda tiresome. Luckily ChatGPT can help you.

[Attachment 80858 - Click to enlarge]

After combining the curlconverter code with the revised ChatGPT one, you can add it to your script right before that append method. You can also add an exit(0) to stop running the code after the dropped request. It's time to reduce the headers/params and to see if we can connect it to what we need. The headers are completely useless and can be deleted. From the parameters only "allowedCountry" seems useless and we're gonna let the others because they describe the content we want.

After formatting the JSON response the only relevant information seems to be this:
Code:
{
  ...
  "seasons": [
    ...
    {
      ...
      "episodes": [
        ...
        {
          ...
          "slugs": [
            "la-maison-bleue-s-2-e5-heros-national"
          ],
          ...
        },
        ...
      ],
      ...
    }
  ],
  ...
}
So not only are we getting all of the episodes for a specific season, but we are also getting all of the seasons. This is surprising since I expected the response to be paginated and also page/limit parameters to be needed. The choice can be justified if all of their shows have few episodes/seasons but if they didn’t even think about this, then it's a bad design choice. Either way, it makes our task easier.

Why does pagination matter? Well consider a random forum and remove the pagination. Whenever you browse the pages of a forum, ALL of their pages are also loaded in the background each time you change pages. It is a terrible choice that can slow down the site's performance. I know this is an exaggerated scenario but it makes it easier to understand why pagination is important.

So how do you generate the episode URL? Well, you can take the source URL, split it using "/" as a delimiter, replace the last element with the episode slug, append play for good measure, and join it back into a URL by using the .join() method. You can access the last element of a list by using the index -1.

You successfully connected the response to the episodes we wanted. Let's see if new unknowns got introduced. Since the headers are gone and the parameters are fixed, that leaves us with the endpoint URL.
Code:
https://api.tv5mondeplus.com/v1/customer/TV5MONDE/businessunit/TV5MONDEplus/content/asset/la-maison-bleue
The part "la-maison-bleue" is the only thing that should be taken care of. And just by looking at the source URL you can already find it at its end. So you're done building a chain of requests from the list of episodes to the source URL. The script becomes:
Code:
import requests

def get_series(source_url):
    series_slug = source_url.split("/")[-1]
    episodes = []
    
    params = {
        'fieldSet': 'ALL',
        'types': 'MOVIE,TV_SHOW,PODCAST',
        'onlyPublished': 'true',
        'includeEpisodes': 'true',
        'client': 'json'
    }

    response = requests.get(
        'https://api.tv5mondeplus.com/v1/customer/TV5MONDE/businessunit/TV5MONDEplus/content/asset/' + series_slug,
        params=params,
    )
    response = response.json()
    
    for season in response["seasons"]:
        for episode in season["episodes"]:
            episode_url = source_url.split("/")
            episode_url[-1] = episode["slugs"][0]
            episode_url.append("play")
            episode_url = "/".join(episode_url)
            
            episodes.append(episode_url)
    
    return episodes

with open("video_urls.txt", "w") as file:
    pass
with open('series_urls.txt', 'r') as file:
    source_urls = file.read().splitlines()

for source_url in source_urls:
    source_url = source_url.strip()
    try:
        series = get_series(source_url)
        with open("video_urls.txt", "a") as file:
            for url in series:
                file.write(url + "\n")

        print("Series done: ", source_url, " Obtained: ", len(series), " URLs")
        print("-----")
    except Exception as e:
        print("Failed to get series: " + source_url + ". Reason: " + str(e))
After running the script, it's gonna generate 20 URLs for the entire series. Just to be safe, click on 2-3 random links and see if you're getting redirected to the correct episode.

C) 3. Testing, fixing, and improving the script

Since the movie URL is a special case, we're gonna ignore it for now. Instead, we're going to test the podcast URL, so replace the content of "series_urls.txt".
Code:
https://www.tv5mondeplus.com/fr/podcast/subcategory/dingue
Don't forget to add the instruction "raise e", just to be sure we get a more descriptive error message. Running the script gives you zero URLs extracted. Weirdly, nothing was extracted and no error was thrown. Let's investigate by printing the response before generating the episode URLs. Go to the podcast series page and pick a random episode. I'll pick:
https://www.tv5mondeplus.com/fr/podcast/subcategory/dingue-14336781_74079A/play

Now let's search for "dingue-14336781_74079A" in the formatted JSON response. Nothing is found. That's kinda bad. It means the podcast series has a different chain of requests that's separated from the usual TV shows. The only good thing is that it returns an empty list because we can edit and fix the script by extending it.
Code:
... chain of requests ...
if result is valid then return result
... otherwise new chain of requests ...
Go to the podcast series and export a HAR file called "podcasts.har". Import it into your browser and open it with Notepad++ as well (close the old ones since we won't use them anymore). Let's build the chain for podcasts. That random podcast episode can be found in the HAR Notepad with this information:
Code:
{
        "request": {
          "bodySize": 217,
          "method": "POST",
          "url": "https://www.tv5mondeplus.com/api/graphql/v1/",
          ...
        },
        "response": {
          "status": 200,
          ...
          "content": {
            "mimeType": "application/json",
            "size": 222565,
            "text": "{\"data\": ... {\"id\": \"dingue-14336781_74079A:fr\", ...
I found four GraphQL requests in the HAR browser but you can find the right one easily by looking at the raw response and checking if it has the right episode. Using curlconverter with that request will give you cookies, headers, and json_data. After checking and reducing what was possible, the cookies and headers are completely useless. Before seeing what new unknowns were introduced by json_data, let's see how we can connect the response with what we wanted: podcast episode URLs.

After formatting the JSON, the relevant information is:
Code:
{
  "data": {
    "lookupContent": {
      ...
      "episodes": {
        "items": [
          ...
          {
            "id": "dingue-14775861_74079A:fr",
            ...
          },
          ...
        ],
        ...
      },
      ...
    }
  }
}
That means the episodes can now be accessed and generated in the same way as the normal TV shows. Let's see what unknowns got introduced now by going backward:
Code:
{
    'operationName': 'VODContentEpisodes',
    'variables': {
        'contentId': 'PODCAST:dingue_74079A:fr',
    },
    'extensions': {
        'persistedQuery': {
            'version': 1,
            'sha256Hash': '3ef37000cba42e64e4f2505fa4fa48d42f84e4335039c4e82f5ca24c11db0676',
        },
    },
}
This is similar to the "video.py" script. The hash can be ignored and "PODCAST" as well since it's the content's type. Because the script will only generate URLs, the language is irrelevant. But since it's not that hard to put it into a variable and take the value from the source URL, I'll separate it. Then only "dingue_74079A" remains which is another variable. After searching it on "podcasts.har" Notepad, you'll get this:
Code:
{
        "request": {
          "bodySize": 0,
          "method": "GET",
          "url": "https://www.tv5mondeplus.com/fr/podcast/subcategory/dingue",
          ...
        },
        "response": {
          "status": 407,
          ...
          "content": {
            "mimeType": "text/html; charset=utf-8",
            "size": 28098,
            "text": "<!DOCTYPE html>... image\" content=\"https://assets.tv5mondeplus.com/imagescaler002/tv5monde/tv5mondeplus/assets/dingue_74079A/posters/...
This is a valid case from which we can start building the requests, especially since it can be found directly in the HTML of the source URL. Up until now we always went with the first requests in the Notepad search and we can do that now as well. However, I'd like to show that you can have multiple different chains of requests that can solve the same task (plus I don't like parsing HTML responses since the solution isn't that elegant when compared to API responses).

Let's hit enter and search for another request in Notepad (just focus on the index of the current line and see if it changes, you can have the same value multiple times in the current line). The next request we find is still HTML content so we're going to ignore it. The next one is in a JSON:
Code:
      {
        "request": {
          "bodySize": 0,
          "method": "GET",
          "url": "https://www.tv5mondeplus.com/_next/data/20240711090914/fr/podcast/subcategory/dingue.json?categoryId=podcast&subcategoryId=subcategory&assetId=dingue",
          ...
        },
        "response": {
          "status": 200,
          ...
          "content": {
            "mimeType": "application/json",
            "size": 16850,
            "text": "{\"pageProps\": ... \"image\":{\"url\":\"https://assets.tv5mondeplus.com/imagescaler002/tv5monde/tv5mondeplus/assets/dingue_74079A/posters/...
Even though this is a JSON, I'm going to ignore this one as well. Just to entertain for a second this idea and to show another solution, if you were to go down this route then you'd have to obtain a request for "20240711090914" (at least!). Searching that value will bring you to:
Code:
      {
        "request": {
          "bodySize": 0,
          "method": "GET",
          "url": "https://www.tv5mondeplus.com/fr/podcast/subcategory/dingue",
          ...
        },
        "response": {
          "status": 407,
          ...
          "content": {
            "mimeType": "text/html; charset=utf-8",
            "size": 28098,
            "text": "<!DOCTYPE html>...<script src=\"/_next/static/20240711090914/_buildManifest.js\" ...
And that's the source URL again and yet another possible chain of requests. There are many solutions (and that's a good thing since we can choose something easy to code).

Moving back to the "dingue_74079A" request (the one with the image path), hitting enter again will throw you to a different line with a different request so that means in the previous JSON you could only find "dingue_74079A" as part of a resource image path. I'd like to find a JSON where "dingue_74079A" is in its own specific place (like how we found everything until now). The current result is perfect:
Code:
      {
        "request": {
          "bodySize": 208,
          "method": "POST",
          "url": "https://www.tv5mondeplus.com/api/graphql/v1/",
          ...
        },
        "response": {
          "status": 200,
          ...
          "content": {
            "mimeType": "application/json",
            "size": 2628,
            "text": "{\"data\": {\"lookupContent\": {\"id\": \"dingue_74079A:fr\", ...
From the list of filtered GraphQL requests, the first one has "dingue_74079A" in its raw response, so we're gonna copy it to curlconverter. After cleaning the generated request, the cookies and headers are useless. Let's see what unknowns are added:
Code:
{
    'operationName': 'VODContentDetails',
    'variables': {
        'contentId': 'redbee:dingue:fr',
    },
    'extensions': {
        'persistedQuery': {
            'version': 1,
            'sha256Hash': 'e396131572f170605ea7a9f13139568323ab9e398a350741952e690be38efb30',
        },
    },
}
The hash and redbee are fixed. The language is solved already. So all that remains is "dingue" which is solved instantly since it's the last part of the source URL. After running the modified "series.py" script, it extracts all of the 39 podcast episodes. Finally, let's test it for the special movie case.
Code:
https://www.tv5mondeplus.com/fr/cinema/policier-et-suspense/goodbye-morocco
It returns zero URLs which is a good thing. All it takes now is checking if at the end we have valid output, and if not then just handle the special case. In addition, we can put some code in try/except to handle the special case there as well to make sure unexpected behavior doesn't happen. We could've tested if a URL was a movie by comparing some sections from the URL. However, I thought it was easier to just handle it as a separate case while using the API responses.

The script should now work for TV shows, podcasts, and movies. I tested it for other 10 random URLs and you should do the same. Luckily, I managed to catch a rare scenario with this random show:
https://www.tv5mondeplus.com/fr/environnement/nature-et-animaux/vies-de-chiens

The script throws the error:

episode_url[-1] = episode["slugs"][0]
~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

So let's put that line in a try/except block and print the problem episode while adding "raise" to stop the script fully.
Code:
...
try:
    episode_url[-1] = episode["slugs"][0]
except:
    print(episode)
    raise
...
Take the printed dictionary and format it. The relevant information seems to be:
Code:
{
  "assetId": "107316121_74079A",
  ...
  "episode": "1",
  ...
  "season": "2",
  ...
  "slugs": [],
  ...
}
The problem makes sense now since the episode doesn't even have a slug assigned. Let's check how it looks in the browser since we know both the season and episode number. That episode's URL is:
Code:
https://www.tv5mondeplus.com/fr/environnement/nature-et-animaux/107316121_74079A/play
Let's see what a normal episode URL should look like by picking a random episode from the first season.
Code:
https://www.tv5mondeplus.com/fr/environnement/nature-et-animaux/vies-de-chiens-s-1-e1-liberte/play
There's a clear difference between "107316121_74079A" and "vies-de-chiens-s-1-e1-liberte". If you go back to the formatted JSON you can clearly see that "107316121_74079A" is the asset ID. Which is perfect since we can fix it now. If the slug can't be extracted, just go for the asset ID. After making this change, it works even for this rare scenario. To be sure, I tested it for another 5 URLs and it went well.

Finally, after getting rid of the "raise" instructions, the script becomes:
Code:
import requests

def get_series(source_url):
    series_slug = source_url.split("/")[-1]
    episodes = []
    
    params = {
        'fieldSet': 'ALL',
        'types': 'MOVIE,TV_SHOW,PODCAST',
        'onlyPublished': 'true',
        'includeEpisodes': 'true',
        'client': 'json'
    }

    response = requests.get(
        'https://api.tv5mondeplus.com/v1/customer/TV5MONDE/businessunit/TV5MONDEplus/content/asset/' + series_slug,
        params=params,
    )
    response = response.json()
    
    for season in response["seasons"]:
        for episode in season["episodes"]:
            episode_url = source_url.split("/")
            try:
                episode_url[-1] = episode["slugs"][0]
            except:
                episode_url[-1] = episode["assetId"]
            episode_url.append("play")
            episode_url = "/".join(episode_url)
            
            episodes.append(episode_url)
    
    if len(episodes) > 0:
        return episodes
    
    language = source_url.split("/")[3]
    series_slug = source_url.split("/")[-1]
    
    json_data = {
        'operationName': 'VODContentDetails',
        'variables': {
            'contentId': 'redbee:' + series_slug + ':' + language,
        },
        'extensions': {
            'persistedQuery': {
                'version': 1,
                'sha256Hash': 'e396131572f170605ea7a9f13139568323ab9e398a350741952e690be38efb30',
            },
        },
    }

    response = requests.post(
        'https://www.tv5mondeplus.com/api/graphql/v1/', 
        json=json_data
    )
    response = response.json()
    
    content_id = response["data"]["lookupContent"]["id"].split(":")[0]
    
    json_data = {
        'operationName': 'VODContentEpisodes',
        'variables': {
            'contentId': 'PODCAST:' + content_id + ':' + language,
        },
        'extensions': {
            'persistedQuery': {
                'version': 1,
                'sha256Hash': '3ef37000cba42e64e4f2505fa4fa48d42f84e4335039c4e82f5ca24c11db0676',
            },
        },
    }

    response = requests.post(
        'https://www.tv5mondeplus.com/api/graphql/v1/', 
        json=json_data
    )
    response = response.json()
    
    try:
        contents = response["data"]["lookupContent"]["episodes"]["items"]
    except:
        return [source_url]
    
    for episode in contents:
        episode_url = source_url.split("/")
        episode_url[-1] = episode["id"].split(":")[0]
        episode_url.append("play")
        episode_url = "/".join(episode_url)
            
        episodes.append(episode_url)
    
    if len(episodes) == 0:
        return [source_url]
    return episodes

with open("video_urls.txt", "w") as file:
    pass
with open('series_urls.txt', 'r') as file:
    source_urls = file.read().splitlines()

for source_url in source_urls:
    source_url = source_url.strip()
    try:
        series = get_series(source_url)
        with open("video_urls.txt", "a") as file:
            for url in series:
                file.write(url + "\n")

        print("Series done: ", source_url, " Obtained: ", len(series), " URLs")
        print("-----")
    except Exception as e:
        print("Failed to get series: " + source_url + ". Reason: " + str(e))
And that was all about writing the script "series.py".

D) Wrapping up the mass downloader

You solved 2 tasks:
- individual video automation
- batch video automation

The question remains, how do you combine them? Well if you ran the first script using "python video.py" and the second script using "python series.py" then you combine them both using:
Code:
python series.py & python video.py
That's all you need. The first command prepares the input for the second command. Let's put it into practice by extending it even further. If you want to just grab randomly all of the shows from the homepage of the site:
https://www.tv5mondeplus.com

you could use a browser addon that just grabs all of the links from that page. I managed to get around ~260 URLs. A good part of them will fail since they aren't pointing to shows/podcasts/movies, but the script won't stop running. To check the final downloaded content, I'll just take a random TV show and download it fully. Don't forget to remove the subprocess.run comment to also launch the download command.

https://www.tv5mondeplus.com/en/langue-francaise/litterature/lettres-de-quebec-l-autre-visite

[Attachment 80878 - Click to enlarge]

Everything seems fine and works as intended.

E) Conclusion

Some questions that you may have after reading this guide:
- What if the pssh wasn't found in the manifest?
Then you get the pssh from the init fragment. If you want a more general solution that can be adapted for any case, you could check @angela's hell noob pack code. You could try and adapt parts of that code to fit into your script (@angela if you don't agree with this recommendation, just leave a message and I'll delete this paragraph).

- What if the seasons/episodes were paginated?
Then you do something like this pseudocode:
Code:
set limit = N
set page = 1
infinite loop:
    content = get(page, limit)
    if content is empty then break the loop
    otherwise increment page and process the content
It is worth noting that if you browse sites that have a lot of episodes/seasons, you should focus on obtaining the HAR file for a case like that. It's important since the requests that are in that HAR also contain necessary information about how to pass parameters like "page" and "limit". In our case, they didn't even bother paginating the response.

- What if you need an account?
Well, an account is only useful for giving you the necessary token to start the chain of requests. In our case, that bearer token was obtained by using a public endpoint "anonymous". If an account is needed to obtain that starting token, you can try 2 methods:
a) Use the browser cookies and see what you can find there. Check if something resembles the token you need. A great Python package that does the job is:
https://github.com/borisbabic/browser_cookie3

b) If the cookies are useless, you need to replicate the login flow of the site. Do not attempt Gmail/Facebook/Apple/etc login flows since those are very complex. Instead, try the classic email/password that any site should offer. Their login flow is a lot easier when compared to the other popular ones.

How to replicate the login flow? The exact same way you replicated the chain of requests. Begin with the end and reach the start. Export the HAR file for the login. Important: if the site redirects you through many pages when logging in then enable "Persist Logs".

Some other things I need to mention:
- This guide doesn't teach you how to code Devine services (for that read their documentation). However, if you know how to make one without it being tied to any existing core downloader, you can adapt it for anything.
- If, for some reason, I missed something regarding this site and the script fails for specific content, then leave a message and I'll take a look. The site has content tied to your region and there may be cases I missed. However, if months passed and the script stopped working, I’m not gonna change anything since the guide's purpose wasn't to give a downloader for this site, but rather teach you how to make one for anything.

- By following the steps I recommended (starting from the end and building your code), you will never be confused about how to make a script. During this guide, I never searched for the right request like one searches for a needle in a haystack. And there were hundreds of requests available in the network tab. Obviously, if a site uses base64 for everything then you're back to looking just in the network tab, request by request. But if someone manages to solve the HAR base64 script problem, then that'd be a game changer.

- All those steps might be obvious to some and while the usual answers of "learn Python" and "open the network requests" are good (and necessary), I just felt they weren't enough. Also, the site may seem easy as well. However, it was a good starting point. From my experience, the difference between an easy site and a complex one is the length of the chain and also how many chains are there (like branches of a tree). The basic idea remains the same.
- The code may seem terribly written but it was left like this on purpose. I wanted to showcase what one could do just with generated code by using the right tools and strategy.
- If you encounter a case where you don't know how to advance: paginated responses, login flow, etc. then I may extend the guide for that specific case.

That being said, I hope this guide is helpful and helps at least a few of you who tried writing downloaders with no success before.

--[----->+<]>.++++++++++++.---.--------.
[*drm mass downloader: widefrog*]~~~~~~~~~~~[*how to make your own mass downloader: guide*]

Quote

22nd Jul 2024 14:15 #3

Karoolus

Search, Learn, Download!

Very nice !!

Just a remark:

Your code to extract PSSH using regex will also extract PlayReady PSSHs and you only get the right one because of comparing the length. While technically this works, it's not ideal in terms of clean programming.
Code:
import requests
from xml.etree import ElementTree as ET

def get_pssh_from_mpd(url):
    response = requests.get(url)
    mpd = response.content
    doc = ET.fromstring(mpd)
    ns = {'mpd': 'urn:mpeg:dash:schema:mpd:2011',
          'cenc': 'urn:mpeg:cenc:2013'}
    content_protections = doc.findall(".//mpd:ContentProtection", namespaces=ns)
    for cp in content_protections:
        scheme_id_uri = cp.get('schemeIdUri')
        if scheme_id_uri and 'edef8ba9-79d6-4ace-a3c8-27dcd51d21ed' in scheme_id_uri.lower():
            pssh_element = cp.find('cenc:pssh', namespaces=ns)
            if pssh_element is not None:
                return pssh_element.text
    return ''
This will parse the MPD as XML and look for PSSHs that have the actual widevine scheme id, meaning there will be no false positives.

Other than that (even though it seems like a very easy website indeed) it looks good!

Those who know, need not ask for help

Quote

22nd Jul 2024 14:34 #4

pssh

Member

Thank you for such a extensive guide

If I was in politics I make sure you drink plenty of beer
and watch plenty of TV to keep you busy. | Data is the new oil.

Quote

22nd Jul 2024 16:42 #5

stevepen1974

Member

Appreciate this, makes understanding much easier.

Quote

22nd Jul 2024 18:40 #6

imr_saleh

Member

precious guide, nice work

Quote

23rd Jul 2024 04:12 #7

2nHxWW6GkN1l916N3ayz8HQoi

Feels Good Man

Originally Posted by Karoolus

Your code to extract PSSH using regex will also extract PlayReady PSSHs

Yep. That's more of a little "hack" that I enjoy using since it's the fastest way to grab the existing pssh and didn't fail me ever. Your XML solution is the objectively correct one and thanks for the snippet code

Originally Posted by Karoolus

Other than that (even though it seems like a very easy website indeed) it looks good!

Thanks. I had to make some compromises for the tutorial length. If I'd have gone for something like vrt.be then the tutorial would have been 3 or 4 times longer (and it gets repetitive/boring after a point). And the only thing that differs is how many requests are connected since vrt doesn't hide anything in base64, so the strategy is identical.

By the way, I'm curious how would you approach that base64 problem. If you had a HAR with a lot of requests, how would you efficiently find all base64 values and decode them? I tried with regex but you get a lot of selections that shouldn't be base64. Maybe I'm missing an easy solution.

Originally Posted by pssh

Thank you for such a extensive guide

No problem. It was interesting to see how much of a script could be just generated code.

Originally Posted by stevepen1974

Appreciate this, makes understanding much easier.

Glad to hear that!

Originally Posted by imr_saleh

precious guide, nice work

Thanks @saleh!

--[----->+<]>.++++++++++++.---.--------.
[*drm mass downloader: widefrog*]~~~~~~~~~~~[*how to make your own mass downloader: guide*]

Quote

23rd Jul 2024 05:52 #8

A_n_g_e_l_a

Member

Originally Posted by Karoolus

Your code to extract PSSH using regex will also extract PlayReady PSSHs and you only get the right one because of comparing the length. While technically this works, it's not ideal in terms of clean programming.

This will parse the MPD as XML and look for PSSHs that have the actual widevine scheme id, meaning there will be no false positives.
!

allhell3.py in my Noobs Starter Kit uses that almost identical method for extracting pssh but I found it wasn't enough and needed to add a fallback. You could use regex to capture a string starting <cenc: pssh>AAAA - that is only ever the widevine ppsh. I chose to regex the Default-KID and calculate pssh as my fallback - that helped with sites that didn't publish a pssh in their mpd.

Well done OP!

Noob Starter Pack. Just download every Widevine mpd! Not kidding!.
https://files.videohelp.com/u/301890/hellyes6.zip

Quote

23rd Jul 2024 14:15 #9

Karoolus

Search, Learn, Download!

Originally Posted by 2nHxWW6GkN1l916N3ayz8HQoi

Yep. That's more of a little "hack" that I enjoy using since it's the fastest way to grab the existing pssh and didn't fail me ever. Your XML solution is the objectively correct one and thanks for the snippet code

Hey, if it works, it works. I was just pointing towards a different solution.

Originally Posted by 2nHxWW6GkN1l916N3ayz8HQoi

Thanks. I had to make some compromises for the tutorial length. If I'd have gone for something like vrt.be then the tutorial would have been 3 or 4 times longer (and it gets repetitive/boring after a point). And the only thing that differs is how many requests are connected since vrt doesn't hide anything in base64, so the strategy is identical.

By the way, I'm curious how would you approach that base64 problem. If you had a HAR with a lot of requests, how would you efficiently find all base64 values and decode them? I tried with regex but you get a lot of selections that shouldn't be base64. Maybe I'm missing an easy solution.

Can you give an example of a website that returns b64? I'd need to have a look because atm I'm not exactly sure what you mean.

Originally Posted by A_n_g_e_l_a

allhell3.py in my Noobs Starter Kit uses that almost identical method for extracting pssh but I found it wasn't enough and needed to add a fallback. You could use regex to capture a string starting <cenc: pssh>AAAA - that is only ever the widevine ppsh. I chose to regex the Default-KID and calculate pssh as my fallback - that helped with sites that didn't publish a pssh in their mpd.

Ah yes, in my own downloader I first check for a PSSH in the MPD, if it's not found I go straight to init fragment and get PSSH from there. Never had an issue that way

Those who know, need not ask for help

Quote

24th Jul 2024 01:48 #10

2nHxWW6GkN1l916N3ayz8HQoi

Feels Good Man

Originally Posted by A_n_g_e_l_a

Well done OP!

Thanks @angela. Your guides inspired me to make one since in the long run, it's most helpful when compared to scripts that break after time without constant maintenance.

Originally Posted by Karoolus

Hey, if it works, it works. I was just pointing towards a different solution.

Yep, and I appreciate it.

Originally Posted by Karoolus

Can you give an example of a website that returns b64? I'd need to have a look because atm I'm not exactly sure what you mean.

Well, this is one of my most favorites to crack
https://watch.globaltv.com

You can just pick a random DRM video, for example
https://watch.globaltv.com/series/411617347723/episode/GLOB0056162170000000/?action=play

You need a Canadian IP. You can use a free browser proxy since you only have to inspect the requests. The challenge was to find the request that returned the manifest URL. It's not that hard when you go request by request starting backward from the request that returns the content of the MPD (the one that uses the MPD URL). However, at first, it made me confused a bit since everything was hidden.

If the site was "normal", you could just export a HAR > ctrl+f search for the MPD URL, and get directly the first request that returns the MPD URL (without wasting time manually searching). But the strategy doesn't work here since everything is base64 encoded. I was trying to see if there was a way in which by having a HAR file, you could automatically decode all base64 strings, so a simple search afterward will bring you what you wanted. I hope I made myself clear.

--[----->+<]>.++++++++++++.---.--------.
[*drm mass downloader: widefrog*]~~~~~~~~~~~[*how to make your own mass downloader: guide*]

Quote

24th Jul 2024 04:51 #11

A_n_g_e_l_a

Member

Originally Posted by Karoolus

Ah yes, in my own downloader I first check for a PSSH in the MPD, if it's not found I go straight to init fragment and get PSSH from there. Never had an issue that way

That's a fair strategy and, of course it works. But if you put an intermediate Default-KID lookup before having to resort to an init.mp4 fetch it is just faster, that's all. It is only a matter of choice following a developer's priorities. But it is rare that the developer would produce an all-solutions script since the pssh delivery method would already be known. AllHell has been my only generic that needed it.

Noob Starter Pack. Just download every Widevine mpd! Not kidding!.
https://files.videohelp.com/u/301890/hellyes6.zip

Quote

24th Jul 2024 07:45 #12

A_n_g_e_l_a

Member

HAR is json and jsonformattter.org does a superb job creating a tree of the json. I use jsonformatter frequently but with HAR it is slow; it is buggy - doesn't seem to like paste in the search-box - but it will give you the nested path to data. so you can use code to extract. And this is so useful!

So with the whole HAR in jsonformatter a search for the session key gave:-

[Attachment 80916 - Click to enlarge]
That as highlighted isn't json that will parse as it has \ escape characters in front of quotes. Paste to a text editor and remove all the \ characters and paste the result to a new jsonformatter window.

[Attachment 80918 - Click to enlarge]

I'd previously clicked on licenseServerUrl which reveals - highlighted blue - the json path to the data.

so to retrieve something like this using python having already downloaded the json_data we can use the highlighted path

object►formats►0►drm►com.widevine.alpha►licenseSer verUrl (json_data is the object)

PHP Code:

licenseServerUrl = json_data['formats'][0]['drm']['com.widevine.alpha']['licenseServerUrl']

In real life use all of that would need to be automated first the text-not-quite-json could be captured
object►log►entries►25►response►content►text
Then removing all the \ characters with replace('\','') will give json to parse as above.

Noob Starter Pack. Just download every Widevine mpd! Not kidding!.
https://files.videohelp.com/u/301890/hellyes6.zip

Quote

25th Jul 2024 03:23 #13

2nHxWW6GkN1l916N3ayz8HQoi

Feels Good Man

Nice find. I had no idea json formatter had a feature like that. As you said, it is slow with a HAR. I thought it was slow just when it came to importing and parsing the tree. But even after it's loaded it's still slow. I think slower than Notepad. Always hate it when they implement a half-assed useful feature instead of optimizing it.

The path of the object is neat though.

--[----->+<]>.++++++++++++.---.--------.
[*drm mass downloader: widefrog*]~~~~~~~~~~~[*how to make your own mass downloader: guide*]

Quote

25th Jul 2024 07:24 #14

mustardran

Member

wow this is huge!!!
this forum is a gold mine, you guys are awesome!

Quote

3rd Nov 2024 05:52 #15

VanZan

Member

Thank you for this. I've never done anything like this before but over the course of a day I (somehow) got a script for Virgin Ireland working. Wouldn't have had a clue if not for this guide. I didn't follow the whole guide (only part one).

May I ask would it be very difficult to get my script working with either wirefrog or devine?

Quote

3rd Nov 2024 06:41 #16

2nHxWW6GkN1l916N3ayz8HQoi

Feels Good Man

Glad to hear you managed to write a script. The second part is dealing with multiple videos at once, like series/seasons etc.

The tutorial is meant to help you write a script as basic as possible, that means it's not tied to anything. So naturally you can adapt it and integrate what you wrote for any downloader you want. Devine has documentation on github and you can take a look at services written by @stabbedbybrick to see how they're written. Some downloaders may need more information to be extracted. You can also find a post in sticky threads for devine.

As for widefrog, I wrote no documentation for it when it comes to writing services as I'm the only one responsible for adding them. So I advise you to choose devine if you want to integrate your script in something bigger.

--[----->+<]>.++++++++++++.---.--------.
[*drm mass downloader: widefrog*]~~~~~~~~~~~[*how to make your own mass downloader: guide*]

Quote

11th Dec 2024 14:14 #17

YoBruce45

Member

From my side as well a big THANK YOU to 2nHxWW6GkN1l916N3ayz8HQo for his awsome guide, which became a kind of bible to me over the past weeks..
As posted in another thread, I decided to try it on my own with TF1.fr, and walked the guide step by step. Thx to the precious hints from the author, i finally managed to make it - I am so happy Thanks for the time you take helping other people dude.

My experience here (I am just repeating the guide ), take your time and go step by step.
For each request, trim all the headers/param/data to their maximum, so that the response deliver the data needed for the next request. Print the outputs and don't hesitate to put some try/except, to see where you code crashed.
At the end, my output (with the prints all along the chain) could look like this:
Code:
LOGIN_TOKEN: st2.s.AtLtWr1htg.glefvU42[....]YgesV2w.sc3
TOKEN: eyJhbGciOiJFUzI1N[....]WY5MDYtNDNiZC1iZjI4LWY0MWEwNTJhZWFjNyIsInN1YiI6IjE5NGUxMjI1NzU5NTRhZmZiMzI5OGM2YmRhYWNlNjk2In0.vvlIZtk1UE5qZhWBRDlaUHeH35NKX50eiV93tWbqC4JYH64IlIWg9KxidieyJahbeCvtm9C6K0xjr59n-MidOQ
TITLE: les-schtroumpfs-le-treizieme-dessert
https://vod-das.cdn-0.diff.tf1.fr/eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjaXAiOiI5Mi4xNTIuMjIuNDEiLCJjbWNkIjoiIiwiZXhwIjoxNzMzOTU4MzQzLCJnaWQiOiIxOTRlMTIyNTc1OTU0YWZmYjMyOThjNmJkYWFjZTY5NiIsImlhdCI6MTczMzk0Mzk0MywiaXNzIjoiZGVsaXZlcnkiLCJtYXhiIjoyODAwMDAwLCJzdGVtIjoiLzIvVVNQLTB4MC84Mi80NC8xNDIxODI0NC9zc20vODhjNTdmNWVjZDE4M2EwYWFkNTVlMzM5ZWYxMjcwYmNiYzkwMWNkODcxYTY4NmE4OWNlZGRhYzljYWFlYzk0OS5pc20vMTQyMTgyNDQubXBkIiwic3ViIjoiMTk0ZTEyMjU3NTk1NGFmZmIzMjk4YzZiZGFhY2U2OTYifQ.YRpdLwk9gsyfzOa5thyjryiJLojWZ_EA7EPEjphnNDY/2/USP-0x0/82/44/14218244/ssm/88c57f5ecd183a0aad55e339ef1270bcbc901cd871a686a89ceddac9caaec949.ism/14218244.mpd
pssh: AAAAMHBzc2gAAAAA7e+LqXnWSs6jyCfc1R0h7QAAABAiCDE0MjE4MjQ0SOPclZsG
keyString: --key d6d09bfaac2a5e94935233d7fb969af9:d987acb2e8f5ca6bac21062dadad2d53--key 9c5c0df0c34855488a1c128fcb961579:76ea1d07407dcc95a3990abddb034509--key 5b5c419344df5c2683a359620eac31e5:c54010868297f9d908840d52845b97b0
One last thing was very weird though.
As i was almost touching the end since i received the .mpd-url, i was disappointed to see that the response_code to get the mpd_url was a 404
Means: i got an apparently valid mpd_url, but when i try to call it, didn't receive anything..
Very strange - this was solved after i realized I had to send the key 'pver' in the param of the request.
Just wanted to share...

Peace to this forum !

Last edited by YoBruce45; 11th Dec 2024 at 14:57.

Quote

11th Dec 2024 14:50 #18

2nHxWW6GkN1l916N3ayz8HQoi

Feels Good Man

Congrats!

Word of warning, you might wanna trim/edit and not make public any tokens that are related to your account. In the wrong hands, someone could harm your account, like edit password etc. Also whenever you have something that looks like its base64, decode first and see if it contains private sensitive data, like ip, country etc. before sharing

--[----->+<]>.++++++++++++.---.--------.
[*drm mass downloader: widefrog*]~~~~~~~~~~~[*how to make your own mass downloader: guide*]

Quote

11th Dec 2024 14:56 #19

YoBruce45

Member

just done, thx^^

Quote

30th Dec 2024 11:07 #20

YoBruce45

Member

for information, maybe some readers might be interested to know that there is a nice alternative for retrieving all the URLs of series episodes, using the module Selenium in Python.
I used to do it before i read this guide because very often, one has to click on some buttons to display hidden episodes, and was satisfied with this solution.
For example, i use this code to retrieve URLs from TF1 and Auvio:

Code:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By

def retrieveTF1(url):
    driver = webdriver.Chrome()
    driver.get(url)
    driver.maximize_window()

    time.sleep(15)

    listVideoBlockElements = driver.find_elements(By.CSS_SELECTOR,"a[class^='flex flex-col-reverse gap-2 after:absolute after:inset-0 after:z-10']")

    for videoBlockElement in listVideoBlockElements:
        print(videoBlockElement.get_attribute('href'))

    print(">> %s links retrieved" % (len(listVideoBlockElements)))

def retrieveAuvio(url):
    driver = webdriver.Chrome()
    driver.get(url)
    driver.maximize_window()

    time.sleep(15)

    listVideoBlockElements = driver.find_elements(By.CSS_SELECTOR,"a[class^='DSBase_pointer__GUr_O TileEpisode_detailsLink__7N4u7 noFocusRing']")

    for videoBlockElement in listVideoBlockElements:
        print(videoBlockElement.get_attribute('href'))

    print(">> %s links retrieved" % (len(listVideoBlockElements)))

url = input("Enter URL and scroll down to display all elements: ")

if "tf1" in url[0:20]:
    retrieveTF1(url)

if "auvio" in url[0:20]:
    retrieveAuvio(url)

Quote

16th Feb 2025 07:45 #21

hizbf

Member

Originally Posted by 2nHxWW6GkN1l916N3ayz8HQoi

Congrats!

Word of warning, you might wanna trim/edit and not make public any tokens that are related to your account. In the wrong hands, someone could harm your account, like edit password etc. Also whenever you have something that looks like its base64, decode first and see if it contains private sensitive data, like ip, country etc. before sharing

Hello, may I ask how to obtain the MPD link and key of mytvsuper TV through the following code:

HTML Code:
params = {
'platform': 'android_tv',
'video_id': video_id,
}
response = requests.get(url='https://user-api.mytvsuper.com/v1/video/checkout',
params=params, headers=headers)

Because the web version of this website only provides 720P image quality, and only the TV version provides 1080P image quality, but I don't know how to write a script

This code comes from this post:https://forum.videohelp.com/threads/414196-mytv-super-1080p-stream#post2731673

Someone in this post said that obtaining MPD links also requires this:https://forum.videohelp.com/threads/414196-mytv-super-1080p-stream#post2731843
But I don't know how to write this script

This website:https://www.mytvsuper.com/en/home/
It has free programs:https://www.mytvsuper.com/en/content/10048/Free-Programme/
Need Hong Kong IP

Last edited by hizbf; 16th Feb 2025 at 08:02.

Quote

16th Feb 2025 10:13 #22

2nHxWW6GkN1l916N3ayz8HQoi

Feels Good Man

Hello. I can't access the site through any free means, so unfortunately I can't help you. Someone else can take a look

--[----->+<]>.++++++++++++.---.--------.
[*drm mass downloader: widefrog*]~~~~~~~~~~~[*how to make your own mass downloader: guide*]

Quote

16th Feb 2025 20:13 #23

hizbf

Member

Originally Posted by 2nHxWW6GkN1l916N3ayz8HQoi

Hello. I can't access the site through any free means, so unfortunately I can't help you. Someone else can take a look

May I ask if you are unable to open the website, play videos, or have no account?
This is a free program:https://www.mytvsuper.com/en/content/rw6358ea861f255d656499cdf7/Free-Programmes/

Last edited by hizbf; 16th Feb 2025 at 20:21.

Quote

17th Feb 2025 01:33 #24

2nHxWW6GkN1l916N3ayz8HQoi

Feels Good Man

No programme found. And when I try to login I get

We currently offer a limited range of services in your region
(myTV Gold / Free Zone and other services are only available in Hong Kong and Macau, thank you for your understanding.)

with all available free vpns

--[----->+<]>.++++++++++++.---.--------.
[*drm mass downloader: widefrog*]~~~~~~~~~~~[*how to make your own mass downloader: guide*]

Quote

17th Feb 2025 05:27 #25

hizbf

Member

Originally Posted by 2nHxWW6GkN1l916N3ayz8HQoi

No programme found. And when I try to login I get

We currently offer a limited range of services in your region
(myTV Gold / Free Zone and other services are only available in Hong Kong and Macau, thank you for your understanding.)

with all available free vpns

This website is from Hong Kong and requires a Hong Kong IP or VPN, otherwise the situation you mentioned may occur

Quote

18th Feb 2025 05:41 #26

Frieren

Elf

Originally Posted by hizbf

Originally Posted by 2nHxWW6GkN1l916N3ayz8HQoi

Congrats!

Word of warning, you might wanna trim/edit and not make public any tokens that are related to your account. In the wrong hands, someone could harm your account, like edit password etc. Also whenever you have something that looks like its base64, decode first and see if it contains private sensitive data, like ip, country etc. before sharing

Hello, may I ask how to obtain the MPD link and key of mytvsuper TV through the following code:

HTML Code:
params = {
'platform': 'android_tv',
'video_id': video_id,
}
response = requests.get(url='https://user-api.mytvsuper.com/v1/video/checkout',
params=params, headers=headers)

Because the web version of this website only provides 720P image quality, and only the TV version provides 1080P image quality, but I don't know how to write a script

This code comes from this post:https://forum.videohelp.com/threads/414196-mytv-super-1080p-stream#post2731673

Someone in this post said that obtaining MPD links also requires this:https://forum.videohelp.com/threads/414196-mytv-super-1080p-stream#post2731843
But I don't know how to write this script

This website:https://www.mytvsuper.com/en/home/
It has free programs:https://www.mytvsuper.com/en/content/10048/Free-Programme/
Need Hong Kong IP
Code:
def get_mytvsuper(channel):
    if channel not in CHANNEL_LIST:
        return '频道代号错误'

    api_token = os.getenv('MYTVSUPER_API_TOKEN')
    if not api_token:
        return 'API token 未设置'

    headers = {
        'Accept': 'application/json',
        'Authorization': 'Bearer ' + api_token,
        'Accept-Language': 'zh-CN,zh-Hans;q=0.9',
        'Host': 'user-api.mytvsuper.com',
        'Origin': 'https://www.mytvsuper.com',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5.2 Safari/605.1.15',
        'Referer': 'https://www.mytvsuper.com/',
        'X-Forwarded-For': '210.6.4.148'  # 香港原生IP  210.6.4.148
    }

    params = {
        'platform': 'android_tv',
        'network_code': channel
    }

    url = 'https://user-api.mytvsuper.com/v1/channel/checkout'
    try:
        response = requests.get(url, headers=headers, params=params)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        return f'请求失败: {e}'

    response_json = response.json()
    profiles = response_json.get('profiles', [])

    play_url = ''
    for profile in profiles:
        if profile.get('quality') == 'high':
            play_url = profile.get('streaming_path', '')
            break

    if not play_url:
        return '未找到播放地址'

    play_url = play_url.split('&p=')[0]

    license_key = CHANNEL_LIST[channel]['license']
    license_data = encode_keys(license_key)  
    print(f"hexTOBase64：{license_data}")
    channel_name = CHANNEL_LIST[channel]['name']
    channel_logo = CHANNEL_LIST[channel]['logo']
    m3u_content = f"#EXTINF:-1 tvg-id=\"{channel}\" tvg-name=\"{channel_name}\" tvg-logo=\"{channel_logo}\",{channel_name}\n"
    m3u_content += "#KODIPROP:inputstream.adaptive.manifest_type=mpd\n"
    m3u_content += "#KODIPROP:inputstream.adaptive.license_type=clearkey\n"
    m3u_content += f"#KODIPROP:inputstream.adaptive.license_key={license_data}\n"
    m3u_content += f"{play_url}\n"

    return m3u_content
full script
https://raw.githubusercontent.com/xiaotan8/xiaotan8.github.io/e9f7670ea6030555b08cc32b...9/mytvsuper.py

m3u8
https://raw.githubusercontent.com/xiaotan8/xiaotan8.github.io/e9f7670ea6030555b08cc32b.../mytvsuper.m3u

Quote

19th Feb 2025 13:22 #27

2nHxWW6GkN1l916N3ayz8HQoi

Feels Good Man

Thanks for sharing @frieren

--[----->+<]>.++++++++++++.---.--------.
[*drm mass downloader: widefrog*]~~~~~~~~~~~[*how to make your own mass downloader: guide*]

Quote

24th Feb 2025 20:02 #28

hizbf

Member

Originally Posted by Frieren
Originally Posted by hizbf

Originally Posted by 2nHxWW6GkN1l916N3ayz8HQoi

Congrats!

Word of warning, you might wanna trim/edit and not make public any tokens that are related to your account. In the wrong hands, someone could harm your account, like edit password etc. Also whenever you have something that looks like its base64, decode first and see if it contains private sensitive data, like ip, country etc. before sharing

Hello, may I ask how to obtain the MPD link and key of mytvsuper TV through the following code:

HTML Code:
params = {
'platform': 'android_tv',
'video_id': video_id,
}
response = requests.get(url='https://user-api.mytvsuper.com/v1/video/checkout',
params=params, headers=headers)

Because the web version of this website only provides 720P image quality, and only the TV version provides 1080P image quality, but I don't know how to write a script

This code comes from this post:https://forum.videohelp.com/threads/414196-mytv-super-1080p-stream#post2731673

Someone in this post said that obtaining MPD links also requires this:https://forum.videohelp.com/threads/414196-mytv-super-1080p-stream#post2731843
But I don't know how to write this script

This website:https://www.mytvsuper.com/en/home/
It has free programs:https://www.mytvsuper.com/en/content/10048/Free-Programme/
Need Hong Kong IP
Code:
def get_mytvsuper(channel):
    if channel not in CHANNEL_LIST:
        return '频道代号错误'

    api_token = os.getenv('MYTVSUPER_API_TOKEN')
    if not api_token:
        return 'API token 未设置'

    headers = {
        'Accept': 'application/json',
        'Authorization': 'Bearer ' + api_token,
        'Accept-Language': 'zh-CN,zh-Hans;q=0.9',
        'Host': 'user-api.mytvsuper.com',
        'Origin': 'https://www.mytvsuper.com',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5.2 Safari/605.1.15',
        'Referer': 'https://www.mytvsuper.com/',
        'X-Forwarded-For': '210.6.4.148'  # 香港原生IP  210.6.4.148
    }

    params = {
        'platform': 'android_tv',
        'network_code': channel
    }

    url = 'https://user-api.mytvsuper.com/v1/channel/checkout'
    try:
        response = requests.get(url, headers=headers, params=params)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        return f'请求失败: {e}'

    response_json = response.json()
    profiles = response_json.get('profiles', [])

    play_url = ''
    for profile in profiles:
        if profile.get('quality') == 'high':
            play_url = profile.get('streaming_path', '')
            break

    if not play_url:
        return '未找到播放地址'

    play_url = play_url.split('&p=')[0]

    license_key = CHANNEL_LIST[channel]['license']
    license_data = encode_keys(license_key)  
    print(f"hexTOBase64：{license_data}")
    channel_name = CHANNEL_LIST[channel]['name']
    channel_logo = CHANNEL_LIST[channel]['logo']
    m3u_content = f"#EXTINF:-1 tvg-id=\"{channel}\" tvg-name=\"{channel_name}\" tvg-logo=\"{channel_logo}\",{channel_name}\n"
    m3u_content += "#KODIPROP:inputstream.adaptive.manifest_type=mpd\n"
    m3u_content += "#KODIPROP:inputstream.adaptive.license_type=clearkey\n"
    m3u_content += f"#KODIPROP:inputstream.adaptive.license_key={license_data}\n"
    m3u_content += f"{play_url}\n"

    return m3u_content
full script
https://raw.githubusercontent.com/xiaotan8/xiaotan8.github.io/e9f7670ea6030555b08cc32b...9/mytvsuper.py

m3u8
https://raw.githubusercontent.com/xiaotan8/xiaotan8.github.io/e9f7670ea6030555b08cc32b.../mytvsuper.m3u
Do you have a script for obtaining 1080P video on demand? Because the script you sent is for obtaining 1080P video live streaming

Quote

24th Feb 2025 20:27 #29

hizbf

Member

Originally Posted by 2nHxWW6GkN1l916N3ayz8HQoi

Thanks for sharing @frieren

May I ask you a question? I just want to obtain the mainfest of the video through the TV API, without needing to obtain the video key. What should I do about this?

Quote

25th Feb 2025 00:42 #30

2nHxWW6GkN1l916N3ayz8HQoi

Feels Good Man

Since I can't access the site, nor do I own a fancy tv, I can only give you a general answer. Inspect the networks requests with an app like fiddler. Identify the request that returns the manifest and write a script.

--[----->+<]>.++++++++++++.---.--------.
[*drm mass downloader: widefrog*]~~~~~~~~~~~[*how to make your own mass downloader: guide*]

Quote

the slacker's guide to mass downloading (most of) the internet, automation

Thread Tools

Similar Threads

Downloading videos from Jiocinema : Step by Step Guide

Beginners guide to downloading from UK broadcasters streaming sites

Mass downloading coursera courses?

FFMPEG Automation

I am looking for a free Video Automation Playout software which one do you