Autogenerate Show Notes with yt-dlp, Whisper.cpp, and Node.js
Published:
End-to-end scripting workflow utilizing Whisper.cpp, yt-dlp, and Commander.js to automatically generate show notes with LLMs from audio and video transcripts.
Outline
- Introduction
- Download and Extract Audio with yt-dlp
- Create and Prepare Transcription for Analysis
- ChatGPT Show Notes Creation Prompt
- Create Autogen Bash Script
- Create Node.js CLI
- Example Show Notes and Next Steps
All of this project’s code can be found on my GitHub at
ajcwebdev/autoshow
.
Introduction
Creating podcast show notes is an arduous process. Many podcasters do not have the support of a team or the personal bandwidth required to produce high quality show notes. A few of the necessary ingredients include:
- Accurate transcript with timestamps
- Chapter headings and descriptions
- Succinct episode summaries of varying length (sentence, paragraph, a few paragraphs)
Thankfully, through the magic of AI, many of these can now be generated automatically with a combination of open source tooling and affordable large language models (LLMs). In this project, we’ll be leveraging OpenAI’s open source transcription model, Whisper and their closed source LLM, ChatGPT.
Setup Project and Install Dependencies
Create a new project directory and perform the following steps:
- Initialize a
package.json
and settype
tomodule
for ESM syntax. - Create a
content
directory for audio and transcription files that we’ll generate along the way. - Create a
.gitignore
file fornode_modules
and thewhisper.cpp
GitHub repo.
yt-dlp
is a command-line program to download videos from YouTube and other video platforms. It is a fork of yt-dlc
, which itself is a fork of youtube-dl
, with additional features and patches integrated from both.
whisper.cpp
is a C++ implementation of OpenAI’s whisper
Python project. This provides the useful feature of making it possible to transcribe episodes in minutes instead of days. Run the following commands to clone the repo and build the base
model:
Note: This will build the smallest and least capable transcription model. For a more accurate but heavyweight model, replace
base
(150MB) withmedium
(1.5GB) orlarge-v2
(3GB).
If you’re a simple JS developer like me, you may find the whisper.cpp
repo a bit intimidating to navigate. Here’s a breakdown of some of the most important pieces of the project to help you get oriented. Click any of the following to see a dropdown with further explanation:
models/ggml-base.bin
- Custom binary format (
ggml
) used by thewhisper.cpp
library.- Represents a quantized or optimized version of OpenAI’s Whisper model tailored for high-performance inference on various platforms.
- The
ggml
format is designed to be lightweight and efficient, allowing the model to be easily integrated into different applications.
main
- Executable compiled from the
whisper.cpp
repository.- Transcribes or translates audio files using the Whisper model.
- Running this executable with an audio file as input transcribes the audio to text.
samples
- The directory for sample audio files.
- Includes a sample file called
jfk.wav
provided for testing and demonstration purposes. - The
main
executable can use it for showcasing the model’s transcription capabilities.
- Includes a sample file called
whisper.cpp
and whisper.h
- These are the core C++ source and header files of the
whisper.cpp
project.- They implement the high-level API for interacting with the Whisper automatic speech recognition (ASR) model.
- This includes loading the model, preprocessing audio inputs, and performing inference.
Download and Extract Audio with yt-dlp
For transcriptions of videos, yt-dlp
can download and extract audio from YouTube URL’s. For podcasts, you can input the URL from the podcast’s RSS feed that hosts the raw file containing the episode’s audio. Create a command that completes the following actions:
- Download a specified YouTube video.
- Extract the video’s audio.
- Convert the audio to WAV format.
- Save the file in Whisper’s
content
directory. - Set filename to
output.wav
.
Note: Include the
--verbose
command if you’re getting weird bugs and don’t know why.
This command uses yt-dlp
, a command-line utility for downloading videos from YouTube and other video platforms, to perform the following actions:
--extract-audio
(-x
) downloads the video from a given URL and extracts its audio.--audio-format
specifies the format the audio should be converted to for Whisper we’ll usewav
for WAV files.--postprocessor-args
has the argument16000
passed to-ar
so the audio sampling rate is set to 16000 Hz (16 kHz) for Whisper.-o
specifies the output template for the downloaded files, in this casecontent/output.wav
which also specifies the directory to place the output file.- The URL,
https://www.youtube.com/watch?v=jKB0EltG9Jo
is the YouTube video we’ll extract the audio from. Each YouTube video has a unique identifier contained in its URL (QhXc9rVLVUo
in this example).
Create and Prepare Transcription for Analysis
It’s possible to run the Whisper model and have the transcript output just to the terminal by running:
Note:
-m
and-f
are shortened aliases used in place of--model
and--file
.- For other models, replace
ggml-base.bin
withggml-medium.bin
orggml-large-v2.bin
.
This is nice for quick demos or short files. However, what you really want is the transcript saved to a new file.
Run Whisper Transcription Model
Whisper.cpp provides many different output options including txt
, vtt
, srt
, lrc
, csv
, and json
. These cover a wide range of uses and vary from highly structured to mostly unstructured data.
- Any combination of output files can be specified with
--output-filetype
using any of the previous options in place offiletype
. - For example, to output two files, an LRC file and basic text file, include
--output-lrc
and--output-txt
.
For this example, we’ll only output one file in the lrc
format:
-of
is an alias for --output-file
. The command is used to modify the final file name along with the selected file extensions. Since our command includes content/transcript
, there will be a file called transcript.lrc
inside the content
directory.
Create files in all output formats
Modify Transcript Output for LLM
Despite the various available options for file formats, whisper.cpp
outputs all of them as text files that later can be parsed and transformed. As with many things in programming, numerous approaches could be used to yield similar results.
Based on your personal workflows/experience, you may find it easier to parse and transform a different common data formats like csv
or json
. For my purpose, I’m going to use the lrc
output which looks like this:
Using a combination of grep
and awk
, I’ll write a short bash command to take the LRC transcript and modify it to look like this instead:
To achieve the desired transformations with the given directory structure, we’ll need to:
- Read the
transcript.lrc
file from thecontent
directory. - Remove the
[by:whisper.cpp]
signature. - Format the timestamps to remove milliseconds.
- Write the transformed content to a new file called
transcript.txt
in the same directory as the original file.
In the next section we’ll create the prompt to tell ChatGPT or Claude how to write the show notes. This prompt along with all the previous logic to download, transcribe, and transform the output will be combined into a single Bash script.
ChatGPT Show Notes Creation Prompt
Now that we have a cleaned up transcript, we can use ChatGPT directly to create the show notes. The output will contain six distinct sections which correspond to the full instructions of the prompt. Any of these sections can be removed, changed, or expanded:
- Potential Episode Titles
- One Sentence Summary
- One Paragraph Summary
- Chapters
- Key Takeaways
Create a file called prompt.md
:
Include the following prompt with the transcript after the final line:
The final step is to take the content of prompt.md
, append the transcript in transcript.md
, and write the combined content to a new file called final.md
in the content
directory.
To achieve this directly from the terminal, use the cat
command to concatenate the content of prompt.md
with content/transcript.md
and redirect the output to create final.md
in the content
directory:
Create Autogen Bash Script
Lets combine all the previous commands into one single script. Create a file called autogen.sh
and give the script executable permissions with chmod
:
The entirety of the example so far will be implemented as a process_video
function executed by a main()
function.
Process Video with Autogen
The --print
option from yt-dlp
can be used to extract metadata from the video. We’ll use the following in our script:
video_id
andupload_date
provide a unique name for each video.webpage_url
for the full video URL.uploader
for the channel name.uploader_url
for the channel URL.title
for the video title.thumbnail
for the video thumbnail.
Include the following code in autogen.sh
:
Run ./autogen.sh --video
followed by the video URL you would like to transcribe:
Next, we’ll write two more functions that each run the process_video
function on multiple videos. These videos will be either contained in a playlist (process_playlist
) or written in a urls.md
file (process_urls_file
).
Process Playlist with Autogen
At this point, the autogen.sh
script is designed to run on individual video URLs. However, if you already have a backlog of content to transcribe, you’ll want to run this script on a series of video URLs.
Lets create another option to accept a playlist URL instead of a video URL. The --print "url"
and --flat-playlist
options from yt-dlp
can be used to write a list of video URLs to a new file which we’ll call urls.md
.
Include the following code in autogen.sh
:
Run ./autogen.sh
with the playlist URL passed to --playlist
.
Process URLs with Autogen
To process a list of arbitrary URLs, we’ll want to bypass the yt-dlp
command that reads a list of videos from a playlist and pass urls.md
directly to Whisper.
Run ./autogen.sh --urls
filled by the path to your urls.md
file.
Create Node.js CLI
For the last part of this tutorial, we’ll port all of this logic into a Node.js CLI using Commander.js.
Include the following code in autogen.js
.
Run on a single YouTube video:
Run on multiple YouTube videos in a playlist:
Run on an arbitrary list of URLs in urls.md
:
Example Show Notes and Next Steps
Here’s what ChatGPT generated for Episode 0 of the Fullstack Jamstack podcast:
This workflow is fine for me because I only create a podcast every week or two, so I can just copy paste the transcript into ChatGPT and copy out the output. However, it’s very possible that you could have dozens or even hundreds of episodes that you want to run this process on.
To achieve this in a short amount of time, you’ll need to use the OpenAI API and drop a bit of coin to do so. In my next blog post, I’ll be showing how to achieve this with OpenAI’s Node.js wrapper library. Once that blog post is complete I’ll update this post and link it at the end.