ajcwebdev
Blog post cover art for Autogenerate Show Notes with yt-dlp, Whisper.cpp, and Node.js

Autogenerate Show Notes with yt-dlp, Whisper.cpp, and Node.js

Published:

Last Updated:

End-to-end scripting workflow utilizing Whisper.cpp, yt-dlp, and Commander.js to automatically generate show notes with LLMs from audio and video transcripts.

Outline

All of this project’s code can be found on my GitHub at ajcwebdev/autoshow.

Introduction

Creating podcast show notes is an arduous process. Many podcasters do not have the support of a team or the personal bandwidth required to produce high quality show notes. A few of the necessary ingredients include:

  • Accurate transcript with timestamps
  • Chapter headings and descriptions
  • Succinct episode summaries of varying length (sentence, paragraph, a few paragraphs)

Thankfully, through the magic of AI, many of these can now be generated automatically with a combination of open source tooling and affordable large language models (LLMs). In this project, we’ll be leveraging OpenAI’s open source transcription model, Whisper and their closed source LLM, ChatGPT.

Setup Project and Install Dependencies

Create a new project directory and perform the following steps:

  • Initialize a package.json and set type to module for ESM syntax.
  • Create a content directory for audio and transcription files that we’ll generate along the way.
  • Create a .gitignore file for node_modules and the whisper.cpp GitHub repo.
Terminal window
mkdir autoshow && \
cd autoshow && \
npm init -y && \
npm pkg set type="module" && \
mkdir content && \
printf "node_modules\n.DS_Store\nwhisper.cpp" > .gitignore

yt-dlp is a command-line program to download videos from YouTube and other video platforms. It is a fork of yt-dlc, which itself is a fork of youtube-dl, with additional features and patches integrated from both.

Terminal window
brew install yt-dlp ffmpeg

whisper.cpp is a C++ implementation of OpenAI’s whisper Python project. This provides the useful feature of making it possible to transcribe episodes in minutes instead of days. Run the following commands to clone the repo and build the base model:

Terminal window
git clone https://github.com/ggerganov/whisper.cpp && \
bash ./whisper.cpp/models/download-ggml-model.sh base && \
make -C whisper.cpp

Note: This will build the smallest and least capable transcription model. For a more accurate but heavyweight model, replace base (150MB) with medium (1.5GB) or large-v2 (3GB).

If you’re a simple JS developer like me, you may find the whisper.cpp repo a bit intimidating to navigate. Here’s a breakdown of some of the most important pieces of the project to help you get oriented. Click any of the following to see a dropdown with further explanation:

models/ggml-base.bin
  • Custom binary format (ggml) used by the whisper.cpp library.
    • Represents a quantized or optimized version of OpenAI’s Whisper model tailored for high-performance inference on various platforms.
    • The ggml format is designed to be lightweight and efficient, allowing the model to be easily integrated into different applications.
main
  • Executable compiled from the whisper.cpp repository.
    • Transcribes or translates audio files using the Whisper model.
    • Running this executable with an audio file as input transcribes the audio to text.
samples
  • The directory for sample audio files.
    • Includes a sample file called jfk.wav provided for testing and demonstration purposes.
    • The main executable can use it for showcasing the model’s transcription capabilities.
whisper.cpp and whisper.h
  • These are the core C++ source and header files of the whisper.cpp project.
    • They implement the high-level API for interacting with the Whisper automatic speech recognition (ASR) model.
    • This includes loading the model, preprocessing audio inputs, and performing inference.

Download and Extract Audio with yt-dlp

For transcriptions of videos, yt-dlp can download and extract audio from YouTube URL’s. For podcasts, you can input the URL from the podcast’s RSS feed that hosts the raw file containing the episode’s audio. Create a command that completes the following actions:

  1. Download a specified YouTube video.
  2. Extract the video’s audio.
  3. Convert the audio to WAV format.
  4. Save the file in Whisper’s content directory.
  5. Set filename to output.wav.

Note: Include the --verbose command if you’re getting weird bugs and don’t know why.

Terminal window
yt-dlp -x \
--audio-format wav \
--postprocessor-args "ffmpeg: -ar 16000" \
-o "content/output.wav" \
"https://www.youtube.com/watch?v=jKB0EltG9Jo"

This command uses yt-dlp, a command-line utility for downloading videos from YouTube and other video platforms, to perform the following actions:

  • --extract-audio (-x) downloads the video from a given URL and extracts its audio.
  • --audio-format specifies the format the audio should be converted to for Whisper we’ll use wav for WAV files.
  • --postprocessor-args has the argument 16000 passed to -ar so the audio sampling rate is set to 16000 Hz (16 kHz) for Whisper.
  • -o specifies the output template for the downloaded files, in this case content/output.wav which also specifies the directory to place the output file.
  • The URL, https://www.youtube.com/watch?v=jKB0EltG9Jo is the YouTube video we’ll extract the audio from. Each YouTube video has a unique identifier contained in its URL (QhXc9rVLVUo in this example).

Create and Prepare Transcription for Analysis

It’s possible to run the Whisper model and have the transcript output just to the terminal by running:

Terminal window
./whisper.cpp/main \
-m whisper.cpp/models/ggml-base.bin \
-f content/output.wav

Note:

  • -m and -f are shortened aliases used in place of --model and --file.
  • For other models, replace ggml-base.bin with ggml-medium.bin or ggml-large-v2.bin.

01 - whisper-terminal-output

This is nice for quick demos or short files. However, what you really want is the transcript saved to a new file.

Run Whisper Transcription Model

Whisper.cpp provides many different output options including txt, vtt, srt, lrc, csv, and json. These cover a wide range of uses and vary from highly structured to mostly unstructured data.

  • Any combination of output files can be specified with --output-filetype using any of the previous options in place of filetype.
  • For example, to output two files, an LRC file and basic text file, include --output-lrc and --output-txt.

For this example, we’ll only output one file in the lrc format:

Terminal window
./whisper.cpp/main \
-m whisper.cpp/models/ggml-base.bin \
-f content/output.wav \
-of content/transcript \
--output-lrc

-of is an alias for --output-file. The command is used to modify the final file name along with the selected file extensions. Since our command includes content/transcript, there will be a file called transcript.lrc inside the content directory.

Create files in all output formats
Terminal window
./whisper.cpp/main \
-m whisper.cpp/models/ggml-base.bin \
-f content/output.wav \
-of content/transcript \
--output-txt --output-vtt \
--output-srt --output-lrc \
--output-csv --output-json

Modify Transcript Output for LLM

Despite the various available options for file formats, whisper.cpp outputs all of them as text files that later can be parsed and transformed. As with many things in programming, numerous approaches could be used to yield similar results.

Based on your personal workflows/experience, you may find it easier to parse and transform a different common data formats like csv or json. For my purpose, I’m going to use the lrc output which looks like this:

[by:whisper.cpp]
[00:00.00] Okay, well, you know, it can be a great question for this episode.
[00:02.24] What is Fullstack Jamstack?
[00:04.04] What?
[00:05.04] Yeah, exactly.
[00:06.04] Yeah.
[00:07.04] And who are we?

Using a combination of grep and awk, I’ll write a short bash command to take the LRC transcript and modify it to look like this instead:

[00:00] Okay, well, you know, it can be a great question for this episode.
[00:02] What is Fullstack Jamstack?
[00:04] What?
[00:05] Yeah, exactly.
[00:06] Yeah.
[00:07] And who are we?

To achieve the desired transformations with the given directory structure, we’ll need to:

  1. Read the transcript.lrc file from the content directory.
  2. Remove the [by:whisper.cpp] signature.
  3. Format the timestamps to remove milliseconds.
  4. Write the transformed content to a new file called transcript.txt in the same directory as the original file.
Terminal window
grep -v '^\[by:whisper\.cpp\]$' "content/transcript.lrc" | awk '{ gsub(/\.[0-9]+/, "", $1); print }' > "content/transcript.txt"

In the next section we’ll create the prompt to tell ChatGPT or Claude how to write the show notes. This prompt along with all the previous logic to download, transcribe, and transform the output will be combined into a single Bash script.

ChatGPT Show Notes Creation Prompt

Now that we have a cleaned up transcript, we can use ChatGPT directly to create the show notes. The output will contain six distinct sections which correspond to the full instructions of the prompt. Any of these sections can be removed, changed, or expanded:

  • Potential Episode Titles
  • One Sentence Summary
  • One Paragraph Summary
  • Chapters
  • Key Takeaways

Create a file called prompt.md:

Terminal window
echo > prompt.md

Include the following prompt with the transcript after the final line:

prompt.md
This is a transcript with timestamps. Write 3 potential titles for the video.
Write a one sentence summary of the transcript, a one paragraph summary, and a two paragraph summary.
- The one sentence summary shouldn't exceed 180 characters (roughly 30 words).
- The one paragraph summary should be approximately 600-1200 characters (roughly 100-200 words).
Create chapters based on the topics discussed throughout.
- Include timestamps for when these chapters begin.
- Chapters shouldn't be shorter than 1-2 minutes or longer than 5-6 minutes.
- Write a one paragraph description for each chapter.
- Note the very last timestamp and make sure the chapters extend to the end of the episode
Lastly, include three key takeaways the listener should get from the episode.
Format the output like so:
```md
## Potential Titles
1. Title I - Title Hard
2. Title II - Title Harder
3. Title II - Title Hard with a Vengeance
## Episode Summary
One sentence summary which doesn't exceed 180 characters (or roughly 30 words).
tl;dr: One paragraph summary which doesn't exceed approximately 600-1200 characters (or roughly 100-200 words)
## Chapters
00:00 - Introduction and Beginning of Episode
The episode starts with a discussion on the importance of creating and sharing projects.
02:56 - Guest Introduction and Background
Introduction of guests followed by host discussing the guests' background and journey.
## Key Takeaways
1. Key takeaway goes here
2. Another key takeaway goes here
3. The final key takeaway goes here
## Transcript
```
TRANSCRIPT ATTACHED

The final step is to take the content of prompt.md, append the transcript in transcript.md, and write the combined content to a new file called final.md in the content directory.

To achieve this directly from the terminal, use the cat command to concatenate the content of prompt.md with content/transcript.md and redirect the output to create final.md in the content directory:

Terminal window
cat prompt.md content/transcript.txt > content/final.md

Create Autogen Bash Script

Lets combine all the previous commands into one single script. Create a file called autogen.sh and give the script executable permissions with chmod:

Terminal window
echo > autogen.sh
chmod +x autogen.sh

The entirety of the example so far will be implemented as a process_video function executed by a main() function.

#!/bin/bash
process_video() {
}
main() {
mode="$1"
input="$2"
if [[ "$#" -ne 2 ]]; then
echo "Usage: $0 --video <video_url>"
return 1
fi
case "$mode" in
--video)
process_video "$input" # Process a single video
;;
*)
echo "Invalid option. Use --video"
return 1
;;
esac
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
main "$@"
fi

Process Video with Autogen

The --print option from yt-dlp can be used to extract metadata from the video. We’ll use the following in our script:

  • video_id and upload_date provide a unique name for each video.
  • webpage_url for the full video URL.
  • uploader for the channel name.
  • uploader_url for the channel URL.
  • title for the video title.
  • thumbnail for the video thumbnail.

Include the following code in autogen.sh:

#!/bin/bash
process_video() {
url="$1"
video_id=$(yt-dlp --print id "$url")
id="content/${video_id}"
upload_date=$(yt-dlp --print filename -o "%(upload_date>%Y-%m-%d)s" "$url")
final="content/${upload_date}-${video_id}"
base="whisper.cpp/models/ggml-base.bin"
# medium="whisper.cpp/models/ggml-medium.bin"
# large="whisper.cpp/models/ggml-large-v2.bin"
echo "---" > "${id}.md"
echo "showLink: \"$(yt-dlp --print webpage_url "$url")\"" >> "${id}.md"
echo "channel: \"$(yt-dlp --print uploader "$url")\"" >> "${id}.md"
echo "channelURL: \"$(yt-dlp --print uploader_url "$url")\"" >> "${id}.md"
echo "title: \"$(yt-dlp --print title "$url")\"" >> "${id}.md"
echo "publishDate: \"$(yt-dlp --print filename -o "%(upload_date>%Y-%m-%d)s" "$url")\"" >> "${id}.md"
echo "coverImage: \"$(yt-dlp --print thumbnail "$url")\"" >> "${id}.md"
echo "---\n" >> "${id}.md"
yt-dlp -x --audio-format wav --postprocessor-args "ffmpeg: -ar 16000" -o "${id}.wav" "$url"
./whisper.cpp/main -m "${base}" -f "${id}.wav" -of "${id}" --output-lrc
originalPath="${id}.lrc"
finalPath="${id}.txt"
grep -v '^\[by:whisper\.cpp\]$' "$originalPath" | awk '{ gsub(/\.[0-9]+/, "", $1); print }' > "$finalPath"
cat "${id}.md" prompt.md "${id}.txt" > "${final}.md"
rm "${id}.wav" "${id}.lrc" "${id}.txt" "${id}.md"
echo "Process completed successfully for URL: $url"
}
main() {
mode="$1"
input="$2"
if [[ "$#" -ne 2 ]]; then
echo "Usage: $0 --video <video_url>"
return 1
fi
case "$mode" in
--video)
process_video "$input" # Process a single video
;;
*)
echo "Invalid option. Use --video"
return 1
;;
esac
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
main "$@"
fi

Run ./autogen.sh --video followed by the video URL you would like to transcribe:

Terminal window
./autogen.sh --video "https://www.youtube.com/watch?v=jKB0EltG9Jo"

Next, we’ll write two more functions that each run the process_video function on multiple videos. These videos will be either contained in a playlist (process_playlist) or written in a urls.md file (process_urls_file).

Process Playlist with Autogen

At this point, the autogen.sh script is designed to run on individual video URLs. However, if you already have a backlog of content to transcribe, you’ll want to run this script on a series of video URLs.

Lets create another option to accept a playlist URL instead of a video URL. The --print "url" and --flat-playlist options from yt-dlp can be used to write a list of video URLs to a new file which we’ll call urls.md.

Include the following code in autogen.sh:

#!/bin/bash
# process_video() {}
process_playlist() {
playlist_url="$1"
yt-dlp --flat-playlist -s --print "url" "$playlist_url" > urls.md
while IFS= read -r url; do
if [[ -n "$url" ]]; then
process_video "$url"
fi
done < urls.md
}
main() {
mode="$1"
input="$2"
if [[ "$#" -ne 2 ]]; then
echo "Usage: $0 --video <video_url> | --playlist <playlist_url>"
return 1
fi
case "$mode" in
--video)
process_video "$input"
;;
--playlist)
process_playlist "$input"
;;
*)
echo "Invalid option. Use --video or --playlist"
return 1
;;
esac
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
main "$@"
fi

Run ./autogen.sh with the playlist URL passed to --playlist.

Terminal window
./autogen.sh --playlist "https://www.youtube.com/playlist?list=PLCVnrVv4KhXMh4DQBigyvHSRTf2CSj129"

Process URLs with Autogen

To process a list of arbitrary URLs, we’ll want to bypass the yt-dlp command that reads a list of videos from a playlist and pass urls.md directly to Whisper.

#!/bin/bash
# process_video() {}
# process_playlist() {}
process_urls_file() {
file_path="$1"
if [[ ! -f "$file_path" ]]; then
echo "File not found: $file_path"
return 1
fi
while IFS= read -r url; do
if [[ -n "$url" ]]; then
process_video "$url"
fi
done < "$file_path"
}
main() {
mode="$1"
input="$2"
if [[ "$#" -ne 2 ]]; then
echo "Usage: $0 --video <video_url> | --playlist <playlist_url> | --urls <file_path>"
return 1
fi
case "$mode" in
--video)
process_video "$input" # Process a single video
;;
--playlist)
process_playlist "$input" # Process a playlist
;;
--urls)
process_urls_file "$input" # Process URLs from a file
;;
*)
echo "Invalid option. Use --video, --playlist, or --urls"
return 1
;;
esac
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
main "$@"
fi

Run ./autogen.sh --urls filled by the path to your urls.md file.

Terminal window
./autogen.sh --urls urls.md

Create Node.js CLI

For the last part of this tutorial, we’ll port all of this logic into a Node.js CLI using Commander.js.

Terminal window
npm i commander
echo > autogen.js

Include the following code in autogen.js.

autogen.js
import { Command } from 'commander'
import { execSync } from 'child_process'
import fs from 'fs'
const program = new Command()
program
.name("autogen")
.description("Automated processing of YouTube videos and playlists")
.version("1.0.0")
program.command('video <url>')
.description('Process a single YouTube video')
.action((url) => {
processVideo(url)
})
program.command('playlist <playlistUrl>')
.description('Process all videos in a YouTube playlist')
.action((playlistUrl) => {
processPlaylist(playlistUrl)
})
program.command('urls <filePath>')
.description('Process YouTube videos from a list of URLs in a file')
.action((filePath) => {
processUrlsFile(filePath)
})
async function processVideo(url) {
try {
const videoId = execSync(`yt-dlp --print id "${url}"`).toString().trim()
const uploadDate = execSync(`yt-dlp --print filename -o "%(upload_date>%Y-%m-%d)s" "${url}"`).toString().trim()
const id = `content/${videoId}`
const final = `content/${uploadDate}-${videoId}`
const baseModel = "whisper.cpp/models/ggml-base.bin"
const mdContent = [
"---",
`showLink: "${execSync(`yt-dlp --print webpage_url "${url}"`).toString().trim()}"`,
`channel: "${execSync(`yt-dlp --print uploader "${url}"`).toString().trim()}"`,
`channelURL: "${execSync(`yt-dlp --print uploader_url "${url}"`).toString().trim()}"`,
`title: "${execSync(`yt-dlp --print title "${url}"`).toString().trim()}"`,
`publishDate: "${uploadDate}"`,
`coverImage: "${execSync(`yt-dlp --print thumbnail "${url}"`).toString().trim()}"`,
"---\n"
].join('\n')
fs.writeFileSync(`${id}.md`, mdContent)
execSync(`yt-dlp -x --audio-format wav --postprocessor-args "ffmpeg: -ar 16000" -o "${id}.wav" "${url}"`)
execSync(`./whisper.cpp/main -m "${baseModel}" -f "${id}.wav" -of "${id}" --output-lrc`)
const lrcPath = `${id}.lrc`
const txtPath = `${id}.txt`
const lrcContent = fs.readFileSync(lrcPath, 'utf8')
const txtContent = lrcContent.split('\n')
.filter(line => !line.startsWith('[by:whisper.cpp]'))
.map(line => line.replace(/\[\d{2}:\d{2}\.\d{2}\]/g, match => match.slice(0, -4) + ']'))
.join('\n')
fs.writeFileSync(txtPath, txtContent)
const finalContent = [
fs.readFileSync(`${id}.md`, 'utf8'),
fs.readFileSync('prompt.md', 'utf8'),
txtContent
].join('\n')
fs.writeFileSync(`${final}.md`, finalContent)
// Clean up
fs.unlinkSync(`${id}.wav`)
fs.unlinkSync(lrcPath)
fs.unlinkSync(txtPath)
fs.unlinkSync(`${id}.md`)
console.log(`Process completed successfully for URL: ${url}`)
} catch (error) {
console.error(`Error processing video: ${url}`, error)
}
}
function processPlaylist(playlistUrl) {
const urls = execSync(`yt-dlp --flat-playlist -s --print "url" "${playlistUrl}"`).toString().split('\n').filter(Boolean)
urls.forEach(url => {
processVideo(url)
})
}
function processUrlsFile(filePath) {
if (!fs.existsSync(filePath)) {
console.error(`File not found: ${filePath}`)
return
}
const urls = fs.readFileSync(filePath, 'utf8').split('\n').filter(Boolean)
urls.forEach(url => {
processVideo(url)
})
}
program.parse(process.argv)

Run on a single YouTube video:

Terminal window
node autogen.js video "https://www.youtube.com/watch?v=jKB0EltG9Jo"

Run on multiple YouTube videos in a playlist:

Terminal window
node autogen.js playlist "https://www.youtube.com/playlist?list=PLCVnrVv4KhXMh4DQBigyvHSRTf2CSj129"

Run on an arbitrary list of URLs in urls.md:

Terminal window
node autogen.js urls urls.md

Example Show Notes and Next Steps

Here’s what ChatGPT generated for Episode 0 of the Fullstack Jamstack podcast:

---
showLink: "https://www.youtube.com/watch?v=QhXc9rVLVUo"
channel: "FSJam"
channelURL: "https://www.youtube.com/@fsjamorg"
title: "Episode 0 - The Fullstack Jamstack Podcast with Anthony Campolo and Christopher Burns"
publishDate: "2020-12-09"
coverImage: "https://i.ytimg.com/vi_webp/QhXc9rVLVUo/maxresdefault.webp"
---
## Potential Titles
1. "Unpacking FSJam: A New Era of Web Development"
2. "From Jam to Fullstack: Revolutionizing Web Architecture"
3. "Navigate the FSJam Landscape: Tools, Frameworks, Community"
## Episode Summary
The podcast explores Fullstack Jamstack's principles, from basic Jamstack components to advanced tools like Prisma and meta frameworks, emphasizing community dialogue and development practices.
This episode of the Fullstack Jamstack podcast, hosted by Anthony Campolo and Christopher Burns, delves into the essence and philosophy of Fullstack Jamstack, a modern web development architecture. Starting with a basic introduction to the Jamstack components (JavaScript, APIs, Markup), the hosts expand into discussing the evolution from monolithic architectures to more decoupled, service-oriented approaches that define Fullstack Jamstack. They explore the significance of tools like Prisma for database management, the role of Content Management Systems (CMS), and the transition towards serverless functions. Furthermore, the discussion includes the introduction of meta frameworks like Redwood and Blitz, which aim to streamline the development process by integrating front-end, back-end, and database layers cohesively. The episode emphasizes community building, the exchange of ideas across different frameworks, and invites listeners to participate in the conversation through social media and Discord.
## Chapters
00:00 - Introduction to Fullstack Jamstack and Podcast Goals
Introduction and foundational questions about FSJam, its significance, and the podcast's aim to educate and foster community dialogue.
03:00 - Defining Jamstack: Components and Evolution
Clarification of Jamstack's components, JavaScript, APIs, Markup and its evolution from static sites to dynamic, service-oriented architectures.
08:00 - From Monolithic to Decoupled Architectures
Discussion on the transition from monolithic to decoupled architectures, highlighting the role of CMS and serverless functions in modern web development.
14:00 - Introduction to Prisma and Database Management
Exploration of Prisma's role in FSJam for efficient DB management and the differences between Prisma 1 and 2.
20:00 - Meta Frameworks and the Future of FSJam
Introduction to meta frameworks like Redwood and Blitz, their contribution to simplifying FSJam development, and speculation on future trends.
28:00 - Philosophies of FSJam & Community Engagement
Discussion on the core philosophies of FSJam, the importance of selecting the right tools and frameworks, and encouraging listener engagement through social media and Discord.
## Key Takeaways
1. Fullstack Jamstack represents a modern approach to web development, emphasizing decoupled architectures that separate the front-end from the back-end, enabling more flexible and scalable applications.
2. Tools like Prisma for database management and the adoption of meta frameworks (Redwood, Blitz) are pivotal to simplify and enhancing the development process for FSJam apps.
3. Community engagement and exchange of ideas across different frameworks are essential for the growth and evolution of FSJam, encouraging developers to contribute, learn, and collaborate.
## Transcript

This workflow is fine for me because I only create a podcast every week or two, so I can just copy paste the transcript into ChatGPT and copy out the output. However, it’s very possible that you could have dozens or even hundreds of episodes that you want to run this process on.

To achieve this in a short amount of time, you’ll need to use the OpenAI API and drop a bit of coin to do so. In my next blog post, I’ll be showing how to achieve this with OpenAI’s Node.js wrapper library. Once that blog post is complete I’ll update this post and link it at the end.