PodGist

Podcast Transcripts

 RSS

Year Filtering

2024-05-30 11:56 by Staff

We'll be rolling out new year filters on the podcast episode list pages. This will make it much easier to page through podcasts that have many many many years of transcripts.

Accuracy

2024-03-21 21:14 by Staff

We recently made some improvements to the engines we use to generate transcripts, so you should notice a substantial increase in accuracy.

Better Search

2023-02-20 11:07 by Staff

It's technically been live now for a few months, but we're happy to officially announce the successful roll-out of a new search mechanism that allows full-text search for all transcripts site-wide. At the top of the homepage and each podcast's list page, you should now see a search bar that allows quick retrieval of text matches on each page's title and body content.

Providing an affordable full-text search system to index hundreds of thousands of documents, without the use of an expensive dedicated server, is no simple feat. To minimize costs while keeping search performance high, we've tried a variety of client-side search tools, like Lunr.js, but these always required constantly building and re-uploading massive indexes every day, or just worked too slowly in the browser given the size of our dataset. Most of these tools were designed to index a simple blog, with a few dozen pages. Not thousands.

We came close to using Sql.js, which worked fairly well in the browser, even if it is still a bit experimental, but its lack of sharding required rebuilding large indexes every day, which wasn't practical.

Eventually, we settled on Pagefindfs, which runs quickly and supports sharding, allowing us to build segmented indexes, so we only have to upload new indexes for the most recent document changes, saving a lot of money and bandwidth.

The only limitation it has is that some of the more advanced search features, like filtering by publication date or author, require each transcript page be explicitly tagged with meta data in a specific Pagefindfs syntax. And since we've been building pages for years, long before we ever knew about Pagefindfs, most pages don't contain those tags. As we move forward, new pages will contain Pagefindfs tags, and given time, we might retroactively parse our old documents and insert the appropriate tags.

RSS

2021-12-26 15:01 by Staff

To help subscribers stay on top of the most recent published transcripts for their favorite podcast, we've added a dedicated RSS feed link to every podcast list page, as well as this news section. You can find the RSS link at the bottom of the podcast description.

User Corrections

2020-11-15 18:00 by Staff

We're very grateful to the users who financially support us by visiting our sponsors or purchasing a subscription. This money helps cover our ever growing hosting costs as well as the services we use to transcribe content.

Previously, our transcript generation was entirely automated using a combination of cheap speech-to-text engines and inexpensive proprietary services. This allowed us to quickly transcribe hundreds of podcasts a month at a sustainable cost. The downside is that the accuracy of the transcripts wasn't perfect.

Tools and services that provide near 100% accurate transcriptions are already available, but they're very expensive. To keep the site sustainable, we transcribe most audio using multiple inexpensive tools, and when those tools can't agree on a transcription, then we use more expensive tools to improve the result. Yet due to cost restrictions, we're still not able to run the more accurate transcribers on all audio. That's a problem we're trying to fix.

For a while, at the bottom of every page, we've had a link to a survey to solicit feedback and suggestions. Since the beginning, the top suggestion has been something along the lines of "Improve your darn accuracy!"

In this same survey, several users have also responded that they're unwilling or unable to financially contribute, but they wouldn't mind helping to correct errors in our transcripts. So we've decided to take them up on that offer. Over the next several weeks, we'll be gradually rolling out a change to all transcript pages that will allow logged-in users to submit corrections.

Now, every transcript will have an "Edit" button near the top of the page. Clicking this button will enable editing mode, as well as color-code sections of the transcript that our system suspects need the most correction. The color code is a gradient going from bright red, meaning the transcription is almost certainly wrong, to bright green, meaning it's likely correct or has already been corrected by a human.

In editing mode, clicking a section of the transcript will display a modal dialog allowing you to submit corrected text. Once the correction is reviewed by our staff, it'll be incorporated into the complete transcript.

We don't expect anyone to submit corrections for free. In exchange for making a minimum of 30 corrections, a user will receive a 1 month subscription to all ad-free content. After that, every 1 corrections will extend the subscription by 1 day. We'll likely adjust these levels as we get feedback.

We hope that this feature opens a new path for improving accuracy while also allowing users to become more engaged with the site.