Name: Speech & Audio Datasets
Creator: Sapien
License: https://www.sapien.io/terms
Keywords: audio datasets, speech datasets, voice recognition, speech recognition, multilingual datasets

At Sapien, we specialize in providing curated Speech & Audio datasets that are diverse, accurate, and ready to use. Whether you're developing voice assistants, transcription tools, or advanced language processing systems, our offerings include top-tier speech recognition datasets, audio classification datasets, and speech-to-text datasets tailored to your project's unique needs. Every dataset is crafted to maintain privacy, accuracy, and usability.

Medical Dialogues

From patient-doctor conversations to healthcare-specific audio, our speech datasets ensure precision and compliance. Perfect for applications in telemedicine, medical transcription, and healthcare AI.

25,000+ Hours of Audio Files: Includes physician-patient conversations across 31 languages.
Formats Available: Digital recordings (MP4), transcripts (TXT/PDF), and rich metadata.
Compliance: HIPAA-compliant datasets adhering to Safe Harbor Guidelines.

Download Sample

Multilingual Speech

Expand your AI’s reach with audio datasets for speech recognition covering diverse languages, dialects, and accents. Ideal for training translation models, voice assistants, and language learning tools.

30+ Global Languages: Including underrepresented dialects.
Flexible Formats: Audio recordings paired with transcripts and annotations.
Applications: Multilingual customer service bots, language tools, and transcription services.

Download Sample

Music Tracks

Curated music datasets for applications in music recommendation systems, composition AI, and entertainment platforms. Each music genre classification dataset includes detailed metadata for genre, tempo, and instrumentation.

Genre Diversity: Rock, jazz, classical, electronic, and more.
Detailed Metadata: Including tempo, key, and instrument annotations.
Applications: Music analysis, streaming platform personalization, and AI-generated compositions.

Download Sample

Transcribed Legal Depositions

Accurate speech-to-text datasets from legal settings, enabling advancements in legal transcription tools, case review automation, and compliance technologies.

Verified Transcripts: Covering legal discussions, depositions, and proceedings.
Comprehensive Formats: Audio files (MP4) paired with transcripts and metadata.
Use Cases: Legal transcription, case management AI, and compliance systems.

Download Sample

Podcasts and Audiobooks

Tap into a rich variety of audio classification datasets from podcasts and audiobooks. Ideal for sentiment analysis, content categorization, and recommendation engines.

Wide Selection: Content spanning education, entertainment, and storytelling genres.
Detailed Annotations: Speaker identification, timestamps, and sentiment markers.
Applications: Content recommendation engines, sentiment analysis, and transcription tools.

Download Sample

Let's Talk

Have a specific dataset need or a question? Contact us today, and we’ll help you find the perfect solution.

Schedule a Consult

About cookies on this site

❮

❯

Categories
Cookie declaration

Cookies used on the site are categorized and below you can read about each category and allow or deny some or all of them. When categories than have been previously allowed are disabled, all cookies assigned to that category will be removed from your browser. Additionally you can see a list of cookies assigned to each category and detailed information in the cookie declaration.

Learn more

Necessary cookies

Some cookies are required to provide core functionality. The website won't function properly without these cookies and they are enabled by default and cannot be disabled.

CookieHub

Cloudflare

Google reCaptcha

Preferences

Preference cookies enables the web site to remember information to customize how the web site looks or behaves for each user. This may include storing selected currency, region, language or color theme.

Analytical cookies

Analytical cookies help us improve our website by collecting and reporting information on its usage.

Google Analytics

HubSpot

Microsoft Clarity

Marketing cookies

Marketing cookies are used to track visitors across websites to allow publishers to display relevant and engaging advertisements. By enabling marketing cookies, you grant permission for personalized advertising across various platforms.

Google Ads

LinkedIn Insight

Microsoft Ads

Name	Hostname	Vendor	Expiry
__cf_bm	.hubspot.com	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
_cfuvid	.hubspot.com		Session
Used by Cloudflare WAF to distinguish individual users who share the same IP address and apply rate limits
__cf_bm	.hsforms.net	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
__cf_bm	.hsforms.com	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
_cfuvid	.hsforms.com		Session
Used by Cloudflare WAF to distinguish individual users who share the same IP address and apply rate limits
cookiehub	.sapien.io	CookieHub	365 days
Used by CookieHub to store information about whether visitors have given or declined the use of cookie categories used on the site.
_GRECAPTCHA	www.google.com	Google	180 days
Used by Google reCaptcha for risk analysis
__cf_bm	.hs-scripts.com	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
__cf_bm	.hsadspixel.net	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
__cf_bm	.hs-analytics.net	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
__cf_bm	.hs-banner.com	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
__cf_bm	.usemessages.com	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
__cf_bm	.hsappstatic.net	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
__cf_bm	.hubspotusercontent-na1.net	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.

Name	Hostname	Vendor	Expiry
lidc	.linkedin.com	LinkedIn Ireland Unlimited Company	1 day
Used by LinkedIn for routing.
li_gc	.linkedin.com	LinkedIn Ireland Unlimited Company	180 days
Used by LinkedIn to store consent of guests regarding the use of cookies for non-essential purposes

Name	Hostname	Vendor	Expiry
_ga	.sapien.io	Google	400 days
Contains a unique identifier used by Google Analytics to determine that two distinct hits belong to the same user across browsing sessions.
_ga_	.sapien.io	Google	400 days
Contains a unique identifier used by Google Analytics 4 to determine that two distinct hits belong to the same user across browsing sessions.
__hstc	.sapien.io	HubSpot	180 days
This cookie name is associated with websites built on the HubSpot platform. This is the main cookie for tracking visitors. It contains the domain, utk, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
hubspotutk	.sapien.io	HubSpot	180 days
This cookie name is associated with websites built on the HubSpot platform. This cookie is used to keep track of a visitor's identity. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.
__hssrc	.sapien.io	HubSpot	Session
This cookie name is associated with websites built on the HubSpot platform. Whenever HubSpot changes the session cookie, this cookie is also set to determine if the visitor has restarted their browser. If this cookie does not exist when HubSpot manages cookies, it is considered a new session.
__hssc	.sapien.io	HubSpot	1 hour
This cookie name is associated with websites built on the HubSpot platform. This cookie keeps track of sessions. This is used to determine if HubSpot should increment the session number and timestamps in the __hstc cookie. It contains the domain, viewCount (increments each pageView in a session), and session start timestamp.
CLID	www.clarity.ms	Microsoft	365 days
Identifies the first-time Clarity saw this user on any site using Clarity.
_clck	.sapien.io	Microsoft	365 days
Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk	.sapien.io	Microsoft	1 day
Connects multiple page views by a user into a single Clarity session recording.
MUID	.bing.com	Microsoft	390 days
Microsoft User Identifier tracking cookie used by Bing Ads. It can be set by embedded microsoft scripts. Widely believed to sync across many different Microsoft domains, allowing user tracking.
MR	.c.bing.com	Microsoft	7 days
Used by Microsoft Clarity to indicate whether to refresh MUID.
SM	.c.clarity.ms	Microsoft	Session
This cookie is installed by Clarity. The cookie is used to store non-personally identifiable information. The cookie is used in synchronizing the MUID (Microsoft unique user ID) across Microsoft domains.
MUID	.clarity.ms	Microsoft	390 days
Microsoft User Identifier tracking cookie used by Bing Ads. It can be set by embedded microsoft scripts. Widely believed to sync across many different Microsoft domains, allowing user tracking.
MR	.c.clarity.ms	Microsoft	7 days
Used by Microsoft Clarity to indicate whether to refresh MUID.
_cltk		Microsoft	Session
This cookie is installed by Microsoft Clarity tool and stores information about how visitors use the website

Name	Hostname	Vendor	Expiry
_gcl_au	.sapien.io	Google Advertising Products	90 days
Used by Google AdSense to understand user interaction with the website by generating analytical data.
bcookie	.linkedin.com	LinkedIn Ireland Unlimited Company	365 days
This is a Microsoft MSN 1st party cookie for sharing the content of the website via social media.
UserMatchHistory	.linkedin.com	LinkedIn Ireland Unlimited Company	30 days
Contains a unique identifier used by LinkedIn to determine that two distinct hits belong to the same user across browsing sessions.
AnalyticsSyncHistory	.linkedin.com	LinkedIn Ireland Unlimited Company	30 days
Used by LinkedIn to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries
bscookie	.www.linkedin.com	LinkedIn Ireland Unlimited Company	365 days
Used by the social networking service, LinkedIn, for tracking the use of embedded services.
IDE	.doubleclick.net	Google Advertising Products	390 days
Used by Google's DoubleClick to serve targeted advertisements that are relevant to users across the web. Targeted advertisements may be displayed to users based on previous visits to a website. These cookies measure the conversion rate of ads presented to the user.
SRM_B	.c.bing.com	Microsoft	390 days
This cookie is installed by Microsoft Bing. Identifies unique web browsers visiting Microsoft sites.
ANONCHK	.c.clarity.ms	Microsoft	1 hour
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation

Speech & Audio Datasets for AI Training

Introduction

Medical Dialogues

Multilingual Speech

Music Tracks

Transcribed Legal Depositions

Podcasts and Audiobooks

Let's Talk