Research Brief · April 2026

EAII Data Sources,Acquisition & Legal Risk Analysis

A complete catalog of identified data sources for the Phase 2 proprietary emotional reasoning engine at Human Discovery, Inc. Every source is classified by tier, license, cost, volume, and legal risk, synthesized by the cluster swarm of research agents on the cloud.

Human Discovery, Inc. Emotional AI Infrastructure HIPAA · FDA SaMD aware
Research Agents Deployed
293
Individual deep research agents deployed off the VPS server, each with their own unique search criteria.
Total Sources
4,550
Total number of sources captured, scanned, and added to research data tank.
Data Sources Cataloged
94
Datasets, providers, and generation pathways classified by tier in this brief.
Cheapest Commercial Path
$8,000
MSP-Podcast flat license (UT Dallas)
Tier Key GREEN, Permissive / Free YELLOW, Negotiate / Vet RED, Non-Commercial / TTO Required PAID, Commercial DaaS SYNTHETIC, Generation Pathways INSTITUTIONAL, Framework Licensing

00 Overview

This document synthesizes all research conducted on data acquisition for the EAII Phase 2 proprietary emotional reasoning engine. The core legal challenge: most "gold standard" academic emotional datasets carry CC BY-NC (Non-Commercial) restrictions that prohibit direct commercial use. This summary catalogs every identified data source, its legal status, cost, volume, and risk level , organized by tier.

01 Quick Reference: Risk Tiers

Four tiers govern acquisition strategy. Match every source to its tier before use.

Tier Description Strategy
🟢 GREEN Fully permissive (Apache 2.0, MIT, CC0) Use freely
🟡 YELLOW Requires negotiation or paid license Budget and negotiate
🔴 RED Non-commercial only under standard terms Do not use without TTO agreement
🔵 PAID Commercial Data-as-a-Service providers Purchase with indemnification

02 Open-Source / Permissive Datasets 🟢 GREEN Tier

Explicitly licensed for commercial use. Lowest IP risk. Some entries carry caveats, read the Notes column carefully.

Dataset Modality License Commercial? Academic? Volume Est. Cost Legal Risk Notes
CPED
(Chinese Personalized & Emotional Dialogue)
Text Apache 2.0 ✅ Yes Yes 12K dialogues / 133K utterances / 390K+ tokens Free Low 13 emotions, Big Five personality traits, Chinese-centric but cross-lingual usable
GoEmotions (Google) Text Apache 2.0 ✅ Yes Yes 58,000 utterances Free Low 27 fine-grained emotion categories; derived from Reddit comments
CMU-MOSEI ⚠️
(Kaggle / SDK distribution)
Text + Audio + Video Apache 2.0 (SDK/Kaggle only) ⚠️ Yellow Yes 23,500+ clips; 65+ hours Free Med-High ⚠️ Official CMU repo is research-only. Apache 2.0 applies to SDK/Kaggle distribution only. Underlying YouTube content has unresolved platform ToS + creator copyright. Requires IP counsel before commercial use. Treat as YELLOW.
PersonaChat / ConvAI2 (via ParlAI) Text MIT ✅ Yes Yes 10,000+ dialogues Free Low Personality-conditioned chit-chat; useful for E-DNA persona modeling
DailyDialog ⚠️ Text CC0 / Public Domain (Kaggle) ✅ Yes (see caveats) Yes 13,000+ dialogues Free Medium ⚠️ Contradictory licensing: CC0 on Kaggle, CC BY-NC-SA in original 2017 ACL publication. Must vet the specific distribution source before ingestion.
CounselChat Text MIT ✅ Yes Quasi-academic ~1,400 interactions Free Low Real licensed counselor-seeker interactions; high-fidelity "helping skills"
Generated-Recovery-Support-Dialogues Text MIT ✅ Yes No ~1,100 dialogues Free Very Low Synthetic; addiction recovery and motivational interviewing
CREMA-D
(Crowd-sourced Emotional Multimodal Actors)
Audio + Video ODbL / CC BY 4.0 ✅ Yes Yes 7,442 clips / 91 actors Free Low Must reference the database; explicit commercial allowance in docs
Dolly-v2 ⚠️ (Databricks) Text CC BY-SA 4.0 ✅ Yes (share-alike risk) No 15,000 pairs Free Medium ⚠️ CC BY-SA = Share-Alike contagion risk if weights are derivative works. Use in isolated LoRA adapter only, do not merge into base model.
OIG / Open Instruction Generalist Text Apache 2.0 ✅ Yes No 44M+ rows Free Low Massive instruction-tuning set; useful for emotional reasoning scaffolding
DeepDialogue Text + Synthetic Audio Permissive (MIT-adjacent) ✅ Yes No 40,150 multi-turn dialogues Free Low Multi-domain with explicit emotional progressions and synthesized voices
MELD (Dual License via Yale OCR) Text + Audio + Video Dual: Open Source / Commercial ✅ Yes (commercial license required) Yes 13,000+ utterances Negotiated Low-Med Derived from Friends TV series. Krishnaswamy Lab at Yale provides commercial license for software developers. Contact Yale Office of Cooperative Research.
Kaggle Multimodal ER Dataset Physiological + Voice CC0 (Public Domain) ✅ Yes No 250 participants; EDA + HRV + voice tone Free Very Low Rare CC0 physiological dataset; synchronized EDA, HRV, and voice
WESAD ⚠️
(Wearable Stress & Affect Detection)
Physiological (ECG, EDA, EMG, Temp, Motion) CC BY-SA / CC BY-NC 4.0 ⚠️ Conditional Yes Multi-subject wearable sensor data Free Med-High ⚠️ Some distributions are CC BY-SA (share-alike risk); others CC BY-NC. Verify source. Standard distribution is research-only, do NOT use in commercial builds without explicit legal vetting.
PhysioNet (multiple datasets) Physiological (ECG, PPG, EDA) PhysioNet License / Publicly available ✅ Yes (open research) Yes Multiple multi-channel datasets Free Low "Facilitates open research in both academic and commercial settings." Confirm individual dataset terms before ingestion.
K-EmoPhone ⚠️ Physiological + Mobile + Context CC BY 4.0 ⚠️ Conditional Yes Mobility, EDA, BVP, context data Free (if approved) Medium Commercial entities must undergo "rigorous review" by KAIST before access is granted. Treat as YELLOW until written approval obtained.
OASST1 / OASST2 (Open-Assistant) Text Apache 2.0 ✅ Yes Yes 161,443+ messages / 35 languages / multi-turn trees Free Low Human-crowdsourced (13,500+ volunteers); includes emotional/helpfulness/toxicity labels; born-open, bypasses derivative copyright risk
CACTUS (CBT Dataset) Text Apache 2.0 ✅ Yes Yes 31,564 CBT dialogues / ~1M utterances Free Low Evaluated with Cognitive Therapy Rating Scale (CTRS); gold-standard clinical validation; strong multi-turn depth
ChatThero Text CC BY 4.0 ✅ Yes Yes Multi-session substance-use recovery episodes with persistent memory Free Low Models multi-session therapy continuity + stressor-aware adaptations; multi-agent simulation
PsychEval / PsychAgent Text CC BY 4.0 ✅ Yes Yes 2,000+ synthetic client profiles × 6-10 continuous sessions Free Low Models "arc of therapy" across sessions; long-horizon dialogue continuity
HelpSteer2 (NVIDIA) Text CC BY 4.0 ✅ Yes Yes 10,000+ prompts with fine-grained ratings Free Low Professional curation for helpfulness, correctness, emotional tone; RLHF-ready for supportive contexts
ExTES (Exemplary Emotional Support) Text CC BY 4.0 ✅ Yes Yes 11,177 dialogues with support-strategy labels Free Very Low 100% synthetic; recursive LLM generation with strategy labels (clarification, affirmation); zero PII risk
EChat-200K (ASLP@NPU) Audio (speech-to-speech) + Text Apache 2.0 ✅ Yes Yes 200,000 empathetic dialogues; single/multi-label subsets Free Low Paralinguistic cues (jitter, shimmer) intact; real + synthetic audio mix
Phi-4-Empathetic Text MIT ✅ Yes No Tied to Microsoft Phi-4 ecosystem Free Low Chain of Thought + DPO + SFT for empathetic alignment
NeuroFeel Text Apache 2.0 ✅ Yes Yes ~10,000 samples / 13 nuanced emotions Free Low Balances underrepresented emotions via synthetic augmentation; real social + ChatGPT-augmented
SyntAct Audio + Video MIT ✅ Yes Yes Synthesized basic emotional expressions Free Very Low 100% synthetic; speech + facial expression validation
Psych8k / ChatPsychiatrist Text Apache 2.0 ✅ Yes Yes ~8,000 clinical Q&A pairs (transcribed + scrubbed) Free Low Therapeutic dimensions labeled (Direct Guidance, Approval, Restatement/Reflection)
Student-Mental-Health (counseling-vn) Text Apache 2.0 ✅ Yes No Student-focused mental health interactions Free Low Domain-specific: student mental health
SSConv / SocialSim (Stanford SALT Lab) Text Apache 2.0 (expected) ✅ Yes Yes Large-scale synthetic emotional support Free Low Simulates Social Disclosure & Social Awareness; human evaluators rated higher than crowdsourced for "logical supportiveness"
UltraChat (filtered for affect) Text Apache 2.0 ✅ Yes (requires subsetting) No Large-scale synthetic conversations (GPT-3.5) Free Low Requires affective filtering; large-scale synthetic instruction-following
SQPsychConv Text Apache 2.0 (expected) ✅ Yes Yes CBT-framework client-therapist dialogues (GPT-4o) Free Low Structured CBT framework; bypasses privacy barriers of real clinical data
Medical-o1-reasoning-SFT Text Apache 2.0 ✅ Yes Yes 90,120 open-ended questions with CoT reasoning (GPT-4o) Free Low Medical reasoning chains; SFT-optimized
WildJailbreak Text ODC-BY ✅ Yes Yes Safety / behavioral interaction dataset Free Low Safety filtering insights; behavioral patterns
Self-Instruct Text Permissive ✅ Yes Yes 82,646 LLM-generated prompts from human seed Free Low Instruction-following diversity; LLM self-generation methodology
LongForm Text MIT ✅ Yes Yes 27,000 English instruction-following examples Free Low QA + story generation diversity; reverse-engineered instructions
MindCorpus (Jan 2026) Text Permissive (academic release) ✅ Yes Yes 5,700 realistic therapeutic sessions Free Very Low 100% synthetic; dual-loop Seeker/Supporter agents; differential privacy verified
MDD-5k Text Permissive (academic release) ✅ Yes Yes 5,000 long conversations / 25 mental illnesses / 26.8 turns avg Free Very Low Largest synthesized diagnostic dataset; neuro-symbolic LLM agents
ConvoSense Text MIT ✅ Yes Yes 500,000+ commonsense inferences / 12,000 dialogues Free Low GPT-generated commonsense reasoning for empathy; improves coherence
Synth-Empathy (Jul 2024) Text Permissive ✅ Yes Yes LLM pipeline with diversity selection Free Low Discards low-quality outputs; diversity module; designed to improve empathy benchmarks
Multi-Speaker Emotional Speech (Magic Data) Audio Permissive (commercial intent) ✅ Yes No Multi-speaker emotionally expressive speech Free Low Engineered for commercial LLM fine-tuning TTS
Amod/mental_health_counseling_conversations Text RAIL-D ✅ Yes ($100 donation) No Mental health counseling conversations ~$100 donation Very Low Unique donation-based commercial pathway to mental health foundation
Synthetic Therapy Conversations (Kaggle, Jerry Yao) Text CC0 ✅ Yes No Patient-therapist role interactions Free Very Low 100% synthetic; no PII risk
SMILE / MeChat Text CC BY 4.0 ✅ Yes Yes 55,000 synthetic counseling conversations Free Low ChatGPT-expanded from single-turn Q&A to multi-turn counseling
SMILE-College Text Permissive ✅ Yes Yes College mental health sentiment data Free Low Human-machine collaborative; LLM empathetic performance evaluation
IDRE (Italian Rephrasing w/ Empathy) Text CC BY 4.0 ✅ Yes Yes Italian healthcare chatbot responses Free Low Italian language; Llama 2-optimized empathetic healthcare responses
LMSYS-Chat-1M ⚠️ (prompts only) Text CC BY 4.0 (prompts) / CC BY-NC 4.0 (outputs) ⚠️ Prompts only Yes 1,000,000 conversations across 25 LLMs Free Medium ⚠️ Bifurcated license. Harvest prompts only; regenerate outputs with open-weight models. Do NOT train on model outputs.
EDOS ⚠️ (Empathetic Dialogue at Scale) Text CC BY 4.0 ⚠️ Conditional Yes Large-scale dataset derived from movie subtitles Free Medium ⚠️ Derived from copyrighted film subtitles, upstream copyright risk despite CC BY 4.0 surface license. 32 emotion labels + 8 empathy intents.
MentalChat16K ⚠️ (synthetic subset) Text Apache 2.0 / MIT (synthetic) · TTO (real) ⚠️ Synthetic subset OK Yes 16,113 QA pairs (anonymized real + synthetic GPT-3.5) Free Med-High ⚠️ Bifurcated: synthetic subset safe; real transcript portion may require Stanford OTL clearance. Generated via Airoboros framework.

03 Restricted Academic Datasets 🔴 RED Tier

Require a negotiated commercial license through a University Technology Transfer Office. Using under standard research EULA in a commercial product is direct copyright infringement.

Dataset Institution Modality Standard License Commercial Path Est. Commercial Cost Legal Risk Notes
IEMOCAP
(Interactive Emotional Dyadic Motion Capture)
USC / ICT / SAIL Audio + Video + Motion Capture Research-only EULA USC Stevens Center for Innovation $10,000-$50,000+ (upfront + annual + royalties) HIGH Industry gold standard; 12 hours of dyadic motion capture + audio + video. Most sought-after emotional multimodal dataset.
DAIC-WOZ USC ICT Audio + Video + Text Non-commercial only USC Stevens Center $10,000-$50,000+ (negotiated) HIGH Clinical interview corpus for PTSD/depression detection.
CMU-MOSEI ⚠️
(Official CMU distribution)
CMU / MultiComp Lab Text + Audio + Video CC BY-NC 4.0 CMU CTTEC (Flintbox) Case-by-case; CMU startup terms: 3% equity + 2% royalties HIGH Distinct from the Apache 2.0 Kaggle version. Underlying data is scraped YouTube, platform ToS risk remains even post-license.
CMU-MOSI CMU Text + Audio + Video CC BY-NC 4.0 CMU CTTEC Case-by-case HIGH 2,199 clips; sentiment intensity focus
Empathetic Dialogues 🚫
(Meta/Facebook AI)
Facebook AI Research Text CC BY-NC Not publicly available Not offered commercially HIGH 25,000 conversations; dominant benchmark. Do NOT use. Replace with GoEmotions + CounselChat + CPED.
ESConv / AugESC 🚫 Tsinghua University Text Non-commercial Not publicly offered Not available HIGH Emotional support conversation gold standard; strictly non-commercial
DEAP 🚫
(EEG + Peripheral)
Queen Mary / various EEG + Peripheral Non-commercial Contact authors Not public HIGH Brain-wave + physiological emotional response data; strictly non-commercial
SEED / SEED-V 🚫
(SJTU EEG)
SJTU EEG Non-commercial EULA Contact SJTU BCMI Lab Not public HIGH Large-scale EEG emotional dataset; EULA explicitly non-commercial
MSP-Podcast ✅
(best value academic)
UT Dallas Multimodal Speech Processing Lab Audio Research EULA Direct purchase from lab (lab-msp.com) $8,000 flat MEDIUM ✓ One of the few academic sets with a fixed public commercial price. Best value for commercial voice emotion data. Confirm pricing directly before budgeting.
RAVDESS Ryerson University Audio + Video Research default Ryerson Affective Data Science Lab (license fee page) Commercial license available (fee not publicly disclosed) MEDIUM ✓ 7,356 files; 24 professional actors; 8 emotions: calm, happy, sad, angry, fearful, surprise, disgust
AM-FED
(Affectiva-MIT Facial Expression)
MIT / Affectiva (now Smart Eye) Video (facial) Academic legacy Direct enterprise license from Smart Eye Not public HIGH Acquired by Smart Eye; must go through enterprise sales channel
Social-IQ CMU Video + Text Research CMU CTTEC / USM Modular pricing HIGH Social intelligence; emotion + intent
OMG-Empathy MIT Media Lab (Picard group) Video (facial) + Multimodal CC BY-NC MIT TLO Negotiated MEDIUM Affective Computing Group dataset; facial + audio + physiological empathy recognition
CAST-Phys MIT Media Lab Video + Physiological (PPG, EDA, Resp, Thermal) CC BY-NC MIT TLO Negotiated MEDIUM 140 participants; 3D+2D facial + thermal + synchronized physiology; contactless remote emotion estimation
AffectiveROAD MIT Media Lab Video + Physiological (Empatica E4, Zephyr Bioharness) CC BY-NC MIT TLO Negotiated MEDIUM Real-world driver stress with synchronized road scene + physiology
MER2025 MIT TLO pathway Video + Audio + Text CC BY-NC 4.0 MIT TLO Negotiated MEDIUM Continuous + discrete emotion tracking; "Affective Computing Meets LLMs" integrated design
RECOLA / DAMI-P2C MIT Media Lab (Picard group) Audio + Video + Physiological Research-only MIT TLO Negotiated MEDIUM Dyadic spontaneous interactions
FeedbackESConv Stanford HAI Text Negotiable via Stanford OTL Stanford OTL / AIMI $70,000/yr (AIMI pricing pattern) MEDIUM 400 conversations; multi-level feedback labels from professional psychotherapy supervisors

04 University Institutional Licensing 🟠 YELLOW / Institutional Tier

These are licensing frameworks, not individual datasets, that unlock broad access to multiple datasets at once.

Institution Office Framework Annual Cost What You Get Best For Risk Level
MIT Media Lab Technology Licensing Office (TLO) Consortium Lab Member (CLM) $50,000-$250,000/yr
(3-year commitment)
Non-exclusive royalty-free rights to ALL IP and datasets created during membership Affective Computing Group data, Driver Stress, multi-modal sets Low once enrolled
MIT Media Lab TLO Bespoke Commercial License (single dataset) ~$20,000 issue fee
+ annual maintenance + royalties
Field-of-use restricted license for specific dataset Single-dataset acquisition Low once executed
USC Stevens Center for Innovation Stevens Center Negotiated Technology Transfer $10,000-$150,000+
(case-by-case)
Commercial rights to IEMOCAP, DAIC-WOZ, CreativeIT Dyadic motion capture + clinical data Low once signed
USC (new 2026 policy) Stevens Center Startup Launch Agreement Equity stake
(amount negotiated)
IP / data in exchange for equity; USC may cover legal formation costs Early-stage startups seeking IEMOCAP-class data without cash outlay Low / strategic
CMU CTTEC Center for Technology Transfer & Enterprise Creation Startup Terms 6% equity (exclusive)
or 3% equity (non-excl.) + 2% royalties
Commercial rights to MOSEI, MOSI, Social-IQ Startups with strong CMU relationship Low once signed
CMU CTTEC CTTEC Express License (Flintbox) Variable Faster clearing of specific datasets Individual dataset licensing Low
Stanford AIMI AIMI Center Annual Commercial License $70,000/yr per dataset (FY25) Commercial rights to AIMI-managed clinical/affective datasets; committee approval + mission alignment check; renewable annually Medical / clinical emotional data Low once enrolled
Stanford OTL Office of Technology Licensing Option Agreement (pre-funded startups) Deferred (option fee) Reserves commercial rights while seeking funding; defers full license until funded Early-stage startups not yet ready for full commercial license Low once signed
Stanford Center for Precision Mental Health Stanford Corporate Members Affiliate Per project Longitudinal de-identified EHR + counseling datasets Clinical / longitudinal emotional data Medium
MIT Media Lab TLO Sponsored Research Agreement (SRA) Negotiated corporate sponsorship Alternative to direct licensing; sponsorship unlocks prototypes + commercial access. Historical spin-outs: Affectiva, Empatica Deeper strategic partnership Low once signed

05 Commercial Data-as-a-Service Providers 🔵 PAID Tier

Highest legal safety, data collected with explicit commercial consent, full indemnification. Correct path for voice, physiological, and robotics data.

Provider Modality License / Indemnification Volume Available Pricing Legal Risk Strengths Weaknesses / Caveats
Defined.ai Text + Audio + Video + Emotion Proprietary / Full commercial indemnification 99,500+ hours meeting recordings; 315-3,125 hrs emotional speech (tiered) $71,500 (Standard, 315 hrs)
to
$1,111,000 (Elite, 3,125 hrs)
Very Low "Ethically sourced"; rights-cleared; marketplace for spontaneous dialogue + emotionally expressive speech; Neural Voice Conversion for anonymization ⚠️ BIPA RISK: Anonymization (neural voice conversion) may NOT fully remove biometric identifiers. Residual jitter/shimmer may qualify as biometric information ($1K-$5K per violation). Request explicit BIPA compliance certification before use.
Appen Text + Audio + Physiological + Robotics Proprietary / Full commercial indemnification 500+ locales; robotics demonstration trajectories; embodied interaction logs $93,000-$150,000+ (enterprise annual)
$10,000+ for pilots
Very Low "Physical AI" capabilities for robotics; LiDAR, embodied interaction, RLHF expert validation; highest emotion fidelity via human actor re-recording; broadest modality coverage Expensive for bespoke custom collection
Scale AI Text + Multimodal + Synthetic Proprietary / Full commercial indemnification Custom per contract Custom contracts Very Low Automated pre-labeling + human review; good for high volume Variable quality for high-nuance emotional tasks; synthetic voices lack micro-prosody
Twine AI Audio (Voice) Proprietary / CCPA+GDPR compliant Custom demographic-specific Custom Very Low Targets specific demographics; custom emotional tone recording; custom consent forms Smaller scale than Appen / Scale
Rwazi Audio (Voice) Proprietary / Commercial rights Custom Custom Very Low "Real world" emotional recording; explicit commercial AI training consent Limited public documentation
Empatica Physiological (PPG, EDA, Temperature) Commercial Research Agreement Custom via EmbracePlus wearable program Per research partnership Low Medical-grade hardware; FDA-cleared EmbracePlus; best-in-class for clinical physiological emotional ground truth Requires in-lab or partnered collection setup
BIOPAC / BioNomadix Physiological (EDA, PPG, ECG) Proprietary / Commercial SDK Custom via Research Ring / Logger Hardware + SDK pricing Low Industry standard for synchronized EDA+PPG+ECG; used in major academic studies commercially re-implemented Hardware procurement overhead
iMotions Physiological (EEG, GSR, Eye Tracking) Commercial SDK Custom integration Enterprise SDK pricing Low Real-time biometric pipeline integration; EEG + GSR + Eye Tracking in one SDK Expensive licensing; requires hardware integration

06 Synthetic Data Generation Pathways 🟣 SYNTHETIC

Not datasets, legal strategies for generating training data without acquiring external sets.

Method Tooling Commercial Use? Legal Risk Volume Potential Cost Key Risk / Notes
Azure OpenAI (GPT-4o) Microsoft Azure API ✅ Yes (Safe Harbor via Enterprise ToS) Low Unlimited API costs + Azure subscription Microsoft Enterprise Agreement provides data ownership; reduces "competing model" exposure vs. direct OpenAI API. Recommended path.
OpenAI API (direct) OpenAI API ⚠️ Gray area Medium Unlimited API costs ToS prohibits training "competing models"; application-layer emotional engines may be permissible. Risk of account revocation.
Anthropic / Claude API (direct) Anthropic API ⚠️ Gray area Medium Unlimited API costs Same "competing model" concern as OpenAI. Anthropic explicitly permits sentiment analysis / content categorization tools.
Open-Weight LLMs
(Llama 3/4, Mistral, DeepSeek-R1)
Self-hosted ✅ Yes Very Low Unlimited Compute costs only DeepSeek-R1 (MIT), Llama 3/4 (Meta Community License, verify commercial use terms). Best path for zero ToS risk.
NVIDIA Isaac Simulator NVIDIA Isaac Sim ✅ Yes Very Low Unlimited synthetic Software licensing Multimodal synthetic interaction generation for embodied robotics scenarios
LLM-as-Annotator
(label, not generate)
Any LLM Lower risk Low-Med High throughput API costs Using LLMs to label human-collected data (vs. generate it) is lower risk and generally permissible under most ToS. Preferred annotation strategy.
Nous Research Fine-tune (Llama 3.1 405B) Self-hosted ✅ Yes Very Low Unlimited Compute costs Validated Sep 2025 for synthetic CBT transcripts; open-weight avoids OpenAI ToS "competing model" exposure. Recommended ToS-safe generator.
Airoboros Framework Open-source ⚠️ Gray area Medium High API costs Self-generation via GPT-3.5 Turbo (used in MentalChat16K); outputs may violate OpenAI ToS if used for competing LLM training
CosyVoice2 + GPT-4 (2026 preprint) Hybrid ✅ Yes (w/ open-weight substitution) Low-Med Unlimited API + compute Zero-shot TTS for synthesizing spoken empathetic dialogues; expected MIT license release
Multi-Agent Simulation (MindCorpus pattern) Any LLM ✅ Yes Low Unlimited API costs Dual-loop Seeker/Supporter agents refining therapeutic responses; differential privacy verification built in

07b Compliance Tooling Reference

Tools required to operationalize the regulatory framework above, especially for clinical / FDA SaMD pathways.

Tool License Purpose Why It Matters
Microsoft Presidio (+ MedicalNERRecognizer) MIT PII / PHI de-identification Required for secondary cleaning. PIIBench: GPT-4o-mini 0.95 recall vs. Presidio <0.14 F1 alone,pair them.
ConsentOS Commercial Record-level consent tracking & revocation FDA SaMD "Golden Thread" compliance; handles real-time consent revocations
Knish.IO Commercial Quantum-secure cryptographic signatures Chain of custody for regulatory audits; FDA 21 CFR Part 11
Mostly AI Commercial Synthetic data generation platform PII-safe synthetic generation
PIIBench Open benchmark PII scrubbing efficacy evaluation Benchmarks scrubber recall across tools
NEXUS Compliance Study (2025): Audited 17,429 data entities. Only 21% of datasets with permissive individual licenses were actually commercially viable once upstream dependencies were traced. Strict provenance vetting is required even for Apache 2.0 / MIT datasets on Hugging Face. An Apache 2.0 label on a Hugging Face card is not a clearance, check the upstream source.

08 CC License Contagion Risk, Share-Alike Warning

Understanding how CC licenses interact with model weights is critical for IP protection.

License Type Can Train On? Risk if Weights Are Derivative Works Recommended Mitigation
CC0 / Public Domain ✅ Yes, freely None No mitigation needed
MIT ✅ Yes, freely None No mitigation needed
Apache 2.0 ✅ Yes, freely None No mitigation needed
CC BY 4.0 ✅ Yes (with attribution) Low Maintain attribution records
CC BY-SA 4.0 ⚠️ ⚠️ Risky Share-Alike contagion, may force weights to be open-sourced Isolate training to a separate LoRA adapter only; do not merge into base model weights
CC BY-NC 4.0 🚫 ❌ No Copyright infringement Do not use for commercial model training
CC BY-NC-SA 4.0 🚫 ❌ No Copyright infringement + share-alike Do not use under any circumstances
Research-only EULA 🚫 ❌ No Copyright infringement Requires TTO commercial license

09 Recommended Acquisition Roadmap

Four-phase strategy from bootstrap to proprietary data moat.

Phase 1, Bootstrap

Months 0-6 · ~$0-$10K
  • CPED (Apache 2.0)
  • GoEmotions (Apache 2.0)
  • CounselChat (MIT)
  • Generated-Recovery-Support-Dialogues (MIT)
  • PersonaChat (MIT)
  • DailyDialog (vet CC0 Kaggle version)
  • Synthetic augmentation: DeepSeek-R1 / Llama 3 (open-weight)
  • LLM-as-annotator on proprietary seed data

Phase 2, Commercial Data Acquisition

Months 3-9 · ~$8K-$150K
  • Purchase MSP-Podcast commercial license, $8,000 flat (UT Dallas)
  • Negotiate MELD commercial license via Yale OCR
  • Purchase Defined.ai Standard tier (~$71,500 / 315 hrs emotional speech)

Phase 3, Institutional & Multimodal

Months 6-18 · ~$50K-$250K
  • MIT Media Lab Consortium membership ($50K-$250K/yr), unlocks all Affective Computing Group IP
  • OR USC Startup Launch Agreement (equity deal), unlocks IEMOCAP + DAIC-WOZ without cash outlay
  • Appen custom collection for embodied / physiological scenarios
  • Empatica EmbracePlus partnership for medical-grade physiological ground truth

Phase 4, Proprietary Flywheel

Months 12-24+ · Operational
  • In-house collection pipeline: deploy product, collect consented real-world emotional data, own the IP
  • DaaS product potential: license proprietary clean multimodal data back to enterprise market
  • Goal: Build an insurmountable proprietary data moat

10 Priority Data Gaps

Known gaps in the data stack and their recommended solutions.

Gap Severity Solution
High-quality English emotional dialogue (commercial) Critical CounselChat + CPED (translated) + Defined.ai custom
Voice prosody with intact micro-prosody (jitter/shimmer) Critical Appen re-enactment method (actors re-record) + MSP-Podcast license
Physiological (EDA, HRV) with commercial rights High PhysioNet + Kaggle CC0 ER dataset + Empatica partnership
EEG emotion data (commercial) Medium No good permissive option; collect proprietary via BIOPAC / OpenBCI
Non-Western / non-English emotional data Medium CPED (Chinese); Defined.ai global locales; Appen 500+ locales
Embodied / human-robot interaction data Medium MultiPhysio-HRC (MDPI 2025); NVIDIA Isaac simulation; Appen Physical AI

10b New Research Labs / Sources to Watch

Labs producing ongoing output relevant to EAII's data stack.

Lab / Org Output Relevant to EAII Access Path
Stanford SALT Lab SocialSim / SSConv synthetic emotional support Published openly
Stanford Center for Precision Mental Health Longitudinal de-identified EHR + counseling precisionmentalhealth@stanford.edu, Corporate Members affiliate
Sonde Health (PureTech spinout) Vocal biomarkers (pitch, hoarseness) for depression screening; licensed from MIT Lincoln Lab Enterprise sales
ASLP Lab / ASLP@NPU EChat-200K audio+text empathetic dialogue Open release
Nous Research Llama 3.1 405B fine-tune for CBT synthesis (Sep 2025) Open-weight
Magic Data Multi-Speaker Emotional Speech Dataset Open release

11 Key Takeaways

  1. The core dataset risk is copyright infringement on CC BY-NC data, not trademark or trade secret. The most-used research benchmarks (Empathetic Dialogues, IEMOCAP, ESConv) are strictly non-commercial. MELD is an exception, it has a commercial dual-license available via Yale OCR.
  2. The safest text data stack: CPED + GoEmotions + CounselChat + DailyDialog (vet source). All Apache 2.0 or MIT, zero cost, adequate volume for Phase 2.
  3. Best single institutional purchase: MSP-Podcast at $8,000, a fixed-price commercial license with no negotiation overhead.
  4. DaaS is the gold standard for legal safety: Defined.ai and Appen provide full indemnification and are the correct path for voice + physiological + robotics data.
  5. Synthetic generation is viable if: (a) using open-weight models (DeepSeek-R1, Llama), OR (b) using Azure OpenAI under Enterprise Agreement, OR (c) using LLMs only as annotators (not generators) of human-collected base data.
  6. The LoRA isolation strategy is the correct IP mitigation for any CC BY-SA contaminated data: keep fine-tuned adapters architecturally separate from base model weights.
  7. BIPA is the highest operational legal risk for voice data. Standard commercial "masking" is not a legal shield if micro-prosodic features (jitter/shimmer) remain extractable, they qualify as biometric identifiers under BIPA.
  8. Synthetic dialogue has exploded since the last inventory. MindCorpus, MDD-5k, CACTUS, ExTES, ChatThero, PsychEval, SMILE, EChat-200K, Psych8k, ConvoSense, Synth-Empathy, and others now form a dense layer of Apache 2.0 / CC BY / MIT counseling-dialogue corpora. Combined with Nous Research's Llama 3.1 405B fine-tune (Sep 2025), the entire synthetic therapeutic dialogue stack can now be built ToS-safe using only open-weight generators and permissively licensed training dialogues, no OpenAI / Anthropic gray areas required.
  9. The NEXUS 21% rule: Do not trust Hugging Face license labels at face value, only ~21% of "permissively licensed" datasets are actually commercially viable once upstream provenance is traced. Vet each dataset's full dependency chain before ingestion.
  10. Clinical deployment raises the bar. HIPAA + FDA SaMD "Golden Thread" (consent logs via ConsentOS, cryptographic lineage via Knish.IO, PHI scrubbing via Presidio + MedicalNERRecognizer) become mandatory if EAII pursues any medical-device classification. Budget compliance tooling separately from dataset acquisition.