00 Overview
01 Quick Reference: Risk Tiers
Four tiers govern acquisition strategy. Match every source to its tier before use.
| Tier | Description | Strategy |
|---|---|---|
| 🟢 GREEN | Fully permissive (Apache 2.0, MIT, CC0) | Use freely |
| 🟡 YELLOW | Requires negotiation or paid license | Budget and negotiate |
| 🔴 RED | Non-commercial only under standard terms | Do not use without TTO agreement |
| 🔵 PAID | Commercial Data-as-a-Service providers | Purchase with indemnification |
02 Open-Source / Permissive Datasets 🟢 GREEN Tier
Explicitly licensed for commercial use. Lowest IP risk. Some entries carry caveats, read the Notes column carefully.
| Dataset | Modality | License | Commercial? | Academic? | Volume | Est. Cost | Legal Risk | Notes |
|---|---|---|---|---|---|---|---|---|
| CPED (Chinese Personalized & Emotional Dialogue) |
Text | Apache 2.0 | ✅ Yes | Yes | 12K dialogues / 133K utterances / 390K+ tokens | Free | Low | 13 emotions, Big Five personality traits, Chinese-centric but cross-lingual usable |
| GoEmotions (Google) | Text | Apache 2.0 | ✅ Yes | Yes | 58,000 utterances | Free | Low | 27 fine-grained emotion categories; derived from Reddit comments |
| CMU-MOSEI ⚠️ (Kaggle / SDK distribution) |
Text + Audio + Video | Apache 2.0 (SDK/Kaggle only) | ⚠️ Yellow | Yes | 23,500+ clips; 65+ hours | Free | Med-High | ⚠️ Official CMU repo is research-only. Apache 2.0 applies to SDK/Kaggle distribution only. Underlying YouTube content has unresolved platform ToS + creator copyright. Requires IP counsel before commercial use. Treat as YELLOW. |
| PersonaChat / ConvAI2 (via ParlAI) | Text | MIT | ✅ Yes | Yes | 10,000+ dialogues | Free | Low | Personality-conditioned chit-chat; useful for E-DNA persona modeling |
| DailyDialog ⚠️ | Text | CC0 / Public Domain (Kaggle) | ✅ Yes (see caveats) | Yes | 13,000+ dialogues | Free | Medium | ⚠️ Contradictory licensing: CC0 on Kaggle, CC BY-NC-SA in original 2017 ACL publication. Must vet the specific distribution source before ingestion. |
| CounselChat | Text | MIT | ✅ Yes | Quasi-academic | ~1,400 interactions | Free | Low | Real licensed counselor-seeker interactions; high-fidelity "helping skills" |
| Generated-Recovery-Support-Dialogues | Text | MIT | ✅ Yes | No | ~1,100 dialogues | Free | Very Low | Synthetic; addiction recovery and motivational interviewing |
| CREMA-D (Crowd-sourced Emotional Multimodal Actors) |
Audio + Video | ODbL / CC BY 4.0 | ✅ Yes | Yes | 7,442 clips / 91 actors | Free | Low | Must reference the database; explicit commercial allowance in docs |
| Dolly-v2 ⚠️ (Databricks) | Text | CC BY-SA 4.0 | ✅ Yes (share-alike risk) | No | 15,000 pairs | Free | Medium | ⚠️ CC BY-SA = Share-Alike contagion risk if weights are derivative works. Use in isolated LoRA adapter only, do not merge into base model. |
| OIG / Open Instruction Generalist | Text | Apache 2.0 | ✅ Yes | No | 44M+ rows | Free | Low | Massive instruction-tuning set; useful for emotional reasoning scaffolding |
| DeepDialogue | Text + Synthetic Audio | Permissive (MIT-adjacent) | ✅ Yes | No | 40,150 multi-turn dialogues | Free | Low | Multi-domain with explicit emotional progressions and synthesized voices |
| MELD (Dual License via Yale OCR) | Text + Audio + Video | Dual: Open Source / Commercial | ✅ Yes (commercial license required) | Yes | 13,000+ utterances | Negotiated | Low-Med | Derived from Friends TV series. Krishnaswamy Lab at Yale provides commercial license for software developers. Contact Yale Office of Cooperative Research. |
| Kaggle Multimodal ER Dataset | Physiological + Voice | CC0 (Public Domain) | ✅ Yes | No | 250 participants; EDA + HRV + voice tone | Free | Very Low | Rare CC0 physiological dataset; synchronized EDA, HRV, and voice |
| WESAD ⚠️ (Wearable Stress & Affect Detection) |
Physiological (ECG, EDA, EMG, Temp, Motion) | CC BY-SA / CC BY-NC 4.0 | ⚠️ Conditional | Yes | Multi-subject wearable sensor data | Free | Med-High | ⚠️ Some distributions are CC BY-SA (share-alike risk); others CC BY-NC. Verify source. Standard distribution is research-only, do NOT use in commercial builds without explicit legal vetting. |
| PhysioNet (multiple datasets) | Physiological (ECG, PPG, EDA) | PhysioNet License / Publicly available | ✅ Yes (open research) | Yes | Multiple multi-channel datasets | Free | Low | "Facilitates open research in both academic and commercial settings." Confirm individual dataset terms before ingestion. |
| K-EmoPhone ⚠️ | Physiological + Mobile + Context | CC BY 4.0 | ⚠️ Conditional | Yes | Mobility, EDA, BVP, context data | Free (if approved) | Medium | Commercial entities must undergo "rigorous review" by KAIST before access is granted. Treat as YELLOW until written approval obtained. |
| OASST1 / OASST2 (Open-Assistant) | Text | Apache 2.0 | ✅ Yes | Yes | 161,443+ messages / 35 languages / multi-turn trees | Free | Low | Human-crowdsourced (13,500+ volunteers); includes emotional/helpfulness/toxicity labels; born-open, bypasses derivative copyright risk |
| CACTUS (CBT Dataset) | Text | Apache 2.0 | ✅ Yes | Yes | 31,564 CBT dialogues / ~1M utterances | Free | Low | Evaluated with Cognitive Therapy Rating Scale (CTRS); gold-standard clinical validation; strong multi-turn depth |
| ChatThero | Text | CC BY 4.0 | ✅ Yes | Yes | Multi-session substance-use recovery episodes with persistent memory | Free | Low | Models multi-session therapy continuity + stressor-aware adaptations; multi-agent simulation |
| PsychEval / PsychAgent | Text | CC BY 4.0 | ✅ Yes | Yes | 2,000+ synthetic client profiles × 6-10 continuous sessions | Free | Low | Models "arc of therapy" across sessions; long-horizon dialogue continuity |
| HelpSteer2 (NVIDIA) | Text | CC BY 4.0 | ✅ Yes | Yes | 10,000+ prompts with fine-grained ratings | Free | Low | Professional curation for helpfulness, correctness, emotional tone; RLHF-ready for supportive contexts |
| ExTES (Exemplary Emotional Support) | Text | CC BY 4.0 | ✅ Yes | Yes | 11,177 dialogues with support-strategy labels | Free | Very Low | 100% synthetic; recursive LLM generation with strategy labels (clarification, affirmation); zero PII risk |
| EChat-200K (ASLP@NPU) | Audio (speech-to-speech) + Text | Apache 2.0 | ✅ Yes | Yes | 200,000 empathetic dialogues; single/multi-label subsets | Free | Low | Paralinguistic cues (jitter, shimmer) intact; real + synthetic audio mix |
| Phi-4-Empathetic | Text | MIT | ✅ Yes | No | Tied to Microsoft Phi-4 ecosystem | Free | Low | Chain of Thought + DPO + SFT for empathetic alignment |
| NeuroFeel | Text | Apache 2.0 | ✅ Yes | Yes | ~10,000 samples / 13 nuanced emotions | Free | Low | Balances underrepresented emotions via synthetic augmentation; real social + ChatGPT-augmented |
| SyntAct | Audio + Video | MIT | ✅ Yes | Yes | Synthesized basic emotional expressions | Free | Very Low | 100% synthetic; speech + facial expression validation |
| Psych8k / ChatPsychiatrist | Text | Apache 2.0 | ✅ Yes | Yes | ~8,000 clinical Q&A pairs (transcribed + scrubbed) | Free | Low | Therapeutic dimensions labeled (Direct Guidance, Approval, Restatement/Reflection) |
| Student-Mental-Health (counseling-vn) | Text | Apache 2.0 | ✅ Yes | No | Student-focused mental health interactions | Free | Low | Domain-specific: student mental health |
| SSConv / SocialSim (Stanford SALT Lab) | Text | Apache 2.0 (expected) | ✅ Yes | Yes | Large-scale synthetic emotional support | Free | Low | Simulates Social Disclosure & Social Awareness; human evaluators rated higher than crowdsourced for "logical supportiveness" |
| UltraChat (filtered for affect) | Text | Apache 2.0 | ✅ Yes (requires subsetting) | No | Large-scale synthetic conversations (GPT-3.5) | Free | Low | Requires affective filtering; large-scale synthetic instruction-following |
| SQPsychConv | Text | Apache 2.0 (expected) | ✅ Yes | Yes | CBT-framework client-therapist dialogues (GPT-4o) | Free | Low | Structured CBT framework; bypasses privacy barriers of real clinical data |
| Medical-o1-reasoning-SFT | Text | Apache 2.0 | ✅ Yes | Yes | 90,120 open-ended questions with CoT reasoning (GPT-4o) | Free | Low | Medical reasoning chains; SFT-optimized |
| WildJailbreak | Text | ODC-BY | ✅ Yes | Yes | Safety / behavioral interaction dataset | Free | Low | Safety filtering insights; behavioral patterns |
| Self-Instruct | Text | Permissive | ✅ Yes | Yes | 82,646 LLM-generated prompts from human seed | Free | Low | Instruction-following diversity; LLM self-generation methodology |
| LongForm | Text | MIT | ✅ Yes | Yes | 27,000 English instruction-following examples | Free | Low | QA + story generation diversity; reverse-engineered instructions |
| MindCorpus (Jan 2026) | Text | Permissive (academic release) | ✅ Yes | Yes | 5,700 realistic therapeutic sessions | Free | Very Low | 100% synthetic; dual-loop Seeker/Supporter agents; differential privacy verified |
| MDD-5k | Text | Permissive (academic release) | ✅ Yes | Yes | 5,000 long conversations / 25 mental illnesses / 26.8 turns avg | Free | Very Low | Largest synthesized diagnostic dataset; neuro-symbolic LLM agents |
| ConvoSense | Text | MIT | ✅ Yes | Yes | 500,000+ commonsense inferences / 12,000 dialogues | Free | Low | GPT-generated commonsense reasoning for empathy; improves coherence |
| Synth-Empathy (Jul 2024) | Text | Permissive | ✅ Yes | Yes | LLM pipeline with diversity selection | Free | Low | Discards low-quality outputs; diversity module; designed to improve empathy benchmarks |
| Multi-Speaker Emotional Speech (Magic Data) | Audio | Permissive (commercial intent) | ✅ Yes | No | Multi-speaker emotionally expressive speech | Free | Low | Engineered for commercial LLM fine-tuning TTS |
| Amod/mental_health_counseling_conversations | Text | RAIL-D | ✅ Yes ($100 donation) | No | Mental health counseling conversations | ~$100 donation | Very Low | Unique donation-based commercial pathway to mental health foundation |
| Synthetic Therapy Conversations (Kaggle, Jerry Yao) | Text | CC0 | ✅ Yes | No | Patient-therapist role interactions | Free | Very Low | 100% synthetic; no PII risk |
| SMILE / MeChat | Text | CC BY 4.0 | ✅ Yes | Yes | 55,000 synthetic counseling conversations | Free | Low | ChatGPT-expanded from single-turn Q&A to multi-turn counseling |
| SMILE-College | Text | Permissive | ✅ Yes | Yes | College mental health sentiment data | Free | Low | Human-machine collaborative; LLM empathetic performance evaluation |
| IDRE (Italian Rephrasing w/ Empathy) | Text | CC BY 4.0 | ✅ Yes | Yes | Italian healthcare chatbot responses | Free | Low | Italian language; Llama 2-optimized empathetic healthcare responses |
| LMSYS-Chat-1M ⚠️ (prompts only) | Text | CC BY 4.0 (prompts) / CC BY-NC 4.0 (outputs) | ⚠️ Prompts only | Yes | 1,000,000 conversations across 25 LLMs | Free | Medium | ⚠️ Bifurcated license. Harvest prompts only; regenerate outputs with open-weight models. Do NOT train on model outputs. |
| EDOS ⚠️ (Empathetic Dialogue at Scale) | Text | CC BY 4.0 | ⚠️ Conditional | Yes | Large-scale dataset derived from movie subtitles | Free | Medium | ⚠️ Derived from copyrighted film subtitles, upstream copyright risk despite CC BY 4.0 surface license. 32 emotion labels + 8 empathy intents. |
| MentalChat16K ⚠️ (synthetic subset) | Text | Apache 2.0 / MIT (synthetic) · TTO (real) | ⚠️ Synthetic subset OK | Yes | 16,113 QA pairs (anonymized real + synthetic GPT-3.5) | Free | Med-High | ⚠️ Bifurcated: synthetic subset safe; real transcript portion may require Stanford OTL clearance. Generated via Airoboros framework. |
03 Restricted Academic Datasets 🔴 RED Tier
Require a negotiated commercial license through a University Technology Transfer Office. Using under standard research EULA in a commercial product is direct copyright infringement.
| Dataset | Institution | Modality | Standard License | Commercial Path | Est. Commercial Cost | Legal Risk | Notes |
|---|---|---|---|---|---|---|---|
| IEMOCAP (Interactive Emotional Dyadic Motion Capture) |
USC / ICT / SAIL | Audio + Video + Motion Capture | Research-only EULA | USC Stevens Center for Innovation | $10,000-$50,000+ (upfront + annual + royalties) | HIGH | Industry gold standard; 12 hours of dyadic motion capture + audio + video. Most sought-after emotional multimodal dataset. |
| DAIC-WOZ | USC ICT | Audio + Video + Text | Non-commercial only | USC Stevens Center | $10,000-$50,000+ (negotiated) | HIGH | Clinical interview corpus for PTSD/depression detection. |
| CMU-MOSEI ⚠️ (Official CMU distribution) |
CMU / MultiComp Lab | Text + Audio + Video | CC BY-NC 4.0 | CMU CTTEC (Flintbox) | Case-by-case; CMU startup terms: 3% equity + 2% royalties | HIGH | Distinct from the Apache 2.0 Kaggle version. Underlying data is scraped YouTube, platform ToS risk remains even post-license. |
| CMU-MOSI | CMU | Text + Audio + Video | CC BY-NC 4.0 | CMU CTTEC | Case-by-case | HIGH | 2,199 clips; sentiment intensity focus |
| Empathetic Dialogues 🚫 (Meta/Facebook AI) |
Facebook AI Research | Text | CC BY-NC | Not publicly available | Not offered commercially | HIGH | 25,000 conversations; dominant benchmark. Do NOT use. Replace with GoEmotions + CounselChat + CPED. |
| ESConv / AugESC 🚫 | Tsinghua University | Text | Non-commercial | Not publicly offered | Not available | HIGH | Emotional support conversation gold standard; strictly non-commercial |
| DEAP 🚫 (EEG + Peripheral) |
Queen Mary / various | EEG + Peripheral | Non-commercial | Contact authors | Not public | HIGH | Brain-wave + physiological emotional response data; strictly non-commercial |
| SEED / SEED-V 🚫 (SJTU EEG) |
SJTU | EEG | Non-commercial EULA | Contact SJTU BCMI Lab | Not public | HIGH | Large-scale EEG emotional dataset; EULA explicitly non-commercial |
| MSP-Podcast ✅ (best value academic) |
UT Dallas Multimodal Speech Processing Lab | Audio | Research EULA | Direct purchase from lab (lab-msp.com) | $8,000 flat | MEDIUM ✓ | One of the few academic sets with a fixed public commercial price. Best value for commercial voice emotion data. Confirm pricing directly before budgeting. |
| RAVDESS | Ryerson University | Audio + Video | Research default | Ryerson Affective Data Science Lab (license fee page) | Commercial license available (fee not publicly disclosed) | MEDIUM ✓ | 7,356 files; 24 professional actors; 8 emotions: calm, happy, sad, angry, fearful, surprise, disgust |
| AM-FED (Affectiva-MIT Facial Expression) |
MIT / Affectiva (now Smart Eye) | Video (facial) | Academic legacy | Direct enterprise license from Smart Eye | Not public | HIGH | Acquired by Smart Eye; must go through enterprise sales channel |
| Social-IQ | CMU | Video + Text | Research | CMU CTTEC / USM | Modular pricing | HIGH | Social intelligence; emotion + intent |
| OMG-Empathy | MIT Media Lab (Picard group) | Video (facial) + Multimodal | CC BY-NC | MIT TLO | Negotiated | MEDIUM | Affective Computing Group dataset; facial + audio + physiological empathy recognition |
| CAST-Phys | MIT Media Lab | Video + Physiological (PPG, EDA, Resp, Thermal) | CC BY-NC | MIT TLO | Negotiated | MEDIUM | 140 participants; 3D+2D facial + thermal + synchronized physiology; contactless remote emotion estimation |
| AffectiveROAD | MIT Media Lab | Video + Physiological (Empatica E4, Zephyr Bioharness) | CC BY-NC | MIT TLO | Negotiated | MEDIUM | Real-world driver stress with synchronized road scene + physiology |
| MER2025 | MIT TLO pathway | Video + Audio + Text | CC BY-NC 4.0 | MIT TLO | Negotiated | MEDIUM | Continuous + discrete emotion tracking; "Affective Computing Meets LLMs" integrated design |
| RECOLA / DAMI-P2C | MIT Media Lab (Picard group) | Audio + Video + Physiological | Research-only | MIT TLO | Negotiated | MEDIUM | Dyadic spontaneous interactions |
| FeedbackESConv | Stanford HAI | Text | Negotiable via Stanford OTL | Stanford OTL / AIMI | $70,000/yr (AIMI pricing pattern) | MEDIUM | 400 conversations; multi-level feedback labels from professional psychotherapy supervisors |
04 University Institutional Licensing 🟠 YELLOW / Institutional Tier
These are licensing frameworks, not individual datasets, that unlock broad access to multiple datasets at once.
| Institution | Office | Framework | Annual Cost | What You Get | Best For | Risk Level |
|---|---|---|---|---|---|---|
| MIT Media Lab | Technology Licensing Office (TLO) | Consortium Lab Member (CLM) | $50,000-$250,000/yr (3-year commitment) |
Non-exclusive royalty-free rights to ALL IP and datasets created during membership | Affective Computing Group data, Driver Stress, multi-modal sets | Low once enrolled |
| MIT Media Lab | TLO | Bespoke Commercial License (single dataset) | ~$20,000 issue fee + annual maintenance + royalties |
Field-of-use restricted license for specific dataset | Single-dataset acquisition | Low once executed |
| USC Stevens Center for Innovation | Stevens Center | Negotiated Technology Transfer | $10,000-$150,000+ (case-by-case) |
Commercial rights to IEMOCAP, DAIC-WOZ, CreativeIT | Dyadic motion capture + clinical data | Low once signed |
| USC (new 2026 policy) | Stevens Center | Startup Launch Agreement | Equity stake (amount negotiated) |
IP / data in exchange for equity; USC may cover legal formation costs | Early-stage startups seeking IEMOCAP-class data without cash outlay | Low / strategic |
| CMU CTTEC | Center for Technology Transfer & Enterprise Creation | Startup Terms | 6% equity (exclusive) or 3% equity (non-excl.) + 2% royalties |
Commercial rights to MOSEI, MOSI, Social-IQ | Startups with strong CMU relationship | Low once signed |
| CMU CTTEC | CTTEC | Express License (Flintbox) | Variable | Faster clearing of specific datasets | Individual dataset licensing | Low |
| Stanford AIMI | AIMI Center | Annual Commercial License | $70,000/yr per dataset (FY25) | Commercial rights to AIMI-managed clinical/affective datasets; committee approval + mission alignment check; renewable annually | Medical / clinical emotional data | Low once enrolled |
| Stanford OTL | Office of Technology Licensing | Option Agreement (pre-funded startups) | Deferred (option fee) | Reserves commercial rights while seeking funding; defers full license until funded | Early-stage startups not yet ready for full commercial license | Low once signed |
| Stanford Center for Precision Mental Health | Stanford | Corporate Members Affiliate | Per project | Longitudinal de-identified EHR + counseling datasets | Clinical / longitudinal emotional data | Medium |
| MIT Media Lab | TLO | Sponsored Research Agreement (SRA) | Negotiated corporate sponsorship | Alternative to direct licensing; sponsorship unlocks prototypes + commercial access. Historical spin-outs: Affectiva, Empatica | Deeper strategic partnership | Low once signed |
05 Commercial Data-as-a-Service Providers 🔵 PAID Tier
Highest legal safety, data collected with explicit commercial consent, full indemnification. Correct path for voice, physiological, and robotics data.
| Provider | Modality | License / Indemnification | Volume Available | Pricing | Legal Risk | Strengths | Weaknesses / Caveats |
|---|---|---|---|---|---|---|---|
| Defined.ai | Text + Audio + Video + Emotion | Proprietary / Full commercial indemnification | 99,500+ hours meeting recordings; 315-3,125 hrs emotional speech (tiered) | $71,500 (Standard, 315 hrs) to $1,111,000 (Elite, 3,125 hrs) |
Very Low | "Ethically sourced"; rights-cleared; marketplace for spontaneous dialogue + emotionally expressive speech; Neural Voice Conversion for anonymization | ⚠️ BIPA RISK: Anonymization (neural voice conversion) may NOT fully remove biometric identifiers. Residual jitter/shimmer may qualify as biometric information ($1K-$5K per violation). Request explicit BIPA compliance certification before use. |
| Appen | Text + Audio + Physiological + Robotics | Proprietary / Full commercial indemnification | 500+ locales; robotics demonstration trajectories; embodied interaction logs | $93,000-$150,000+ (enterprise annual) $10,000+ for pilots |
Very Low | "Physical AI" capabilities for robotics; LiDAR, embodied interaction, RLHF expert validation; highest emotion fidelity via human actor re-recording; broadest modality coverage | Expensive for bespoke custom collection |
| Scale AI | Text + Multimodal + Synthetic | Proprietary / Full commercial indemnification | Custom per contract | Custom contracts | Very Low | Automated pre-labeling + human review; good for high volume | Variable quality for high-nuance emotional tasks; synthetic voices lack micro-prosody |
| Twine AI | Audio (Voice) | Proprietary / CCPA+GDPR compliant | Custom demographic-specific | Custom | Very Low | Targets specific demographics; custom emotional tone recording; custom consent forms | Smaller scale than Appen / Scale |
| Rwazi | Audio (Voice) | Proprietary / Commercial rights | Custom | Custom | Very Low | "Real world" emotional recording; explicit commercial AI training consent | Limited public documentation |
| Empatica | Physiological (PPG, EDA, Temperature) | Commercial Research Agreement | Custom via EmbracePlus wearable program | Per research partnership | Low | Medical-grade hardware; FDA-cleared EmbracePlus; best-in-class for clinical physiological emotional ground truth | Requires in-lab or partnered collection setup |
| BIOPAC / BioNomadix | Physiological (EDA, PPG, ECG) | Proprietary / Commercial SDK | Custom via Research Ring / Logger | Hardware + SDK pricing | Low | Industry standard for synchronized EDA+PPG+ECG; used in major academic studies commercially re-implemented | Hardware procurement overhead |
| iMotions | Physiological (EEG, GSR, Eye Tracking) | Commercial SDK | Custom integration | Enterprise SDK pricing | Low | Real-time biometric pipeline integration; EEG + GSR + Eye Tracking in one SDK | Expensive licensing; requires hardware integration |
06 Synthetic Data Generation Pathways 🟣 SYNTHETIC
Not datasets, legal strategies for generating training data without acquiring external sets.
| Method | Tooling | Commercial Use? | Legal Risk | Volume Potential | Cost | Key Risk / Notes |
|---|---|---|---|---|---|---|
| Azure OpenAI (GPT-4o) | Microsoft Azure API | ✅ Yes (Safe Harbor via Enterprise ToS) | Low | Unlimited | API costs + Azure subscription | Microsoft Enterprise Agreement provides data ownership; reduces "competing model" exposure vs. direct OpenAI API. Recommended path. |
| OpenAI API (direct) | OpenAI API | ⚠️ Gray area | Medium | Unlimited | API costs | ToS prohibits training "competing models"; application-layer emotional engines may be permissible. Risk of account revocation. |
| Anthropic / Claude API (direct) | Anthropic API | ⚠️ Gray area | Medium | Unlimited | API costs | Same "competing model" concern as OpenAI. Anthropic explicitly permits sentiment analysis / content categorization tools. |
| Open-Weight LLMs (Llama 3/4, Mistral, DeepSeek-R1) |
Self-hosted | ✅ Yes | Very Low | Unlimited | Compute costs only | DeepSeek-R1 (MIT), Llama 3/4 (Meta Community License, verify commercial use terms). Best path for zero ToS risk. |
| NVIDIA Isaac Simulator | NVIDIA Isaac Sim | ✅ Yes | Very Low | Unlimited synthetic | Software licensing | Multimodal synthetic interaction generation for embodied robotics scenarios |
| LLM-as-Annotator (label, not generate) |
Any LLM | Lower risk | Low-Med | High throughput | API costs | Using LLMs to label human-collected data (vs. generate it) is lower risk and generally permissible under most ToS. Preferred annotation strategy. |
| Nous Research Fine-tune (Llama 3.1 405B) | Self-hosted | ✅ Yes | Very Low | Unlimited | Compute costs | Validated Sep 2025 for synthetic CBT transcripts; open-weight avoids OpenAI ToS "competing model" exposure. Recommended ToS-safe generator. |
| Airoboros Framework | Open-source | ⚠️ Gray area | Medium | High | API costs | Self-generation via GPT-3.5 Turbo (used in MentalChat16K); outputs may violate OpenAI ToS if used for competing LLM training |
| CosyVoice2 + GPT-4 (2026 preprint) | Hybrid | ✅ Yes (w/ open-weight substitution) | Low-Med | Unlimited | API + compute | Zero-shot TTS for synthesizing spoken empathetic dialogues; expected MIT license release |
| Multi-Agent Simulation (MindCorpus pattern) | Any LLM | ✅ Yes | Low | Unlimited | API costs | Dual-loop Seeker/Supporter agents refining therapeutic responses; differential privacy verification built in |
07 Legal Risk Reference: Key Regulations
Regulations that directly govern EAII's data collection and model training activities.
| Regulation | Jurisdiction | What It Covers for EAII | Max Penalty | Priority |
|---|---|---|---|---|
| BIPA (Biometric Information Privacy Act) |
Illinois, USA | Voice biometrics, facial geometry, physiological identifiers | $1,000-$5,000 per violation (class actions possible) | 🔴 CRITICAL, voice/physio data |
| GDPR Article 9 | EU | "Special category" data including physiological and mental health data | Up to 4% global annual revenue | 🔴 CRITICAL, EU users |
| EU AI Act | EU | Emotion recognition in workplace and educational settings explicitly prohibited | Up to €30M or 6% global revenue | HIGH, scope limitations required |
| CCPA | California, USA | Biometric and sensitive personal data rights | $2,500-$7,500 per violation | HIGH, California users |
| COPPA | USA | Data involving minors | Up to $51,744 per violation | HIGH, block <13 users from emotional data collection |
| Copyright Infringement | USA | Training on unlicensed CC BY-NC data | Statutory damages up to $150,000 per work | HIGH, core dataset risk |
| LLM ToS Violation | Contractual | "Competing model" clause violations | Account termination + possible C&D or litigation | MEDIUM, reputational + operational |
| HIPAA (PHI) | USA | Protected Health Information in clinical dialogues; general-purpose LLMs miss >50% of PHI, pair Microsoft Presidio + MedicalNERRecognizer | Civil + criminal penalties | HIGH, clinical data |
| FDA SaMD (21 CFR Part 11) | USA | Clinical deployment requires "Golden Thread": record-level consent logs + cryptographic lineage. Beware "synthetic data creep" / model collapse. | Device denial; post-market action | HIGH, if clinical features deployed |
07b Compliance Tooling Reference
Tools required to operationalize the regulatory framework above, especially for clinical / FDA SaMD pathways.
| Tool | License | Purpose | Why It Matters |
|---|---|---|---|
| Microsoft Presidio (+ MedicalNERRecognizer) | MIT | PII / PHI de-identification | Required for secondary cleaning. PIIBench: GPT-4o-mini 0.95 recall vs. Presidio <0.14 F1 alone,pair them. |
| ConsentOS | Commercial | Record-level consent tracking & revocation | FDA SaMD "Golden Thread" compliance; handles real-time consent revocations |
| Knish.IO | Commercial | Quantum-secure cryptographic signatures | Chain of custody for regulatory audits; FDA 21 CFR Part 11 |
| Mostly AI | Commercial | Synthetic data generation platform | PII-safe synthetic generation |
| PIIBench | Open benchmark | PII scrubbing efficacy evaluation | Benchmarks scrubber recall across tools |
08 CC License Contagion Risk, Share-Alike Warning
Understanding how CC licenses interact with model weights is critical for IP protection.
| License Type | Can Train On? | Risk if Weights Are Derivative Works | Recommended Mitigation |
|---|---|---|---|
| CC0 / Public Domain | ✅ Yes, freely | None | No mitigation needed |
| MIT | ✅ Yes, freely | None | No mitigation needed |
| Apache 2.0 | ✅ Yes, freely | None | No mitigation needed |
| CC BY 4.0 | ✅ Yes (with attribution) | Low | Maintain attribution records |
| CC BY-SA 4.0 ⚠️ | ⚠️ Risky | Share-Alike contagion, may force weights to be open-sourced | Isolate training to a separate LoRA adapter only; do not merge into base model weights |
| CC BY-NC 4.0 🚫 | ❌ No | Copyright infringement | Do not use for commercial model training |
| CC BY-NC-SA 4.0 🚫 | ❌ No | Copyright infringement + share-alike | Do not use under any circumstances |
| Research-only EULA 🚫 | ❌ No | Copyright infringement | Requires TTO commercial license |
09 Recommended Acquisition Roadmap
Four-phase strategy from bootstrap to proprietary data moat.
Phase 1, Bootstrap
Months 0-6 · ~$0-$10K- CPED (Apache 2.0)
- GoEmotions (Apache 2.0)
- CounselChat (MIT)
- Generated-Recovery-Support-Dialogues (MIT)
- PersonaChat (MIT)
- DailyDialog (vet CC0 Kaggle version)
- Synthetic augmentation: DeepSeek-R1 / Llama 3 (open-weight)
- LLM-as-annotator on proprietary seed data
Phase 2, Commercial Data Acquisition
Months 3-9 · ~$8K-$150K- Purchase MSP-Podcast commercial license, $8,000 flat (UT Dallas)
- Negotiate MELD commercial license via Yale OCR
- Purchase Defined.ai Standard tier (~$71,500 / 315 hrs emotional speech)
Phase 3, Institutional & Multimodal
Months 6-18 · ~$50K-$250K- MIT Media Lab Consortium membership ($50K-$250K/yr), unlocks all Affective Computing Group IP
- OR USC Startup Launch Agreement (equity deal), unlocks IEMOCAP + DAIC-WOZ without cash outlay
- Appen custom collection for embodied / physiological scenarios
- Empatica EmbracePlus partnership for medical-grade physiological ground truth
Phase 4, Proprietary Flywheel
Months 12-24+ · Operational- In-house collection pipeline: deploy product, collect consented real-world emotional data, own the IP
- DaaS product potential: license proprietary clean multimodal data back to enterprise market
- Goal: Build an insurmountable proprietary data moat
10 Priority Data Gaps
Known gaps in the data stack and their recommended solutions.
| Gap | Severity | Solution |
|---|---|---|
| High-quality English emotional dialogue (commercial) | Critical | CounselChat + CPED (translated) + Defined.ai custom |
| Voice prosody with intact micro-prosody (jitter/shimmer) | Critical | Appen re-enactment method (actors re-record) + MSP-Podcast license |
| Physiological (EDA, HRV) with commercial rights | High | PhysioNet + Kaggle CC0 ER dataset + Empatica partnership |
| EEG emotion data (commercial) | Medium | No good permissive option; collect proprietary via BIOPAC / OpenBCI |
| Non-Western / non-English emotional data | Medium | CPED (Chinese); Defined.ai global locales; Appen 500+ locales |
| Embodied / human-robot interaction data | Medium | MultiPhysio-HRC (MDPI 2025); NVIDIA Isaac simulation; Appen Physical AI |
10b New Research Labs / Sources to Watch
Labs producing ongoing output relevant to EAII's data stack.
| Lab / Org | Output Relevant to EAII | Access Path |
|---|---|---|
| Stanford SALT Lab | SocialSim / SSConv synthetic emotional support | Published openly |
| Stanford Center for Precision Mental Health | Longitudinal de-identified EHR + counseling | precisionmentalhealth@stanford.edu, Corporate Members affiliate |
| Sonde Health (PureTech spinout) | Vocal biomarkers (pitch, hoarseness) for depression screening; licensed from MIT Lincoln Lab | Enterprise sales |
| ASLP Lab / ASLP@NPU | EChat-200K audio+text empathetic dialogue | Open release |
| Nous Research | Llama 3.1 405B fine-tune for CBT synthesis (Sep 2025) | Open-weight |
| Magic Data | Multi-Speaker Emotional Speech Dataset | Open release |
11 Key Takeaways
- The core dataset risk is copyright infringement on CC BY-NC data, not trademark or trade secret. The most-used research benchmarks (Empathetic Dialogues, IEMOCAP, ESConv) are strictly non-commercial. MELD is an exception, it has a commercial dual-license available via Yale OCR.
- The safest text data stack: CPED + GoEmotions + CounselChat + DailyDialog (vet source). All Apache 2.0 or MIT, zero cost, adequate volume for Phase 2.
- Best single institutional purchase: MSP-Podcast at $8,000, a fixed-price commercial license with no negotiation overhead.
- DaaS is the gold standard for legal safety: Defined.ai and Appen provide full indemnification and are the correct path for voice + physiological + robotics data.
- Synthetic generation is viable if: (a) using open-weight models (DeepSeek-R1, Llama), OR (b) using Azure OpenAI under Enterprise Agreement, OR (c) using LLMs only as annotators (not generators) of human-collected base data.
- The LoRA isolation strategy is the correct IP mitigation for any CC BY-SA contaminated data: keep fine-tuned adapters architecturally separate from base model weights.
- BIPA is the highest operational legal risk for voice data. Standard commercial "masking" is not a legal shield if micro-prosodic features (jitter/shimmer) remain extractable, they qualify as biometric identifiers under BIPA.
- Synthetic dialogue has exploded since the last inventory. MindCorpus, MDD-5k, CACTUS, ExTES, ChatThero, PsychEval, SMILE, EChat-200K, Psych8k, ConvoSense, Synth-Empathy, and others now form a dense layer of Apache 2.0 / CC BY / MIT counseling-dialogue corpora. Combined with Nous Research's Llama 3.1 405B fine-tune (Sep 2025), the entire synthetic therapeutic dialogue stack can now be built ToS-safe using only open-weight generators and permissively licensed training dialogues, no OpenAI / Anthropic gray areas required.
- The NEXUS 21% rule: Do not trust Hugging Face license labels at face value, only ~21% of "permissively licensed" datasets are actually commercially viable once upstream provenance is traced. Vet each dataset's full dependency chain before ingestion.
- Clinical deployment raises the bar. HIPAA + FDA SaMD "Golden Thread" (consent logs via ConsentOS, cryptographic lineage via Knish.IO, PHI scrubbing via Presidio + MedicalNERRecognizer) become mandatory if EAII pursues any medical-device classification. Budget compliance tooling separately from dataset acquisition.