EAII Data Sources, Human Discovery, Inc.

00 Overview

This document synthesizes all research conducted on data acquisition for the EAII Phase 2 proprietary emotional reasoning engine. The core legal challenge: most "gold standard" academic emotional datasets carry CC BY-NC (Non-Commercial) restrictions that prohibit direct commercial use. This summary catalogs every identified data source, its legal status, cost, volume, and risk level , organized by tier.

01 Quick Reference: Risk Tiers

Four tiers govern acquisition strategy. Match every source to its tier before use.

Tier	Description	Strategy
🟢 GREEN	Fully permissive (Apache 2.0, MIT, CC0)	Use freely
🟡 YELLOW	Requires negotiation or paid license	Budget and negotiate
🔴 RED	Non-commercial only under standard terms	Do not use without TTO agreement
🔵 PAID	Commercial Data-as-a-Service providers	Purchase with indemnification

02 Open-Source / Permissive Datasets 🟢 GREEN Tier

Explicitly licensed for commercial use. Lowest IP risk. Some entries carry caveats, read the Notes column carefully.

Dataset	Modality	License	Commercial?	Academic?	Volume	Est. Cost	Legal Risk	Notes
CPED (Chinese Personalized & Emotional Dialogue)	Text	Apache 2.0	✅ Yes	Yes	12K dialogues / 133K utterances / 390K+ tokens	Free	Low	13 emotions, Big Five personality traits, Chinese-centric but cross-lingual usable
GoEmotions (Google)	Text	Apache 2.0	✅ Yes	Yes	58,000 utterances	Free	Low	27 fine-grained emotion categories; derived from Reddit comments
CMU-MOSEI ⚠️ (Kaggle / SDK distribution)	Text + Audio + Video	Apache 2.0 (SDK/Kaggle only)	⚠️ Yellow	Yes	23,500+ clips; 65+ hours	Free	Med-High	⚠️ Official CMU repo is research-only. Apache 2.0 applies to SDK/Kaggle distribution only. Underlying YouTube content has unresolved platform ToS + creator copyright. Requires IP counsel before commercial use. Treat as YELLOW.
PersonaChat / ConvAI2 (via ParlAI)	Text	MIT	✅ Yes	Yes	10,000+ dialogues	Free	Low	Personality-conditioned chit-chat; useful for E-DNA persona modeling
DailyDialog ⚠️	Text	CC0 / Public Domain (Kaggle)	✅ Yes (see caveats)	Yes	13,000+ dialogues	Free	Medium	⚠️ Contradictory licensing: CC0 on Kaggle, CC BY-NC-SA in original 2017 ACL publication. Must vet the specific distribution source before ingestion.
CounselChat	Text	MIT	✅ Yes	Quasi-academic	~1,400 interactions	Free	Low	Real licensed counselor-seeker interactions; high-fidelity "helping skills"
Generated-Recovery-Support-Dialogues	Text	MIT	✅ Yes	No	~1,100 dialogues	Free	Very Low	Synthetic; addiction recovery and motivational interviewing
CREMA-D (Crowd-sourced Emotional Multimodal Actors)	Audio + Video	ODbL / CC BY 4.0	✅ Yes	Yes	7,442 clips / 91 actors	Free	Low	Must reference the database; explicit commercial allowance in docs
Dolly-v2 ⚠️ (Databricks)	Text	CC BY-SA 4.0	✅ Yes (share-alike risk)	No	15,000 pairs	Free	Medium	⚠️ CC BY-SA = Share-Alike contagion risk if weights are derivative works. Use in isolated LoRA adapter only, do not merge into base model.
OIG / Open Instruction Generalist	Text	Apache 2.0	✅ Yes	No	44M+ rows	Free	Low	Massive instruction-tuning set; useful for emotional reasoning scaffolding
DeepDialogue	Text + Synthetic Audio	Permissive (MIT-adjacent)	✅ Yes	No	40,150 multi-turn dialogues	Free	Low	Multi-domain with explicit emotional progressions and synthesized voices
MELD (Dual License via Yale OCR)	Text + Audio + Video	Dual: Open Source / Commercial	✅ Yes (commercial license required)	Yes	13,000+ utterances	Negotiated	Low-Med	Derived from Friends TV series. Krishnaswamy Lab at Yale provides commercial license for software developers. Contact Yale Office of Cooperative Research.
Kaggle Multimodal ER Dataset	Physiological + Voice	CC0 (Public Domain)	✅ Yes	No	250 participants; EDA + HRV + voice tone	Free	Very Low	Rare CC0 physiological dataset; synchronized EDA, HRV, and voice
WESAD ⚠️ (Wearable Stress & Affect Detection)	Physiological (ECG, EDA, EMG, Temp, Motion)	CC BY-SA / CC BY-NC 4.0	⚠️ Conditional	Yes	Multi-subject wearable sensor data	Free	Med-High	⚠️ Some distributions are CC BY-SA (share-alike risk); others CC BY-NC. Verify source. Standard distribution is research-only, do NOT use in commercial builds without explicit legal vetting.
PhysioNet (multiple datasets)	Physiological (ECG, PPG, EDA)	PhysioNet License / Publicly available	✅ Yes (open research)	Yes	Multiple multi-channel datasets	Free	Low	"Facilitates open research in both academic and commercial settings." Confirm individual dataset terms before ingestion.
K-EmoPhone ⚠️	Physiological + Mobile + Context	CC BY 4.0	⚠️ Conditional	Yes	Mobility, EDA, BVP, context data	Free (if approved)	Medium	Commercial entities must undergo "rigorous review" by KAIST before access is granted. Treat as YELLOW until written approval obtained.
OASST1 / OASST2 (Open-Assistant)	Text	Apache 2.0	✅ Yes	Yes	161,443+ messages / 35 languages / multi-turn trees	Free	Low	Human-crowdsourced (13,500+ volunteers); includes emotional/helpfulness/toxicity labels; born-open, bypasses derivative copyright risk
CACTUS (CBT Dataset)	Text	Apache 2.0	✅ Yes	Yes	31,564 CBT dialogues / ~1M utterances	Free	Low	Evaluated with Cognitive Therapy Rating Scale (CTRS); gold-standard clinical validation; strong multi-turn depth
ChatThero	Text	CC BY 4.0	✅ Yes	Yes	Multi-session substance-use recovery episodes with persistent memory	Free	Low	Models multi-session therapy continuity + stressor-aware adaptations; multi-agent simulation
PsychEval / PsychAgent	Text	CC BY 4.0	✅ Yes	Yes	2,000+ synthetic client profiles × 6-10 continuous sessions	Free	Low	Models "arc of therapy" across sessions; long-horizon dialogue continuity
HelpSteer2 (NVIDIA)	Text	CC BY 4.0	✅ Yes	Yes	10,000+ prompts with fine-grained ratings	Free	Low	Professional curation for helpfulness, correctness, emotional tone; RLHF-ready for supportive contexts
ExTES (Exemplary Emotional Support)	Text	CC BY 4.0	✅ Yes	Yes	11,177 dialogues with support-strategy labels	Free	Very Low	100% synthetic; recursive LLM generation with strategy labels (clarification, affirmation); zero PII risk
EChat-200K (ASLP@NPU)	Audio (speech-to-speech) + Text	Apache 2.0	✅ Yes	Yes	200,000 empathetic dialogues; single/multi-label subsets	Free	Low	Paralinguistic cues (jitter, shimmer) intact; real + synthetic audio mix
Phi-4-Empathetic	Text	MIT	✅ Yes	No	Tied to Microsoft Phi-4 ecosystem	Free	Low	Chain of Thought + DPO + SFT for empathetic alignment
NeuroFeel	Text	Apache 2.0	✅ Yes	Yes	~10,000 samples / 13 nuanced emotions	Free	Low	Balances underrepresented emotions via synthetic augmentation; real social + ChatGPT-augmented
SyntAct	Audio + Video	MIT	✅ Yes	Yes	Synthesized basic emotional expressions	Free	Very Low	100% synthetic; speech + facial expression validation
Psych8k / ChatPsychiatrist	Text	Apache 2.0	✅ Yes	Yes	~8,000 clinical Q&A pairs (transcribed + scrubbed)	Free	Low	Therapeutic dimensions labeled (Direct Guidance, Approval, Restatement/Reflection)
Student-Mental-Health (counseling-vn)	Text	Apache 2.0	✅ Yes	No	Student-focused mental health interactions	Free	Low	Domain-specific: student mental health
SSConv / SocialSim (Stanford SALT Lab)	Text	Apache 2.0 (expected)	✅ Yes	Yes	Large-scale synthetic emotional support	Free	Low	Simulates Social Disclosure & Social Awareness; human evaluators rated higher than crowdsourced for "logical supportiveness"
UltraChat (filtered for affect)	Text	Apache 2.0	✅ Yes (requires subsetting)	No	Large-scale synthetic conversations (GPT-3.5)	Free	Low	Requires affective filtering; large-scale synthetic instruction-following
SQPsychConv	Text	Apache 2.0 (expected)	✅ Yes	Yes	CBT-framework client-therapist dialogues (GPT-4o)	Free	Low	Structured CBT framework; bypasses privacy barriers of real clinical data
Medical-o1-reasoning-SFT	Text	Apache 2.0	✅ Yes	Yes	90,120 open-ended questions with CoT reasoning (GPT-4o)	Free	Low	Medical reasoning chains; SFT-optimized
WildJailbreak	Text	ODC-BY	✅ Yes	Yes	Safety / behavioral interaction dataset	Free	Low	Safety filtering insights; behavioral patterns
Self-Instruct	Text	Permissive	✅ Yes	Yes	82,646 LLM-generated prompts from human seed	Free	Low	Instruction-following diversity; LLM self-generation methodology
LongForm	Text	MIT	✅ Yes	Yes	27,000 English instruction-following examples	Free	Low	QA + story generation diversity; reverse-engineered instructions
MindCorpus (Jan 2026)	Text	Permissive (academic release)	✅ Yes	Yes	5,700 realistic therapeutic sessions	Free	Very Low	100% synthetic; dual-loop Seeker/Supporter agents; differential privacy verified
MDD-5k	Text	Permissive (academic release)	✅ Yes	Yes	5,000 long conversations / 25 mental illnesses / 26.8 turns avg	Free	Very Low	Largest synthesized diagnostic dataset; neuro-symbolic LLM agents
ConvoSense	Text	MIT	✅ Yes	Yes	500,000+ commonsense inferences / 12,000 dialogues	Free	Low	GPT-generated commonsense reasoning for empathy; improves coherence
Synth-Empathy (Jul 2024)	Text	Permissive	✅ Yes	Yes	LLM pipeline with diversity selection	Free	Low	Discards low-quality outputs; diversity module; designed to improve empathy benchmarks
Multi-Speaker Emotional Speech (Magic Data)	Audio	Permissive (commercial intent)	✅ Yes	No	Multi-speaker emotionally expressive speech	Free	Low	Engineered for commercial LLM fine-tuning TTS
Amod/mental_health_counseling_conversations	Text	RAIL-D	✅ Yes ($100 donation)	No	Mental health counseling conversations	~$100 donation	Very Low	Unique donation-based commercial pathway to mental health foundation
Synthetic Therapy Conversations (Kaggle, Jerry Yao)	Text	CC0	✅ Yes	No	Patient-therapist role interactions	Free	Very Low	100% synthetic; no PII risk
SMILE / MeChat	Text	CC BY 4.0	✅ Yes	Yes	55,000 synthetic counseling conversations	Free	Low	ChatGPT-expanded from single-turn Q&A to multi-turn counseling
SMILE-College	Text	Permissive	✅ Yes	Yes	College mental health sentiment data	Free	Low	Human-machine collaborative; LLM empathetic performance evaluation
IDRE (Italian Rephrasing w/ Empathy)	Text	CC BY 4.0	✅ Yes	Yes	Italian healthcare chatbot responses	Free	Low	Italian language; Llama 2-optimized empathetic healthcare responses
LMSYS-Chat-1M ⚠️ (prompts only)	Text	CC BY 4.0 (prompts) / CC BY-NC 4.0 (outputs)	⚠️ Prompts only	Yes	1,000,000 conversations across 25 LLMs	Free	Medium	⚠️ Bifurcated license. Harvest prompts only; regenerate outputs with open-weight models. Do NOT train on model outputs.
EDOS ⚠️ (Empathetic Dialogue at Scale)	Text	CC BY 4.0	⚠️ Conditional	Yes	Large-scale dataset derived from movie subtitles	Free	Medium	⚠️ Derived from copyrighted film subtitles, upstream copyright risk despite CC BY 4.0 surface license. 32 emotion labels + 8 empathy intents.
MentalChat16K ⚠️ (synthetic subset)	Text	Apache 2.0 / MIT (synthetic) · TTO (real)	⚠️ Synthetic subset OK	Yes	16,113 QA pairs (anonymized real + synthetic GPT-3.5)	Free	Med-High	⚠️ Bifurcated: synthetic subset safe; real transcript portion may require Stanford OTL clearance. Generated via Airoboros framework.

03 Restricted Academic Datasets 🔴 RED Tier

Require a negotiated commercial license through a University Technology Transfer Office. Using under standard research EULA in a commercial product is direct copyright infringement.

Dataset	Institution	Modality	Standard License	Commercial Path	Est. Commercial Cost	Legal Risk	Notes
IEMOCAP (Interactive Emotional Dyadic Motion Capture)	USC / ICT / SAIL	Audio + Video + Motion Capture	Research-only EULA	USC Stevens Center for Innovation	$10,000-$50,000+ (upfront + annual + royalties)	HIGH	Industry gold standard; 12 hours of dyadic motion capture + audio + video. Most sought-after emotional multimodal dataset.
DAIC-WOZ	USC ICT	Audio + Video + Text	Non-commercial only	USC Stevens Center	$10,000-$50,000+ (negotiated)	HIGH	Clinical interview corpus for PTSD/depression detection.
CMU-MOSEI ⚠️ (Official CMU distribution)	CMU / MultiComp Lab	Text + Audio + Video	CC BY-NC 4.0	CMU CTTEC (Flintbox)	Case-by-case; CMU startup terms: 3% equity + 2% royalties	HIGH	Distinct from the Apache 2.0 Kaggle version. Underlying data is scraped YouTube, platform ToS risk remains even post-license.
CMU-MOSI	CMU	Text + Audio + Video	CC BY-NC 4.0	CMU CTTEC	Case-by-case	HIGH	2,199 clips; sentiment intensity focus
Empathetic Dialogues 🚫 (Meta/Facebook AI)	Facebook AI Research	Text	CC BY-NC	Not publicly available	Not offered commercially	HIGH	25,000 conversations; dominant benchmark. Do NOT use. Replace with GoEmotions + CounselChat + CPED.
ESConv / AugESC 🚫	Tsinghua University	Text	Non-commercial	Not publicly offered	Not available	HIGH	Emotional support conversation gold standard; strictly non-commercial
DEAP 🚫 (EEG + Peripheral)	Queen Mary / various	EEG + Peripheral	Non-commercial	Contact authors	Not public	HIGH	Brain-wave + physiological emotional response data; strictly non-commercial
SEED / SEED-V 🚫 (SJTU EEG)	SJTU	EEG	Non-commercial EULA	Contact SJTU BCMI Lab	Not public	HIGH	Large-scale EEG emotional dataset; EULA explicitly non-commercial
MSP-Podcast ✅ (best value academic)	UT Dallas Multimodal Speech Processing Lab	Audio	Research EULA	Direct purchase from lab (lab-msp.com)	$8,000 flat	MEDIUM ✓	One of the few academic sets with a fixed public commercial price. Best value for commercial voice emotion data. Confirm pricing directly before budgeting.
RAVDESS	Ryerson University	Audio + Video	Research default	Ryerson Affective Data Science Lab (license fee page)	Commercial license available (fee not publicly disclosed)	MEDIUM ✓	7,356 files; 24 professional actors; 8 emotions: calm, happy, sad, angry, fearful, surprise, disgust
AM-FED (Affectiva-MIT Facial Expression)	MIT / Affectiva (now Smart Eye)	Video (facial)	Academic legacy	Direct enterprise license from Smart Eye	Not public	HIGH	Acquired by Smart Eye; must go through enterprise sales channel
Social-IQ	CMU	Video + Text	Research	CMU CTTEC / USM	Modular pricing	HIGH	Social intelligence; emotion + intent
OMG-Empathy	MIT Media Lab (Picard group)	Video (facial) + Multimodal	CC BY-NC	MIT TLO	Negotiated	MEDIUM	Affective Computing Group dataset; facial + audio + physiological empathy recognition
CAST-Phys	MIT Media Lab	Video + Physiological (PPG, EDA, Resp, Thermal)	CC BY-NC	MIT TLO	Negotiated	MEDIUM	140 participants; 3D+2D facial + thermal + synchronized physiology; contactless remote emotion estimation
AffectiveROAD	MIT Media Lab	Video + Physiological (Empatica E4, Zephyr Bioharness)	CC BY-NC	MIT TLO	Negotiated	MEDIUM	Real-world driver stress with synchronized road scene + physiology
MER2025	MIT TLO pathway	Video + Audio + Text	CC BY-NC 4.0	MIT TLO	Negotiated	MEDIUM	Continuous + discrete emotion tracking; "Affective Computing Meets LLMs" integrated design
RECOLA / DAMI-P2C	MIT Media Lab (Picard group)	Audio + Video + Physiological	Research-only	MIT TLO	Negotiated	MEDIUM	Dyadic spontaneous interactions
FeedbackESConv	Stanford HAI	Text	Negotiable via Stanford OTL	Stanford OTL / AIMI	$70,000/yr (AIMI pricing pattern)	MEDIUM	400 conversations; multi-level feedback labels from professional psychotherapy supervisors

04 University Institutional Licensing 🟠 YELLOW / Institutional Tier

These are licensing frameworks, not individual datasets, that unlock broad access to multiple datasets at once.

Institution	Office	Framework	Annual Cost	What You Get	Best For	Risk Level
MIT Media Lab	Technology Licensing Office (TLO)	Consortium Lab Member (CLM)	$50,000-$250,000/yr (3-year commitment)	Non-exclusive royalty-free rights to ALL IP and datasets created during membership	Affective Computing Group data, Driver Stress, multi-modal sets	Low once enrolled
MIT Media Lab	TLO	Bespoke Commercial License (single dataset)	~$20,000 issue fee + annual maintenance + royalties	Field-of-use restricted license for specific dataset	Single-dataset acquisition	Low once executed
USC Stevens Center for Innovation	Stevens Center	Negotiated Technology Transfer	$10,000-$150,000+ (case-by-case)	Commercial rights to IEMOCAP, DAIC-WOZ, CreativeIT	Dyadic motion capture + clinical data	Low once signed
USC (new 2026 policy)	Stevens Center	Startup Launch Agreement	Equity stake (amount negotiated)	IP / data in exchange for equity; USC may cover legal formation costs	Early-stage startups seeking IEMOCAP-class data without cash outlay	Low / strategic
CMU CTTEC	Center for Technology Transfer & Enterprise Creation	Startup Terms	6% equity (exclusive) or 3% equity (non-excl.) + 2% royalties	Commercial rights to MOSEI, MOSI, Social-IQ	Startups with strong CMU relationship	Low once signed
CMU CTTEC	CTTEC	Express License (Flintbox)	Variable	Faster clearing of specific datasets	Individual dataset licensing	Low
Stanford AIMI	AIMI Center	Annual Commercial License	$70,000/yr per dataset (FY25)	Commercial rights to AIMI-managed clinical/affective datasets; committee approval + mission alignment check; renewable annually	Medical / clinical emotional data	Low once enrolled
Stanford OTL	Office of Technology Licensing	Option Agreement (pre-funded startups)	Deferred (option fee)	Reserves commercial rights while seeking funding; defers full license until funded	Early-stage startups not yet ready for full commercial license	Low once signed
Stanford Center for Precision Mental Health	Stanford	Corporate Members Affiliate	Per project	Longitudinal de-identified EHR + counseling datasets	Clinical / longitudinal emotional data	Medium
MIT Media Lab	TLO	Sponsored Research Agreement (SRA)	Negotiated corporate sponsorship	Alternative to direct licensing; sponsorship unlocks prototypes + commercial access. Historical spin-outs: Affectiva, Empatica	Deeper strategic partnership	Low once signed

05 Commercial Data-as-a-Service Providers 🔵 PAID Tier

Highest legal safety, data collected with explicit commercial consent, full indemnification. Correct path for voice, physiological, and robotics data.

Provider	Modality	License / Indemnification	Volume Available	Pricing	Legal Risk	Strengths	Weaknesses / Caveats
Defined.ai	Text + Audio + Video + Emotion	Proprietary / Full commercial indemnification	99,500+ hours meeting recordings; 315-3,125 hrs emotional speech (tiered)	$71,500 (Standard, 315 hrs) to $1,111,000 (Elite, 3,125 hrs)	Very Low	"Ethically sourced"; rights-cleared; marketplace for spontaneous dialogue + emotionally expressive speech; Neural Voice Conversion for anonymization	⚠️ BIPA RISK: Anonymization (neural voice conversion) may NOT fully remove biometric identifiers. Residual jitter/shimmer may qualify as biometric information ($1K-$5K per violation). Request explicit BIPA compliance certification before use.
Appen	Text + Audio + Physiological + Robotics	Proprietary / Full commercial indemnification	500+ locales; robotics demonstration trajectories; embodied interaction logs	$93,000-$150,000+ (enterprise annual) $10,000+ for pilots	Very Low	"Physical AI" capabilities for robotics; LiDAR, embodied interaction, RLHF expert validation; highest emotion fidelity via human actor re-recording; broadest modality coverage	Expensive for bespoke custom collection
Scale AI	Text + Multimodal + Synthetic	Proprietary / Full commercial indemnification	Custom per contract	Custom contracts	Very Low	Automated pre-labeling + human review; good for high volume	Variable quality for high-nuance emotional tasks; synthetic voices lack micro-prosody
Twine AI	Audio (Voice)	Proprietary / CCPA+GDPR compliant	Custom demographic-specific	Custom	Very Low	Targets specific demographics; custom emotional tone recording; custom consent forms	Smaller scale than Appen / Scale
Rwazi	Audio (Voice)	Proprietary / Commercial rights	Custom	Custom	Very Low	"Real world" emotional recording; explicit commercial AI training consent	Limited public documentation
Empatica	Physiological (PPG, EDA, Temperature)	Commercial Research Agreement	Custom via EmbracePlus wearable program	Per research partnership	Low	Medical-grade hardware; FDA-cleared EmbracePlus; best-in-class for clinical physiological emotional ground truth	Requires in-lab or partnered collection setup
BIOPAC / BioNomadix	Physiological (EDA, PPG, ECG)	Proprietary / Commercial SDK	Custom via Research Ring / Logger	Hardware + SDK pricing	Low	Industry standard for synchronized EDA+PPG+ECG; used in major academic studies commercially re-implemented	Hardware procurement overhead
iMotions	Physiological (EEG, GSR, Eye Tracking)	Commercial SDK	Custom integration	Enterprise SDK pricing	Low	Real-time biometric pipeline integration; EEG + GSR + Eye Tracking in one SDK	Expensive licensing; requires hardware integration

06 Synthetic Data Generation Pathways 🟣 SYNTHETIC

Not datasets, legal strategies for generating training data without acquiring external sets.

Method	Tooling	Commercial Use?	Legal Risk	Volume Potential	Cost	Key Risk / Notes
Azure OpenAI (GPT-4o)	Microsoft Azure API	✅ Yes (Safe Harbor via Enterprise ToS)	Low	Unlimited	API costs + Azure subscription	Microsoft Enterprise Agreement provides data ownership; reduces "competing model" exposure vs. direct OpenAI API. Recommended path.
OpenAI API (direct)	OpenAI API	⚠️ Gray area	Medium	Unlimited	API costs	ToS prohibits training "competing models"; application-layer emotional engines may be permissible. Risk of account revocation.
Anthropic / Claude API (direct)	Anthropic API	⚠️ Gray area	Medium	Unlimited	API costs	Same "competing model" concern as OpenAI. Anthropic explicitly permits sentiment analysis / content categorization tools.
Open-Weight LLMs (Llama 3/4, Mistral, DeepSeek-R1)	Self-hosted	✅ Yes	Very Low	Unlimited	Compute costs only	DeepSeek-R1 (MIT), Llama 3/4 (Meta Community License, verify commercial use terms). Best path for zero ToS risk.
NVIDIA Isaac Simulator	NVIDIA Isaac Sim	✅ Yes	Very Low	Unlimited synthetic	Software licensing	Multimodal synthetic interaction generation for embodied robotics scenarios
LLM-as-Annotator (label, not generate)	Any LLM	Lower risk	Low-Med	High throughput	API costs	Using LLMs to label human-collected data (vs. generate it) is lower risk and generally permissible under most ToS. Preferred annotation strategy.
Nous Research Fine-tune (Llama 3.1 405B)	Self-hosted	✅ Yes	Very Low	Unlimited	Compute costs	Validated Sep 2025 for synthetic CBT transcripts; open-weight avoids OpenAI ToS "competing model" exposure. Recommended ToS-safe generator.
Airoboros Framework	Open-source	⚠️ Gray area	Medium	High	API costs	Self-generation via GPT-3.5 Turbo (used in MentalChat16K); outputs may violate OpenAI ToS if used for competing LLM training
CosyVoice2 + GPT-4 (2026 preprint)	Hybrid	✅ Yes (w/ open-weight substitution)	Low-Med	Unlimited	API + compute	Zero-shot TTS for synthesizing spoken empathetic dialogues; expected MIT license release
Multi-Agent Simulation (MindCorpus pattern)	Any LLM	✅ Yes	Low	Unlimited	API costs	Dual-loop Seeker/Supporter agents refining therapeutic responses; differential privacy verification built in

07 Legal Risk Reference: Key Regulations

Regulations that directly govern EAII's data collection and model training activities.

Regulation	Jurisdiction	What It Covers for EAII	Max Penalty	Priority
BIPA (Biometric Information Privacy Act)	Illinois, USA	Voice biometrics, facial geometry, physiological identifiers	$1,000-$5,000 per violation (class actions possible)	🔴 CRITICAL, voice/physio data
GDPR Article 9	EU	"Special category" data including physiological and mental health data	Up to 4% global annual revenue	🔴 CRITICAL, EU users
EU AI Act	EU	Emotion recognition in workplace and educational settings explicitly prohibited	Up to €30M or 6% global revenue	HIGH, scope limitations required
CCPA	California, USA	Biometric and sensitive personal data rights	$2,500-$7,500 per violation	HIGH, California users
COPPA	USA	Data involving minors	Up to $51,744 per violation	HIGH, block <13 users from emotional data collection
Copyright Infringement	USA	Training on unlicensed CC BY-NC data	Statutory damages up to $150,000 per work	HIGH, core dataset risk
LLM ToS Violation	Contractual	"Competing model" clause violations	Account termination + possible C&D or litigation	MEDIUM, reputational + operational
HIPAA (PHI)	USA	Protected Health Information in clinical dialogues; general-purpose LLMs miss >50% of PHI, pair Microsoft Presidio + MedicalNERRecognizer	Civil + criminal penalties	HIGH, clinical data
FDA SaMD (21 CFR Part 11)	USA	Clinical deployment requires "Golden Thread": record-level consent logs + cryptographic lineage. Beware "synthetic data creep" / model collapse.	Device denial; post-market action	HIGH, if clinical features deployed

07b Compliance Tooling Reference

Tools required to operationalize the regulatory framework above, especially for clinical / FDA SaMD pathways.

Tool	License	Purpose	Why It Matters
Microsoft Presidio (+ MedicalNERRecognizer)	MIT	PII / PHI de-identification	Required for secondary cleaning. PIIBench: GPT-4o-mini 0.95 recall vs. Presidio <0.14 F1 alone,pair them.
ConsentOS	Commercial	Record-level consent tracking & revocation	FDA SaMD "Golden Thread" compliance; handles real-time consent revocations
Knish.IO	Commercial	Quantum-secure cryptographic signatures	Chain of custody for regulatory audits; FDA 21 CFR Part 11
Mostly AI	Commercial	Synthetic data generation platform	PII-safe synthetic generation
PIIBench	Open benchmark	PII scrubbing efficacy evaluation	Benchmarks scrubber recall across tools

NEXUS Compliance Study (2025): Audited 17,429 data entities. Only 21% of datasets with permissive individual licenses were actually commercially viable once upstream dependencies were traced. Strict provenance vetting is required even for Apache 2.0 / MIT datasets on Hugging Face. An Apache 2.0 label on a Hugging Face card is not a clearance, check the upstream source.

08 CC License Contagion Risk, Share-Alike Warning

Understanding how CC licenses interact with model weights is critical for IP protection.

License Type	Can Train On?	Risk if Weights Are Derivative Works	Recommended Mitigation
CC0 / Public Domain	✅ Yes, freely	None	No mitigation needed
MIT	✅ Yes, freely	None	No mitigation needed
Apache 2.0	✅ Yes, freely	None	No mitigation needed
CC BY 4.0	✅ Yes (with attribution)	Low	Maintain attribution records
CC BY-SA 4.0 ⚠️	⚠️ Risky	Share-Alike contagion, may force weights to be open-sourced	Isolate training to a separate LoRA adapter only; do not merge into base model weights
CC BY-NC 4.0 🚫	❌ No	Copyright infringement	Do not use for commercial model training
CC BY-NC-SA 4.0 🚫	❌ No	Copyright infringement + share-alike	Do not use under any circumstances
Research-only EULA 🚫	❌ No	Copyright infringement	Requires TTO commercial license

09 Recommended Acquisition Roadmap

Four-phase strategy from bootstrap to proprietary data moat.

Phase 1, Bootstrap

Months 0-6 · ~$0-$10K

CPED (Apache 2.0)
GoEmotions (Apache 2.0)
CounselChat (MIT)
Generated-Recovery-Support-Dialogues (MIT)
PersonaChat (MIT)
DailyDialog (vet CC0 Kaggle version)
Synthetic augmentation: DeepSeek-R1 / Llama 3 (open-weight)
LLM-as-annotator on proprietary seed data

Phase 2, Commercial Data Acquisition

Months 3-9 · ~$8K-$150K

Purchase MSP-Podcast commercial license, $8,000 flat (UT Dallas)
Negotiate MELD commercial license via Yale OCR
Purchase Defined.ai Standard tier (~$71,500 / 315 hrs emotional speech)

Phase 3, Institutional & Multimodal

Months 6-18 · ~$50K-$250K

MIT Media Lab Consortium membership ($50K-$250K/yr), unlocks all Affective Computing Group IP
OR USC Startup Launch Agreement (equity deal), unlocks IEMOCAP + DAIC-WOZ without cash outlay
Appen custom collection for embodied / physiological scenarios
Empatica EmbracePlus partnership for medical-grade physiological ground truth

Phase 4, Proprietary Flywheel

Months 12-24+ · Operational

In-house collection pipeline: deploy product, collect consented real-world emotional data, own the IP
DaaS product potential: license proprietary clean multimodal data back to enterprise market
Goal: Build an insurmountable proprietary data moat

10 Priority Data Gaps

Known gaps in the data stack and their recommended solutions.

Gap	Severity	Solution
High-quality English emotional dialogue (commercial)	Critical	CounselChat + CPED (translated) + Defined.ai custom
Voice prosody with intact micro-prosody (jitter/shimmer)	Critical	Appen re-enactment method (actors re-record) + MSP-Podcast license
Physiological (EDA, HRV) with commercial rights	High	PhysioNet + Kaggle CC0 ER dataset + Empatica partnership
EEG emotion data (commercial)	Medium	No good permissive option; collect proprietary via BIOPAC / OpenBCI
Non-Western / non-English emotional data	Medium	CPED (Chinese); Defined.ai global locales; Appen 500+ locales
Embodied / human-robot interaction data	Medium	MultiPhysio-HRC (MDPI 2025); NVIDIA Isaac simulation; Appen Physical AI

10b New Research Labs / Sources to Watch

Labs producing ongoing output relevant to EAII's data stack.

Lab / Org	Output Relevant to EAII	Access Path
Stanford SALT Lab	SocialSim / SSConv synthetic emotional support	Published openly
Stanford Center for Precision Mental Health	Longitudinal de-identified EHR + counseling	`precisionmentalhealth@stanford.edu`, Corporate Members affiliate
Sonde Health (PureTech spinout)	Vocal biomarkers (pitch, hoarseness) for depression screening; licensed from MIT Lincoln Lab	Enterprise sales
ASLP Lab / ASLP@NPU	EChat-200K audio+text empathetic dialogue	Open release
Nous Research	Llama 3.1 405B fine-tune for CBT synthesis (Sep 2025)	Open-weight
Magic Data	Multi-Speaker Emotional Speech Dataset	Open release

11 Key Takeaways

The core dataset risk is copyright infringement on CC BY-NC data, not trademark or trade secret. The most-used research benchmarks (Empathetic Dialogues, IEMOCAP, ESConv) are strictly non-commercial. MELD is an exception, it has a commercial dual-license available via Yale OCR.
The safest text data stack: CPED + GoEmotions + CounselChat + DailyDialog (vet source). All Apache 2.0 or MIT, zero cost, adequate volume for Phase 2.
Best single institutional purchase: MSP-Podcast at $8,000, a fixed-price commercial license with no negotiation overhead.
DaaS is the gold standard for legal safety: Defined.ai and Appen provide full indemnification and are the correct path for voice + physiological + robotics data.
Synthetic generation is viable if: (a) using open-weight models (DeepSeek-R1, Llama), OR (b) using Azure OpenAI under Enterprise Agreement, OR (c) using LLMs only as annotators (not generators) of human-collected base data.
The LoRA isolation strategy is the correct IP mitigation for any CC BY-SA contaminated data: keep fine-tuned adapters architecturally separate from base model weights.
BIPA is the highest operational legal risk for voice data. Standard commercial "masking" is not a legal shield if micro-prosodic features (jitter/shimmer) remain extractable, they qualify as biometric identifiers under BIPA.
Synthetic dialogue has exploded since the last inventory. MindCorpus, MDD-5k, CACTUS, ExTES, ChatThero, PsychEval, SMILE, EChat-200K, Psych8k, ConvoSense, Synth-Empathy, and others now form a dense layer of Apache 2.0 / CC BY / MIT counseling-dialogue corpora. Combined with Nous Research's Llama 3.1 405B fine-tune (Sep 2025), the entire synthetic therapeutic dialogue stack can now be built ToS-safe using only open-weight generators and permissively licensed training dialogues, no OpenAI / Anthropic gray areas required.
The NEXUS 21% rule: Do not trust Hugging Face license labels at face value, only ~21% of "permissively licensed" datasets are actually commercially viable once upstream provenance is traced. Vet each dataset's full dependency chain before ingestion.
Clinical deployment raises the bar. HIPAA + FDA SaMD "Golden Thread" (consent logs via ConsentOS, cryptographic lineage via Knish.IO, PHI scrubbing via Presidio + MedicalNERRecognizer) become mandatory if EAII pursues any medical-device classification. Budget compliance tooling separately from dataset acquisition.