Completed

YouTube AI Content Strategy Optimization

Personal Project 2025.10
Python
SQL
YouTube Data API
KMeans
OLS
Interaction Terms
Cohort/Window Analysis
Creator Strategy
A/B-ready Insights

Content Analytics · Clustering · Causal-style Estimation · Posting Strategy

Project Overview

Goal: help AI content creators and platform recommenders solve “when to post” and “what to post” under high volatility in early performance. Data & scope: • Collected 3,143 videos from 80+ AI/tech channels (Jan–Sep 2025) via Python + SQL. • Built an early-performance panel using T+48h views and engagement signals to reduce long-tail drift. • Grouped channels by scale (e.g., head vs long-tail) using 90-day rolling views. • Clustered videos into time-slot segments (e.g., morning / afternoon / evening) with KMeans and calendar features. Method: • Estimated posting-time effects using OLS with channel fixed characteristics and interaction terms. • Tested heterogeneity with “time-slot × channel-scale” interaction to validate asymmetric lift across creator tiers. Key findings (actionable): • Short-form 5–10 min content performs best when pushed earlier in the day. • Mid-size channels benefit most from “afternoon → morning” shift (≈ +16.5% on play volume). • Long-tail channels see additional gains during late night posting windows (≈ +2%). • Reallocating ~30% of afternoon uploads to morning/night suggests ~10% overall view uplift. Output: A deployable strategy playbook for creators + scheduling recommendation rules for platforms.

What I Did

  • Designed a reproducible pipeline to crawl channel/video metadata and early engagement (T+48h) using Python + SQL (rate-limit safe, incremental updates).
  • Constructed a panel dataset with time features (weekday/weekend, hour bins), content duration buckets, and engagement ratios (like/view, comment/view).
  • Defined channel scale tiers using 90-day rolling views, then validated stability to avoid “one-hit” outliers.
  • Clustered posting time segments with KMeans and compared against rule-based bins to ensure interpretability.
  • Estimated time-slot effects via OLS, including “time-slot × channel-scale” interactions to capture heterogeneous lift.
  • Performed robustness checks with alternative dependent variables (views vs engagement) and sensitivity tests on window length (24h/48h/72h).
  • Converted model outputs into a strategy playbook: recommended posting windows by channel tier + content length, with expected uplift ranges.

Reflection

The biggest challenge was separating “time-of-posting” from creator quality and topic selection. I handled this by: (1) focusing on early-window metrics (T+48h) to reduce long-run algorithm drift, (2) adding interaction terms for channel scale to model heterogeneity, (3) sanity-checking clusters against interpretable time bins. If iterating further, I’d introduce: • causal identification improvements (e.g., diff-in-diff around schedule changes, instrument candidates), • topic embeddings for content type control, • a lightweight dashboard that recommends optimal upload slots in real time. This project strengthened my ability to turn noisy platform data into practical decision rules — the output is not just a report, but an executable growth strategy.

🛰 Ship Console