Source-to-Schema Process

1

Extraction

  • Pull NSE equity master and historical symbol data
  • Collect corporate action announcements from NSE sources
  • Download ticker mapping and alias history from official sources
  • Store raw unprocessed files in data/raw/
2

Normalization

  • Standardize field names and formats across all sources
  • Map source identifiers to canonical security IDs
  • Validate dates, amounts, and ratios for consistency
  • Store cleaned data in data/staging/
3

Enrichment

  • Build symbol lineage graphs (rename chains, mergers, splits)
  • Compute adjustment factors for price series normalization
  • Tag data quality issues and provenance
  • Generate curated fact and dimension tables in data/curated/
4

Publishing

  • Export public sample subset to CSV/Parquet
  • Generate paid full release bundles
  • Update versioned Dolt repository
  • Upload artifacts to storage

Normalization Philosophy

Trust over completeness Mark uncertain data with confidence flags rather than excluding it.
Provenance first Log source and transformation for every record.
Audit trail Keep all versions in Dolt for regulatory and analytical review.
Fail gracefully Missing data is better than wrong data; gaps are logged explicitly.

Data Model

Dimension Tables

TablePurpose
dim_security_masterCentral security identifier hub; NSE symbol, ISIN, active status
dim_issuerCompany/issuer identity; sector and market cap category
dim_exchangeExchange reference (NSE, BSE); enables multi-exchange support
dim_symbol_aliasAll historical symbols for a security with effective date ranges
dim_corporate_action_typeImmutable lookup: SPLIT, DIVIDEND, BONUS, MERGER, DELISTING, etc.

Fact Tables

TablePurpose
fact_equity_eodEnd-of-day OHLCV price snapshots
fact_corporate_action_eventNormalized corporate action records with confidence scores
fact_adjustment_factorPre-computed cumulative adjustment multipliers for backtesting
fact_symbol_lineage_eventTicker and name change history (renames, mergers, delistings)
fact_listing_status_historyActive, suspended, delisted, relisted status over time

Key Design Principles


Data Sources & Confidence Tiers

All data is sourced from official NSE public sources: equity master, corporate actions board, daily bhavcopy, and circulars.

ConfidenceSources
High ≥ 0.95 NSE EOD OHLCV, NSE Corporate Action Board, NSDL ISIN Registry
Medium 0.7–0.95 NSE historical archives, parsed web content, BSE cross-references
Low < 0.7 Estimated adjustment factors, reconstructed lineage from sparse data

Every record includes a confidence_score and _source_file field for full traceability.