II.
Page JSON
Structured · livepage:library-data-engineering-analytics
Data Engineering, Analytics, and BI Specialization (Library) json
Inspect the normalized record payload exactly as the atlas UI reads it.
{
"id": "page:library-data-engineering-analytics",
"_kind": "Page",
"_file": "wiki/library/data-engineering-analytics.md",
"_cluster": "wiki",
"attributes": {
"nodeKind": "Page",
"title": "Data Engineering, Analytics, and BI Specialization (Library)",
"displayName": "Data Engineering, Analytics, and BI Specialization (Library)",
"slug": "library/data-engineering-analytics",
"articlePath": "wiki/library/data-engineering-analytics.md",
"article": "\n# Data Engineering, Analytics, and BI Specialization\n\n## Overview\n\nThe Data Engineering, Analytics, and BI specialization encompasses the end-to-end lifecycle of transforming raw data into actionable insights. This specialization bridges the gap between data collection and business decision-making, combining technical infrastructure (data engineering), analytical modeling (analytics engineering), and visualization (business intelligence).\n\n## Core Roles\n\n### Data Engineer\n\n**Primary Responsibilities**:\n- Design and build scalable data pipelines and infrastructure\n- Implement ETL/ELT processes for data ingestion and transformation\n- Manage data warehouse and data lake architectures\n- Ensure data quality, reliability, and performance\n- Optimize data storage and compute resources\n- Implement data governance and security controls\n\n**Key Skills**:\n- Programming: Python, SQL, Scala, Java\n- Big Data: Spark, Hadoop, Kafka, Flink\n- Cloud Platforms: AWS, GCP, Azure\n- Orchestration: Airflow, Prefect, Dagster\n- Data Warehouses: Snowflake, BigQuery, Redshift\n- Version Control: Git, CI/CD pipelines\n- Data Modeling: Dimensional modeling, normalization\n\n**Typical Tasks**:\n- Build batch and streaming data pipelines\n- Optimize query performance and data models\n- Implement data quality checks and monitoring\n- Migrate legacy systems to modern data platforms\n- Automate data workflows and deployments\n- Collaborate with data scientists and analysts on data needs\n\n**Career Path**:\nJunior Data Engineer → Data Engineer → Senior Data Engineer → Staff/Principal Data Engineer → Data Engineering Manager/Director\n\n### Analytics Engineer\n\n**Primary Responsibilities**:\n- Transform raw data into clean, modeled datasets for analysis\n- Build and maintain data models (dimensional, denormalized)\n- Implement data quality tests and documentation\n- Create reusable metrics and business logic\n- Bridge the gap between data engineering and analytics\n- Enable self-service analytics through well-modeled data\n\n**Key Skills**:\n- SQL (advanced): CTEs, window functions, optimization\n- dbt: Modeling, testing, documentation, packages\n- Data Modeling: Star schema, snowflake schema, OBT\n- Version Control: Git workflows, pull requests\n- Data Warehouses: Snowflake, BigQuery, Redshift\n- Data Quality: Great Expectations, dbt tests\n- Business Acumen: Understanding of business metrics and KPIs\n\n**Typical Tasks**:\n- Build dbt models (staging, intermediate, fact, dimension)\n- Implement data quality tests and monitors\n- Document data models and lineage\n- Define and standardize business metrics\n- Optimize model performance and costs\n- Collaborate with analysts on data requirements\n\n**Career Path**:\nAnalytics Engineer → Senior Analytics Engineer → Staff Analytics Engineer → Analytics Engineering Manager\n\n**dbt Workflow**:\n```sql\n-- Example: Staging model (stg_orders.sql)\nwith source as (\n select * from {{ source('ecommerce', 'raw_orders') }}\n),\n\ncleaned as (\n select\n order_id,\n customer_id,\n cast(order_date as date) as order_date,\n order_total,\n order_status\n from source\n where order_id is not null\n)\n\nselect * from cleaned\n```\n\n### BI Developer / Analyst\n\n**Primary Responsibilities**:\n- Design and build dashboards and reports\n- Create data visualizations for business stakeholders\n- Perform ad-hoc analysis and exploratory data analysis\n- Translate business requirements into technical specifications\n- Implement row-level security and access controls\n- Train business users on self-service analytics tools\n\n**Key Skills**:\n- BI Tools: Tableau, Power BI, Looker, Metabase\n- SQL: Querying and analysis\n- Data Visualization: Chart selection, design principles\n- Business Intelligence: KPIs, metrics, reporting\n- Communication: Presenting insights to stakeholders\n- Statistics: Descriptive statistics, basic inference\n\n**Typical Tasks**:\n- Build executive dashboards and operational reports\n- Perform root cause analysis on business metrics\n- Create ad-hoc reports for business questions\n- Optimize dashboard performance\n- Implement data governance in BI tools\n- Conduct user training and support\n\n**Career Path**:\nBI Analyst → Senior BI Analyst → BI Developer → Senior BI Developer → BI Manager/Lead\n\n## Data Pipelines\n\n### Pipeline Architecture Patterns\n\n#### ETL (Extract, Transform, Load)\n- **Flow**: Source → Transform (staging) → Load (warehouse)\n- **Use Cases**: Data cleaning, complex transformations, on-premise systems\n- **Tools**: Informatica, Talend, SSIS, custom scripts\n- **Pros**: Data validated before loading, reduced warehouse load\n- **Cons**: Longer latency, transformation bottleneck\n\n#### ELT (Extract, Load, Transform)\n- **Flow**: Source → Load (warehouse) → Transform (in warehouse)\n- **Use Cases**: Cloud data warehouses with powerful compute\n- **Tools**: Fivetran + dbt, Stitch + dbt, Airbyte + dbt\n- **Pros**: Faster ingestion, leverage warehouse compute, flexibility\n- **Cons**: Raw data in warehouse, transformation costs\n\n#### Lambda Architecture\n- **Components**: Batch layer + Speed layer + Serving layer\n- **Use Cases**: Real-time and historical analytics\n- **Challenge**: Maintaining two codebases (batch and streaming)\n\n#### Kappa Architecture\n- **Components**: Stream processing only\n- **Use Cases**: Everything as a stream, unified pipeline\n- **Tools**: Kafka + Flink/Spark Streaming\n\n### Pipeline Orchestration\n\n#### Workflow Management\n- **Airflow DAGs**: Task dependencies, scheduling, monitoring\n- **Prefect Flows**: Dynamic workflows, parameterization\n- **Dagster**: Asset-based orchestration, type system\n- **Step Functions**: Serverless workflow on AWS\n- **Temporal**: Durable execution, long-running workflows\n\n#### Best Practices\n- **Idempotency**: Pipelines produce same output given same input\n- **Incremental Processing**: Process only new/changed data\n- **Backfilling**: Ability to reprocess historical data\n- **Monitoring**: Track pipeline health, data quality, SLAs\n- **Alerting**: Notify on failures, delays, quality issues\n- **Testing**: Unit tests, integration tests, data validation\n- **Documentation**: Pipeline purpose, dependencies, SLAs\n\n### Data Pipeline Layers\n\n#### Bronze Layer (Raw)\n- Exact copy of source data\n- Minimal transformation (type casting, column renaming)\n- Full history preserved\n- Schema-on-read\n\n#### Silver Layer (Cleaned)\n- Cleaned and conformed data\n- Deduplicated, validated\n- Standardized formats\n- Type 2 SCD for history\n\n#### Gold Layer (Business)\n- Business-level aggregations\n- Fact and dimension tables\n- Metrics and KPIs\n- Optimized for reporting\n\n## Data Modeling\n\n### Dimensional Modeling (Kimball Methodology)\n\n#### Star Schema\n```\nFact Table (Center):\n- fact_sales\n - sale_id (PK)\n - date_id (FK)\n - customer_id (FK)\n - product_id (FK)\n - store_id (FK)\n - quantity\n - amount\n - cost\n\nDimension Tables (Points):\n- dim_date\n - date_id (PK)\n - date\n - day_of_week\n - month\n - quarter\n - year\n\n- dim_customer\n - customer_id (PK)\n - customer_name\n - segment\n - region\n\n- dim_product\n - product_id (PK)\n - product_name\n - category\n - brand\n\n- dim_store\n - store_id (PK)\n - store_name\n - city\n - state\n```\n\n#### Fact Table Types\n- **Transaction Facts**: One row per transaction (sales, orders)\n- **Periodic Snapshot Facts**: Regular intervals (daily inventory)\n- **Accumulating Snapshot Facts**: Lifecycle events (order fulfillment)\n- **Factless Facts**: Event tracking without measures (attendance)\n\n#### Slowly Changing Dimensions (SCD)\n\n**Type 0**: No changes allowed (immutable)\n\n**Type 1**: Overwrite (no history)\n```sql\nUPDATE dim_customer\nSET city = 'New York'\nWHERE customer_id = 123;\n```\n\n**Type 2**: Add new row (full history)\n```sql\n-- Current record\ncustomer_id | name | city | valid_from | valid_to | is_current\n123 | John | Boston | 2020-01-01 | 2023-05-15 | FALSE\n123 | John | New York | 2023-05-16 | 9999-12-31 | TRUE\n```\n\n**Type 3**: Add new column (limited history)\n```sql\ncustomer_id | name | current_city | previous_city\n123 | John | New York | Boston\n```\n\n**Type 6**: Hybrid (1+2+3)\n\n### Data Vault Modeling\n\n**Purpose**: Scalable, auditable, flexible modeling for enterprise data warehouses\n\n**Core Components**:\n\n1. **Hubs**: Business keys\n```sql\nhub_customer:\n- hub_customer_id (PK, surrogate key)\n- customer_id (business key)\n- load_timestamp\n- record_source\n```\n\n2. **Links**: Many-to-many relationships\n```sql\nlink_customer_order:\n- link_customer_order_id (PK)\n- hub_customer_id (FK)\n- hub_order_id (FK)\n- load_timestamp\n- record_source\n```\n\n3. **Satellites**: Descriptive attributes with history\n```sql\nsat_customer_details:\n- hub_customer_id (PK, FK)\n- load_timestamp (PK)\n- customer_name\n- email\n- city\n- hash_diff (for change detection)\n- record_source\n```\n\n**Benefits**:\n- Audit trail and full history\n- Parallel loading from multiple sources\n- Flexible for changing requirements\n- Separates business keys from attributes\n\n**Challenges**:\n- Complex queries (many joins)\n- Requires intermediate layer for BI\n- Steeper learning curve\n\n### One Big Table (OBT)\n\n**Concept**: Denormalized, wide table with all relevant attributes and metrics\n\n**Use Cases**:\n- Cloud data warehouses (cheap storage, fast compute)\n- Self-service analytics (simpler queries)\n- BI tool performance (fewer joins)\n\n**Example**:\n```sql\ncreate table obt_sales as\nselect\n s.sale_id,\n s.sale_date,\n s.quantity,\n s.amount,\n c.customer_name,\n c.customer_segment,\n c.customer_region,\n p.product_name,\n p.product_category,\n p.product_brand,\n st.store_name,\n st.store_city,\n st.store_state\nfrom fact_sales s\njoin dim_customer c on s.customer_id = c.customer_id\njoin dim_product p on s.product_id = p.product_id\njoin dim_store st on s.store_id = st.store_id;\n```\n\n**Trade-offs**:\n- Pros: Fast queries, simple for analysts, fewer joins\n- Cons: Data duplication, higher storage costs, update complexity\n\n## Metrics and KPIs\n\n### Metric Types\n\n#### Business Metrics\n- **Revenue Metrics**: MRR, ARR, revenue growth\n- **Customer Metrics**: CAC, LTV, churn rate, retention rate\n- **Product Metrics**: DAU, MAU, engagement rate, feature adoption\n- **Operational Metrics**: Conversion rate, cart abandonment, fulfillment time\n\n#### Data Quality Metrics\n- **Freshness**: Time since last update\n- **Completeness**: Percentage of null values\n- **Accuracy**: Data matches source of truth\n- **Consistency**: Same value across systems\n\n### Metrics Layer\n\n**Purpose**: Centralized definition of business metrics\n\n**Benefits**:\n- Single source of truth\n- Consistent calculations across tools\n- Reusability across dashboards and reports\n- Governance and documentation\n\n**Implementation**:\n- **dbt Metrics**: Define metrics in YAML\n- **Looker LookML**: Measures and dimensions\n- **Cube.js**: Headless BI semantic layer\n- **MetricFlow**: dbt semantic layer (acquired by dbt Labs)\n- **Supergrain**: Metrics catalog and lineage\n\n**Example (dbt Metrics)**:\n```yaml\n# models/metrics.yml\nmetrics:\n - name: total_revenue\n label: Total Revenue\n model: ref('fct_sales')\n calculation_method: sum\n expression: amount\n timestamp: order_date\n time_grains: [day, week, month, quarter, year]\n dimensions:\n - customer_segment\n - product_category\n```\n\n### Metric Governance\n\n- **Naming Conventions**: Clear, descriptive names\n- **Documentation**: Business definition, calculation logic\n- **Ownership**: Assign metric owners\n- **Versioning**: Track changes to metric definitions\n- **Validation**: Compare against known values\n- **Lineage**: Track metric dependencies\n\n## Experimentation\n\n### A/B Testing Framework\n\n#### Experiment Design\n1. **Hypothesis**: What are we testing and why?\n2. **Success Metrics**: Primary and secondary metrics\n3. **Guardrail Metrics**: Metrics that shouldn't degrade\n4. **Randomization Unit**: User, session, or request\n5. **Traffic Allocation**: Percentage per variant\n6. **Duration**: How long to run the experiment\n\n#### Statistical Concepts\n\n**Sample Size Calculation**:\n```\nn = (Z_α/2 + Z_β)² × (σ₁² + σ₂²) / (μ₁ - μ₂)²\n\nWhere:\n- Z_α/2: Critical value for significance level (e.g., 1.96 for 95%)\n- Z_β: Critical value for power (e.g., 0.84 for 80%)\n- σ: Standard deviation\n- μ: Mean values\n- MDE: Minimum detectable effect (μ₁ - μ₂)\n```\n\n**Statistical Power**: Probability of detecting an effect when it exists (typically 80%)\n\n**Significance Level (α)**: Probability of false positive (typically 5%)\n\n**P-value**: Probability of observing results if null hypothesis is true\n\n#### Experimentation Platform Components\n\n1. **Feature Flagging**: Control which users see which variant\n2. **Assignment Service**: Consistent user bucketing\n3. **Metrics Pipeline**: Collect and aggregate metrics\n4. **Analysis Engine**: Statistical testing and visualization\n5. **Reporting Dashboard**: Results and insights\n\n#### Common Pitfalls\n\n- **Peeking**: Checking results before experiment completes\n- **Sample Ratio Mismatch (SRM)**: Uneven traffic split indicates bug\n- **Novelty Effect**: Initial excitement skews results\n- **Multiple Testing**: Testing many metrics increases false positives\n- **Selection Bias**: Non-random assignment\n- **Carryover Effects**: Previous experiments affect current one\n\n### Multi-Armed Bandit\n\n**Concept**: Adaptive allocation that shifts traffic to better-performing variants\n\n**Thompson Sampling**:\n- Bayesian approach\n- Balances exploration and exploitation\n- Faster convergence than pure A/B testing\n\n**Use Cases**:\n- Content recommendation\n- Ad optimization\n- Email subject line testing\n- Lower risk than fixed A/B test\n\n## Integration with Data Science and ML\n\n### Data Engineering for ML\n\n#### Feature Store\n**Purpose**: Centralized repository for ML features\n\n**Benefits**:\n- Feature reusability across models\n- Consistent feature engineering\n- Online and offline feature serving\n- Feature versioning and lineage\n\n**Tools**:\n- **Feast**: Open-source feature store\n- **Tecton**: Managed feature platform\n- **AWS SageMaker Feature Store**\n- **Google Vertex AI Feature Store**\n- **Databricks Feature Store**\n\n**Example**:\n```python\n# Define feature view\n@feature_view(\n entities=[customer],\n ttl=timedelta(days=1),\n online=True,\n batch_source=BigQuerySource(\n table=\"project.dataset.customer_features\"\n )\n)\ndef customer_features_fv():\n return Field(\"total_purchases\", Int64)\n```\n\n#### ML Pipeline Integration\n\n**Training Pipeline**:\n1. **Data Extraction**: Pull data from warehouse\n2. **Feature Engineering**: Transform raw data into features\n3. **Training**: Train model on features\n4. **Validation**: Evaluate model performance\n5. **Model Registry**: Store trained model\n\n**Inference Pipeline**:\n1. **Feature Retrieval**: Get features from feature store\n2. **Model Serving**: Load model and make predictions\n3. **Prediction Storage**: Store predictions in warehouse\n4. **Monitoring**: Track model performance and drift\n\n### Analytics for Model Evaluation\n\n**Model Metrics**:\n- **Classification**: Accuracy, precision, recall, F1, AUC-ROC\n- **Regression**: RMSE, MAE, R², MAPE\n- **Ranking**: NDCG, MAP, MRR\n\n**Model Monitoring**:\n- **Data Drift**: Input feature distribution changes\n- **Concept Drift**: Relationship between features and target changes\n- **Prediction Drift**: Model output distribution changes\n- **Performance Degradation**: Metrics decline over time\n\n**Tools**:\n- **Evidently AI**: ML monitoring and testing\n- **Whylabs**: Data and ML monitoring\n- **Arize**: ML observability\n- **Fiddler**: ML monitoring and explainability\n\n### BI for ML Insights\n\n**ML Dashboards**:\n- Model performance over time\n- Feature importance and distributions\n- Prediction distribution and statistics\n- Error analysis and edge cases\n- A/B test results for new models\n\n**Business Impact**:\n- Revenue impact of ML models\n- Cost savings from automation\n- User engagement improvements\n- Operational efficiency gains\n\n## Data Stack Examples\n\n### Modern Data Stack (Cloud)\n\n**Ingestion**: Fivetran / Airbyte\n↓\n**Storage**: Snowflake / BigQuery / Redshift\n↓\n**Transformation**: dbt\n↓\n**BI**: Tableau / Looker / Power BI\n↓\n**Orchestration**: Airflow / Prefect\n↓\n**Monitoring**: Monte Carlo / Datafold\n\n### Real-Time Analytics Stack\n\n**Streaming**: Kafka / Kinesis\n↓\n**Processing**: Flink / Spark Streaming\n↓\n**Storage**: ClickHouse / Druid\n↓\n**Visualization**: Grafana / Superset\n\n### ML Platform Stack\n\n**Data**: Data Warehouse (Snowflake)\n↓\n**Features**: Feature Store (Feast/Tecton)\n↓\n**Training**: Databricks / SageMaker\n↓\n**Serving**: SageMaker / Vertex AI\n↓\n**Monitoring**: Evidently AI / Arize\n\n## Best Practices\n\n### Data Engineering\n- Design idempotent and rerunnable pipelines\n- Implement comprehensive data quality checks\n- Monitor pipeline health and data freshness\n- Use version control for all pipeline code\n- Document data sources, transformations, and SLAs\n- Optimize for cost and performance\n- Implement proper error handling and retries\n\n### Analytics Engineering\n- Follow dbt best practices (staging, intermediate, marts)\n- Test all critical business logic\n- Document models and columns thoroughly\n- Use consistent naming conventions\n- Leverage incremental models for large datasets\n- Implement code reviews for model changes\n- Monitor model performance and costs\n\n### Business Intelligence\n- Design dashboards for the audience\n- Optimize for performance (aggregations, extracts)\n- Implement proper access controls\n- Provide context and documentation in dashboards\n- Use appropriate visualizations for data types\n- Monitor dashboard usage and performance\n- Gather feedback from business users\n\n### Cross-Functional Collaboration\n- Establish data contracts between teams\n- Define SLAs for data availability and quality\n- Regular sync meetings between data teams\n- Shared documentation and knowledge base\n- Consistent tooling and standards\n- Clear ownership and accountability\n\n## Career Development\n\n### Learning Path\n\n**Beginner**:\n- SQL fundamentals and query optimization\n- Basic Python/scripting\n- Data warehouse concepts\n- BI tool basics (Tableau/Power BI)\n- Version control (Git)\n\n**Intermediate**:\n- Advanced SQL (window functions, CTEs, optimization)\n- dbt for transformation\n- Data modeling (dimensional, normalization)\n- Airflow or similar orchestration\n- Cloud data platforms (Snowflake/BigQuery)\n- Data quality frameworks\n\n**Advanced**:\n- Streaming data (Kafka, Flink)\n- Distributed systems concepts\n- Data architecture and system design\n- ML integration and feature stores\n- Infrastructure as code (Terraform)\n- Team leadership and mentorship\n\n### Key Competencies\n\n**Technical**:\n- SQL mastery\n- Programming (Python, Scala, Java)\n- Cloud platforms\n- Data modeling\n- ETL/ELT development\n- Performance optimization\n\n**Business**:\n- Domain knowledge\n- Stakeholder management\n- Requirement gathering\n- Communication and presentation\n- Project management\n- Strategic thinking\n\n**Soft Skills**:\n- Problem-solving\n- Collaboration\n- Documentation\n- Teaching and mentoring\n- Adaptability\n- Attention to detail\n\n## See Also\n\n- [references.md](./references.md) - Comprehensive tool and framework references\n- [../methodologies/](../methodologies/) - Process methodologies for data projects\n",
"documents": [
"specialization:data-engineering-analytics"
]
},
"outgoingEdges": [
{
"from": "page:library-data-engineering-analytics",
"to": "specialization:data-engineering-analytics",
"kind": "documents"
}
],
"incomingEdges": [
{
"from": "page:index",
"to": "page:library-data-engineering-analytics",
"kind": "contains_page"
}
]
}