Here’s a complete guide to building and managing data workflows in Azure Data Factory (ADF) — covering pipelines, triggers, linked services, integration runtimes, and best practices for real-world deployment.


🏗️ 1. What Is Azure Data Factory (ADF)?

ADF is a cloud-based ETL/ELT and orchestration service that lets you:

  • Connect to over 100 data sources
  • Transform, schedule, and orchestrate data pipelines
  • Integrate with Azure Databricks, Synapse, ADLS, SQL, etc.

🔄 2. Core Components of ADF

ComponentDescription
PipelinesGroup of data activities (ETL steps)
ActivitiesEach task (copy, Databricks notebook, stored proc)
Linked ServicesConnection configs to external systems
DatasetsMetadata pointers to source/destination tables/files
Integration Runtime (IR)Compute engine used for data movement and transformation
TriggersSchedules or events to start pipeline execution
Parameters/VariablesDynamic values for reusability and flexibility

⚙️ 3. Setup and Build: Step-by-Step


✅ Step 1: Create Linked Services

Linked services are connection definitions (e.g., to ADLS, Azure SQL, Databricks).

🔹 Example: Azure Data Lake Gen2 Linked Service

{
  "name": "LS_ADLS2",
  "type": "AzureBlobFS",
  "typeProperties": {
    "url": "https://<storage-account>.dfs.core.windows.net",
    "servicePrincipalId": "<client-id>",
    "servicePrincipalKey": {
      "type": "AzureKeyVaultSecret",
      "store": { "referenceName": "AzureKeyVault1", "type": "LinkedServiceReference" },
      "secretName": "adls-key"
    },
    "tenant": "<tenant-id>"
  }
}

✅ Step 2: Create Datasets

Datasets define the schema or file path structure of your source and target.

🔹 Example: ADLS CSV Dataset

{
  "name": "ds_sales_csv",
  "properties": {
    "linkedServiceName": {
      "referenceName": "LS_ADLS2",
      "type": "LinkedServiceReference"
    },
    "type": "DelimitedText",
    "typeProperties": {
      "location": {
        "type": "FileSystemLocation",
        "fileSystem": "raw",
        "folderPath": "sales/",
        "fileName": "*.csv"
      },
      "columnDelimiter": ","
    }
  }
}

✅ Step 3: Create Pipeline and Add Activities

A pipeline can include:

  • Copy Activity
  • Databricks Notebook
  • Lookup, ForEach, Web Activity
  • If Condition, Wait, Set Variable

🔹 Example: Run Databricks Notebook in Pipeline

{
  "name": "RunDatabricksNotebook",
  "type": "DatabricksNotebook",
  "typeProperties": {
    "notebookPath": "/Repos/bronze_ingestion_sales",
    "baseParameters": {
      "input_path": "raw/sales",
      "output_path": "bronze/sales"
    }
  },
  "linkedServiceName": {
    "referenceName": "LS_Databricks",
    "type": "LinkedServiceReference"
  }
}

✅ Step 4: Integration Runtime (IR)

IR is the compute used to run the activities.

IR TypeUse Case
AutoResolve IR (default)Serverless compute in cloud
Self-hosted IROn-prem data movement
Azure SSIS IRRun SSIS packages

Usually, no setup needed unless:

  • Working with on-premises
  • Using custom compute

✅ Step 5: Add Trigger

Trigger TypeUse Case
Schedule TriggerRun daily/hourly/cron
Event TriggerFile arrival in Blob/ADLS
Manual TriggerFor testing

🔹 Example: Schedule Trigger (Daily at 2 AM)

{
  "name": "trigger_daily_2am",
  "type": "ScheduleTrigger",
  "typeProperties": {
    "recurrence": {
      "frequency": "Day",
      "interval": 1,
      "startTime": "2025-07-09T02:00:00Z",
      "timeZone": "India Standard Time"
    }
  },
  "pipelines": [
    {
      "pipelineReference": {
        "referenceName": "SalesPipeline",
        "type": "PipelineReference"
      }
    }
  ]
}

📦 Example: ETL Pipeline to Run Databricks Bronze Layer

  1. Linked Services
    • ADLS Gen2
    • Azure Databricks
  2. Datasets
    • Raw sales CSV
    • Bronze Delta folder
  3. Pipeline Activities
    • Lookup for files
    • Run Databricks notebook
    • Email failure alert
  4. Trigger
    • Schedule daily at midnight

🧾 Monitoring & Logs

  • Go to Monitor > Pipeline Runs
  • View:
    • Status
    • Duration
    • Activity logs
    • Input/output parameters
  • Set Alerts in Azure Monitor for failed runs

🔐 Security + Governance

FeatureNotes
Key Vault IntegrationUse secrets in linked services
Managed IdentityFor secure storage access
Role-Based Access (RBAC)Granular access to pipelines

🧠 Best Practices

AreaTip
ReusabilityUse parameters and global variables in pipelines
Dev/Test/ProdSeparate environments with config-based switches
Error HandlingUse If Condition + Fail Activity blocks
ModularityCreate small pipelines and call via Execute Pipeline
LoggingUse Log tables or storage containers with activity metadata

📎 Bonus: CI/CD with ADF (Git Integration)

  • Connect ADF to Azure DevOps Git
  • All pipelines, datasets, and linked services are version-controlled as JSON
  • Create release pipelines to publish to different environments

Pages: 1 2


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Leave a Reply

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading