Complete guide to building and managing data workflows in Azure Data Factory (ADF)

Here’s a complete guide to building and managing data workflows in Azure Data Factory (ADF) — covering pipelines, triggers, linked services, integration runtimes, and best practices for real-world deployment.

🏗️ 1. What Is Azure Data Factory (ADF)?

ADF is a cloud-based ETL/ELT and orchestration service that lets you:

Connect to over 100 data sources
Transform, schedule, and orchestrate data pipelines
Integrate with Azure Databricks, Synapse, ADLS, SQL, etc.

🔄 2. Core Components of ADF

Component	Description
Pipelines	Group of data activities (ETL steps)
Activities	Each task (copy, Databricks notebook, stored proc)
Linked Services	Connection configs to external systems
Datasets	Metadata pointers to source/destination tables/files
Integration Runtime (IR)	Compute engine used for data movement and transformation
Triggers	Schedules or events to start pipeline execution
Parameters/Variables	Dynamic values for reusability and flexibility

⚙️ 3. Setup and Build: Step-by-Step

✅ Step 1: Create Linked Services

Linked services are connection definitions (e.g., to ADLS, Azure SQL, Databricks).

🔹 Example: Azure Data Lake Gen2 Linked Service

{
  "name": "LS_ADLS2",
  "type": "AzureBlobFS",
  "typeProperties": {
    "url": "https://<storage-account>.dfs.core.windows.net",
    "servicePrincipalId": "<client-id>",
    "servicePrincipalKey": {
      "type": "AzureKeyVaultSecret",
      "store": { "referenceName": "AzureKeyVault1", "type": "LinkedServiceReference" },
      "secretName": "adls-key"
    },
    "tenant": "<tenant-id>"
  }
}

✅ Step 2: Create Datasets

Datasets define the schema or file path structure of your source and target.

🔹 Example: ADLS CSV Dataset

{
  "name": "ds_sales_csv",
  "properties": {
    "linkedServiceName": {
      "referenceName": "LS_ADLS2",
      "type": "LinkedServiceReference"
    },
    "type": "DelimitedText",
    "typeProperties": {
      "location": {
        "type": "FileSystemLocation",
        "fileSystem": "raw",
        "folderPath": "sales/",
        "fileName": "*.csv"
      },
      "columnDelimiter": ","
    }
  }
}

✅ Step 3: Create Pipeline and Add Activities

A pipeline can include:

Copy Activity
Databricks Notebook
Lookup, ForEach, Web Activity
If Condition, Wait, Set Variable

🔹 Example: Run Databricks Notebook in Pipeline

{
  "name": "RunDatabricksNotebook",
  "type": "DatabricksNotebook",
  "typeProperties": {
    "notebookPath": "/Repos/bronze_ingestion_sales",
    "baseParameters": {
      "input_path": "raw/sales",
      "output_path": "bronze/sales"
    }
  },
  "linkedServiceName": {
    "referenceName": "LS_Databricks",
    "type": "LinkedServiceReference"
  }
}

✅ Step 4: Integration Runtime (IR)

IR is the compute used to run the activities.

IR Type	Use Case
AutoResolve IR (default)	Serverless compute in cloud
Self-hosted IR	On-prem data movement
Azure SSIS IR	Run SSIS packages

Usually, no setup needed unless:

Working with on-premises
Using custom compute

✅ Step 5: Add Trigger

Trigger Type	Use Case
Schedule Trigger	Run daily/hourly/cron
Event Trigger	File arrival in Blob/ADLS
Manual Trigger	For testing

🔹 Example: Schedule Trigger (Daily at 2 AM)

{
  "name": "trigger_daily_2am",
  "type": "ScheduleTrigger",
  "typeProperties": {
    "recurrence": {
      "frequency": "Day",
      "interval": 1,
      "startTime": "2025-07-09T02:00:00Z",
      "timeZone": "India Standard Time"
    }
  },
  "pipelines": [
    {
      "pipelineReference": {
        "referenceName": "SalesPipeline",
        "type": "PipelineReference"
      }
    }
  ]
}

📦 Example: ETL Pipeline to Run Databricks Bronze Layer

Linked Services
- ADLS Gen2
- Azure Databricks
Datasets
- Raw sales CSV
- Bronze Delta folder
Pipeline Activities
- Lookup for files
- Run Databricks notebook
- Email failure alert
Trigger
- Schedule daily at midnight

🧾 Monitoring & Logs

Go to Monitor > Pipeline Runs
View:
- Status
- Duration
- Activity logs
- Input/output parameters
Set Alerts in Azure Monitor for failed runs

🔐 Security + Governance

Feature	Notes
Key Vault Integration	Use secrets in linked services
Managed Identity	For secure storage access
Role-Based Access (RBAC)	Granular access to pipelines

🧠 Best Practices

Area	Tip
Reusability	Use parameters and global variables in pipelines
Dev/Test/Prod	Separate environments with config-based switches
Error Handling	Use If Condition + Fail Activity blocks
Modularity	Create small pipelines and call via Execute Pipeline
Logging	Use `Log` tables or storage containers with activity metadata

📎 Bonus: CI/CD with ADF (Git Integration)

Connect ADF to Azure DevOps Git
All pipelines, datasets, and linked services are version-controlled as JSON
Create release pipelines to publish to different environments

HintsToday

recent posts

about