Here’s a complete guide to building and managing data workflows in Azure Data Factory (ADF) — covering pipelines, triggers, linked services, integration runtimes, and best practices for real-world deployment.
🏗️ 1. What Is Azure Data Factory (ADF)?
ADF is a cloud-based ETL/ELT and orchestration service that lets you:
- Connect to over 100 data sources
- Transform, schedule, and orchestrate data pipelines
- Integrate with Azure Databricks, Synapse, ADLS, SQL, etc.
🔄 2. Core Components of ADF
Component | Description |
---|---|
Pipelines | Group of data activities (ETL steps) |
Activities | Each task (copy, Databricks notebook, stored proc) |
Linked Services | Connection configs to external systems |
Datasets | Metadata pointers to source/destination tables/files |
Integration Runtime (IR) | Compute engine used for data movement and transformation |
Triggers | Schedules or events to start pipeline execution |
Parameters/Variables | Dynamic values for reusability and flexibility |
⚙️ 3. Setup and Build: Step-by-Step
✅ Step 1: Create Linked Services
Linked services are connection definitions (e.g., to ADLS, Azure SQL, Databricks).
🔹 Example: Azure Data Lake Gen2 Linked Service
{
"name": "LS_ADLS2",
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<storage-account>.dfs.core.windows.net",
"servicePrincipalId": "<client-id>",
"servicePrincipalKey": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "AzureKeyVault1", "type": "LinkedServiceReference" },
"secretName": "adls-key"
},
"tenant": "<tenant-id>"
}
}
✅ Step 2: Create Datasets
Datasets define the schema or file path structure of your source and target.
🔹 Example: ADLS CSV Dataset
{
"name": "ds_sales_csv",
"properties": {
"linkedServiceName": {
"referenceName": "LS_ADLS2",
"type": "LinkedServiceReference"
},
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "FileSystemLocation",
"fileSystem": "raw",
"folderPath": "sales/",
"fileName": "*.csv"
},
"columnDelimiter": ","
}
}
}
✅ Step 3: Create Pipeline and Add Activities
A pipeline can include:
- Copy Activity
- Databricks Notebook
- Lookup, ForEach, Web Activity
- If Condition, Wait, Set Variable
🔹 Example: Run Databricks Notebook in Pipeline
{
"name": "RunDatabricksNotebook",
"type": "DatabricksNotebook",
"typeProperties": {
"notebookPath": "/Repos/bronze_ingestion_sales",
"baseParameters": {
"input_path": "raw/sales",
"output_path": "bronze/sales"
}
},
"linkedServiceName": {
"referenceName": "LS_Databricks",
"type": "LinkedServiceReference"
}
}
✅ Step 4: Integration Runtime (IR)
IR is the compute used to run the activities.
IR Type | Use Case |
---|---|
AutoResolve IR (default) | Serverless compute in cloud |
Self-hosted IR | On-prem data movement |
Azure SSIS IR | Run SSIS packages |
Usually, no setup needed unless:
- Working with on-premises
- Using custom compute
✅ Step 5: Add Trigger
Trigger Type | Use Case |
---|---|
Schedule Trigger | Run daily/hourly/cron |
Event Trigger | File arrival in Blob/ADLS |
Manual Trigger | For testing |
🔹 Example: Schedule Trigger (Daily at 2 AM)
{
"name": "trigger_daily_2am",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2025-07-09T02:00:00Z",
"timeZone": "India Standard Time"
}
},
"pipelines": [
{
"pipelineReference": {
"referenceName": "SalesPipeline",
"type": "PipelineReference"
}
}
]
}
📦 Example: ETL Pipeline to Run Databricks Bronze Layer
- Linked Services
- ADLS Gen2
- Azure Databricks
- Datasets
- Raw sales CSV
- Bronze Delta folder
- Pipeline Activities
- Lookup for files
- Run Databricks notebook
- Email failure alert
- Trigger
- Schedule daily at midnight
🧾 Monitoring & Logs
- Go to Monitor > Pipeline Runs
- View:
- Status
- Duration
- Activity logs
- Input/output parameters
- Set Alerts in Azure Monitor for failed runs
🔐 Security + Governance
Feature | Notes |
---|---|
Key Vault Integration | Use secrets in linked services |
Managed Identity | For secure storage access |
Role-Based Access (RBAC) | Granular access to pipelines |
🧠 Best Practices
Area | Tip |
---|---|
Reusability | Use parameters and global variables in pipelines |
Dev/Test/Prod | Separate environments with config-based switches |
Error Handling | Use If Condition + Fail Activity blocks |
Modularity | Create small pipelines and call via Execute Pipeline |
Logging | Use Log tables or storage containers with activity metadata |
📎 Bonus: CI/CD with ADF (Git Integration)
- Connect ADF to Azure DevOps Git
- All pipelines, datasets, and linked services are version-controlled as JSON
- Create release pipelines to publish to different environments
Leave a Reply