To standardize the process of updating the metadata table and include version control, we can implement the following:
1. Metadata Management Approach
- Version Control in Metadata Table:
- Add a
version
column to track changes to queries or logic. - Use a
status
column to differentiate betweenactive
andinactive
records. - Introduce
created_at
andupdated_at
timestamps for auditing purposes.
- Add a
- Standardized Update Process:
- Use a template or form for metadata updates.
- Maintain a change log to document changes in logic and their rationale.
- Review & Approval Workflow:
- Require changes to metadata to go through a review process before activation.
- Store updates in a staging table before moving them to the active metadata table.
- Version History:
- Maintain all previous versions of the metadata in the same table (using
version
andstatus
). - Allow rollback to previous versions if needed.
- Maintain all previous versions of the metadata in the same table (using
2. Updated Metadata Table Schema
CREATE TABLE IF NOT EXISTS metadatatable (
program_name STRING,
product_name STRING,
stage_name STRING,
step_name STRING,
version INT DEFAULT 1, -- Version number
status STRING DEFAULT 'active', -- active or inactive
operation_type STRING, -- SQL, DataFrame
query TEXT,
custom_logic TEXT,
temp_view_name STRING,
table_name STRING,
write_mode STRING, -- overwrite, append, archive, snapshot
snapshot_mode STRING,
stage_priority INT,
steps_priority INT,
month_year STRING,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
3. Version Control Logic
a) Adding a New Query
When adding a new step or query:
- Increment Version: Ensure the
version
for the new step is1
. - Set Status to Active: Mark the new step as
active
.
Example:
INSERT INTO metadatatable (
program_name, product_name, stage_name, step_name, version, status, operation_type, query,
custom_logic, temp_view_name, table_name, write_mode, snapshot_mode, stage_priority,
steps_priority, month_year, created_at, updated_at
)
VALUES (
'Risk Program', 'Personal Loan', 'Stage 1', 'New Logic Step', 1, 'active', 'SQL',
'SELECT * FROM transactions WHERE risk_score > 700',
NULL, 'filtered_view', 'high_risk_table', 'overwrite', NULL, 1, 3, '202412', CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
);
b) Modifying an Existing Query
When updating an existing step:
- Set the Old Step to Inactive: Update the
status
of the current step toinactive
. - Insert a New Version: Insert the new query with an incremented
version
.
Example:
-- Mark the existing query as inactive
UPDATE metadatatable
SET status = 'inactive', updated_at = CURRENT_TIMESTAMP
WHERE program_name = 'Risk Program' AND product_name = 'Personal Loan'
AND stage_name = 'Stage 1' AND step_name = 'New Logic Step' AND status = 'active';
-- Insert the updated query with a new version
INSERT INTO metadatatable (
program_name, product_name, stage_name, step_name, version, status, operation_type, query,
custom_logic, temp_view_name, table_name, write_mode, snapshot_mode, stage_priority,
steps_priority, month_year, created_at, updated_at
)
VALUES (
'Risk Program', 'Personal Loan', 'Stage 1', 'New Logic Step', 2, 'active', 'SQL',
'SELECT * FROM transactions WHERE risk_score > 750',
NULL, 'filtered_view', 'high_risk_table', 'overwrite', NULL, 1, 3, '202412', CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
);
c) Rolling Back to a Previous Version
To roll back to a previous version:
- Set Current Version to Inactive: Mark the current version as inactive.
- Reactivate the Previous Version: Update the status of the required version to
active
.
Example:
-- Mark the current version as inactive
UPDATE metadatatable
SET status = 'inactive', updated_at = CURRENT_TIMESTAMP
WHERE program_name = 'Risk Program' AND product_name = 'Personal Loan'
AND stage_name = 'Stage 1' AND step_name = 'New Logic Step' AND status = 'active';
-- Reactivate the previous version
UPDATE metadatatable
SET status = 'active', updated_at = CURRENT_TIMESTAMP
WHERE program_name = 'Risk Program' AND product_name = 'Personal Loan'
AND stage_name = 'Stage 1' AND step_name = 'New Logic Step' AND version = 1;
4. Automation of Metadata Updates
a) Python Script for Metadata Updates
Use a script to automate updates to the metadata table.
from pyspark.sql import SparkSession
from datetime import datetime
# Initialize SparkSession
spark = SparkSession.builder
.appName("Metadata Update Process")
.enableHiveSupport()
.getOrCreate()
def update_metadata(metadata_table, changes):
for change in changes:
step_name = change["step_name"]
program_name = change["program_name"]
product_name = change["product_name"]
stage_name = change["stage_name"]
new_query = change["query"]
new_custom_logic = change.get("custom_logic", None)
version = change.get("version", None)
# Mark the current version inactive
spark.sql(f"""
UPDATE {metadata_table}
SET status = 'inactive', updated_at = CURRENT_TIMESTAMP
WHERE program_name = '{program_name}' AND product_name = '{product_name}'
AND stage_name = '{stage_name}' AND step_name = '{step_name}' AND status = 'active'
""")
# Insert the new version
spark.sql(f"""
INSERT INTO {metadata_table} (
program_name, product_name, stage_name, step_name, version, status,
operation_type, query, custom_logic, temp_view_name, table_name, write_mode,
snapshot_mode, stage_priority, steps_priority, month_year, created_at, updated_at
)
SELECT
'{program_name}', '{product_name}', '{stage_name}', '{step_name}',
COALESCE(MAX(version), 0) + 1 AS version, 'active',
operation_type, '{new_query}', '{new_custom_logic}', temp_view_name,
table_name, write_mode, snapshot_mode, stage_priority, steps_priority,
month_year, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
FROM {metadata_table}
WHERE program_name = '{program_name}' AND product_name = '{product_name}'
AND stage_name = '{stage_name}' AND step_name = '{step_name}'
""")
# Example Changes
changes = [
{
"program_name": "Risk Program",
"product_name": "Personal Loan",
"stage_name": "Stage 1",
"step_name": "New Logic Step",
"query": "SELECT * FROM transactions WHERE risk_score > 800",
"custom_logic": None
}
]
# Update Metadata Table
update_metadata("metadatatable", changes)
5. Best Practices
- Audit Trail:
- Always retain previous versions for audit and rollback purposes.
- Validation:
- Validate all new queries or logic before updating the metadata.
- Testing:
- Maintain a test environment to validate metadata changes before promoting them to production.
- Automation:
- Use scripts to automate metadata updates, ensuring consistency and reducing manual errors.
- Change Log:
- Maintain a separate table or document to log metadata changes, including who made the changes and why.
This framework ensures transparency, traceability, and rollback capabilities for metadata management in your ETL process.