Got it 👍 You want a realistic end-to-end project that covers:
- Fetch messy dataset from a free public API (CSV/Excel).
- Clean data → focus on string cleaning & manipulation (interview-useful).
- Store / serve cleaned data.
- Build a dashboard (Flask / FastAPI recommended) → interactive tables/graphs.
- Production-grade → modular, structured, easy to run locally.
🔹 Suggested Project: “World Population & Country Data Dashboard”
- Data Source (free API): REST Countries API → returns messy JSON (with nested data, inconsistent casing, missing fields).
- Alternative CSV source: World Bank population dataset → CSV.
🔹 Project Workflow
Step 1: Fetch Data
- Use
requests
to call the API. - Save JSON → convert into
pandas
DataFrame. - Alternatively, download CSV from web.
Step 2: Data Cleaning (String Manipulation)
- Standardize country names (strip spaces, title case).
- Handle missing values (
fillna
,dropna
). - Extract numeric parts from messy fields (e.g., population, area).
- Split / join fields (e.g., capital cities).
- Create derived columns (continent short codes, name lengths).
Step 3: Store Clean Data
- Save cleaned data to
SQLite
(portable for your laptop). - Or keep as cleaned CSV/Parquet.
Step 4: Build Dashboard
- Use Flask (simple) or FastAPI + Jinja2 + Bootstrap (recommended).
- Pages:
- Home → Summary stats (population, area).
- Search → Query countries.
- Charts → Plot population by continent (using Plotly/Matplotlib).
Step 5: Productionize
Modular structure: string_project/ ├── app.py # Flask app ├── data_fetch.py # API/CSV fetcher ├── data_clean.py # String cleaning functions ├── models.py # SQLite DB helper ├── static/ # CSS/JS ├── templates/ # Jinja2 HTML templates └── requirements.txt
requirements.txt
flask pandas requests plotly sqlalchemy
🔹 Sample End-to-End Script (minimal but extendable)
# app.py
from flask import Flask, render_template, request
import pandas as pd
import requests
app = Flask(__name__)
DATA_URL = "https://restcountries.com/v3.1/all"
def fetch_and_clean():
# Step 1: Fetch
res = requests.get(DATA_URL)
countries = res.json()
# Step 2: Normalize into DataFrame
df = pd.json_normalize(countries)
# Step 3: String cleaning
df['name.common'] = df['name.common'].str.strip().str.title()
df['region'] = df['region'].fillna("Unknown").str.upper()
df['capital'] = df['capital'].astype(str).str.replace(r"[\[\]']", "", regex=True)
# Derived column
df['name_length'] = df['name.common'].str.len()
return df[['name.common', 'region', 'capital', 'population', 'area', 'name_length']]
@app.route("/")
def home():
df = fetch_and_clean()
summary = {
"total_countries": df.shape[0],
"total_population": df["population"].sum(),
"largest_country": df.loc[df["area"].idxmax(), "name.common"]
}
return render_template("home.html", summary=summary)
@app.route("/countries")
def countries():
query = request.args.get("q", "")
df = fetch_and_clean()
if query:
df = df[df['name.common'].str.contains(query, case=False)]
return render_template("countries.html", tables=df.to_html(classes="table table-striped"), query=query)
if __name__ == "__main__":
app.run(debug=True)
🔹 Example Templates (Jinja2)
templates/home.html
<!DOCTYPE html>
<html>
<head>
<title>Country Dashboard</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css">
</head>
<body class="container mt-4">
<h1>🌍 Country Dashboard</h1>
<p>Total Countries: {{ summary.total_countries }}</p>
<p>Total Population: {{ summary.total_population }}</p>
<p>Largest Country (Area): {{ summary.largest_country }}</p>
<a href="/countries">Browse Countries</a>
</body>
</html>
templates/countries.html
<!DOCTYPE html>
<html>
<head>
<title>Countries</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css">
</head>
<body class="container mt-4">
<h2>Countries</h2>
<form method="get">
<input type="text" name="q" value="{{ query }}" placeholder="Search by name">
<button type="submit" class="btn btn-primary btn-sm">Search</button>
</form>
<div class="mt-3">
{{ tables | safe }}
</div>
</body>
</html>