Tutorials

  • Spark SQL windows Function and Best Usecases

    For Better understanding on Spark SQL windows Function and Best Usecases do refer our post Window functions in Oracle Pl/Sql and Hive explained and compared with examples. Window functions in Spark SQL are powerful tools that allow you to perform calculations across a set of table rows that are somehow related to the current row.…

  • PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

    PySpark Architecture Cheat Sheet 1. Core Components of PySpark Component Description Key Features Spark Core The foundational Spark component for scheduling, memory management, and fault tolerance. Task scheduling, data partitioning, RDD APIs. Spark SQL Enables interaction with structured data via SQL, DataFrames, and Datasets. Supports SQL queries, schema inference, integration with Hive. Spark Streaming Allows…

  • Quick Spark SQL reference- Spark SQL cheatsheet for Revising in One Go

    Here’s an enhanced Spark SQL cheatsheet with additional details, covering join types, union types, and set operations like EXCEPT and INTERSECT, along with options for table management (DDL operations like UPDATE, INSERT, DELETE, etc.). This comprehensive sheet is designed to help with quick Spark SQL reference. Category Concept Syntax / Example Description Basic Statements SELECT…

  • Functions in Spark SQL- Cheatsheets, Complex Examples

    Here’s a categorized Spark SQL function reference, which organizes common Spark SQL functions by functionality. This can help with selecting the right function based on the operation you want to perform. 1. Aggregate Functions Function Description Example avg() Calculates the average value. SELECT avg(age) FROM table; count() Counts the number of rows. SELECT count(*) FROM…

  • CRUD in SQL – Create Database, Create Table, Insert, Select, Update, Alter table, Delete

    CRUD stands for Create, Read, Update, and Delete. It’s a set of basic operations that are essential for managing data in a database or any persistent storage system. It refers to the four basic functions that any persistent storage application needs to perform. Persistent storage refers to data storage that retains information even after the…

  • Pyspark, Spark SQL and Python Pandas- Collection of Various Useful cheatsheets, cheatcodes for revising

    Comparative overview of partitions, bucketing, segmentation, and broadcasting in PySpark, Spark SQL, and Hive QL in tabular form, along with examples Here’s a comparative overview of partitions, bucketing, segmentation, and broadcasting in PySpark, Spark SQL, and Hive QL in tabular form, along with examples: Concept PySpark Spark SQL Hive QL Partitions df.repartition(numPartitions, “column”) creates partitions…

  • Types of SQL /Spark SQL commands- DDL,DML,DCL,TCL,DQL

    Here’s a breakdown of the main SQL command categories and their purposes, including examples of commands commonly used within each: 1. Data Definition Language (DDL) DDL commands define and modify the structure of database objects like tables, schemas, and indexes. They generally affect the schema or database structure rather than the data itself. Command Purpose…

  • Python Pandas Series Tutorial- Usecases, Cheatcode Sheet to revise

    The pandas Series is a one-dimensional array-like data structure that can store data of any type, including integers, floats, strings, or even Python objects. Each element in a Series is associated with a unique index label, making it easy to perform data retrieval and operations based on labels. Here’s a detailed guide on using Series…

  • Pandas operations, functions, and use cases ranging from basic operations like filtering, merging, and sorting, to more advanced topics like handling missing data, error handling

    This tutorial covers a wide range of pandas operations and advanced concepts with examples that are practical and useful in real-world scenarios. The key topics include: In pandas, there are several core data structures designed for handling different types of data, enabling efficient and flexible data manipulation. These data structures include: Each of these structures…

  • PySpark Projects:- Scenario Based Complex ETL projects Part3

    I have divided a pyspark big script in many steps –by using steps1=”’ some codes”’ till steps7, i want to execute all these steps one after another and also if needed some steps can be not be executed. if any steps fails then then next step only get executed if mentioned to run even if…

  • PySpark Projects:- Scenario Based Complex ETL projects Part2

    How to code in Pyspark a Complete ETL job using only Pyspark sql api not dataframe specific API? Here’s an example of a complete ETL (Extract, Transform, Load) job using PySpark SQL API: Explanation Tips and Variations PySpark ETL script that incorporates : control table management, job status tracking, data pre-checks, retries, dynamic broadcasting, caching,…

  • PySpark Control Statements- Conditional Statements, Loop, Exception Handling

    PySpark supports various control statements to manage the flow of your Spark applications. PySpark supports using Python’s if-else-elif statements, but with limitations. Supported Usage Unsupported Usage Conditional statements in Pyspark 1.Python’s if elif else 2. Use when and otherwise for Simple Conditions: In PySpark, the when and otherwise functions replace traditional if-else logic. They are…

  • TroubleShoot Pyspark Issues- Error Handling in Pyspark, Debugging and custom Log table, status table generation in Pyspark

    When working with PySpark, there are several common issues that developers face. These issues can arise from different aspects such as memory management, performance bottlenecks, data skewness, configurations, and resource contention. Here’s a guide on troubleshooting some of the most common PySpark issues and how to resolve them. 1. Out of Memory Errors (OOM) Memory-related…

  • Pyspark Memory Management, Partition & Join Strategy – Scenario Based Questions

    Q1.–We are working with large datasets in PySpark, such as joining a 30GB table with a 1TB table or Various Transformation on 30 GB Data, we have 100 cores limit to use per user , what can be best configuration and Optimization strategy to use in pyspark ? will 100 cores are enough or should…

  • CPU Cores, executors, executor memory in pyspark- Explain Memory Management in Pyspark

    To determine the optimal number of CPU cores, executors, and executor memory for a PySpark job, several factors need to be considered, including the size and complexity of the job, the resources available in the cluster, and the nature of the data being processed. Here’s a general guide: 1. Number of CPU Cores per Executor…

  • Partitioning a Table in SQL , Hive QL, Spark SQL

    Partitioning in SQL, HiveQL, and Spark SQL is a technique used to divide large tables into smaller, more manageable pieces or partitions. These partitions are based on a column (or multiple columns) and help improve query performance, especially when dealing with large datasets. The main purpose of partitioning is to speed up query execution by…

  • Pivot & unpivot in Spark SQL – How to translate SAS Proc Transpose to Spark SQL

    PIVOT Clause in Spark sql or Mysql or Oracle Pl sql or Hive QL The PIVOT clause is a powerful tool in SQL that allows you to rotate rows into columns, making it easier to analyze and report data. Here’s how to use the PIVOT clause in Spark SQL, MySQL, Oracle PL/SQL, and Hive QL:…

  • Oracle Query Execution phases- How query flows?

    SQL query flows through the Oracle engine in the following steps: Step 1: Parsing Step 2: Optimization Step 3: Row Source Generation Step 4: Execution Step 5: Fetch Additional Steps: This high-level overview shows how a query flows through the Oracle engine. Depending on the query complexity and database configuration, additional steps or variations may…

  • Pyspark -Introduction, Components, Compared With Hadoop, PySpark Architecture- (Driver- Executor)

    PySpark is a powerful Python API for Apache Spark, a distributed computing framework that enables large-scale data processing. Spark History Spark was initially started by Matei Zaharia at UC Berkeley’s AMPLab in 2009, and open sourced in 2010 under a BSD license. In 2013, the project was donated to the Apache Software Foundation and switched…

  • Deploying a PySpark job- Explain Various Methods and Processes Involved

    Deploying a PySpark job can be done in various ways depending on your infrastructure, use case, and scheduling needs. Below are the different deployment methods available, including details on how to use them: 1. Running PySpark Jobs via PySpark Shell How it Works: Steps to Deploy: Use Cases: 2. Submitting Jobs via spark-submit How it…