-
Python Regex complete tutorial with usecases of email search inside whole dbms or code search inside a code repository
Regular expressions (regex) are a powerful tool for matching patterns in text. Python’s re module provides functions and tools for working with regular expressions. Here’s a complete tutorial on using regex in Python. 1. Importing the re Module To use regular expressions in Python, you need to import the re module: import re 2. Basic…
-
PySpark Projects:- Scenario Based Complex ETL projects Part1
1.Exploratory Data Analysis (EDA) with Pandas in Banking – Converted in Pyspark While searching for A free Pandas Project on Google Found this link –Exploratory Data Analysis (EDA) with Pandas in Banking . I have tried to convert this Pyscript in Pyspark one. First, let’s handle the initial steps of downloading and extracting the data:…
-
String Manipulation on PySpark DataFrames
String manipulation is a common task in data processing. PySpark provides a variety of built-in functions for manipulating string columns in DataFrames. Below, we explore some of the most useful string manipulation functions and demonstrate how to use them with examples. Common String Manipulation Functions Example Usage 1. Concatenation Syntax: 2. Substring Extraction Syntax: 3.…
-
Date and Time Functions- Pyspark Dataframes & Pyspark Sql Queries
Here’s a comprehensive list of some common PySpark date functions along with detailed explanations and examples on Dataframes(We will again discuss thess basis Pyspark sql Queries): 1. current_date() Returns the current date. 2. current_timestamp() Returns the current timestamp. 3. date_format() Formats a date using the specified format. 4. year(), month(), dayofmonth() Extracts the year, month,…
-
Window functions in PySpark on Dataframe
Window functions in PySpark allow you to perform operations on a subset of your data using a “window” that defines a range of rows. These functions are similar to SQL window functions and are useful for tasks like ranking, cumulative sums, and moving averages. Let’s go through various PySpark DataFrame window functions, compare them with…
-
Pyspark Dataframe programming – operations, functions, all statements, syntax with Examples
PySpark provides a powerful API for data manipulation, similar to pandas, but optimized for big data processing. Below is a comprehensive overview of DataFrame operations, functions, and syntax in PySpark with examples. Creating DataFrames Creating DataFrames from various sources is a common task in PySpark. Below are examples for creating DataFrames from CSV files, Excel…
-
Python Project Alert:- Dynamic list of variables Creation
Let us go through the Project requirement:- 1.Let us create One or Multiple dynamic lists of variables and save it in dictionary or Array or other datastructre for further repeating use in python. Variable names are in form of dynamic names for example Month_202401 to Month_202312 for 24 months( Take these 24 month backdated or…
-
I wrote a Python code or I created a Python script, and it executed successfully- So what does it Mean?
I wrote a Python code or I created a Python script, and it executed successfully So what does it Mean? This will be the most basic question a Early Python Learner can ask ! So Consider this scenario- where i executed a script in python which saves a many csv in Local disk and also…
-
Spark SQL Join Types- Syntax examples, Comparision
Spark SQL supports several types of joins, each suited to different use cases. Below is a detailed explanation of each join type, including syntax examples and comparisons. Types of Joins in Spark SQL 1. Inner Join An inner join returns only the rows that have matching values in both tables. Syntax: SELECT a.*, b.*FROM tableA…
-
Temporary Functions in PL/Sql Vs Spark Sql
Temporary functions allow users to define functions that are session-specific and used to encapsulate reusable logic within a database session. While both PL/SQL and Spark SQL support the concept of user-defined functions, their implementation and usage differ significantly. Temporary Functions in PL/SQL PL/SQL, primarily used with Oracle databases, allows you to create temporary or anonymous…
-
How PySpark automatically optimizes the job execution by breaking it down into stages and tasks based on data dependencies. can explain with an example
Apache Spark, including PySpark, automatically optimizes job execution by breaking it down into stages and tasks based on data dependencies. This process is facilitated by Spark’s Directed Acyclic Graph (DAG) Scheduler, which helps in optimizing the execution plan for efficiency. Let’s break this down with a detailed example and accompanying numbers to illustrate the process.…
-
Understanding Pyspark execution with the help of Logs in Detail
explain a typical Pyspark execution Logs A typical PySpark execution log provides detailed information about the various stages and tasks of a Spark job. These logs are essential for debugging and optimizing Spark applications. Here’s a step-by-step explanation of what you might see in a typical PySpark execution log: Step 1: Spark Context Initialization When…
-
Pyspark RDDs a Wonder -Transformations, actions and execution operations- please explain and list them
RDD (Resilient Distributed Dataset) is the fundamental data structure in Apache Spark. It is an immutable, distributed collection of objects that can be processed in parallel across a cluster of machines. Purpose of RDD How RDD is Beneficial RDDs are the backbone of Apache Spark’s distributed computing capabilities. They enable scalable, fault-tolerant, and efficient processing…
-
Are Dataframes in PySpark Lazy evaluated?
Yes, DataFrames in PySpark are lazily evaluated, similar to RDDs. Lazy evaluation is a key feature of Spark’s processing model, which helps optimize the execution of transformations and actions on large datasets. What is Lazy Evaluation? Lazy evaluation means that Spark does not immediately execute the transformations you apply to a DataFrame. Instead, it builds…
-
BDL Ecosystem-HDFS and Hive Tables
Big Data Lake: Data Storage HDFS is a scalable storage solution designed to handle massive datasets across clusters of machines. Hive tables provide a structured approach for querying and analyzing data stored in HDFS. Understanding how these components work together is essential for effectively managing data in your BDL ecosystem. HDFS – Hadoop Distributed File…
-
Big Data, Data Warehouse, Data Lakes, Big Data Lake – Explain in simple words
Big data and big data lakes are complementary concepts. Big data refers to the characteristics of the data itself, while a big data lake provides a storage solution for that data. Organizations often leverage big data lakes to store and manage their big data, enabling further analysis and exploration. Here’s an analogy: Think of big…
-
Window functions in Oracle Pl/Sql and Hive explained and compared with examples
Window functions, also known as analytic functions, perform calculations across a set of table rows that are somehow related to the current row. This is different from regular aggregate functions, which aggregate results for the entire set of rows. Both Oracle PL/SQL and Apache Hive support window functions, but there are some differences in their…
-
Common Table Expressions (CTEs) in Oracle Pl/Sql / Hive / Spark SQL explained and Compared
Common Table Expressions (CTEs) are a useful feature in SQL for simplifying complex queries and improving readability. Both Oracle PL/SQL and Apache Hive support CTEs, although there may be slight differences in their syntax and usage. Common Table Expressions in Oracle PL/SQL In Oracle, CTEs are defined using the WITH clause. They are used to…
-
String/Character Manipulation functions in Oracle PL/SQL, Apache Hive
Function Name Description Example Usage Result CONCAT Concatenates two strings. SELECT CONCAT(‘Oracle’, ‘PL/SQL’) FROM dual; OraclePL/SQL ` ` (Concatenation) Concatenates two strings. LENGTH Returns the length of a string. SELECT LENGTH(‘Oracle’); 6 LOWER Converts all characters in a string to lowercase. SELECT LOWER(‘ORACLE’); oracle UPPER Converts all characters in a string to uppercase. SELECT UPPER(‘oracle’);…
-
Date and Time manipulation in Oracle SQL, Apache Hive QL, Mysql
Date and Time manipulation in Oracle SQL In Oracle SQL, date and time manipulation is essential for many database operations, ranging from basic date arithmetic to complex formatting and extraction. Here’s a guide covering various common operations you might need. 1. Basic Date Arithmetic Adding/Subtracting Days: Adding/Subtracting Months: Adding/Subtracting Years: 2. Extracting Components from Date…