Use Apache Spark in - Azure Databricks Lab 2


Use Apache Spark in Azure Databricks


One of the benefits of Spark is support for a wide range of programming languages, including Java, Scala, Python, and SQL; making Spark a very flexible solution for data processing workloads including data cleansing and manipulation, statistical analysis and machine learning, and data analytics and visualization. Azure Databricks is built on Apache Spark and offers a highly scalable solution for data engineering and analysis tasks that involve working with data in files.


Step 1. Setup Azure Databricks Workspace and open notebook

Steps are available: https://saurabhsinhainblogs.blogspot.com/2024/01/azure-databricks-lab-how-to-start-with.html

  • Connect to Azure Portal
  • Setup Azure Datbricks
  • Setup Cluster for Azure Databricks
  • Open Notebook


Step 2. Prepare data to consume

Shell commands can also use to download data files from GitHub into the Databricks file system (DBFS)

Code:

%sh
 rm -r /dbfs/spark_lab
 mkdir /dbfs/spark_lab
 wget -O /dbfs/spark_lab/2019.csv https://raw.githubusercontent.com/MicrosoftLearning/mslearn-databricks/main/data/2019.csv
 wget -O /dbfs/spark_lab/2020.csv https://raw.githubusercontent.com/MicrosoftLearning/mslearn-databricks/main/data/2020.csv
 wget -O /dbfs/spark_lab/2021.csv https://raw.githubusercontent.com/MicrosoftLearning/mslearn-databricks/main/data/2021.csv




Step 3. Load the data frame and view the data


Data is loaded and visible here but
  • Data doesn't include the column headers
  • Information about the data types is missing

Code:
df = spark.read.load('spark_lab/*.csv', format='csv')
display(df.limit(100))

Load the data frame and view the data in Azure Databricks













Step 3. Define a schema for the data frame

Observe that this time, our below problems are solved

  • The data frame includes column headers
  • Information about the data types is also available


Code:

from pyspark.sql.types import *
from pyspark.sql.functions import *
orderSchema = StructType([
     StructField("SalesOrderNumber", StringType()), StructField("SalesOrderLineNumber", IntegerType()), StructField("OrderDate", DateType()),
     StructField("CustomerName", StringType()),  StructField("Email", StringType()),  StructField("Item", StringType()),
     StructField("Quantity", IntegerType()),  StructField("UnitPrice", FloatType()), StructField("Tax", FloatType())  ])
df = spark.read.load('/spark_lab/*.csv', format='csv', schema=orderSchema)
display(df.limit(100))

df.printSchema()


Pyspark define schema













print datatyping spark databricks














Step 4. Filter a data frame

  • Filter the columns of the sales orders data frame to include only the customer name and email address.
    • When you perform an operation on a data frame, the result is a new data frame (in this case, a new customer data frame is created by selecting a specific subset of columns from the df data frame)
    • The dataframe['Field1', 'Field2', ...] syntax is a shorthand way of defining a subset of column. You can also use select method, so the first line of the code above could be written as customers = df.select("CustomerName", "Email")
  • Count the total number of order records
  • Count the number of distinct customers
  • Display the distinct customers
    • Dataframes provide functions such as count and distinct that can be used to summarize and filter the data they contain.


            Code: 

            customers = df['CustomerName', 'Email']
            print(customers.count())
            print(customers.distinct().count())
            display(customers.distinct())

            Filter a dataframe

            Let’s apply a filter to include only the customers who have placed an order for a specific product by running the following code
            Code:

            customers = df.select("CustomerName", "Email").where(df['Item']=='Road-250 Red, 52')
            print(customers.count())
            print(customers.distinct().count())
            display(customers.distinct())



            Selec and where in pyspark databricks





            Step 5. Aggregate and group data in a data frame


            Show the sum of order quantities grouped by product. The groupBy method groups the rows by Item, and the subsequent sum aggregate function is applied to all of the remaining numeric columns

            Code:

            productSales = df.select("Item", "Quantity").groupBy("Item").sum()
            display(productSales)

            Aggregate databricks pyspark


            Show the number of sales orders per year. Note that the select method includes a SQL year function to extract the year component of the OrderDate field, and then an alias method is used to assign a column name to the extracted year value. The data is then grouped by the derived Year column and the count of rows in each group is calculated before finally the orderBy method is used to sort the resulting dataframe.

            Code:

            yearlySales = df.select(year("OrderDate").alias("Year")).groupBy("Year").count().orderBy("Year")
            display(yearlySales)

            databricks pyspark function using sql








            Step 6. Query data using Spark SQL

            • The native methods of the data frame object you used previously enable you to query and analyze data quite effectively. However, many data analysts are more comfortable working with SQL syntax. 
            • Spark SQL is a SQL language API in Spark that you can use to run SQL statements, or even persist data in relational tables. 
            • The code you just ran creates a relational view of the data in a data frame, and then uses the spark.sql library to embed Spark SQL syntax within your Python code query the view and return the results as a data frame.

            Code: 

            df.createOrReplaceTempView("salesorders")
            spark_df = spark.sql("SELECT * FROM salesorders")
            display(spark_df)

            SQL API in pyspark pysql, databricks


            Execute SQL code in datbricks
            • The ``%sql` line at the beginning of the cell (called a magic) indicates that the Spark SQL language runtime should be used to run the code in this cell instead of PySpark.
            • The SQL code references the salesorder view that you created previously.
            • The output from the SQL query is automatically displayed as the result under the cell.

            Code:
            %sql
            SELECT YEAR(OrderDate) AS OrderYear,
                   SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue
            FROM salesorders
            GROUP BY YEAR(OrderDate)
            ORDER BY OrderYear;

            SQL code in Azure AI databricks pyspark




            Step 7. Visualize data with Spark


            • Azure Databricks include support for visualizing data from a data frame or Spark SQL query.
            • It is not designed for comprehensive charting.
            • You can use Python graphics libraries like Matplotlib and Seaborn to create charts from data in data frames.
            Code:
            %sql    
            SELECT * FROM salesorders

            Table of results, select + and then select Visualization to view the visualization editor, and then apply the following options:
            • Visualization type: Bar
            • X Column: Item
            • Y Column: Add a new column and select Quantity. Apply the Sum aggregation.
            • Save the visualization and then re-run the code cell to view the resulting chart in the notebook.
            pyspark-databricks visualization





            Step 8. Use Matplotlib in Databricks


            Let's utilize the Python Visualization Library for the same'


            Code:
            sqlQuery = "SELECT CAST(YEAR(OrderDate) AS CHAR(4)) AS OrderYear, \
                            SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue \
                     FROM salesorders \
                     GROUP BY CAST(YEAR(OrderDate) AS CHAR(4)) \
                     ORDER BY OrderYear"
            df_spark = spark.sql(sqlQuery)
            df_spark.show()




            Pyspark and matplotlib







            Results, which consist of a column chart with the total gross revenue for each year. 

            Note the following features of the code used to produce this chart:
            • The matplotlib library requires a Pandas dataframe, so you need to convert the Spark dataframe returned by the Spark SQL query to this format.
            • At the core of the matplotlib library is the pyplot object. This is the foundation for most plotting functionality.



            Code:
            from matplotlib import pyplot as plt
            # matplotlib requires a Pandas dataframe, not a Spark one
            df_sales = df_spark.toPandas()
            # Create a bar plot of revenue by year
            plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'])
            # Display the plot
            plt.show()

            matplotlib databricks



            Lets Customize the plot 


            Code:
            # Clear the plot area
            plt.clf()
            # Create a bar plot of revenue by year
            plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')
            # Customize the chart
            plt.title('Revenue by Year')
            plt.xlabel('Year')
            plt.ylabel('Revenue')
            plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
            plt.xticks(rotation=45)
            # Show the figure
            plt.show()

            Customize the subplot in Azure Synapses



             
            The figure was created implicitly for you; but you can create it explicitly.

            Code:
            # Clear the plot area
            plt.clf()
            # Create a Figure
            fig = plt.figure(figsize=(8,3))
            # Create a bar plot of revenue by year
            plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')
            # Customize the chart
            plt.title('Revenue by Year')
            plt.xlabel('Year')
            plt.ylabel('Revenue')
            plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
            plt.xticks(rotation=45)
            # Show the figure
            plt.show()

            Implicit and explicit subplots in data bricks




            A figure can contain multiple subplots, each on its own axis

            Code:

            # Clear the plot area
            plt.clf()
            # Create a figure for 2 subplots (1 row, 2 columns)
            fig, ax = plt.subplots(1, 2, figsize = (10,4))
            # Create a bar plot of revenue by year on the first axis
            ax[0].bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')
            ax[0].set_title('Revenue by Year')
            # Create a pie chart of yearly order counts on the second axis
            yearly_counts = df_sales['OrderYear'].value_counts()
            ax[1].pie(yearly_counts)
            ax[1].set_title('Orders per Year')
            ax[1].legend(yearly_counts.keys().tolist())
            # Add a title to the Figure
            fig.suptitle('Sales Data')
            # Show the figure
            plt.show()

            multiple subplots Azure databricks

















            Step 9. Use the Seaborn Library

            Using the seaborn library (which is built on matplotlib and abstracts some of its complexity) to create a chart:

            Code:

            import seaborn as sns
            plt.clf() # Clear the plot area
            # Create a bar chart
            ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales)
            plt.show()

            Seaborn library makes it simpler to create complex plots of statistical data, and enables you to control the visual theme for consistent data visualizations.




















            Seaborn library makes it simpler to create complex plots of statistical data, and enables you to control the visual theme for consistent data visualizations.

            Code:

            plt.clf() # Clear the plot area
            # Set the visual theme for seaborn
            sns.set_theme(style="whitegrid")
            # Create a bar chart
            ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales)
            plt.show()

            Azure Synapses and seaborn





















            Like matplotlib. seaborn supports multiple chart types

            Code:

            plt.clf()  # Clear the plot area
            # Create a bar chart
            ax = sns.lineplot(x="OrderYear", y="GrossRevenue", data=df_sales)
            plt.show()

            matplotlib, seaborn Azure databricks


















            Step Last.  Cleanup Resources

            • In the Azure Databricks portal, on the Compute page, select your cluster and select ■ Terminate to shut it down.
            • If you’ve finished exploring Azure Databricks, you can delete the resources you’ve created to avoid unnecessary Azure costs and free up capacity in your subscription.


            No comments:

            Post a Comment