Explore Azure Databricks - Azure Databricks Lab 1


Explore Azure Databricks




Step 1.  Setup Azure Databricks Workspace and open notebook


  • Connect to Azure Portal
  • Setup Azure Datbricks
  • Setup Cluster for Azure Databricks
  • Open Notebook

Step 2. Prepare Data to consume


  • Go to the URL: https://raw.githubusercontent.com/MicrosoftLearning/mslearn-databricks/main/data/products.csv
  • Download data in CSV file, I named the file product.csv
  • On the File menu, select Upload data to DBFS.
  • In the Upload Data dialog box, note the DBFS Target Directory to where the file will be uploaded. 
  • Then select the Files area, and upload the products.csv file you downloaded to your computer. 
  • When the file has been uploaded, select Next
  • In the Access files from notebooks pane, select the sample PySpark code and copy it to the clipboard. 
  • You will use it to load the data from the file into a data frame. Then select Done.

In the Access files from notebooks pane, select the sample PySpark code and copy it to the clipboard.


Step 3. Execute code in Notebook


  • In the notebook, in the empty code cell, paste the code you copied; which should look similar to this:
  • Use the ▸ Run Cell menu option at the top-right of the cell to run it, starting and attaching the cluster if prompted.
  • Wait for the Spark job run by the code to complete. The code has created a data frame object named df1 from the data in the file you uploaded.

Code:

df1 = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/a@b.com/products.csv")


df1 = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/saurabh.f.sinha@accenture.com/products.csv")


Step 4. Display the contents of the data frame


  • Under the existing code cell, use the + icon to add a new code cell. Then in the new cell, enter the following code:
  • Use the ▸ Run Cell menu option at the top-right of the new cell to run it. This code displays the contents of the data frame, which should look similar to this:
Code:

print(df1)

df1.collect()

display(df1)



print(df1)


collect(df1)
display(df1)


Step 5. Create Data Visualization and Data Profile

  • Above the table of results, select + and then select Visualization to view the visualization editor, and then apply the following options:
  • Visualization type: Bar
  • X Column: Category
  • Y Column: Add a new column and select ProductID. Apply the Count aggregation.

Visualization type: Bar + Data Profile


Step 6. Create and query a table

  • Save the data frame as table object 
  • SQL code to return the name and price of products in the Touring Bikes category.

Code: 

df1.write.saveAsTable("products")

%sql

SELECT ProductName, ListPrice FROM products WHERE Category = 'Touring Bikes';

 

df1.write.saveAsTable("products")

%sql  SELECT ProductName, ListPrice FROM products WHERE Category = 'Touring Bikes';


Now if you want, you can try a couple more options


 Step Last.  Cleanup Resources


  • In the Azure Databricks portal, on the Compute page, select your cluster and select ■ Terminate to shut it down.
  • If you’ve finished exploring Azure Databricks, you can delete the resources you’ve created to avoid unnecessary Azure costs and free up capacity in your subscription.


No comments:

Post a Comment