sparkbyexamples.com
Open in
urlscan Pro
2a06:98c1:3121::3
Public Scan
Submitted URL: http://sparkbyexamples.com/
Effective URL: https://sparkbyexamples.com/
Submission: On December 18 via api from CH — Scanned from NL
Effective URL: https://sparkbyexamples.com/
Submission: On December 18 via api from CH — Scanned from NL
Form analysis
3 forms found in the DOMhttps://sparkbyexamples.com/
<form aria-label="Search this website" role="search" class="searchform" action="https://sparkbyexamples.com/" data-hs-cf-bound="true"><input aria-label="Insert search query" type="search" id="ocean-search-form-1" class="field" autocomplete="off"
placeholder="Search" name="s">
<input type="hidden" name="post_type" value="post">
</form>
https://sparkbyexamples.com/
<form aria-label="Search this website" action="https://sparkbyexamples.com/" class="mobile-searchform" data-hs-cf-bound="true"><input aria-label="Insert search query" class="field" id="ocean-mobile-search-2" type="search" name="s" autocomplete="off"
placeholder="Search">
<button aria-label="Submit search" class="searchform-submit">
<i class="icon-magnifier" aria-hidden="true" role="img"></i></button>
<input type="hidden" name="post_type" value="post">
</form>
POST https://sparkbyexamples.com/wp-comments-post.php
<form action="https://sparkbyexamples.com/wp-comments-post.php" method="post" id="commentform" class="comment-form" novalidate="" data-hs-cf-bound="true">
<div class="comment-textarea"><label for="comment" class="screen-reader-text">Comment</label><textarea name="comment" id="comment" cols="39" rows="4" tabindex="0" class="textarea-comment" placeholder="Your comment here..."></textarea></div>
<div class="comment-form-author"><label for="author" class="screen-reader-text">Enter your name or username to comment</label><input name="author" id="author" placeholder="Name (required)" size="22" tabindex="0" aria-required="true"
class="input-name"></div>
<div class="comment-form-email"><label for="email" class="screen-reader-text">Enter your email address to comment</label><input name="email" id="email" placeholder="Email (required)" size="22" tabindex="0" aria-required="true" class="input-email">
</div>
<div class="comment-form-url"><label for="url" class="screen-reader-text">Enter your website URL (optional)</label><input name="url" id="url" placeholder="Website" size="22" tabindex="0" class="input-website"></div>
<p class="comment-form-cookies-consent"><input id="wp-comment-cookies-consent" name="wp-comment-cookies-consent" type="checkbox" value="yes"> <label for="wp-comment-cookies-consent">Save my name, email, and website in this browser for the next time
I comment.</label></p>
<p class="form-submit"><input name="submit" type="submit" id="comment-submit" class="submit" value="Post Comment"> <input type="hidden" name="comment_post_ID" value="63" id="comment_post_ID">
<input type="hidden" name="comment_parent" id="comment_parent" value="0">
</p>
<p style="display:none"><input type="hidden" id="akismet_comment_nonce" name="akismet_comment_nonce" value="c3cdc615ab"></p>
<p style="display:none!important"><label>Δ<textarea name="ak_hp_textarea" cols="45" rows="8" maxlength="100"></textarea></label><input type="hidden" id="ak_js_1" name="ak_js" value="1702916439973">
<script>
document.getElementById("ak_js_1").setAttribute("value", (new Date()).getTime());
</script>
</p>
</form>
Text Content
Skip to content * Home * About * Write For US | *** Please Subscribe for Ad Free & Premium Content *** * * * * * * Spark By {Examples} * Join for Ad Free * Spark * Spark Introduction * Spark RDD Tutorial * Spark SQL Functions * What’s New in Spark 3.0? * Spark Streaming * Apache Spark on AWS * Apache Spark Interview Questions * PySpark * Pandas * R * R Programming * R Data Frame * R dplyr Tutorial * R Vector * Hive * Snowflake * FAQ * Spark Interview Questions * MongoDB Interview Questions * Tutorials * H2O.ai * AWS * Python * MongoDB * Apache Kafka * Apache Hadoop * NumPy * Apache HBase * Apache Cassandra * H2O Sparkling Water * Scala Language * Log In * Toggle website search Menu Close * Spark * Spark Introduction * Spark RDD Tutorial * Spark SQL Functions * What’s New in Spark 3.0? * Spark Streaming * Apache Spark on AWS * Apache Spark Interview Questions * PySpark * Pandas * R * R Programming * R Data Frame * R dplyr Tutorial * R Vector * Hive * Snowflake * FAQ * Spark Interview Questions * MongoDB Interview Questions * Tutorials * H2O.ai * AWS * Python * MongoDB * Apache Kafka * Apache Hadoop * NumPy * Apache HBase * Apache Cassandra * H2O Sparkling Water * Scala Language * Log In * Toggle website search * Home * About * Write For US Ad LEARN APACHE SPARK TUTORIAL 3.5 WITH EXAMPLES In this Apache Spark Tutorial for Beginners, you will learn Spark version 3.5 with Scala code examples. All Spark examples provided in this Apache Spark Tutorial for Beginners are basic, simple, and easy to practice for beginners who are enthusiastic about learning Spark, and these sample examples were tested in our development environment. Ad Note: In case you can’t find the spark sample code example you are looking for on this tutorial page, I would recommend using the Search option from the menu bar to find your tutorial. Table of Contents Ad * What is Apache Spark * Features & Advantages * Architecture * Installation * RDD * DataFrame * SQL * Data Sources * Streaming * GraphFrame Note that every sample example explained here is available at Spark Examples Github Project for reference WHAT IS APACHE SPARK? Apache Spark Tutorial – Apache Spark is an Open source analytical processing engine for large-scale powerful distributed data processing and machine learning applications. Spark was Originally developed at the University of California, Berkeley’s, and later donated to the Apache Software Foundation. In February 2014, Spark became a Top-Level Apache Project and has been contributed by thousands of engineers making Spark one of the most active open-source projects in Apache. Ad Apache Spark 3.5 is a framework that is supported in Scala, Python, R Programming, and Java. Below are different implementations of Spark. * Spark – Default interface for Scala and Java * PySpark – Python interface for Spark * SparklyR – R interface for Spark. Examples explained in this Spark tutorial are with Scala, and the same is also explained with PySpark Tutorial (Spark with Python) Examples. Python also supports Pandas which also contains Data Frame but this is not distributed. APACHE SPARK FEATURES * In-memory computation * Distributed processing using parallelize * Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c) * Fault-tolerant * Immutable * Lazy evaluation * Cache & persistence * Inbuild-optimization when using DataFrames * Supports ANSI SQL APACHE SPARK ADVANTAGES * Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently in a distributed fashion. * Applications running on Spark are 100x faster than traditional systems. * You will get great benefits from using Spark for data ingestion pipelines. * Using Spark we can process data from Hadoop HDFS, AWS S3, Databricks DBFS, Azure Blob Storage, and many file systems. * Spark also is used to process real-time data using Streaming and Kafka. * Using Spark Streaming you can also stream files from the file system and also stream from the socket. * Spark natively has machine learning and graph libraries. * Provides connectors to store the data in NoSQL databases like MongoDB. WHAT VERSIONS OF JAVA & SCALA SPARK 3.5 SUPPORTS? Apache Spark 3.5 is compatible with Java versions 8, 11, and 17, Scala versions 2.12 and 2.13, Python 3.8 and newer, as well as R 3.5 and beyond. However, it’s important to note that support for Java 8 versions prior to 8u371 has been deprecated starting from Spark 3.5.0. Ad LANGUAGESUPPORTED VERSIONPython3.8JavaJava 8, 11, 13, 17, and the latest versions Java 8 versions prior to 8u371 have been deprecatedScala2.12 and 2.13R3.5 Apache Spark Tutorial – Versions Supported APACHE SPARK ARCHITECTURE Spark works in a master-slave architecture where the master is called the “Driver” and slaves are called “Workers”. When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager. Source: https://spark.apache.org/ For additional learning on this topic, I would recommend reading the following. Ad * What is Spark Job * What is the Spark Stage? Explained * What is Spark Executor * What is Apache Spark Driver? * What is DAG in Spark or PySpark * What is a Lineage Graph in Spark? * How to Submit a Spark Job via Rest API? CLUSTER MANAGER TYPES As of writing this Apache Spark Tutorial, Spark supports below cluster managers: * Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster. * Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and Spark applications. * Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, a cluster manager. * Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications. local – which is not really a cluster manager but still I wanted to mention that we use “local” for master() in order to run Spark on our laptop/computer. Ad SPARK INSTALLATION In order to run the Apache Spark examples mentioned in this tutorial, you need to have Spark and its needed tools to be installed on your computer. Since most developers use Windows for development, I will explain how to install Spark on Windows in this tutorial. you can also Install Spark on a Linux server if needed. Related: Spark Installation on Mac (macOS) Download Apache Spark by accessing the Spark Download page and selecting the link from “Download Spark (point 3)”. If you want to use a different version of Spark & Hadoop, select the one you wanted from dropdowns, and the link on point 3 changes to the selected version and provides you with an updated link to download. After downloading, untar the binary using 7zip and copy the underlying folder spark-3.5.0-bin-hadoop3 to c:\apps Now set the following environment variables. SPARK_HOME = C:\apps\spark-3.5.0-bin-hadoop3 HADOOP_HOME = C:\apps\spark-3.5.0-bin-hadoop3 PATH=%PATH%;C:\apps\spark-3.0.5-bin-hadoop3\bin SETUP WINUTILS.EXE Download wunutils.exe file from winutils, and copy it to %SPARK_HOME%\bin folder. Winutils are different for each Hadoop version hence download the right version from https://github.com/steveloughran/winutils SPARK-SHELL Spark binary comes with an interactive spark-shell. In order to start a shell, go to your SPARK_HOME/bin directory and type “spark-shell“. This command loads the Spark and displays what version of Spark you are using. spark-shell By default, spark-shell provides with spark (SparkSession) and sc (SparkContext) objects to use. Let’s see some examples. spark-shell create RDD Spark-shell also creates a Spark context web UI and by default, it can access from http://localhost:4041. SPARK-SUBMIT The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations, the application you are submitting can be written in Scala, Java, or Python (PySpark) code. You can use this utility in order to do the following. 1. Submitting Spark applications on different cluster managers like Yarn, Kubernetes, Mesos, and Stand-alone. 2. Submitting Spark application on client or cluster deployment modes ./bin/spark-submit \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key<=<value> \ --driver-memory <value>g \ --executor-memory <value>g \ --executor-cores <number of cores> \ --jars <comma separated dependencies> --class <main-class> \ <application-jar> \ [application-arguments] SPARK WEB UI Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of the Spark cluster, and Spark configurations. On Spark Web UI, you can see how the operations are executed. Spark Web UI SPARK HISTORY SERVER Spark History server, keep a log of all completed Spark application you submit by spark-submit, spark-shell. before you start, first you need to set the below config on spark-defaults.conf spark.eventLog.enabled true spark.history.fs.logDirectory file:///c:/logs/path Now, start the Spark history server on Linux or Mac by running. $SPARK_HOME/sbin/start-history-server.sh If you are running Spark on Windows, you can start the history server by starting the below command. $SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer By default, the History server listens at 18080 port and you can access it from the browser using http://localhost:18080/ Spark History Server By clicking on each App ID, you will get the details of the application in Spark web UI. The history server is very helpful when you are doing Spark performance tuning to improve spark jobs where you can cross-check the previous application run with the current run. SPARK MODULES * Spark Core * Spark SQL * Spark Streaming * Spark MLlib * Spark GraphX Spark Modules SPARK CORE In this section of the Apache Spark Tutorial, you will learn different concepts of the Spark Core library with examples in Scala code. Spark Core is the main base library of Spark which provides the abstraction of how distributed task dispatching, scheduling, basic I/O functionalities etc. Before getting your hands dirty on Spark programming, have your Development Environment Setup to run Spark Examples using IntelliJ IDEA SPARKSESSION SparkSession introduced in version 2.0, is an entry point to underlying Spark functionality in order to programmatically use Spark RDD, DataFrame, and Dataset. It’s object spark is default available in spark-shell. Creating a SparkSession instance would be the first statement you would write to the program with RDD, DataFrame and Dataset. SparkSession will be created using SparkSession.builder() builder pattern. import org.apache.spark.sql.SparkSession val spark:SparkSession = SparkSession.builder() .master("local[1]") .appName("SparkByExamples.com") .getOrCreate() SPARK CONTEXT SparkContext is available since Spark 1.x (JavaSparkContext for Java) and is used to be an entry point to Spark and PySpark before introducing SparkSession in 2.0. Creating SparkContext was the first step to the program with RDD and to connect to Spark Cluster. It’s object sc by default available in spark-shell. Since Spark 2.x version, When you create SparkSession, SparkContext object is by default created and it can be accessed using spark.sparkContext Note that you can create just one SparkContext per JVM but can create many SparkSession objects. RDD SPARK TUTORIAL RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark and it is the primary data abstraction in Apache Spark and the Spark Core. RDDs are fault-tolerant, immutable distributed collections of objects, which means once you create an RDD you cannot change it. Each dataset in RDD is divided into logical partitions, which can be computed on different nodes of the cluster. This Apache Spark RDD Tutorial will help you start understanding and using Apache Spark RDD (Resilient Distributed Dataset) with Scala code examples. All RDD examples provided in this tutorial were also tested in our development environment and are available at GitHub spark scala examples project for quick reference. In this section of the Apache Spark tutorial, I will introduce the RDD and explain how to create them and use their transformation and action operations. Here is the full article on Spark RDD in case you want to learn more about it and get your fundamentals strong. RDD CREATION RDDs are created primarily in two different ways, first parallelizing an existing collection and secondly referencing a dataset in an external storage system (HDFS, HDFS, S3 and many more). SPARKCONTEXT.PARALLELIZE() sparkContext.parallelize is used to parallelize an existing collection in your driver program. This is a basic method to create RDD. //Create RDD from parallelize val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000)) val rdd=spark.sparkContext.parallelize(dataSeq) SPARKCONTEXT.TEXTFILE() Using textFile() method we can read a text (.txt) file from many sources like HDFS, S#, Azure, local e.t.c into RDD. //Create RDD from external Data source val rdd2 = spark.sparkContext.textFile("/path/textFile.txt") RDD OPERATIONS On Spark RDD, you can perform two kinds of operations. RDD TRANSFORMATIONS Spark RDD Transformations are lazy operations meaning they don’t execute until you call an action on RDD. Since RDDs are immutable, When you run a transformation(for example map()), instead of updating a current RDD, it returns a new RDD. Some transformations on RDDs are flatMap(), map(), reduceByKey(), filter(), sortByKey() and all these return a new RDD instead of updating the current. RDD ACTIONS RDD Action operation returns the values from an RDD to a driver node. In other words, any RDD function that returns non RDD[T] is considered as an action. RDD operations trigger the computation and return RDD in a List to the driver program. Some actions on RDDs are count(), collect(), first(), max(), reduce() and more. RDD EXAMPLES * Read CSV file into RDD * RDD Pair Functions * Generate DataFrame from RDD DATAFRAME SPARK TUTORIAL WITH BASIC EXAMPLES DataFrame definition is very well explained by Databricks hence I do not want to define it again and confuse you. Below is the definition I took from Databricks. > DataFrame is a distributed collection of data organized into named columns. It > is conceptually equivalent to a table in a relational database or a data frame > in R/Python, but with richer optimizations under the hood. DataFrames can be > constructed from a wide array of sources such as structured data files, tables > in Hive, external databases, or existing RDDs. > > – Databricks DATAFRAME CREATION The simplest way to create a Spark DataFrame is from a seq collection. Spark DataFrame can also be created from an RDD and by reading files from several sources. Related: Spark Word Count Explained with Example USING CREATEDATAFRAME() By using createDataFrame() function of the SparkSession you can create a DataFrame. val data = Seq(('James','','Smith','1991-04-01','M',3000), ('Michael','Rose','','2000-05-19','M',4000), ('Robert','','Williams','1978-09-05','M',4000), ('Maria','Anne','Jones','1967-12-01','F',4000), ('Jen','Mary','Brown','1980-02-17','F',-1) ) val columns = Seq("firstname","middlename","lastname","dob","gender","salary") df = spark.createDataFrame(data), schema = columns).toDF(columns:_*) Since DataFrames are structure format that contains names and column, we can get the schema of the DataFrame using the df.printSchema() df.show() shows the 20 elements from the DataFrame. +---------+----------+--------+----------+------+------+ |firstname|middlename|lastname|dob |gender|salary| +---------+----------+--------+----------+------+------+ |James | |Smith |1991-04-01|M |3000 | |Michael |Rose | |2000-05-19|M |4000 | |Robert | |Williams|1978-09-05|M |4000 | |Maria |Anne |Jones |1967-12-01|F |4000 | |Jen |Mary |Brown |1980-02-17|F |-1 | +---------+----------+--------+----------+------+------+ In this Apache Spark SQL DataFrame Tutorial, I have explained several mostly used operation/functions on DataFrame & DataSet with working Scala examples. * Spark DataFrame – Rename nested column * How to add or update a column on DataFrame * How to drop a column on DataFrame * Spark when otherwise usage * How to add literal constant to DataFrame * Spark Data Types explained * How to change column data type * How to Pivot and Unpivot a DataFrame * Create a DataFrame using StructType & StructField schema * How to select the first row of each group * How to sort DataFrame * How to union DataFrame * How to drop Rows with null values from DataFrame * How to split single to multiple columns * How to concatenate multiple columns * How to replace null values in DataFrame * How to remove duplicate rows on DataFrame * How to remove distinct on multiple selected columns * Spark map() vs mapPartitions() SPARK DATAFRAME ADVANCED CONCEPTS * Spark Partitioning, Repartitioning and Coalesce * How does Spark shuffle work? * Spark Cache and Persistence * Spark Persistance Storage levels * Spark Broadcast shared variable * Spark Accumulator shared variable * Spark UDF SPARK ARRAY AND MAP OPERATIONS * How to create an Array (ArrayType) column on DataFrame * How to create a Map (MapType) column on DataFrame * How to convert an Array to columns * How to create an Array of struct column * How to explode an Array and map columns * How to explode an Array of structs * How to explode an Array of map columns to rows * How to create a DataFrame with nested Array * How to explode nested Arrays to rows * How to flatten nested Array to single Array * Spark – Convert array of String to a String column SPARK AGGREGATE * How to group rows in DataFrame * How to get Count distinct on DataFrame * How to add row number to DataFrame * How to select the first row of each group SPARK SQL JOINS * Spark SQL Join SPARK PERFORMANCE * Spark Performance Improvement OTHER HELPFUL TOPICS ON DATAFRAME * How to stop DEBUG & INFO log messages * Print DataFrame full column contents * Unstructured vs semi-structured vs structured files SPARK SQL SCHEMA & STRUCTTYPE * How to convert case class to a schema * Spark Schema explained with examples * How to create array of struct column * Spark StructType & StructField * How to flatten nested column SPARK SQL FUNCTIONS Spark SQL provides several built-in functions, When possible try to leverage the standard library as they are a little bit more compile-time safe, handle null, and perform better when compared to UDF. If your application is critical on performance try to avoid using custom UDF at all costs as these are not guaranteed on performance. In this section, we will see several Spark SQL functions Tutorials with Scala examples. * Spark Date and Time Functions * Spark String Functions * Spark Array Functions * Spark Map Functions * Spark Aggregate Functions * Spark Window Functions * Spark Sort Functions SPARK DATA SOURCE WITH EXAMPLES Spark SQL supports operating on a variety of data sources through the DataFrame interface. This section of the tutorial describes reading and writing data using the Spark Data Sources with Scala examples. Using Data source API we can load from or save data to RDMS databases, Avro, parquet, XML etc. TEXT * Spark process Text file * How to process JSON from a Text file CSV * How to process CSV file * How to convert Parquet file to CSV file * How to process JSON from a CSV file * How to Convert Avro file to CSV file * How to convert CSV file to Avro, Parquet & JSON JSON * JSON Example (Read & Write) * How to Read JSON from multi-line * How to read JSON file with custom schema * How to process JSON from a CSV file * How to process JSON from a Text file * How to convert Parquet file to JSON file * How to convert Avro file to JSON file * How to convert JSON to Avro, Parquet, CSV file PARQUET * Parquet Example (Read and Write) * How to convert Parquet file to CSV file * How to convert Parquet file to Avro file * How to convert Avro file to Parquet file AVRO * Avro Example (Read and Write) * Spark 2.3 – Apache Avro Example * How to Convert Avro file to CSV file * How to convert Parquet file to Avro file * How to convert Avro file to JSON file * How to convert Avro file to Parquet file ORC * Spark Read & Write ORC * XML * Processing Nested XML structured files * How to validate XML with XSD HIVE & TABLES * Spark Save DataFrame to Hive Table * Spark JDBC Parallel Read * Read JDBC Table to Spark DataFrame * Spark saveAsTable() with Examples * Spark Query Table using JDBC * Spark Read and Write MySQL Database Table * Spark with SQL Server – Read and Write Table * Spark spark.table() vs spark.read.table() SQL SPARK TUTORIAL Spark SQL is one of the most used Spark modules which is used for processing structured columnar data format. Once you have a DataFrame created, you can interact with the data by using SQL syntax. In other words, Spark SQL brings native RAW SQL queries on Spark meaning you can run traditional ANSI SQL on Spark Dataframe. In the later section of this Apache Spark tutorial, you will learn in detail using SQL select, where, group by, join, union e.t.c In order to use SQL, first, we need to create a temporary table on DataFrame using createOrReplaceTempView() function. Once created, this table can be accessed throughout the SparkSession and it will be dropped along with your SparkContext termination. On a table, SQL query will be executed using sql() method of the SparkSession and this method returns a new DataFrame. df.createOrReplaceTempView("PERSON_DATA") val df2 = spark.sql("SELECT * from PERSON_DATA") df2.printSchema() df2.show() Let’s see another example using group by. val groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender") groupDF.show() This yields the below output +------+--------+ |gender|count(1)| +------+--------+ | F| 2| | M| 3| +------+--------+ Similarly, you can run any traditional SQL queries on DataFrames using Spark SQL. SPARK HDFS & S3 TUTORIAL * Processing files from Hadoop HDFS (TEXT, CSV, Parquet, Avro, JSON) * Processing TEXT files from Amazon S3 bucket * Processing JSON files from Amazon S3 bucket * Processing CSV files from Amazon S3 bucket * Processing Parquet files from Amazon S3 bucket * Processing Avro files from Amazon S3 bucket SPARK STREAMING TUTORIAL & EXAMPLES Spark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads. It is used to process real-time data from sources like file system folders, TCP sockets, S3, Kafka, Flume, Twitter, and Amazon Kinesis to name a few. The processed data can be pushed to databases, Kafka, live dashboards e.t.c source: https://spark.apache.org/ * Spark Streaming – OutputModes Append vs Complete vs Update * Spark Streaming – Read JSON Files From Directory with Scala Example * Spark Streaming – Read data From TCP Socket with Scala Example * Spark Streaming – Consuming & Producing Kafka messages in JSON format * Spark Streaming – Consuming & Producing Kafka messages in Avro format * Using from_avro and to_avro functions * Reading Avro data from Kafka topic using from_avro() and to_avro() * Spark Batch Processing using Kafka Data Source SPARK WITH KAFKA TUTORIALS * Spark Streaming – Consuming & Producing Kafka messages in JSON format * Spark Streaming – Consuming & Producing Kafka messages in Avro format * Using from_avro and to_avro functions * Reading Avro data from Kafka topic using from_avro() and to_avro() * Spark Batch Processing using Kafka Data Source SPARK – HBASE TUTORIALS & EXAMPLES In this section of the Spark Tutorial, you will learn several Apache HBase spark connectors and how to read an HBase table to a Spark DataFrame and write DataFrame to HBase table. * Spark HBase Connectors explained * Writing Spark DataFrame to HBase table using shc-core Hortonworks library * Creating Spark DataFrame from Hbase table using shc-core Hortonworks library SPARK – HIVE TUTORIALS In this section, you will learn what is Apache Hive and several examples of connecting to Hive, creating Hive tables, reading them into DataFrame * Start HiveServer2 and connect to hive beeline SPARK GRAPHX AND GRAPHFRAMES Spark GraphFrames are introduced in Spark 3.0 version to support Graphs on DataFrames. Prior to 3.0, Spark had GraphX library which ideally runs on RDD, and lost all Data Frame capabilities. WHAT ARE THE KEY FEATURES AND IMPROVEMENTS RELEASED IN SPARK 3.5.0 Following are some of the key features and improvements in Spark 3.5 * Spark Connect: This release extends the general availability of Spark Connect with support for Scala and Go clients, distributed training and inference support, and enhanced compatibility for Structured streaming. * PySpark and SQL Functionality: New functionality has been introduced in PySpark and SQL, including the SQL IDENTIFIER clause, named argument support for SQL function calls, SQL function support for HyperLogLog approximate aggregations, and Python user-defined table functions. * Distributed Training with DeepSpeed: The release simplifies distributed training with DeepSpeed, making it more accessible. * Structured Streaming: It introduces watermark propagation among operators and dropDuplicatesWithinWatermark operations in Structured Streaming, enhancing its capabilities. * English SDK: Apache Spark for English SDK integrates the extensive expertise of Generative AI in Apache Spark. REFERENCES: * https://spark.apache.org/ * https://databricks.com/spark/about * https://github.com/apache/spark -------------------------------------------------------------------------------- Author Naveen Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn LinkedIn Ad LEAVE A REPLY CANCEL REPLY Comment Enter your name or username to comment Enter your email address to comment Enter your website URL (optional) Save my name, email, and website in this browser for the next time I comment. Δ APACHE SPARK TUTORIAL SPARK INTRODUCTION * Spark – Installation on Mac * Spark – Installation on Windows * Spark – Installation on Linux | Ubuntu * Spark – Cluster Setup with Hadoop Yarn * Spark – Web/Application UI * Spark – Setup with Scala and IntelliJ * Spark – How to Run Examples From this Site on IntelliJ IDEA * Spark – SparkSession * Spark – SparkContext SPARK RDD * Spark RDD – Parallelize * Spark RDD – Read text file * Spark RDD – Read CSV * Spark RDD – Create RDD * Spark RDD – Actions * Spark RDD – Pair Functions * Spark RDD – Repartition and Coalesce * Spark RDD – Shuffle Partitions * Spark RDD – Cache vs Persist * Spark RDD – Persistance Storage Levels * Spark RDD – Broadcast Variables * Spark RDD – Accumulator Variables * Spark RDD – Convert RDD to DataFrame SPARK SQL TUTORIAL * DataFrame – createDataFrame() * DataFrame – where() & filter() * DataFrame – withColumn() * DataFrame – withColumnRenamed() * DataFrame – drop() * DataFrame – distinct() * DataFrame – groupBy() * DataFrame – join() * DataFrame – map() vs mapPartitions() * DataFrame – foreach() vs foreachPartition() * DataFrame – pivot() * DataFrame – union() * DataFrame – collect() * DataFrame – cache() & persist() * DataFrame – udf() * Spark SQL StructType & StructField SPARK SQL FUNCTIONS * Spark SQL String Functions * Spark SQL Date and Timestamp Functions * Spark SQL Array Functions * Spark SQL Map Functions * Spark SQL Sort Functions * Spark SQL Aggregate Functions * Spark SQL Window Functions * Spark SQL JSON Functions SPARK DATA SOURCE API * Spark – Read & Write CSV file * Spark – Read and Write JSON file * Spark – Read & Write Parquet file * Spark – Read & Write XML file * Spark – Read & Write Avro files * Spark – Read & Write Avro files (Spark version 2.3.x or earlier) * Spark – Read & Write HBase using “hbase-spark” Connector * Spark – Read & Write from HBase using Hortonworks * Spark – Read & Write ORC file * Spark – Read Binary File SPARK STREAMING & KAFKA * Spark Streaming – OutputModes * Spark Streaming – Reading Files From Directory * Spark Streaming – Reading Data From TCP Socket * Spark Streaming – Processing Kafka Messages in JSON Format * Spark Streaming – Processing Kafka messages in AVRO Format * Spark SQL Batch – Consume & Produce Kafka Message Ad Ad Ad 00:00/00:00 Ad Ad Ad Ad TOP TUTORIALS * Apache Spark Tutorial * PySpark Tutorial * Python Pandas Tutorial * R Programming Tutorial * Python NumPy Tutorial * Apache Hive Tutorial * Apache HBase Tutorial * Apache Cassandra Tutorial * Apache Kafka Tutorial * Snowflake Data Warehouse Tutorial * H2O Sparkling Water Tutorial CATEGORIES * Apache Spark * PySpark * Pandas * R Programming * Snowflake Database * NumPy * Apache Hive * Apache HBase * Apache Kafka * Apache Cassandra * H2O Sparkling Water LEGAL * Privacy Policy * Refund Policy * Terms of Use ABOUT SPARKBYEXAMPLES.COM SparkByExamples is made for learning and training in Big Data, Machine learning and Artificial Intelligence. Our tutorials are written and curated by expers with simple examples to help you understand better. When you use SparkByExamples, you agree to have read and accepted our terms of use, and privacy policy. * Opens in a new tab * Opens in a new tab * Opens in a new tab * Opens in a new tab * Opens in a new tab info@sparkbyexamples.com +1 (949) 345-0676 Desert Bloom Irvine, CA 92618 USA Copyright 2023 www.SparkByExamples.com. All rights reserved. 🌎 ✕ 🍪 PRIVACY & TRANSPARANTIE Wij en onze partners gebruiken cookies om Informatie op een apparaat opslaan en/of openen. Wij en onze partners gebruiken gegevens voor Gepersonaliseerde advertenties en content, advertentie- en contentmetingen, doelgroepenonderzoek en ontwikkeling van diensten. Een voorbeeld van de gegevens die worden verwerkt, kan een unieke identificatie zijn die in een cookie is opgeslagen. Sommige van onze partners kunnen uw gegevens verwerken als onderdeel van hun legitieme zakelijke belang zonder toestemming te vragen. Gebruik de onderstaande link met de leverancierslijst om te zien voor welke doeleinden zij denken dat ze een gerechtvaardigd belang hebben, of om bezwaar te maken tegen deze gegevensverwerking. De verstrekte toestemming wordt alleen gebruikt voor gegevensverwerking die afkomstig is van deze website. Als u uw instellingen wilt wijzigen of uw toestemming op enig moment wilt intrekken, vindt u de link om dit te doen in ons privacybeleid dat toegankelijk is vanaf onze startpagina.. Instellingen beheren Ga verder met aanbevolen cookies Leverancierslijst | Privacy Policy ❮❯