Apache spark установка на windows

Introduction

Apache Spark is an open-source framework that processes large volumes of stream data from multiple sources. Spark is used in distributed computing with machine learning applications, data analytics, and graph-parallel processing.

This guide will show you how to install Apache Spark on Windows 10 and test the installation.

tutorial on installing apache spark on windows

Prerequisites

  • A system running Windows 10
  • A user account with administrator privileges (required to install software, modify file permissions, and modify system PATH)
  • Command Prompt or Powershell
  • A tool to extract .tar files, such as 7-Zip

Installing Apache Spark on Windows 10 may seem complicated to novice users, but this simple tutorial will have you up and running. If you already have Java 8 and Python 3 installed, you can skip the first two steps.

Step 1: Install Java 8

Apache Spark requires Java 8. You can check to see if Java is installed using the command prompt.

Open the command line by clicking Start > type cmd > click Command Prompt.

Type the following command in the command prompt:

java -version

If Java is installed, it will respond with the following output:

Windows CLI output for Java version.

Your version may be different. The second digit is the Java version – in this case, Java 8.

If you don’t have Java installed:

1. Open a browser window, and navigate to https://java.com/en/download/.

Java download page in a browser

2. Click the Java Download button and save the file to a location of your choice.

3. Once the download finishes double-click the file to install Java.

Note: At the time this article was written, the latest Java version is 1.8.0_251. Installing a later version will still work. This process only needs the Java Runtime Environment (JRE) – the full Development Kit (JDK) is not required. The download link to JDK is https://www.oracle.com/java/technologies/javase-downloads.html.

Step 2: Install Python

1. To install the Python package manager, navigate to https://www.python.org/ in your web browser.

2. Mouse over the Download menu option and click Python 3.8.3. 3.8.3 is the latest version at the time of writing the article.

3. Once the download finishes, run the file.

Python download page for version 3.8.3

4. Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH. Leave the other box checked.

5. Next, click Customize installation.

Python wizard 3.8.3, step to add Python to PATH

6. You can leave all boxes checked at this step, or you can uncheck the options you do not want.

7. Click Next.

8. Select the box Install for all users and leave other boxes as they are.

9. Under Customize install location, click Browse and navigate to the C drive. Add a new folder and name it Python.

10. Select that folder and click OK.

Python installation, advanced options step

11. Click Install, and let the installation complete.

12. When the installation completes, click the Disable path length limit option at the bottom and then click Close.

13. If you have a command prompt open, restart it. Verify the installation by checking the version of Python:

python --version

The output should print Python 3.8.3.

Note: For detailed instructions on how to install Python 3 on Windows or how to troubleshoot potential issues, refer to our Install Python 3 on Windows guide.

Step 3: Download Apache Spark

1. Open a browser and navigate to https://spark.apache.org/downloads.html.

2. Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview version.

  • In our case, in Choose a Spark release drop-down menu select 2.4.5 (Feb 05 2020).
  • In the second drop-down Choose a package type, leave the selection Pre-built for Apache Hadoop 2.7.

3. Click the spark-2.4.5-bin-hadoop2.7.tgz link.

Apache Spark download page.

4. A page with a list of mirrors loads where you can see different servers to download from. Pick any from the list and save the file to your Downloads folder.

Step 4: Verify Spark Software File

1. Verify the integrity of your download by checking the checksum of the file. This ensures you are working with unaltered, uncorrupted software.

2. Navigate back to the Spark Download page and open the Checksum link, preferably in a new tab.

3. Next, open a command line and enter the following command:

certutil -hashfile c:\users\username\Downloads\spark-2.4.5-bin-hadoop2.7.tgz SHA512

4. Change the username to your username. The system displays a long alphanumeric code, along with the message Certutil: -hashfile completed successfully.

Checksum output for the Spark installation file.

5. Compare the code to the one you opened in a new browser tab. If they match, your download file is uncorrupted.

Step 5: Install Apache Spark

Installing Apache Spark involves extracting the downloaded file to the desired location.

1. Create a new folder named Spark in the root of your C: drive. From a command line, enter the following:

cd \

mkdir Spark

2. In Explorer, locate the Spark file you downloaded.

3. Right-click the file and extract it to C:\Spark using the tool you have on your system (e.g., 7-Zip).

4. Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the necessary files inside.

Step 6: Add winutils.exe File

Download the winutils.exe file for the underlying Hadoop version for the Spark installation you downloaded.

1. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin folder, locate winutils.exe, and click it.

Winutils download page

2. Find the Download button on the right side to download the file.

3. Now, create new folders Hadoop and bin on C: using Windows Explorer or the Command Prompt.

4. Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.

Step 7: Configure Environment Variables

Configuring environment variables in Windows adds the Spark and Hadoop locations to your system PATH. It allows you to run the Spark shell directly from a command prompt window.

1. Click Start and type environment.

2. Select the result labeled Edit the system environment variables.

3. A System Properties dialog box appears. In the lower-right corner, click Environment Variables and then click New in the next window.

Add new environment variable in Windows.

4. For Variable Name type SPARK_HOME.

5. For Variable Value type C:\Spark\spark-2.4.5-bin-hadoop2.7 and click OK. If you changed the folder path, use that one instead.

Adding Spark home variable path in Windows.

6. In the top box, click the Path entry, then click Edit. Be careful with editing the system path. Avoid deleting any entries already on the list.

Edit the path variable to add Spark home.

7. You should see a box with entries on the left. On the right, click New.

8. The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-2.4.5-bin-hadoop2.7\bin. We recommend using %SPARK_HOME%\bin to avoid possible issues with the path.

Adding the Spark home to the path Windows variable.

9. Repeat this process for Hadoop and Java.

  • For Hadoop, the variable name is HADOOP_HOME and for the value use the path of the folder you created earlier: C:\hadoop. Add C:\hadoop\bin to the Path variable field, but we recommend using %HADOOP_HOME%\bin.
  • For Java, the variable name is JAVA_HOME and for the value use the path to your Java JDK directory (in our case it’s C:\Program Files\Java\jdk1.8.0_251).

10. Click OK to close all open windows.

Note: Star by restarting the Command Prompt to apply changes. If that doesn’t work, you will need to reboot the system.

Step 8: Launch Spark

1. Open a new command-prompt window using the right-click and Run as administrator:

2. To start Spark, enter:

C:\Spark\spark-2.4.5-bin-hadoop2.7\bin\spark-shell

If you set the environment path correctly, you can type spark-shell to launch Spark.

3. The system should display several lines indicating the status of the application. You may get a Java pop-up. Select Allow access to continue.

Finally, the Spark logo appears, and the prompt displays the Scala shell.

Scala shell after launching apacheSpark in windows

4., Open a web browser and navigate to http://localhost:4040/.

5. You can replace localhost with the name of your system.

6. You should see an Apache Spark shell Web UI. The example below shows the Executors page.

Spark Windows Executors page Web UI

7. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.

Note: If you installed Python, you can run Spark using Python with this command:

pyspark

Exit using quit().

Test Spark

In this example, we will launch the Spark shell and use Scala to read the contents of a file. You can use an existing file, such as the README file in the Spark directory, or you can create your own. We created pnaptest with some text.

1. Open a command-prompt window and navigate to the folder with the file you want to use and launch the Spark shell.

2. First, state a variable to use in the Spark context with the name of the file. Remember to add the file extension if there is any.

val x =sc.textFile("pnaptest")

3. The output shows an RDD is created. Then, we can view the file contents by using this command to call an action:

x.take(11).foreach(println)
Spark Scala test action with reading a file.

This command instructs Spark to print 11 lines from the file you specified. To perform an action on this file (value x), add another value y, and do a map transformation.

4. For example, you can print the characters in reverse with this command:

val y = x.map(_.reverse)

5. The system creates a child RDD in relation to the first one. Then, specify how many lines you want to print from the value y:

y.take(11).foreach(println)
Spark scala action to put the characters of a file in reverse.

The output prints 11 lines of the pnaptest file in the reverse order.

When done, exit the shell using ctrl-d.

Conclusion

You should now have a working installation of Apache Spark on Windows 10 with all dependencies installed. Get started running an instance of Spark in your Windows environment.

Our suggestion is to also learn more about what Spark DataFrame is, the features, and how to use Spark DataFrame when collecting data.

In this article, I will explain step-by-step how to do Apache Spark Installation on windows os 7, 10, and the latest version and also explain how to start a history server and monitor your jobs using Web UI.

Related:

  • Spark Install Latest Version on Mac
  • PySpark Install on Windows

Install Java 8 or Later

To install Apache Spark on windows, you would need Java 8 or the latest version hence download the Java version from Oracle and install it on your system. If you wanted OpenJDK you can download it from here.

After download, double click on the downloaded .exe (jdk-8u201-windows-x64.exe) file in order to install it on your windows system. Choose any custom directory or keep the default location.

Note: This article explains Installing Apache Spark on Java 8, same steps will also work for Java 11 and 13 versions.

Apache Spark comes in a compressed tar/zip files hence installation on windows is not much of a deal as you just need to download and untar the file. Download Apache spark by accessing the Spark Download page and select the link from “Download Spark (point 3 from below screenshot)”.

If you wanted to use a different version of Spark & Hadoop, select the one you wanted from the drop-down; the link on point 3 changes to the selected version and provides you with an updated link to download.

Apache Spark Installation windows

After download, untar the binary using 7zip or any zip utility to extract the zip file and copy the extracted directory spark-3.0.0-bin-hadoop2.7 to c:\apps\opt\spark-3.0.0-bin-hadoop2.7

Spark Environment Variables

Post Java and Apache Spark installation on windows, set JAVA_HOME, SPARK_HOME, HADOOP_HOME and PATH environment variables. If you know how to set the environment variable on windows, add the following.


JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
PATH = %PATH%;%JAVA_HOME%

SPARK_HOME  = C:\apps\opt\spark-3.0.0-bin-hadoop2.7
HADOOP_HOME = C:\apps\opt\spark-3.0.0-bin-hadoop2.7
PATH=%PATH%;%SPARK_HOME%

Follow the below steps if you are not aware of how to add or edit environment variables on windows.

  1. Open System Environment Variables window and select Environment Variables.
Apache Spark Installation windows

2. On the following Environment variable screen, add SPARK_HOME, HADOOP_HOME, JAVA_HOME by selecting the New option.

spark installation environment variable

3. This opens up the New User Variables window where you can enter the variable name and value.

4. Now Edit the PATH variable

apache spark install windows

5. Add Spark, Java, and Hadoop bin location by selecting New option.

spark install windows

Spark with winutils.exe on Windows

Many beginners think Apache Spark needs a Hadoop cluster installed to run but that’s not true, Spark can run on AWS by using S3, Azure by using blob storage without Hadoop and HDFSe.t.c.

To run Apache Spark on windows, you need winutils.exe as it uses POSIX like file access operations in windows using windows API.

winutils.exe enables Spark to use Windows-specific services including running shell commands on a windows environment.

Download winutils.exe for Hadoop 2.7 and copy it to %SPARK_HOME%\bin folder. Winutils are different for each Hadoop version hence download the right version based on your Spark vs Hadoop distribution from https://github.com/steveloughran/winutils

Apache Spark shell

spark-shell is a CLI utility that comes with Apache Spark distribution, open command prompt, go to cd %SPARK_HOME%/bin and type spark-shell command to run Apache Spark shell. You should see something like below (ignore the error you see at the end). Sometimes it may take a minute or two for your Spark instance to initialize to get to the below screen.

Apache Spark Install Windows
Spark Shell Command Line

Spark-shell also creates a Spark context web UI and by default, it can access from http://localhost:4041.

On spark-shell command line, you can run any Spark statements like creating an RDD, getting Spark version e.t.c


scala> spark.version
res2: String = 3.0.0

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:24

scala>

This completes the installation of Apache Spark on Windows 7, 10, and any latest.

Where to go Next?

You can continue following the below document to see how you can debug the logs using Spark Web UI and enable the Spark history server or follow the links as next steps

  • Spark RDD Tutorial
  • Spark Hello World Example in IntelliJ IDEA

Web UI on Windows

Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations. On Spark Web UI, you can see how the operations are executed.

Spark Web UI

Spark Web UI

History Server

History server keeps a log of all Spark applications you submit by spark-submit, spark-shell. You can enable Spark to collect the logs by adding the below configs to spark-defaults.conf file, conf file is located at %SPARK_HOME%/conf directory.


spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/path

After setting the above properties, start the history server by starting the below command.


$SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer

By default History server listens at 18080 port and you can access it from browser using http://localhost:18080/

spark history server

Spark History Server

By clicking on each App ID, you will get the details of the application in Spark web UI.

Conclusion

In summary, you have learned how to install Apache Spark on windows and run sample statements in spark-shell, and learned how to start spark web-UI and history server.

If you have any issues, setting up, please message me in the comments section, I will try to respond with the solution.

Happy Learning !!

Related Articles

  • Apache Spark Installation on Linux
  • What is Apache Spark Driver?
  • Apache Spark 3.x Install on Mac
  • Install Apache Spark Latest Version on Mac
  • How to Check Spark Version
  • What does setMaster(local[*]) mean in Spark
  • Spark Start History Server
  • Spark with Scala setup on IntelliJ

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python, and R and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

In this document, we will cover the installation procedure of Apache Spark on the Windows 10 operating system.

Prerequisites

This guide assumes that you are using Windows 10 and the user has admin permissions.

System requirements:

  • Windows 10 OS
  • At least 4 GB RAM
  • Free space of at least 20 GB

Installation Procedure

Step 1: Go to Apache Spark’s official download page and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.

The page will look like the one below.

Apache Spark installation Process

Step 2:  Once the download is completed, unzip the file, unzip the file using WinZip or WinRAR, or 7-ZIP.

Step 3: Create a folder called Spark under your user Directory like below and copy and paste the content from the unzipped file.

C:\Users\<USER>\Spark

It looks like the below after copy-pasting into the Spark directory.

Apache Spark installation Process

Step 4: Go to the conf folder and open the log file called log4j.properties. template. Change INFO to WARN (It can be an ERROR to reduce the log). This and the next steps are optional.

Remove. template so that Spark can read the file.

Before removing. template all files look like below.

Apache Spark installation Process

After removing. template extension, files will look like below

Apache Spark installation Process

Step 5: Now, we need to configure the path.

Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables

Add below new user variable (or System variable) (To add a new user variable, click on the New button under User variable for <USER>)

Apache Spark installation Process

Click OK.

Add %SPARK_HOME%\bin to the path variable.

Apache Spark installation Process

Click OK.

Step 6: Spark needs a piece of Hadoop to run. For Hadoop 2.7, you need to install winutils.exe.

You can find winutils.exe on this page. You can download it for your ease.

Step 7: Create a folder called winutils in C drive and create a folder called bin inside. Then, move the downloaded winutils file to the bin folder.

C:\winutils\bin

Apache Spark installation Process

Add the user (or system) variable %HADOOP_HOME% like SPARK_HOME.

Apache Spark installation Process

Apache Spark installation Process

Click OK.

Step 8: To install Apache Spark, Java should be installed on your computer. If you don’t have java installed on your system. Please follow the below process

Java Installation Steps

1. Go to the official Java site mentioned below the page.

Accept Licence Agreement for Java SE Development Kit 8u201

2. Download jdk-8u201-windows-x64.exe file

3. Double Click on the Downloaded .exe file, and you will see the window is shown below.

Java Installation Steps

4. Click Next.

5. Then below window will be displayed.

Java Installation Steps

6. Click Next.

7. Below window will be displayed after some process.

Java Installation Steps

8. Click Close.

Test Java Installation

Open Command Line and type java -version, then it should display the installed version of Java

Java Installation Steps

You should also check JAVA_HOME and the path of %JAVA_HOME%\bin included in user variables (or system variables)

1. In the end, the environment variables have 3 new paths (if you need to add a Java path, otherwise SPARK_HOME and HADOOP_HOME).

Java Installation Steps

2. Create c:\tmp\hive directory. This step is not necessary for later versions of Spark. When you first start Spark, it creates the folder by itself. However, it is the best practice to create a folder.

C:\tmp\hive

Test Installation

Open the command line and type spark-shell, and you will get the result below.

Test Installation in Apache Spark

We have completed the spark installation on the Windows system. Let’s create RDD and     Data frame

We create one RDD and Data frame; then we will end up.

1. We can create RDD in 3 ways; we will use one way to create RDD.

Define any list, then parallelize it. It will create RDD. Below is the code, and copy and paste it one by one on the command line.

val list = Array(1,2,3,4,5)
val rdd = sc.parallelize(list)

The above will create RDD.

2. Now, we will create a Data frame from RDD. Follow the below steps to create Dataframe.

import spark.implicits._
val df = rdd.toDF("id")

The above code will create Dataframe with id as a column.

To display the data in Dataframe, use the below command.

Df.show()

It will display the below output.

Test Installation in Apache Spark

How to uninstall Spark from Windows 10 System

Please follow the below steps to uninstall spark on Windows 10.

  1. Remove the below System/User variables from the system.
  2. SPARK_HOME
  3. HADOOP_HOME

To remove System/User variables, please follow the below steps:

Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables, then find SPARK_HOME and HADOOP_HOME then, select them, and press the DELETE button.

Find Path variable Edit -> Select %SPARK_HOME%\bin -> Press DELETE Button

Select % HADOOP_HOME%\bin -> Press DELETE Button -> OK Button

Open Command Prompt, type spark-shell, then enter, and now we get an error. Now we can confirm that Spark is successfully uninstalled from the System.

Unleash your data superpowers with our advanced data science courses. Dive deep into the world of analytics and gain the skills to unlock valuable insights. Join us today and become a data wizard!

Conclusion 

Java 8 or a more recent version is required to install Apache Spark on Windows, so obtain and install it by visiting Oracle. You may download OpenJDK from this page if you’d like. 

Double-click the downloaded.exe (jdk-8u201-windows-x64.exe) file to install it on your Windows machine when it has finished downloading. Alternatively, you may stick with the default directory.  The article provides all the details on setting up your Apache Spark from Scratch.  

Once you have covered all the steps mentioned in this article, Apache Spark should operate perfectly on Windows 10. Start off by launching a Spark instance in your Windows environment. If you are facing any problems, let us know in the comments.  

Apache Spark is a lightning-fast unified analytics engine used for cluster computing for large data sets like BigData and Hadoop with the aim to run programs parallel across multiple nodes. It is a combination of multiple stack libraries such as SQL and Dataframes, GraphX, MLlib, and Spark Streaming.

Spark operates in 4 different modes:

  1. Standalone Mode: Here all processes run within the same JVM process.
  2. Standalone Cluster Mode: In this mode, it uses the Job-Scheduling framework in-built in Spark.
  3. Apache Mesos: In this mode, the work nodes run on various machines, but the driver runs only in the master node.
  4. Hadoop YARN: In this mode, the drivers run inside the application’s master node and is handled by YARN on the Cluster.

In This article, we will explore Apache Spark installation in a Standalone mode. Apache Spark is developed in Scala programming language and runs on the JVM. Java installation is one of the mandatory things in spark. So let’s start with Java installation. 

Installing Java:

Step 1: Download the Java JDK.

Step 2: Open the downloaded Java SE Development Kit and follow along with the instructions for installation.

Step 3: Open the environment variable on the laptop by typing it in the windows search bar.

Set JAVA_HOME Variables:

To set the JAVA_HOME variable follow the below steps:

  • Click on the User variable Add JAVA_HOME to PATH with value Value: C:\Program Files\Java\jdk1.8.0_261.
  • Click on the System variable Add C:\Program Files\Java\jdk1.8.0_261\bin to PATH variable.
  • Open command prompt and type “java –version”, it will show below appear & verify Java installation.

Installing Scala: 

For installing Scala on your local machine follow the below steps:

Step 1: Download Scala. 

Step 2: Click on the .exe file and follow along instructions to customize the setup according to your needs.

Step 3: Accept the agreement and click the next button. 

Set environmental variables:

  • In User Variable Add SCALA_HOME to PATH  with value C:\Program Files (x86)\scala.
  • In System Variable Add C:\Program Files (x86)\scala\bin to PATH variable.

Verify Scala installation:

In the Command prompt use the below command to verify Scala installation:

scala

Installing Spark:

Download a pre-built version of the Spark and extract it into the C drive, such as C:\Spark. Then click on the installation file and follow along the instructions to set up Spark.

Set environmental variables:

  • In User variable Add SPARK_HOME to PATH with value C:\spark\spark-2.4.6-bin-hadoop2.7.
  • In System variable Add%SPARK_HOME%\bin to PATH variable.

Download Windows Utilities:

If you wish to operate on Hadoop data follow the below steps to download utility for Hadoop:

Step 1: Download the winutils.exe file.

Step 2: Copy the file to  C:\spark\spark-1.6.1-bin-hadoop2.6\bin.

Step 3: Now execute “spark-shell” on cmd to verify spark installation as shown below:

Last Updated :
10 Apr, 2023

Like Article

Save Article

How to install Spark on Windows 11

Hello everyone, today we are going to install Spark on Windows.

Prerequisites

  1. Java 8 runtime environment (JRE)
  2. Apache Spark 3.3.0

Step 1 — Install Java 8 or Later

To install Apache Spark on windows, you would need Java 8 or the latest version hence download the Java version from Oracle and install it on your system. If you wanted OpenJDK you can download it from here.

You can install Java 8 from the following link here.

After finishing the file download we open a new command prompt, we should unpack the package

Because I am installing Java in folder Java of my C drive (C:\Java)

we create the the directory

then run the following command to unzip:

tar -xvzf  jre-8u361-windows-x64.tar.gz -C C:\Java\

Note: This article explains Installing Apache Spark on Java 8, same steps will also work for Java 11 and 13 versions.

Step 2 — Download packages

Apache Spark comes in a compressed tar/zip files hence installation on windows is not much of a deal as you just need to download and untar the file.

For this tutorial we are going to install Apache Spark 3.3.2 with Pre-built Apache Hadoop 3.3

Download Apache spark by accessing the Spark Download page and select the link from “Download Spark ”.

image-20230507210930446

We download the following file:

https://www.apache.org/dyn/closer.lua/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz

After download, untar the binary using 7zip or any zip utility to extract the zip file and copy the extracted directory spark-3.3.2-bin-hadoop3.tgz to c:\Hadoop\spark-3.3.2-bin-hadoop3

Let us open a terminal and we create the the directory

then we go to the directory where was Downloaded the file

run the following command to unzip:

tar -xvzf  spark-3.3.2-bin-hadoop3.tgz -C C:\Spark\

The extracted files are in the directory C:\Spark\spark-3.3.2-bin-hadoop3

cd C:\Spark\spark-3.3.2-bin-hadoop3
dir

image-20230507212327031

Step 3 — Edit Spark Environment Variables

Now we’ve downloaded and unpacked all the artefacts we need to configure two important environment variables.

First you click the windows button and type environment

image-20230507121326401

a) Configure Environment variables

We configure JAVA_HOME environment variable

by adding new environment variable:

Variable name : JAVA_HOME
Variable value: C:\Java\jre1.8.0_361

Follow the below steps if you are not aware of how to add or edit environment variables on windows.

  1. Open System Environment Variables window and select Environment Variables.
  2. On the following Environment variable screen, add SPARK_HOME, HADOOP_HOME, JAVA_HOME by selecting the New option.
  3. This opens up the New User Variables window where you can enter the variable name and value

image-20230507121225400

the same with SPARK_HOME environment variable:

Variable name : SPARK_HOME
Variable value: C:\Spark\spark-3.3.2-bin-hadoop3

and finally with HADOOP_HOME environment variable:

Variable name : HADOOP_HOME
Variable value: C:\Hadoop\hadoop-3.3.0

b) Configure PATH environment variable

Once we finish setting up the above two environment variables, we need to add the bin folders to the PATH environment variable. On Edit the PATH variable

image-20230507121941482

If PATH environment exists in your system, you can also manually add the following two paths to it:

%JAVA_HOME%/bin
%SPARK_HOME%/bin
%HADOOP_HOME%/bin

Add Spark, Java, and Hadoop bin location by selecting New option.

spark install windows

Spark with winutils.exe on Windows

To run Apache Spark on windows, you need winutils.exe as it uses POSIX like file access operations in windows using windows API.

winutils.exe enables Spark to use Windows-specific services including running shell commands on a windows environment.

We create the folder

mkdir C:\Hadoop\hadoop-3.3.0\

Download winutils.exe for Hadoop 3.3 and copy it to %HADOOP_HOME%\bin folder.

cd Downloads
copy winutils.exe C:\Hadoop\hadoop-3.3.0\bin

Verification of Installation

Once you complete the installation, Close your terminal window and open a new one and please run the following command to verify:

you will have

java version "1.8.0_361"
Java(TM) SE Runtime Environment (build 1.8.0_361-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.361-b09, mixed mode)

You should also be able to run the following command:

Microsoft Windows [Version 10.0.22621.1555]
(c) Microsoft Corporation. All rights reserved.

C:\Users\ruslanmv>spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.1.207:4040
Spark context available as 'sc' (master = local[*], app id = local-1683488208402).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/

Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_361)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 23/05/07 21:37:04 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped

spark-shell is a CLI utility that comes with Apache Spark distribution.

Spark-shell also creates a Spark context web UI and by default, it can access from http://localhost:4041.

Testing Spark

Open a new terminal and type

On spark-shell command line, you can run any Spark statements like creating an RDD, getting Spark version e.t.c

you will get

let us try type

val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

you get

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23

Congratulations! You have installed Apache Spark on Windows 11.

  • Aomei что это за папка в windows 10
  • Anydesk запускать при старте windows
  • Anydesk windows 10 без монитора
  • Aomei перенос windows 10 на другой компьютер
  • Aomei partition assistant скачать бесплатно на русском c ключом для windows