Как установить spark на windows 10

Introduction

Apache Spark is an open-source framework that processes large volumes of stream data from multiple sources. Spark is used in distributed computing with machine learning applications, data analytics, and graph-parallel processing.

This guide will show you how to install Apache Spark on Windows 10 and test the installation.

tutorial on installing apache spark on windows

Prerequisites

  • A system running Windows 10
  • A user account with administrator privileges (required to install software, modify file permissions, and modify system PATH)
  • Command Prompt or Powershell
  • A tool to extract .tar files, such as 7-Zip

Installing Apache Spark on Windows 10 may seem complicated to novice users, but this simple tutorial will have you up and running. If you already have Java 8 and Python 3 installed, you can skip the first two steps.

Step 1: Install Java 8

Apache Spark requires Java 8. You can check to see if Java is installed using the command prompt.

Open the command line by clicking Start > type cmd > click Command Prompt.

Type the following command in the command prompt:

java -version

If Java is installed, it will respond with the following output:

Windows CLI output for Java version.

Your version may be different. The second digit is the Java version – in this case, Java 8.

If you don’t have Java installed:

1. Open a browser window, and navigate to https://java.com/en/download/.

Java download page in a browser

2. Click the Java Download button and save the file to a location of your choice.

3. Once the download finishes double-click the file to install Java.

Note: At the time this article was written, the latest Java version is 1.8.0_251. Installing a later version will still work. This process only needs the Java Runtime Environment (JRE) – the full Development Kit (JDK) is not required. The download link to JDK is https://www.oracle.com/java/technologies/javase-downloads.html.

Step 2: Install Python

1. To install the Python package manager, navigate to https://www.python.org/ in your web browser.

2. Mouse over the Download menu option and click Python 3.8.3. 3.8.3 is the latest version at the time of writing the article.

3. Once the download finishes, run the file.

Python download page for version 3.8.3

4. Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH. Leave the other box checked.

5. Next, click Customize installation.

Python wizard 3.8.3, step to add Python to PATH

6. You can leave all boxes checked at this step, or you can uncheck the options you do not want.

7. Click Next.

8. Select the box Install for all users and leave other boxes as they are.

9. Under Customize install location, click Browse and navigate to the C drive. Add a new folder and name it Python.

10. Select that folder and click OK.

Python installation, advanced options step

11. Click Install, and let the installation complete.

12. When the installation completes, click the Disable path length limit option at the bottom and then click Close.

13. If you have a command prompt open, restart it. Verify the installation by checking the version of Python:

python --version

The output should print Python 3.8.3.

Note: For detailed instructions on how to install Python 3 on Windows or how to troubleshoot potential issues, refer to our Install Python 3 on Windows guide.

Step 3: Download Apache Spark

1. Open a browser and navigate to https://spark.apache.org/downloads.html.

2. Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview version.

  • In our case, in Choose a Spark release drop-down menu select 2.4.5 (Feb 05 2020).
  • In the second drop-down Choose a package type, leave the selection Pre-built for Apache Hadoop 2.7.

3. Click the spark-2.4.5-bin-hadoop2.7.tgz link.

Apache Spark download page.

4. A page with a list of mirrors loads where you can see different servers to download from. Pick any from the list and save the file to your Downloads folder.

Step 4: Verify Spark Software File

1. Verify the integrity of your download by checking the checksum of the file. This ensures you are working with unaltered, uncorrupted software.

2. Navigate back to the Spark Download page and open the Checksum link, preferably in a new tab.

3. Next, open a command line and enter the following command:

certutil -hashfile c:\users\username\Downloads\spark-2.4.5-bin-hadoop2.7.tgz SHA512

4. Change the username to your username. The system displays a long alphanumeric code, along with the message Certutil: -hashfile completed successfully.

Checksum output for the Spark installation file.

5. Compare the code to the one you opened in a new browser tab. If they match, your download file is uncorrupted.

Step 5: Install Apache Spark

Installing Apache Spark involves extracting the downloaded file to the desired location.

1. Create a new folder named Spark in the root of your C: drive. From a command line, enter the following:

cd \

mkdir Spark

2. In Explorer, locate the Spark file you downloaded.

3. Right-click the file and extract it to C:\Spark using the tool you have on your system (e.g., 7-Zip).

4. Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the necessary files inside.

Step 6: Add winutils.exe File

Download the winutils.exe file for the underlying Hadoop version for the Spark installation you downloaded.

1. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin folder, locate winutils.exe, and click it.

Winutils download page

2. Find the Download button on the right side to download the file.

3. Now, create new folders Hadoop and bin on C: using Windows Explorer or the Command Prompt.

4. Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.

Step 7: Configure Environment Variables

Configuring environment variables in Windows adds the Spark and Hadoop locations to your system PATH. It allows you to run the Spark shell directly from a command prompt window.

1. Click Start and type environment.

2. Select the result labeled Edit the system environment variables.

3. A System Properties dialog box appears. In the lower-right corner, click Environment Variables and then click New in the next window.

Add new environment variable in Windows.

4. For Variable Name type SPARK_HOME.

5. For Variable Value type C:\Spark\spark-2.4.5-bin-hadoop2.7 and click OK. If you changed the folder path, use that one instead.

Adding Spark home variable path in Windows.

6. In the top box, click the Path entry, then click Edit. Be careful with editing the system path. Avoid deleting any entries already on the list.

Edit the path variable to add Spark home.

7. You should see a box with entries on the left. On the right, click New.

8. The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-2.4.5-bin-hadoop2.7\bin. We recommend using %SPARK_HOME%\bin to avoid possible issues with the path.

Adding the Spark home to the path Windows variable.

9. Repeat this process for Hadoop and Java.

  • For Hadoop, the variable name is HADOOP_HOME and for the value use the path of the folder you created earlier: C:\hadoop. Add C:\hadoop\bin to the Path variable field, but we recommend using %HADOOP_HOME%\bin.
  • For Java, the variable name is JAVA_HOME and for the value use the path to your Java JDK directory (in our case it’s C:\Program Files\Java\jdk1.8.0_251).

10. Click OK to close all open windows.

Note: Star by restarting the Command Prompt to apply changes. If that doesn’t work, you will need to reboot the system.

Step 8: Launch Spark

1. Open a new command-prompt window using the right-click and Run as administrator:

2. To start Spark, enter:

C:\Spark\spark-2.4.5-bin-hadoop2.7\bin\spark-shell

If you set the environment path correctly, you can type spark-shell to launch Spark.

3. The system should display several lines indicating the status of the application. You may get a Java pop-up. Select Allow access to continue.

Finally, the Spark logo appears, and the prompt displays the Scala shell.

Scala shell after launching apacheSpark in windows

4., Open a web browser and navigate to http://localhost:4040/.

5. You can replace localhost with the name of your system.

6. You should see an Apache Spark shell Web UI. The example below shows the Executors page.

Spark Windows Executors page Web UI

7. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.

Note: If you installed Python, you can run Spark using Python with this command:

pyspark

Exit using quit().

Test Spark

In this example, we will launch the Spark shell and use Scala to read the contents of a file. You can use an existing file, such as the README file in the Spark directory, or you can create your own. We created pnaptest with some text.

1. Open a command-prompt window and navigate to the folder with the file you want to use and launch the Spark shell.

2. First, state a variable to use in the Spark context with the name of the file. Remember to add the file extension if there is any.

val x =sc.textFile("pnaptest")

3. The output shows an RDD is created. Then, we can view the file contents by using this command to call an action:

x.take(11).foreach(println)
Spark Scala test action with reading a file.

This command instructs Spark to print 11 lines from the file you specified. To perform an action on this file (value x), add another value y, and do a map transformation.

4. For example, you can print the characters in reverse with this command:

val y = x.map(_.reverse)

5. The system creates a child RDD in relation to the first one. Then, specify how many lines you want to print from the value y:

y.take(11).foreach(println)
Spark scala action to put the characters of a file in reverse.

The output prints 11 lines of the pnaptest file in the reverse order.

When done, exit the shell using ctrl-d.

Conclusion

You should now have a working installation of Apache Spark on Windows 10 with all dependencies installed. Get started running an instance of Spark in your Windows environment.

Our suggestion is to also learn more about what Spark DataFrame is, the features, and how to use Spark DataFrame when collecting data.

In this article, I will explain step-by-step how to do Apache Spark Installation on windows os 7, 10, and the latest version and also explain how to start a history server and monitor your jobs using Web UI.

Related:

  • Spark Install Latest Version on Mac
  • PySpark Install on Windows

Install Java 8 or Later

To install Apache Spark on windows, you would need Java 8 or the latest version hence download the Java version from Oracle and install it on your system. If you wanted OpenJDK you can download it from here.

After download, double click on the downloaded .exe (jdk-8u201-windows-x64.exe) file in order to install it on your windows system. Choose any custom directory or keep the default location.

Note: This article explains Installing Apache Spark on Java 8, same steps will also work for Java 11 and 13 versions.

Apache Spark comes in a compressed tar/zip files hence installation on windows is not much of a deal as you just need to download and untar the file. Download Apache spark by accessing the Spark Download page and select the link from “Download Spark (point 3 from below screenshot)”.

If you wanted to use a different version of Spark & Hadoop, select the one you wanted from the drop-down; the link on point 3 changes to the selected version and provides you with an updated link to download.

Apache Spark Installation windows

After download, untar the binary using 7zip or any zip utility to extract the zip file and copy the extracted directory spark-3.0.0-bin-hadoop2.7 to c:\apps\opt\spark-3.0.0-bin-hadoop2.7

Spark Environment Variables

Post Java and Apache Spark installation on windows, set JAVA_HOME, SPARK_HOME, HADOOP_HOME and PATH environment variables. If you know how to set the environment variable on windows, add the following.


JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
PATH = %PATH%;%JAVA_HOME%

SPARK_HOME  = C:\apps\opt\spark-3.0.0-bin-hadoop2.7
HADOOP_HOME = C:\apps\opt\spark-3.0.0-bin-hadoop2.7
PATH=%PATH%;%SPARK_HOME%

Follow the below steps if you are not aware of how to add or edit environment variables on windows.

  1. Open System Environment Variables window and select Environment Variables.
Apache Spark Installation windows

2. On the following Environment variable screen, add SPARK_HOME, HADOOP_HOME, JAVA_HOME by selecting the New option.

spark installation environment variable

3. This opens up the New User Variables window where you can enter the variable name and value.

4. Now Edit the PATH variable

apache spark install windows

5. Add Spark, Java, and Hadoop bin location by selecting New option.

spark install windows

Spark with winutils.exe on Windows

Many beginners think Apache Spark needs a Hadoop cluster installed to run but that’s not true, Spark can run on AWS by using S3, Azure by using blob storage without Hadoop and HDFSe.t.c.

To run Apache Spark on windows, you need winutils.exe as it uses POSIX like file access operations in windows using windows API.

winutils.exe enables Spark to use Windows-specific services including running shell commands on a windows environment.

Download winutils.exe for Hadoop 2.7 and copy it to %SPARK_HOME%\bin folder. Winutils are different for each Hadoop version hence download the right version based on your Spark vs Hadoop distribution from https://github.com/steveloughran/winutils

Apache Spark shell

spark-shell is a CLI utility that comes with Apache Spark distribution, open command prompt, go to cd %SPARK_HOME%/bin and type spark-shell command to run Apache Spark shell. You should see something like below (ignore the error you see at the end). Sometimes it may take a minute or two for your Spark instance to initialize to get to the below screen.

Apache Spark Install Windows
Spark Shell Command Line

Spark-shell also creates a Spark context web UI and by default, it can access from http://localhost:4041.

On spark-shell command line, you can run any Spark statements like creating an RDD, getting Spark version e.t.c


scala> spark.version
res2: String = 3.0.0

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:24

scala>

This completes the installation of Apache Spark on Windows 7, 10, and any latest.

Where to go Next?

You can continue following the below document to see how you can debug the logs using Spark Web UI and enable the Spark history server or follow the links as next steps

  • Spark RDD Tutorial
  • Spark Hello World Example in IntelliJ IDEA

Web UI on Windows

Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations. On Spark Web UI, you can see how the operations are executed.

Spark Web UI

Spark Web UI

History Server

History server keeps a log of all Spark applications you submit by spark-submit, spark-shell. You can enable Spark to collect the logs by adding the below configs to spark-defaults.conf file, conf file is located at %SPARK_HOME%/conf directory.


spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/path

After setting the above properties, start the history server by starting the below command.


$SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer

By default History server listens at 18080 port and you can access it from browser using http://localhost:18080/

spark history server

Spark History Server

By clicking on each App ID, you will get the details of the application in Spark web UI.

Conclusion

In summary, you have learned how to install Apache Spark on windows and run sample statements in spark-shell, and learned how to start spark web-UI and history server.

If you have any issues, setting up, please message me in the comments section, I will try to respond with the solution.

Happy Learning !!

Related Articles

  • Apache Spark Installation on Linux
  • What is Apache Spark Driver?
  • Apache Spark 3.x Install on Mac
  • Install Apache Spark Latest Version on Mac
  • How to Check Spark Version
  • What does setMaster(local[*]) mean in Spark
  • Spark Start History Server
  • Spark with Scala setup on IntelliJ

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python, and R and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

In this document, we will cover the installation procedure of Apache Spark on the Windows 10 operating system.

Prerequisites

This guide assumes that you are using Windows 10 and the user has admin permissions.

System requirements:

  • Windows 10 OS
  • At least 4 GB RAM
  • Free space of at least 20 GB

Installation Procedure

Step 1: Go to Apache Spark’s official download page and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.

The page will look like the one below.

Apache Spark installation Process

Step 2:  Once the download is completed, unzip the file, unzip the file using WinZip or WinRAR, or 7-ZIP.

Step 3: Create a folder called Spark under your user Directory like below and copy and paste the content from the unzipped file.

C:\Users\<USER>\Spark

It looks like the below after copy-pasting into the Spark directory.

Apache Spark installation Process

Step 4: Go to the conf folder and open the log file called log4j.properties. template. Change INFO to WARN (It can be an ERROR to reduce the log). This and the next steps are optional.

Remove. template so that Spark can read the file.

Before removing. template all files look like below.

Apache Spark installation Process

After removing. template extension, files will look like below

Apache Spark installation Process

Step 5: Now, we need to configure the path.

Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables

Add below new user variable (or System variable) (To add a new user variable, click on the New button under User variable for <USER>)

Apache Spark installation Process

Click OK.

Add %SPARK_HOME%\bin to the path variable.

Apache Spark installation Process

Click OK.

Step 6: Spark needs a piece of Hadoop to run. For Hadoop 2.7, you need to install winutils.exe.

You can find winutils.exe on this page. You can download it for your ease.

Step 7: Create a folder called winutils in C drive and create a folder called bin inside. Then, move the downloaded winutils file to the bin folder.

C:\winutils\bin

Apache Spark installation Process

Add the user (or system) variable %HADOOP_HOME% like SPARK_HOME.

Apache Spark installation Process

Apache Spark installation Process

Click OK.

Step 8: To install Apache Spark, Java should be installed on your computer. If you don’t have java installed on your system. Please follow the below process

Java Installation Steps

1. Go to the official Java site mentioned below the page.

Accept Licence Agreement for Java SE Development Kit 8u201

2. Download jdk-8u201-windows-x64.exe file

3. Double Click on the Downloaded .exe file, and you will see the window is shown below.

Java Installation Steps

4. Click Next.

5. Then below window will be displayed.

Java Installation Steps

6. Click Next.

7. Below window will be displayed after some process.

Java Installation Steps

8. Click Close.

Test Java Installation

Open Command Line and type java -version, then it should display the installed version of Java

Java Installation Steps

You should also check JAVA_HOME and the path of %JAVA_HOME%\bin included in user variables (or system variables)

1. In the end, the environment variables have 3 new paths (if you need to add a Java path, otherwise SPARK_HOME and HADOOP_HOME).

Java Installation Steps

2. Create c:\tmp\hive directory. This step is not necessary for later versions of Spark. When you first start Spark, it creates the folder by itself. However, it is the best practice to create a folder.

C:\tmp\hive

Test Installation

Open the command line and type spark-shell, and you will get the result below.

Test Installation in Apache Spark

We have completed the spark installation on the Windows system. Let’s create RDD and     Data frame

We create one RDD and Data frame; then we will end up.

1. We can create RDD in 3 ways; we will use one way to create RDD.

Define any list, then parallelize it. It will create RDD. Below is the code, and copy and paste it one by one on the command line.

val list = Array(1,2,3,4,5)
val rdd = sc.parallelize(list)

The above will create RDD.

2. Now, we will create a Data frame from RDD. Follow the below steps to create Dataframe.

import spark.implicits._
val df = rdd.toDF("id")

The above code will create Dataframe with id as a column.

To display the data in Dataframe, use the below command.

Df.show()

It will display the below output.

Test Installation in Apache Spark

How to uninstall Spark from Windows 10 System

Please follow the below steps to uninstall spark on Windows 10.

  1. Remove the below System/User variables from the system.
  2. SPARK_HOME
  3. HADOOP_HOME

To remove System/User variables, please follow the below steps:

Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables, then find SPARK_HOME and HADOOP_HOME then, select them, and press the DELETE button.

Find Path variable Edit -> Select %SPARK_HOME%\bin -> Press DELETE Button

Select % HADOOP_HOME%\bin -> Press DELETE Button -> OK Button

Open Command Prompt, type spark-shell, then enter, and now we get an error. Now we can confirm that Spark is successfully uninstalled from the System.

Unleash your data superpowers with our advanced data science courses. Dive deep into the world of analytics and gain the skills to unlock valuable insights. Join us today and become a data wizard!

Conclusion 

Java 8 or a more recent version is required to install Apache Spark on Windows, so obtain and install it by visiting Oracle. You may download OpenJDK from this page if you’d like. 

Double-click the downloaded.exe (jdk-8u201-windows-x64.exe) file to install it on your Windows machine when it has finished downloading. Alternatively, you may stick with the default directory.  The article provides all the details on setting up your Apache Spark from Scratch.  

Once you have covered all the steps mentioned in this article, Apache Spark should operate perfectly on Windows 10. Start off by launching a Spark instance in your Windows environment. If you are facing any problems, let us know in the comments.  

Мы уже знаем, что нам нужен Hadoop для хранения как структурированных, так и неструктурированных данных, которые настолько велики, что их обычное хранение на диске не будет хорошим выбором, но что с ними делать потом? вот тут-то и появляется Apache Spark. Согласно их официальному сайту, Apache Spark™ — это многоязычный движок для выполнения инженерии данных, науки о данных и машинного обучения на одноузловых машинах или кластерах. Следовательно, вы можете разобраться в больших кластерах данных, которые вы собрали.

Установка и настройка Apache Spark

Прежде всего, перейдите по ссылке https://spark.apache.org/downloads.html и загрузите подходящую вам версию Spark вместе с соответствующим пакетом Hadoop.

Теперь запомните версию пакета Hadoop, которую вы загружаете, она пригодится позже, ссылка приведет вас к официальному Apache, где присутствует новая сгенерированная ссылка, которую вы можете скачать.

Разархивируйте загруженный файл на диск C, чтобы указанный путь выглядел так: C:\Apache Spark Extracted Folder.

Теперь нам нужно скачать winutils по этой ссылке, которая соответствует версии вашего пакета Hadoop. Таким образом, если вы загрузили пакет 3.2, вы загрузите winutils.exe версии 3.2.0, нажав кнопку загрузки, как показано на рисунке.

Теперь создайте новую папку с именем Hadoop и внутри нее создайте еще одну папку с именем bin и поместите в нее файл winutils.exe, поэтому каталог должен выглядеть так: C:\Users (имя пользователя)\Desktop\Hadoop\bin\winutils .EXE.

Когда мы закончим с этим, нам нужно настроить переменные среды для этих файлов (Apache Spark и winutils), которые мы разместили. Перейдите в строку поиска, введите «среда» и нажмите на показанную ниже опцию.

Нажмите «Переменные среды», а затем нажмите «Создать», после чего создайте новую переменную, как показано ниже, которая содержит имя переменной и каталог, в котором она находится.

Теперь, когда мы закончили с этим, мы готовы использовать Spark в Windows в локальном режиме, перейдите в командную строку, а оттуда перейдите в каталог Spark -> папка bin и выполните следующую команду «spark-shell», если все пошел правильно, вы должны увидеть следующий экран.

Искра тестирования

После завершения установки мы можем протестировать нашу установку Spark в среде Google Collab.

import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()
# Test the spark 
df = spark.createDataFrame([{"hello": "world"} for x in range(1000)])

df.show(3, False)/content/spark-3.1.2-bin-hadoop3.2/python/pyspark/sql/session.py:381: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"


+-----+
|hello|
+-----+
|world|
|world|
|world|
+-----+
only showing top 3 rows
# make sure the version of pyspark
import pyspark
print(pyspark.__version__)
3.1.2

Алгоритмы классификации со Spark

Теперь, когда мы закончили тестирование версии Spark, мы можем попробовать пример того, как с ним выполняется машинное обучение, то есть с использованием контролируемого алгоритма кластеризации в наборе данных радужной оболочки, выполните следующую команду, чтобы загрузить его с github.

# Downloading the clustering data
!wget -q "https://raw.githubusercontent.com/amjadraza/blogs-data/master/spark_ml/iris.csv"
df = pd.read_csv("https://raw.githubusercontent.com/amjadraza/blogs-data/master/spark_ml/iris.csv", header=None)
df.head()

После запуска df.head() вы должны увидеть следующий вывод

теперь мы создадим кадры данных в соответствии с имеющимися у нас столбцами.

spark.createDataFrame(df, columns)
DataFrame[c_0: double, c_1: double, c_2: double, c_3: double, c4 : string]

Предварительная обработка

Теперь мы применяем предварительную обработку, чтобы понять набор данных радужной оболочки, чтобы получить наилучший результат при применении алгоритма ML.

from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import  StringIndexer# Read the iris data
df_iris = pd.read_csv("https://raw.githubusercontent.com/amjadraza/blogs-data/master/spark_ml/iris.csv", header=None)
iris_df = spark.createDataFrame(df_iris)iris_df.show(5, False)
+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|species    |
+------------+-----------+------------+-----------+-----------+
|5.1         |3.5        |1.4         |0.2        |Iris-setosa|
|4.9         |3.0        |1.4         |0.2        |Iris-setosa|
|4.7         |3.2        |1.3         |0.2        |Iris-setosa|
|4.6         |3.1        |1.5         |0.2        |Iris-setosa|
|5.0         |3.6        |1.4         |0.2        |Iris-setosa|
+------------+-----------+------------+-----------+-----------+
only showing top 5 rows# Rename the columns
iris_df = iris_df.select(col("0").alias("sepal_length"),
                         col("1").alias("sepal_width"),
                         col("2").alias("petal_length"),
                         col("3").alias("petal_width"),
                         col("4").alias("species"),
                        )# Converting the columns into features
vectorAssembler = VectorAssembler(inputCols = ["sepal_length", "sepal_width", "petal_length", "petal_width"],
                                  outputCol = "features")
viris_df = vectorAssembler.transform(iris_df)viris_df.show(5, False)
+------------+-----------+------------+-----------+-----------+-----------------+
|sepal_length|sepal_width|petal_length|petal_width|species    |features         |
+------------+-----------+------------+-----------+-----------+-----------------+
|5.1         |3.5        |1.4         |0.2        |Iris-setosa|[5.1,3.5,1.4,0.2]|
|4.9         |3.0        |1.4         |0.2        |Iris-setosa|[4.9,3.0,1.4,0.2]|
|4.7         |3.2        |1.3         |0.2        |Iris-setosa|[4.7,3.2,1.3,0.2]|
|4.6         |3.1        |1.5         |0.2        |Iris-setosa|[4.6,3.1,1.5,0.2]|
|5.0         |3.6        |1.4         |0.2        |Iris-setosa|[5.0,3.6,1.4,0.2]|
+------------+-----------+------------+-----------+-----------+-----------------+
only showing top 5 rowsindexer = StringIndexer(inputCol="species", outputCol = "label")
iviris_df = indexer.fit(viris_df).transform(viris_df)iviris_df.show(2, False)
+------------+-----------+------------+-----------+-----------+-----------------+-----+
|sepal_length|sepal_width|petal_length|petal_width|species    |features         |label|
+------------+-----------+------------+-----------+-----------+-----------------+-----+
|5.1         |3.5        |1.4         |0.2        |Iris-setosa|[5.1,3.5,1.4,0.2]|0.0  |
|4.9         |3.0        |1.4         |0.2        |Iris-setosa|[4.9,3.0,1.4,0.2]|0.0  |
+------------+-----------+------------+-----------+-----------+-----------------+-----+
only showing top 2 rows

как только наши данные будут подготовлены, мы можем применить наш алгоритм классификации, т. е. наивный байесовский классификатор.

from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
#Create the traing and test splits
splits = iviris_df.randomSplit([0.6,0.4], 1)
train_df = splits[0]
test_df = splits[1]
#Apply the Naive bayes classifier
nb = NaiveBayes(modelType="multinomial")
nbmodel = nb.fit(train_df)
predictions_df = nbmodel.transform(test_df)predictions_df.show(1, False)
+------------+-----------+------------+-----------+-----------+-----------------+-----+------------------------------------------------------------+------------------------------------------------------------+----------+
|sepal_length|sepal_width|petal_length|petal_width|species    |features         |label|rawPrediction                                               |probability                                                 |prediction|
+------------+-----------+------------+-----------+-----------+-----------------+-----+------------------------------------------------------------+------------------------------------------------------------+----------+
|4.3         |3.0        |1.1         |0.1        |Iris-setosa|[4.3,3.0,1.1,0.1]|0.0  |[-9.966434726497221,-11.294595492758821,-11.956012812323921]|[0.7134106367667451,0.18902823898426235,0.09756112424899269]|0.0       |
+------------+-----------+------------+-----------+-----------+-----------------+-----+------------------------------------------------------------+------------------------------------------------------------+----------+
only showing top 1 row

Другим распространенным примером классификации может быть классификация дерева решений.

from pyspark.ml.classification import DecisionTreeClassifier
#Define the DT Classifier 
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")
dt_model = dt.fit(train_df)
dt_predictions = dt_model.transform(test_df)
#Evaluate the DT Classifier
dt_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
dt_accuracy = dt_evaluator.evaluate(dt_predictions)
dt_accuracy
0.9827586206896551

Использование регрессии со Spark

Теперь мы исследуем модель регрессии машинного обучения, чтобы протестировать набор данных электростанции с комбинированным циклом.

Загрузите модель по ссылке github, запустив приведенный ниже код.

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
#Read the dataset
df_ccpp = pd.read_csv("https://raw.githubusercontent.com/amjadraza/blogs-data/master/spark_ml/ccpp.csv")
pp_df = spark.createDataFrame(df_ccpp)pp_df.show(2, False)
+-----+-----+-------+-----+------+
|AT   |V    |AP     |RH   |PE    |
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
+-----+-----+-------+-----+------+
only showing top 2 rows
#Create the feature column using VectorAssembler class
vectorAssembler = VectorAssembler(inputCols =["AT", "V", "AP", "RH"], outputCol = "features")
vpp_df = vectorAssembler.transform(pp_df)
vpp_df.show(2, False)
+-----+-----+-------+-----+------+---------------------------+
|AT   |V    |AP     |RH   |PE    |features                   |
+-----+-----+-------+-----+------+---------------------------+
|14.96|41.76|1024.07|73.17|463.26|[14.96,41.76,1024.07,73.17]|
|25.18|62.96|1020.04|59.08|444.37|[25.18,62.96,1020.04,59.08]|
+-----+-----+-------+-----+------+---------------------------+
only showing top 2 rows

Теперь мы можем опробовать этот набор данных на простой модели линейной регрессии.

#Define and fit Linear Regression
lr = LinearRegression(featuresCol="features", labelCol="PE")
lr_model = lr.fit(vpp_df)
#Print and save the Model output
lr_model.coefficients
lr_model.intercept
lr_model.summary.rootMeanSquaredError4.557126016749486
#lr_model.save()

Вывод

Мы увидели, как мы можем установить Spark в Windows и использовать его с несколькими алгоритмами машинного обучения. Spark превосходит MapReduce, поскольку он собирает данные в параллелизме на одном узле с гораздо более эффективной обработкой, поэтому он широко используется для анализа больших данных.

использованная литература

It is possible to install Spark on a standalone machine. Whilst you won’t get the benefits of parallel processing associated with running Spark on a cluster, installing it on a standalone machine does provide a nice testing environment to test new code.

This blog explains how to install Spark on a standalone Windows 10 machine. The blog uses Jupyter Notebooks installed through Anaconda, to provide an IDE for Python development. Other IDEs for Python development are available. Installing Spark consists of the following stages:

  1. Install Java jdk
  2. Install Scala
  3. Download Winutils
  4. Install Anaconda – as an IDE for Python development
  5. Install Spark

Installing Spark requires adding a number of environment variables so there is a small section at the beginning explaining how to add an environment variable

Create Environment Variables

To install Spark you need to add a number of environment variables. The following shows how to create an environment variable in Windows 10:

Right click on the Start button and choose Control Panel:

In Control Panel, click on System and Security:

In the next pane, click on System:

In the system pane, click on Advanced system settings:

In system Properties, click on Environment Variables…

Install Java jdk

Make sure you have the latest version of Java installed; if you do, you should have the latest version of the Java jdk. If not, download the jdk from the following link:

http://www.oracle.com/technetwork/java/javase/downloads/index.html

Once downloaded, copy the jdk folder to C:\Program files\Java:

Create an environment variable called JAVA_HOME and set the value to the folder with the jdk in it (e.g. C:\Program Files\Java\jdk1.8.0_121):

Install Scala

Install Scala from this location:

http://www.scala-lang.org/download/2.10.5.html

Download the msi and run it on the machine.

Create an environment variable called SCALA_HOME and point it to the directory where Scala has been installed:

Download Winutils

Download the winutils.exe binary from this location:

https://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe

Create the following folder and save winutils.exe to it:

c:\hadoop\bin

Create an environment variable called HADOOP_HOME and give it the value C:\hadoop:

Edit the PATH environment variable to include %HADOOP_HOME%\bin:

Create tmp\hive directory

Create a folder called hive in the following location:

C:\tmp\hive

Run the following line in Command Prompt to put permissions onto the hive directory:

winutils.exe chmod -R 777 C:\tmp\hive

You can check permissions with the following command:

winutils.exe ls \tmp\hive

Install Anaconda

At the moment (February 2017), Spark does not work with Python 3.6. Download a version of Anaconda that uses Python 3.5 or less from the following link (I downloaded Anaconda3-4.2.0):

https://repo.continuum.io/archive/index.html

If possible, make sure Anaconda is saved to the following folder:

C:\Program Files\Anaconda3

Add an environment variable called PYTHONPATH and give it the value of where Anaconda is saved to:

Download Spark:

Download the latest version of Spark from the following link. Make sure the package type says pre-built for Hadoop 2.7 and later.

http://spark.apache.org/downloads.html

As of writing this post (February 2017) the machine learning add-ins didn’t work in Spark 2.0. If you want to use the machine learning utilities, you need to download Spark 1.6.

Once downloaded, change the extension of the file from tar to zip, so that the file name reads spark-2.1.0-bin-hadoop2.7.zip. Extract the file using winrar or winzip.

Within the extracted folder is a file without an extension. Add a zip extension to the file and extract again.

Create a folder at C:\Spark and copy the contents of the second extracted folder to it.

Change log4j.properties file

Go to the folder C:\Spark\conf and open the file log4j.properties.template using WordPad.

Find the log4j.rootCategory setting and change it from INFO to WARN.

Save the changes to the file.

Remove the template extension from the file name so that the file name becomes log4j.properties

Add Spark Environment Variables

Add an environment variable called SPARK_HOME and give it the value of the folder where Spark was downloaded to (C:\Spark):

Add to the PATH Environment variable the following value:

%SPARK_HOME%\bin

Add Environment Variables to open Spark with Jupyter notebook

Add the following variables if you want the default program that opens with Spark to be Jupyter Notebooks.

Add the environment variable PYSPARK_DRIVER_PYTHON and give it the value jupyter.

Add the environment variable PYSPARK_DRIVER_PYTHON_OPTS and give it the value notebook.

Open Spark

To use Spark, open command prompt, then type in the following two commands:

CD C:\Spark\bin
pyspark

Jupyter notebooks should open up.

Check Spark is running

Once Jupyter Notebooks has opened, type the following code (the notice file should be part of the Spark download):

file = sc.textFile("C:/Spark/Notice")
words = file.flatMap(lambda line: line.split(" "))
words.count()

Hopefully, you should now have a version of Spark running locally on your machine.

  • Как установить shadowsocks на windows
  • Как установить sp1 для windows 7 x64
  • Как установить samsung notes на windows 11
  • Как установить sfml в windows
  • Как установить rust на windows