Weiter zum Inhalt

Startseite Spark

Kurs

Grundlagen von Big Data mit PySpark

ExperteSchwierigkeitsgrad

Aktualisiert 02/2025

Dieser Kurs zeigt praxisnah, wie du in PySpark mit Big Data arbeitest.

Kurs kostenlos starten

SparkData Engineering4 Std.16 Videos55 Übungen4,600 XP64,529Leistungsnachweis

Kostenloses Konto erstellen

oder

Durch Klick auf die Schaltfläche akzeptierst du unsere Nutzungsbedingungen, unsere Datenschutzrichtlinie und die Speicherung deiner Daten in den USA.

Beliebt bei Lernenden in Tausenden Unternehmen

Training für 2 oder mehr Personen?

Probiere es mit DataCamp for Business

Kursbeschreibung

In den letzten Jahren wurde viel über Big Data geredet und in vielen Unternehmen ist dieses Thema endlich angekommen. Aber was ist mit Big Data eigentlich gemeint? Dieser Kurs vermittelt die Grundlagen von Big Data mit PySpark. Spark ist ein extrem schnelles Cluster-Computing-Framework für Big Data. Es bietet eine allgemeine Datenverarbeitungsplattform und lässt dich Programme bis zu 100x schneller im Speicher oder 10x schneller auf der Festplatte ausführen als Hadoop. Du verwendest PySpark, ein Python-Paket für die Spark-Programmierung, und seine leistungsstarken, höheren Bibliotheken wie SparkSQL, MLlib (für maschinelles Lernen) und so weiter. In Übungen untersuchst du die Werke von William Shakespeare, analysierst Daten zur FIFA-WM 2018 und führst Clustering mit Genom-Datensätzen durch. Am Ende dieses Kurses hast du ein tiefes Verständnis von PySpark und seiner Nutzung für allgemeine Big-Data-Analysen.

Voraussetzungen

Introduction to Python

1

Introduction to Big Data analysis with Spark

This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.

What is Big Data?

The 3 V's of Big Data

PySpark: Spark with Python

Understanding SparkContext

Interactive Use of PySpark

Loading data in PySpark shell

Review of functional programming in Python

Use of lambda() with map()

Use of lambda() with filter()

Kapitel starten

2

Programming in PySpark RDD’s

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.

Abstracting Data with RDDs

RDDs from Parallelized collections

RDDs from External Datasets

Partitions in your data

Basic RDD Transformations and Actions

Map and Collect

Filter and Count

Pair RDDs in PySpark

ReduceBykey and Collect

SortByKey and Collect

Advanced RDD Actions

CountingBykeys

Create a base RDD and transform it

Remove stop words and reduce the dataset

Print word frequencies

Kapitel starten

3

PySpark SQL & DataFrames

In this chapter, you'll learn about Spark SQL which is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. This chapter shows how Spark SQL allows you to use DataFrames in Python.

Abstracting Data with DataFrames

RDD to DataFrame

Loading CSV into DataFrame

Operating on DataFrames in PySpark

Inspecting data in PySpark DataFrame

PySpark DataFrame subsetting and cleaning

Filtering your DataFrame

Interacting with DataFrames using PySpark SQL

Running SQL Queries Programmatically

SQL queries for filtering Table

Data Visualization in PySpark using DataFrames

PySpark DataFrame visualization

Part 1: Create a DataFrame from CSV file

Part 2: SQL Queries on DataFrame

Part 3: Data visualization

Kapitel starten

4

Machine Learning with PySpark MLlib

PySpark MLlib is the Apache Spark scalable machine learning library in Python consisting of common learning algorithms and utilities. Throughout this last chapter, you'll learn important Machine Learning algorithms. You will build a movie recommendation engine and a spam filter, and use k-means clustering.

Overview of PySpark MLlib

PySpark ML libraries

PySpark MLlib algorithms

Collaborative filtering

Loading Movie Lens dataset into RDDs

Model training and predictions

Model evaluation using MSE

Classification

Loading spam and non-spam data

Feature hashing and LabelPoint

Logistic Regression model training

Loading and parsing the 5000 points data

K-means training

Visualizing clusters

Congratulations!

Kapitel starten

Grundlagen von Big Data mit PySpark

Kurs
abgeschlossen

Leistungsnachweis verdienen

Füge diesen Fähigkeitsnachweis zu Deinem LinkedIn-Profil, Anschreiben oder Lebenslauf hinzu
Teile es auf Social Media und in Deiner LeistungsbeurteilungJetzt anmelden

Schließe dich 19 Millionen Lernenden an und starte Grundlagen von Big Data mit PySpark heute!

Kostenloses Konto erstellen

oder

Durch Klick auf die Schaltfläche akzeptierst du unsere Nutzungsbedingungen, unsere Datenschutzrichtlinie und die Speicherung deiner Daten in den USA.

DataCamp gibt es auch für Mobilgeräte

Mit unseren Kursen für Mobilgeräte und täglichen Programmier-Challenges erweiterst du deine Datenkompetenz von unterwegs.