Course
Cleaning Data in Python
Included withPremium or Teams
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.Loved by learners at thousands of companies
Training 2 or more people?
Try DataCamp for BusinessCourse Description
Discover How to Clean Data in Python
It's commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. Data cleaning is an essential step for every data scientist, as analyzing dirty data can lead to inaccurate conclusions.In this course, you will learn how to identify, diagnose, and treat various data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more!
Learn How to Clean Different Data Types
The first chapter of the course explores common data problems and how you can fix them. You will first understand basic data types and how to deal with them individually. After, you'll apply range constraints and remove duplicated data points.The last chapter explores record linkage, a powerful tool to merge multiple datasets. You'll learn how to link records by calculating the similarity between strings. Finally, you'll use your new skills to join two restaurant review datasets into one clean master dataset.
Gain Confidence in Cleaning Data
By the end of the course, you will gain the confidence to clean data from various types and use record linkage to merge multiple datasets. Cleaning data is an essential skill for data scientists. If you want to learn more about cleaning data in Python and its applications, check out the following tracks: Data Scientist with Python and Importing & Cleaning Data with Python.Feels like what you want to learn?
Start Course for FreeWhat you'll learn
- Assess data uniformity and integrity by applying unit conversions, cross-field validation, and assert statements
- Differentiate strategies for handling missing data, such as deletion, statistical imputation, and encoding, based on the underlying pattern of missingness.
- Distinguish between text, categorical, numerical, and date data problems and select appropriate pandas and NumPy cleaning functions for each
- Evaluate string-matching metrics and record-linkage workflows to consolidate records with fuzzy duplicates
- Identify common data quality issues including incorrect data types, range violations, duplicates, inconsistent categories, and missing values
Prerequisites
Python ToolboxJoining Data with pandasCommon data problems
Text and categorical data problems
Advanced data problems
Record linkage
Complete
Earn Statement of Accomplishment
Add this credential to your LinkedIn profile, resume, or CVShare it on social media and in your performance review
Included withPremium or Teams
Enroll NowFAQs
Why is data cleaning necessary?
Data cleaning is an essential step for data scientists as it ensures that the data used in an analysis is the most reliable and efficient it could be. This is done through various steps, including removing duplicates and incomplete records and modifying data to rectify incomplete records.Dirty data emerges in a number of ways: it could be due to human error, a faulty sensory device, or data corruption. So when dirty data is used instead of clean data, it will result in inaccurate and unreliable conclusions.
Is data cleaning easy to learn?
Learning data cleaning in itself is a relatively straightforward process. However data scientists spend 80% of their time cleaning and manipulating data, and there are many different ways that data may be compromised. That means that the actual process of cleaning data is often tricky and time-consuming.
Will I receive a certificate at the end of the course?
Yes! Upon completion, you will receive a certificate you can share with employers and others within your network.
Who will benefit from this course?
This course is beneficial for professionals who are interested in expanding their knowledge of data cleaning and manipulation with Python, including data analysts, data scientists, and data engineers.
What topics does this course cover?
This course will teach you how to identify, diagnose, and treat a variety of data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more.
Join over 19 million learners and start Cleaning Data in Python today!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.