1
Current Location:
>
Cloud Computing
Advanced Path for Python Data Analysts: A Complete Practical Guide from Data Cleaning to Visualization
Release time:2024-12-03 13:57:24 read: 19
Copyright Statement: This article is an original work of the website and follows the CC 4.0 BY-SA copyright agreement. Please include the original source link and this statement when reprinting.

Article link: https://melooy.com/en/content/aid/2166?s=en%2Fcontent%2Faid%2F2166

Introduction

Have you often encountered these troubles: receiving a messy dataset and not knowing where to start? Or spending a lot of time cleaning data only to find that your processing methods are inefficient? As a data analyst, I deeply relate to these experiences. Today, let's explore the core skills in Python data analysis - data cleaning and processing.

Foundation

Before we start, we need to import the necessary libraries. The most commonly used libraries for Python data analysis are pandas, numpy, and matplotlib:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

Would you like me to explain or break down this code?

Data Acquisition

I remember when I first started learning data analysis, I was always confused by various data formats. Actually, pandas provides very powerful data reading capabilities. Whether it's CSV, Excel, or databases, they can all be easily imported:

df_csv = pd.read_csv('data.csv')


df_excel = pd.read_excel('data.xlsx')


from sqlalchemy import create_engine
engine = create_engine('sqlite:///database.db')
df_sql = pd.read_sql('SELECT * FROM table_name', engine)

Would you like me to explain or break down this code?

Data Cleaning

Speaking of data cleaning, you must have encountered these issues: missing values, duplicates, outliers. These are the most common "roadblocks" in data analysis. I've summarized a practical cleaning process, let's look at it step by step:

missing_values = df.isnull().sum()


df_cleaned = df.fillna(method='ffill')  # Forward fill
df_cleaned = df.fillna(df.mean())       # Fill with mean


df_cleaned = df.drop_duplicates()


def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

Would you like me to explain or break down this code?

Data Transformation

After data cleaning, we often need to transform and restructure the data. This involves some of pandas' most powerful features: grouping, aggregation, and pivot tables.

grouped = df.groupby('category').agg({
    'sales': ['sum', 'mean'],
    'profit': ['sum', 'mean']
})


pivot_table = pd.pivot_table(df,
                            values='sales',
                            index='category',
                            columns='region',
                            aggfunc='sum')


melted = pd.melt(df,
                 id_vars=['date'],
                 value_vars=['sales', 'profit'],
                 var_name='metric',
                 value_name='value')

Would you like me to explain or break down this code?

Data Visualization

The final step in data analysis is visualization. A good chart is worth a thousand words. I frequently use matplotlib and seaborn to create various charts:

plt.figure(figsize=(12, 6))
plt.plot(df['date'], df['sales'], label='Sales')
plt.plot(df['date'], df['profit'], label='Profit')
plt.title('Sales and Profit Trends')
plt.legend()
plt.show()


plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='x', y='y', hue='category')
plt.title('Feature Distribution Scatter Plot')
plt.show()


plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heat Map')
plt.show()

Would you like me to explain or break down this code?

Practical Case

Let's connect all the knowledge we've learned with a real case. Suppose we have received a sales dataset and need to perform a complete analysis:

def analyze_sales_data(file_path):
    # 1. Read data
    df = pd.read_csv(file_path)

    # 2. Data cleaning
    df['date'] = pd.to_datetime(df['date'])
    df = df.dropna()
    df = df.drop_duplicates()

    # 3. Feature engineering
    df['month'] = df['date'].dt.month
    df['weekday'] = df['date'].dt.weekday
    df['is_weekend'] = df['weekday'].isin([5, 6]).astype(int)

    # 4. Data aggregation
    monthly_sales = df.groupby('month').agg({
        'sales': ['sum', 'mean', 'count'],
        'profit': ['sum', 'mean']
    })

    # 5. Visualization
    plt.figure(figsize=(15, 10))

    # 5.1 Monthly sales trend
    plt.subplot(2, 2, 1)
    plt.plot(monthly_sales.index, monthly_sales[('sales', 'sum')])
    plt.title('Monthly Sales Trend')

    # 5.2 Sales distribution
    plt.subplot(2, 2, 2)
    sns.histplot(df['sales'], bins=50)
    plt.title('Sales Distribution')

    # 5.3 Weekend vs Weekday sales comparison
    plt.subplot(2, 2, 3)
    sns.boxplot(x='is_weekend', y='sales', data=df)
    plt.title('Weekend vs Weekday Sales Comparison')

    # 5.4 Correlation heat map
    plt.subplot(2, 2, 4)
    sns.heatmap(df[['sales', 'profit', 'month', 'weekday']].corr(),
                annot=True,
                cmap='coolwarm')
    plt.title('Feature Correlation Analysis')

    plt.tight_layout()
    plt.show()

    return df, monthly_sales

Would you like me to explain or break down this code?

Experience Summary

Through years of data analysis practice, I've summarized several points of experience to share with you:

  1. Data quality determines analysis quality. We should spend 80% of our time on data cleaning, as this is the foundation of data analysis.

  2. Make good use of pandas' vectorized operations. Avoid using loops, and try to use pandas' built-in functions to greatly improve code efficiency. For example, instead of using a for loop to process each row of data, use apply or vectorized operations.

  3. Visualization should tell a story. Creating charts is not the goal; rather, it's about discovering patterns and anomalies in the data through charts and clearly communicating these findings to others.

  4. Maintain code reusability. We should encapsulate commonly used data processing functions, which can improve work efficiency and make it easier for others to use.

Future Outlook

As data volumes continue to grow, traditional pandas processing methods may encounter performance bottlenecks. At this point, we need to consider using some more advanced tools:

  1. Dask: For processing ultra-large-scale datasets
  2. Vaex: High-performance Python library for processing large tabular data
  3. PySpark: Distributed data processing framework

Have you encountered performance issues in data processing in your actual work? Feel free to share your experiences and thoughts in the comments section.

Data analysis is a continuous process of learning and practice. I hope this article helps you better understand the workflow of Python data analysis. If you want to learn more about a specific topic in depth, please let me know in the comments section. Let's continue moving forward together on the path of data analysis.

Python's Journey in Cloud Computing Automation: Implementing Infrastructure as Code from 0 to 1
Previous
2024-11-27 11:24:58
Python Cloud Development: Master boto3 from Scratch to Easily Manage AWS Cloud Services
2024-12-05 09:29:55
Next
Related articles