Monday, June 22, 2026

PYTHON - Data Cleaning and Pre -processing Data

1. Basic Quality Inspection & Summary

Before altering data, ask ChatGPT to write code that analyzes the overall health of your dataset 

Prompt:

“Act as a Senior Data Scientist. I have a DataFrame called df containing global sales data with columns: order_id, product_name, price, quantity, order_date, and customer_email. Write Python code using Pandas to check for missing values, identify total duplicate records, and output the exact data types of each column. Keep the code clean and well-commented.”

2. Handling Missing Data and Type Standardization
Standardizing irregular data formats and resolving blank fields ensures model stability
Prompt:
*“You are a data cleaning assistant. I have a dataset with hidden anomalies. Write a Python script to do the following:
  1. Impute missing numerical values in the price column using the column median.
  2. Convert strings representing missing values like 'N/A' or 'null' into true NumPy NaN values.
  3. Standardize the order_date column into a proper datetime64[ns] format.
    Output only optimized Python code without verbose explanations.”*
3. Handling Outliers Using Statistical Methods
Automatically strip extreme visual anomalies that skew downstream machine learning models
Prompt:
“Write a Python function using Pandas and NumPy that takes a DataFrame df and removes rows containing outliers in the numeric column quantity. Use the Interquartile Range (IQR) method where outliers are defined as values 1.5 × IQR outside the first and third quartiles. Return the cleaned DataFrame.”
4. Categorical Feature Encoding
Transforming non-numeric labels into structured matrices is essential for Scikit-learn pipelines. 
Prompt:
“I am preparing data for a machine learning model. Write Python code to encode a categorical column named shipping_region which has values like 'North', 'South', and 'West'. Use One-Hot Encoding via pd.get_dummies() and ensure it avoids the dummy variable trap by dropping the first category.” 


PYTHON

import pandas as pd

from openai import OpenAI


# Initialize client (Ensure your API key is configured correctly)

client = OpenAI(api_key="your_openai_api_key")


def clean_text_with_gpt(dirty_text_list):

    # Convert text chunks into a structured prompt

    prompt_content = f"Standardize and clean this list of company names. Remove trailing characters, fix typos, and return a clean, comma-separated list:\n{dirty_text_list}"

    

    response = client.chat.completions.create(

        model="gpt-4o",  # Use a stable, current model

        messages=[

            {"role": "system", "content": "You are a precise data cleaning utility. Output only the requested list without chat."},

            {"role": "user", "content": prompt_content}

        ],

        temperature=0.1  # Low temperature keeps output deterministic

    )

    return response.choices[0].message.content


# Example integration into a Pandas pipeline

df = pd.DataFrame({"company": ["Apple Inc.", "apple", "Aple!!", "Google LLC", "Gooogle"]})

cleaned_output = clean_text_with_gpt(df["company"].tolist())

print(cleaned_output)

```

If you want to fine-tune this workflow for your projects, tell me:

* What does a **sample row** of your dataset look like?

* What **specific data issues** are you trying to fix (e.g., typos, broken zip codes, messy text)?

* Do you want the AI to **write the code for you** or **process the data directly** via the API?

 

No comments:

Post a Comment