GOALS Connecting People to Connecting Global: PYTHON - Data Cleaning and Pre -processing Data

1. Basic Quality Inspection & Summary

Before altering data, ask ChatGPT to write code that analyzes the overall health of your dataset

Prompt:

“Act as a Senior Data Scientist. I have a DataFrame called df containing global sales data with columns: order_id, product_name, price, quantity, order_date, and customer_email. Write Python code using Pandas to check for missing values, identify total duplicate records, and output the exact data types of each column. Keep the code clean and well-commented.”

2. Handling Missing Data and Type Standardization
Standardizing irregular data formats and resolving blank fields ensures model stability
Prompt:
*“You are a data cleaning assistant. I have a dataset with hidden anomalies. Write a Python script to do the following:
Impute missing numerical values in the price column using the column median.
Convert strings representing missing values like 'N/A' or 'null' into true NumPy NaN values.
Standardize the order_date column into a proper datetime64[ns] format.
Output only optimized Python code without verbose explanations.”*
3. Handling Outliers Using Statistical Methods
Automatically strip extreme visual anomalies that skew downstream machine learning models
Prompt:
“Write a Python function using Pandas and NumPy that takes a DataFrame df and removes rows containing outliers in the numeric column quantity. Use the Interquartile Range (IQR) method where outliers are defined as values 1.5 × IQR outside the first and third quartiles. Return the cleaned DataFrame.”
4. Categorical Feature Encoding
Transforming non-numeric labels into structured matrices is essential for Scikit-learn pipelines.
Prompt:
“I am preparing data for a machine learning model. Write Python code to encode a categorical column named shipping_region which has values like 'North', 'South', and 'West'. Use One-Hot Encoding via pd.get_dummies() and ensure it avoids the dummy variable trap by dropping the first category.”

PYTHON
import pandas as pd
from openai import OpenAI

# Initialize client (Ensure your API key is configured correctly)
client = OpenAI(api_key="your_openai_api_key")

def clean_text_with_gpt(dirty_text_list):
# Convert text chunks into a structured prompt
prompt_content = f"Standardize and clean this list of company names. Remove trailing characters, fix typos, and return a clean, comma-separated list:\n{dirty_text_list}"

response = client.chat.completions.create(
model="gpt-4o", # Use a stable, current model
messages=[
{"role": "system", "content": "You are a precise data cleaning utility. Output only the requested list without chat."},
{"role": "user", "content": prompt_content}
],
temperature=0.1 # Low temperature keeps output deterministic
)
return response.choices[0].message.content

# Example integration into a Pandas pipeline
df = pd.DataFrame({"company": ["Apple Inc.", "apple", "Aple!!", "Google LLC", "Gooogle"]})
cleaned_output = clean_text_with_gpt(df["company"].tolist())
print(cleaned_output)
```
If you want to fine-tune this workflow for your projects, tell me:
* What does a **sample row** of your dataset look like?
* What **specific data issues** are you trying to fix (e.g., typos, broken zip codes, messy text)?
* Do you want the AI to **write the code for you** or **process the data directly** via the API?

GOALS Connecting People to Connecting Global

Monday, June 22, 2026

PYTHON - Data Cleaning and Pre -processing Data

No comments:

Post a Comment

R.GNANAKUMARAN

Followers