Before altering data, ask ChatGPT to write code that analyzes the overall health of your dataset
Prompt:
df containing global sales data with columns: order_id, product_name, price, quantity, order_date, and customer_email. Write Python code using Pandas to check for missing values, identify total duplicate records, and output the exact data types of each column. Keep the code clean and well-commented.”*“You are a data cleaning assistant. I have a dataset with hidden anomalies. Write a Python script to do the following:
- Impute missing numerical values in the
pricecolumn using the column median. - Convert strings representing missing values like 'N/A' or 'null' into true NumPy NaN values.
- Standardize the
order_datecolumn into a properdatetime64[ns]format.
Output only optimized Python code without verbose explanations.”*
“Write a Python function using Pandas and NumPy that takes a DataFrame
df and removes rows containing outliers in the numeric column quantity. Use the Interquartile Range (IQR) method where outliers are defined as values 1.5 × IQR outside the first and third quartiles. Return the cleaned DataFrame.”Prompt:
“I am preparing data for a machine learning model. Write Python code to encode a categorical column namedshipping_regionwhich has values like 'North', 'South', and 'West'. Use One-Hot Encoding viapd.get_dummies()and ensure it avoids the dummy variable trap by dropping the first category.”
PYTHON
import pandas as pd
from openai import OpenAI
# Initialize client (Ensure your API key is configured correctly)
client = OpenAI(api_key="your_openai_api_key")
def clean_text_with_gpt(dirty_text_list):
# Convert text chunks into a structured prompt
prompt_content = f"Standardize and clean this list of company names. Remove trailing characters, fix typos, and return a clean, comma-separated list:\n{dirty_text_list}"
response = client.chat.completions.create(
model="gpt-4o", # Use a stable, current model
messages=[
{"role": "system", "content": "You are a precise data cleaning utility. Output only the requested list without chat."},
{"role": "user", "content": prompt_content}
],
temperature=0.1 # Low temperature keeps output deterministic
)
return response.choices[0].message.content
# Example integration into a Pandas pipeline
df = pd.DataFrame({"company": ["Apple Inc.", "apple", "Aple!!", "Google LLC", "Gooogle"]})
cleaned_output = clean_text_with_gpt(df["company"].tolist())
print(cleaned_output)
```
If you want to fine-tune this workflow for your projects, tell me:
* What does a **sample row** of your dataset look like?
* What **specific data issues** are you trying to fix (e.g., typos, broken zip codes, messy text)?
* Do you want the AI to **write the code for you** or **process the data directly** via the API?
No comments:
Post a Comment