Amazon Sales Dataset

Amazon Sales Dataset
Download
[free_download_btn]

The Amazon Sales Dataset, created by Karkavelraja J, is a comprehensive collection of product sales data from Amazon's e-commerce platform containing detailed information about products, pricing, ratings, reviews, and sales performance metrics. This dataset captures real-world retail dynamics including product categories, discount strategies, customer feedback, and sales patterns across various product lines.

Available on Kaggle, this dataset is excellent for building sales forecasting models, performing pricing analytics, understanding customer preferences, and developing data-driven strategies to optimize product listings, pricing decisions, and inventory management in the competitive e-commerce landscape.

Key Features

  • Records: Thousands of product entries across multiple categories on Amazon.
  • Variables:
    • Product ID/ASIN: Unique Amazon product identifier
    • Product Name/Title: Name and description of the product
    • Category: Product category or department
    • Discounted Price: Current selling price after discounts
    • Actual Price: Original price before discounts
    • Discount Percentage: Percentage discount offered
    • Rating: Average customer rating (typically 0-5 scale)
    • Rating Count: Number of customers who rated the product
    • Review Count: Number of customer reviews
    • About Product: Product description and features
    • User ID/Name: Customer information (if available)
    • Image Links: Product image URLs (if available)
    • Product Link: Direct link to product page
  • Data Type: Mixed (numerical prices and ratings, categorical product information, text descriptions).
  • Format: CSV file.
  • Scope: Multiple product categories with diverse price ranges and customer engagement levels.

Why This Dataset

This dataset provides insights into e-commerce sales dynamics, pricing strategies, and customer behavior patterns that drive online retail success. It allows analysts and data scientists to understand what factors influence product performance and sales outcomes. It's ideal for projects that aim to:

  1. Predict product sales performance based on pricing and ratings.
  2. Analyze the relationship between discounts and sales volume.
  3. Identify pricing strategies that maximize revenue or market share.
  4. Understand how customer ratings and reviews impact purchasing decisions.
  5. Segment products based on performance metrics and characteristics.
  6. Forecast demand for inventory planning and stock optimization.
  7. Perform competitive analysis within product categories.
  8. Build recommendation systems based on product attributes and customer preferences.

How to Use the Dataset

  1. Download the CSV file from Kaggle.
  2. Load into Python using Pandas: df = pd.read_csv('amazon_sales.csv').
  3. Explore the structure using .info(), .head(), .describe() to understand data types and distributions.
  4. Check for missing values using .isnull().sum() and decide on handling strategies.
  5. Clean price data by removing currency symbols and converting to numerical format.
  6. Calculate discount amounts: df['discount_amount'] = df['actual_price'] - df['discounted_price'].
  7. Verify discount percentages and recalculate if necessary for consistency.
  8. Analyze rating distributions using histograms and box plots across categories.
  9. Handle text data in product descriptions using NLP techniques if needed for analysis.
  10. Engineer features such as:
    • Price-to-rating ratio
    • Review engagement rate (reviews per rating)
    • Categorical encoding for product categories
    • Price bins (budget, mid-range, premium)
    • Discount tier classifications
  11. Visualize relationships between price, discounts, ratings, and review counts using scatter plots and correlation matrices.
  12. Identify outliers in pricing, ratings, or review counts that may indicate data quality issues or exceptional products.
  13. Segment products using clustering techniques (K-means, hierarchical) based on multiple attributes.
  14. Build predictive models using regression for sales/ratings prediction or classification for success categories.
  15. Evaluate models using appropriate metrics like RMSE, MAE, R², or classification metrics depending on problem type.

Possible Project Ideas

  • Sales performance predictor estimating product success based on pricing and category features.
  • Optimal pricing strategy analyzer determining price points that maximize revenue or conversion.
  • Discount effectiveness study measuring how different discount levels impact sales and ratings.
  • Product recommendation system suggesting products based on features, categories, and customer preferences.
  • Rating prediction model forecasting customer ratings from product attributes and pricing.
  • Category performance comparison analyzing which product categories perform best on various metrics.
  • Price optimization tool recommending competitive pricing based on market analysis.
  • Customer review volume predictor estimating engagement levels for new products.
  • Market basket analysis identifying products frequently viewed or purchased together.
  • Inventory forecasting system predicting demand patterns for stock management.
  • Competitive positioning dashboard visualizing product performance relative to category competitors.
  • Product success classifier identifying factors that distinguish bestsellers from underperformers.
  • Dynamic pricing simulator testing different pricing scenarios and their projected outcomes.
  • Sentiment-price correlation analysis combining review sentiment with pricing strategy effectiveness.
  • New product launch advisor recommending optimal launch strategies based on category analysis.

Dataset Challenges and Considerations

  • Missing Data: Some products may have incomplete information, particularly newer listings with few reviews.
  • Price Variability: Prices fluctuate over time; dataset represents a snapshot rather than historical trends.
  • Rating Bias: Products with few ratings may have unreliable average scores; consider rating count as confidence measure.
  • Category Imbalance: Some categories may be overrepresented while others have limited samples.
  • Discount Authenticity: Some "discounts" may represent inflated original prices; verify discount legitimacy.
  • Selection Bias: Dataset may not include all products; availability of data could correlate with product popularity.
  • Temporal Context: Sales patterns are influenced by seasonality, trends, and events not captured in static data.
  • Currency and Locale: Ensure consistent currency and regional context if dataset spans multiple markets.
  • Review Count vs Rating Quality: High review counts don't always correlate with rating accuracy or product quality.

Key Analysis Approaches

Pricing Analysis:

  • Compare actual vs discounted prices across categories
  • Identify optimal discount percentages for different product types
  • Analyze price elasticity and its impact on ratings

Rating Analysis:

  • Correlate ratings with pricing, discounts, and review volume
  • Identify threshold review counts where ratings stabilize
  • Segment products by rating tiers for targeted strategies

Category Intelligence:

  • Benchmark performance within and across categories
  • Identify category-specific pricing and discount patterns
  • Understand competitive dynamics in different product segments

Feature Engineering:

  • Create composite scores combining price, rating, and review metrics
  • Develop profitability proxies using price and discount data
  • Build engagement indices from rating and review counts
  • Version
  • Download 4
  • File Size 1.95 MB
  • File Count 1
  • Create Date December 23, 2025
  • Last Updated December 23, 2025
FileAction
archive (8).zipDownload

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top