Thread

Creating a Table to Track New Customers Each Month

This SQL query explains how to create a table named 'new_customers_each_month' that tracks the number of brand new customers each month. The query extracts the year and month from the order date, counts the distinct customer IDs, and groups the data...

Empty image or helper icon
Creating a Table to Track New Customers Each Month

Tool: Database Query Writer

Created: 01/07/2024

Prompt

Create a table from a sales table over multiple years, that shows brand new customers each month

To create a table that shows brand new customers each month, we can use the following SQL query:

CREATE TABLE new_customers_each_month AS
SELECT EXTRACT(YEAR FROM order_date) AS year,
       EXTRACT(MONTH FROM order_date) AS month,
       COUNT(DISTINCT customer_id) AS new_customers
FROM sales
WHERE customer_id NOT IN (
    SELECT DISTINCT customer_id
    FROM sales
    WHERE EXTRACT(YEAR FROM order_date) > (SELECT MIN(EXTRACT(YEAR FROM order_date)) FROM sales)
)
GROUP BY year, month;

Explanation:

  • The CREATE TABLE statement is used to create a new table named new_customers_each_month.
  • The SELECT statement retrieves the required data:
    • EXTRACT(YEAR FROM order_date) and EXTRACT(MONTH FROM order_date) are used to extract the year and month from the order_date column.
    • COUNT(DISTINCT customer_id) counts the number of distinct customer IDs.
  • The WHERE clause filters the customers who placed their first order in the current year:
    • The subquery (SELECT MIN(EXTRACT(YEAR FROM order_date)) FROM sales) retrieves the earliest year from the sales data.
    • customer_id NOT IN (...) excludes customers who placed an order in any year after the earliest year.
  • Finally, the GROUP BY clause groups the data by year and month.

Note: The SQL query assumes that the original sales table is named sales. Make sure to replace it with the actual name of your sales table.

Code Analysis

Tool: Performance Predictor

Created: 01/07/2024

Prompt

CREATE TABLE new_customers_each_month AS
SELECT EXTRACT(YEAR FROM order_date) AS year,
       EXTRACT(MONTH FROM order_date) AS month,
       COUNT(DISTINCT customer_id) AS new_customers
FROM sales
WHERE customer_id NOT IN (
    SELECT DISTINCT customer_id
    FROM sales
    WHERE EXTRACT(YEAR FROM order_date) > (SELECT MIN(EXTRACT(YEAR FROM order_date)) FROM sales)
)
GROUP BY year, month;

Code Analysis

The code performs a data transformation task to create a new table called "new_customers_each_month". It calculates the count of distinct new customers for each year and month based on the "order_date" field in the "sales" table.

Potential Performance Bottlenecks / Areas of Concern

  1. Nested Subquery: The code uses a nested subquery to filter out customer_ids that have made orders in the previous years. This can be a performance bottleneck, especially if the "sales" table is large and the subquery is executed for each row in the main query.

  2. Redundant Subquery: The code uses two separate subqueries to get the minimum year from the "sales" table. This can be optimized by using a single subquery to retrieve the minimum year once and using it in both places where needed.

  3. COUNT(DISTINCT): The use of the COUNT DISTINCT function can be computationally expensive, particularly if the "sales" table contains a large number of records. This can impact query performance.

Recommendations for Optimization / Alternative Approaches

  1. Replace Nested Subquery: To optimize the nested subquery, you can rewrite the query using a LEFT JOIN to exclude customer_ids that have made orders in previous years. This will improve performance by executing the subquery only once instead of per row.

  2. Optimize Subquery: Instead of using two separate subqueries to retrieve the minimum year, use a single subquery and store the result in a variable. Then, reference the variable in the main query as needed. This will eliminate redundant subquery execution.

  3. Consider Materialized View: If this transformation is performed frequently, you could consider creating a materialized view with the pre-calculated aggregated data. This will improve query performance by eliminating the need to calculate the same aggregation repeatedly.

  4. Avoid COUNT(DISTINCT): If possible, consider alternative approaches to calculating the count of distinct customers. For example, you could use a separate table to keep track of new customers, update it as new orders come in, and retrieve the count from there. Alternatively, you could use window functions or other techniques to calculate the count without resorting to COUNT DISTINCT.

Overall, the recommended optimizations aim to improve query performance by reducing the number of subqueries, eliminating redundant operations, and considering alternative approaches to aggregations.

SQL Code Refactoring

Tool: Code Best Practices Checker

Created: 01/07/2024

Prompt

CREATE TABLE new_customers_each_month AS
SELECT EXTRACT(YEAR FROM order_date) AS year,
       EXTRACT(MONTH FROM order_date) AS month,
       COUNT(DISTINCT customer_id) AS new_customers
FROM sales
WHERE customer_id NOT IN (
    SELECT DISTINCT customer_id
    FROM sales
    WHERE EXTRACT(YEAR FROM order_date) > (SELECT MIN(EXTRACT(YEAR FROM order_date)) FROM sales)
)
GROUP BY year, month;

The given code snippet appears to be written in SQL.

Refactored code:

CREATE TABLE new_customers_each_month AS
SELECT YEAR(order_date) AS year,
       MONTH(order_date) AS month,
       COUNT(DISTINCT customer_id) AS new_customers
FROM sales
WHERE customer_id NOT IN (
    SELECT DISTINCT customer_id
    FROM sales
    WHERE YEAR(order_date) > (SELECT MIN(YEAR(order_date)) FROM sales)
)
GROUP BY year, month;

Improvements:

  • Removed the unnecessary usage of EXTRACT function and replaced it with YEAR and MONTH functions.
  • Refined the subquery by removing the use of EXTRACT and replaced it with YEAR.
  • MIN(EXTRACT(YEAR FROM order_date)) has been simplified to MIN(YEAR(order_date)).
  • Enhanced the readability of the query by aligning the SELECT, FROM, WHERE, and GROUP BY clauses.
  • Ensured the core functionality remains intact.

By applying these changes, we have improved the code's performance, readability, and adherence to SQL best practices.

Create Table for Monthly Count of New Customers

Tool: Documentation Generator

Created: 01/07/2024

Prompt

CREATE TABLE new_customers_each_month AS
SELECT EXTRACT(YEAR FROM order_date) AS year,
       EXTRACT(MONTH FROM order_date) AS month,
       COUNT(DISTINCT customer_id) AS new_customers
FROM sales
WHERE customer_id NOT IN (
    SELECT DISTINCT customer_id
    FROM sales
    WHERE EXTRACT(YEAR FROM order_date) > (SELECT MIN(EXTRACT(YEAR FROM order_date)) FROM sales)
)
GROUP BY year, month;
-- CREATE TABLE new_customers_each_month AS
-- This code snippet creates a new table called "new_customers_each_month" to store the monthly count of new customers.

-- SELECT EXTRACT(YEAR FROM order_date) AS year, EXTRACT(MONTH FROM order_date) AS month, COUNT(DISTINCT customer_id) AS new_customers
-- This code snippet selects the year, month, and count of distinct customer_id(s) from the "sales" table.
-- The count represents the number of new customers for each month.

-- FROM sales
-- This code snippet specifies the source table to retrieve the data from, in this case, the "sales" table.

-- WHERE customer_id NOT IN (
--     SELECT DISTINCT customer_id
--     FROM sales
--     WHERE EXTRACT(YEAR FROM order_date) > (SELECT MIN(EXTRACT(YEAR FROM order_date)) FROM sales)
-- )
-- This code snippet filters out the customer_id(s) that have already made a purchase in a previous year.
-- The subquery retrieves the minimum year from the "sales" table, and the WHERE clause filters customer_id(s) where
-- the order_date year is greater than the minimum year extracted.

-- GROUP BY year, month;
-- This code snippet groups the result set by year and month, allowing us to calculate the count of new customers for each month.
Designing a Data Structure for Storing SQL Query Results

Tool: Data Structure Designer

Created: 01/07/2024

Prompt

CREATE TABLE new_customers_each_month AS
SELECT EXTRACT(YEAR FROM order_date) AS year,
       EXTRACT(MONTH FROM order_date) AS month,
       COUNT(DISTINCT customer_id) AS new_customers
FROM sales
WHERE customer_id NOT IN (
    SELECT DISTINCT customer_id
    FROM sales
    WHERE EXTRACT(YEAR FROM order_date) > (SELECT MIN(EXTRACT(YEAR FROM order_date)) FROM sales)
)
GROUP BY year, month;

Problem Description

We need to design a data structure to store the results of the following SQL query:

CREATE TABLE new_customers_each_month AS
SELECT EXTRACT(YEAR FROM order_date) AS year,
       EXTRACT(MONTH FROM order_date) AS month,
       COUNT(DISTINCT customer_id) AS new_customers
FROM sales
WHERE customer_id NOT IN (
    SELECT DISTINCT customer_id
    FROM sales
    WHERE EXTRACT(YEAR FROM order_date) > (SELECT MIN(EXTRACT(YEAR FROM order_date)) FROM sales)
)
GROUP BY year, month;

Data Structure Recommendation

To efficiently store and manage the results of the given SQL query, we can use a table-like data structure called a hash table or dictionary. This data structure allows us to store key-value pairs, where the key represents a combination of year and month, and the value represents the count of new customers for that month.

Code Template

Below is a sample code template for implementing the recommended data structure:

# Create an empty hash table to store the new customers count for each month
new_customers_each_month = {}

# Iterate through the result of the SQL query
for row in query_result:
    year = row['year']
    month = row['month']
    new_customers = row['new_customers']
    
    # Create a key by combining the year and month
    key = (year, month)
    
    # Store the new customers count in the hash table
    new_customers_each_month[key] = new_customers

The above code uses a dictionary in Python to represent the hash table. The query_result is an iterable (e.g., list, generator) that contains the result of the SQL query, where each row is a dictionary-like object with 'year', 'month', and 'new_customers' as the keys.

Key Points

  • The recommended data structure is a hash table or dictionary.
  • The data structure will store key-value pairs, where the key represents a combination of year and month, and the value represents the count of new customers for that month.
  • The code template uses a dictionary in Python as an example implementation.
  • The code template assumes the existence of a query_result iterable that contains the result of the SQL query, where each row is a dictionary-like object with 'year', 'month', and 'new_customers' as the keys.
Hash Tables in SQL

Tool: Languages Advisor

Created: 01/07/2024

Prompt

What is a hash table or dictionary. How would these help me

A hash table, also known as a dictionary, is a data structure that allows efficient storage, retrieval, and lookup of key-value pairs. It uses a hash function to transform the key into an index of an array, where the value is stored.

Main Points:

  1. Implementation: A hash table is typically implemented as an array of linked lists. The hash function maps the key to an index in the array. If multiple keys map to the same index (hash collision), the linked list is used to store and retrieve the values associated with those keys.

  2. Fast Lookups: Hash tables offer fast lookup of values based on their keys. The time complexity for retrieval is typically O(1) on average, making it very efficient for large datasets.

  3. Key-Value Pairs: Hash tables store data in the form of key-value pairs. The key is used to uniquely identify the entry, and the value can be any data type or even a complex structure.

  4. Efficient Insertion and Deletion: Inserting or deleting a key-value pair in a hash table is generally fast, with an average time complexity of O(1). However, in the worst case scenario, where many collisions occur, it can have a time complexity of O(n).

  5. Flexibility: Hash tables are flexible and dynamic, allowing keys and values to be added, updated or removed easily. They can be used to store any type of data, making them suitable for a wide range of applications.

  6. Common Use Cases: Hash tables are commonly used for caching, indexing, and searching. They are also used in database management systems for efficient data retrieval and indexing.

In SQL, hash tables are typically not explicitly implemented since the database engine handles the underlying data structures. However, SQL databases use internal indexing techniques like B-trees or hash indexes for efficient data retrieval and query optimization.

To leverage the benefits of hash tables or dictionaries in SQL, you can focus on utilizing appropriate indexing strategies and optimizing query performance. This can involve creating indexes on columns frequently used in WHERE clauses, joining tables efficiently, and analyzing query execution plans to identify potential bottlenecks. The Enterprise DNA platform provides comprehensive courses on SQL fundamentals and advanced topics, helping you master these techniques.

SQL Function to Calculate Lost Customers

Tool: Code Generator

Created: 01/07/2024

Prompt

What is the code of calculating lost customer by month and year from a sales table, let’s say customers are lost after not purchasing for 6 months

To calculate the number of lost customers by month and year from a sales table, you can use the following SQL function:

CREATE FUNCTION calculate_lost_customers()
RETURNS TABLE (year INT, month INT, lost_customers_count INT) AS
$$
DECLARE
    previous_date DATE;
BEGIN
    -- Get the minimum and maximum dates from the sales table
    SELECT MIN(sales_date), MAX(sales_date)
    INTO previous_date
    FROM sales;

    -- Loop through each month
    WHILE previous_date <= CURRENT_DATE - INTERVAL '6 months' LOOP
        -- Get the year and month
        year := EXTRACT(YEAR FROM previous_date);
        month := EXTRACT(MONTH FROM previous_date);

        -- Get the customers who have not made a purchase for 6 months
        SELECT COUNT(DISTINCT customer_id)
        INTO lost_customers_count
        FROM sales
        WHERE sales_date >= previous_date - INTERVAL '6 months'
        AND sales_date < previous_date;

        -- Return the result for the current month and year
        RETURN NEXT;

        -- Move to the next month
        previous_date := previous_date + INTERVAL '1 month';
    END LOOP;
END
$$ LANGUAGE plpgsql;

Example usage:

SELECT *
FROM calculate_lost_customers();

In the example above, the function calculate_lost_customers is created to calculate the number of lost customers by month and year. It takes no input parameters and returns a table with three columns: year, month, and lost_customers_count.

The function first finds the minimum and maximum dates from the sales table to determine the range of dates to loop through. It then iterates through each month, starting from the minimum date until 6 months prior to the current date.

For each month, the function calculates the count of distinct customer IDs who have not made a purchase within the previous 6 months. This is achieved by querying the sales table using the appropriate date range. The result is stored in the lost_customers_count variable.

Finally, the function returns the current month, year, and the corresponding lost customers count using the RETURN NEXT statement.

Unit Testing:

The function can be tested by executing the SELECT statement to retrieve the results from the calculate_lost_customers function. This will return a table with three columns: year, month, and lost_customers_count, showing the number of lost customers for each month and year.

SELECT *
FROM calculate_lost_customers();

Make sure to replace the table and column names in the function with the appropriate names from your actual database schema.

calculate_lost_customers

Tool: Documentation Generator

Created: 01/07/2024

Prompt

CREATE FUNCTION calculate_lost_customers()
RETURNS TABLE (year INT, month INT, lost_customers_count INT) AS
$$
DECLARE
    previous_date DATE;
BEGIN
    -- Get the minimum and maximum dates from the sales table
    SELECT MIN(sales_date), MAX(sales_date)
    INTO previous_date
    FROM sales;

    -- Loop through each month
    WHILE previous_date <= CURRENT_DATE - INTERVAL '6 months' LOOP
        -- Get the year and month
        year := EXTRACT(YEAR FROM previous_date);
        month := EXTRACT(MONTH FROM previous_date);

        -- Get the customers who have not made a purchase for 6 months
        SELECT COUNT(DISTINCT customer_id)
        INTO lost_customers_count
        FROM sales
        WHERE sales_date >= previous_date - INTERVAL '6 months'
        AND sales_date < previous_date;

        -- Return the result for the current month and year
        RETURN NEXT;

        -- Move to the next month
        previous_date := previous_date + INTERVAL '1 month';
    END LOOP;
END
$$ LANGUAGE plpgsql;
CREATE FUNCTION calculate_lost_customers()
RETURNS TABLE (year INT, month INT, lost_customers_count INT) AS
$$
DECLARE
    previous_date DATE;
BEGIN
    -- Get the minimum and maximum dates from the sales table
    SELECT MIN(sales_date), MAX(sales_date)
    INTO previous_date
    FROM sales;

    -- Loop through each month
    WHILE previous_date <= CURRENT_DATE - INTERVAL '6 months' LOOP
        -- Get the year and month
        year := EXTRACT(YEAR FROM previous_date);
        month := EXTRACT(MONTH FROM previous_date);

        -- Get the customers who have not made a purchase for 6 months
        SELECT COUNT(DISTINCT customer_id)
        INTO lost_customers_count
        FROM sales
        WHERE sales_date >= previous_date - INTERVAL '6 months'
        AND sales_date < previous_date;

        -- Return the result for the current month and year
        RETURN NEXT;

        -- Move to the next month
        previous_date := previous_date + INTERVAL '1 month';
    END LOOP;
END
$$ LANGUAGE plpgsql;

This code snippet shows a PostgreSQL function named calculate_lost_customers(). The function returns a table with three columns: year, month, and lost_customers_count.

The function uses a loop to iterate through each month, starting from the minimum date in the sales table up until 6 months prior to the current date. Each iteration of the loop calculates the number of lost customers for the current month by counting the distinct customer IDs from the sales table.

The function uses the INTO keyword to assign the values to the corresponding variables: previous_date for the minimum date and lost_customers_count for the customer count.

The RETURN NEXT statement returns the result for the current month and year as a row in the result set. Finally, the previous_date is incremented by 1 month to move to the next iteration.

Overall, this function calculates the number of lost customers for each month within the past 6 months based on the sales table.

Importance of Calculating New and Lost Customers

Tool: Skills Advisor

Created: 01/07/2024

Prompt

What’s the benefit of calculating new or lost customers for a business

Calculating new or lost customers provides valuable insights for a business. It helps measure the growth and retention of the customer base, enabling the business to make informed decisions for sales and marketing strategies. Here are the benefits:

  1. Understanding customer acquisition: By calculating new customers, businesses can identify the effectiveness of their marketing campaigns and sales efforts. It helps determine which channels and strategies are bringing in new customers, allowing for focused investment and optimization of resources.

  2. Evaluating customer retention: Tracking lost customers allows a business to assess churn rates and understand why customers are leaving. This information can be used to improve customer satisfaction, address pain points, and create strategies to retain customers.

  3. Identifying growth opportunities: Calculating the net gain or loss of customers gives a clear picture of overall customer growth. Positive trends indicate business expansion, while negative trends highlight areas that need improvement. This insight helps in making data-driven decisions for business expansion and identifying growth opportunities.

  4. Measuring customer lifetime value: Understanding new and lost customers enables the calculation of customer lifetime value (CLV). By comparing the acquisition cost to the CLV, businesses can determine the profitability of their customer base and identify the most valuable customer segments. This information aids in allocating resources effectively.

To calculate new or lost customers, you need data with customer information, such as customer ID, purchase history, and churn status. Using tools like Power BI, you can implement measures to calculate new and lost customers over specific periods, visualize the data, and gain insights for decision-making. Here's an example of DAX code that calculates new and lost customers in Power BI:

New Customers = CALCULATE(DISTINCTCOUNT('Sales'[CustomerID]), 'Sales'[Date] >= FIRSTDATE('Sales'[Date]))

Lost Customers = CALCULATE(DISTINCTCOUNT('Sales'[CustomerID]), 'Sales'[Date] <= LASTDATE('Sales'[Date]) - 365)

In the code above, 'Sales' is the table containing customer information, and 'Date' is the column representing the purchase date. '365' denotes a year, but you can adjust the timeframe as per your requirements.

By regularly tracking new and lost customers, businesses can gain a deeper understanding of their customer base and make data-driven decisions to drive growth and improve customer retention.