Chris Parks

I'm a technical problem-solver who specializes in planning, organizing, and facilitating technical projects between systems. I have led successful integrations projects between platforms such as Shopify, Amazon, NetSuite, and more, on top of providing technical consultation and guidance for clients looking to build custom integrations via API.

Recreating Python/Pandas Data Manipulation with Complex SQL Query

I’ve previously posted about looking into NBA statistical data to investigate three-point shooting streaks: https://richchrisparks.com/analyzing-nba-3-point-shot-streaks-with-python-pandas-and-tableau/ .

In the post, I explain in general terms how I take play-by-play data sorted either by team or by player from a SQL query and transform it into a table of “streaks” (consecutively made or missed three-point shots) using Python and Pandas. You can see the original logic in GitHub here: https://github.com/richparks92/missed-threes/blob/main/get_streaks.py.

Let’s recreate this logic using SQL, step-by-step.

Goal

I want to take data from the play-by-play data table in my SQLite database and format it into a table like below to discover team shot streaks:

team_idis_makegame_idcntstreak_id
1610612737.00002960000541
1610612737.01002960000522
1610612737.00002960000543
1610612737.01002960000514

Creating the SQL Query

0. Original Data

This is what the play-by-play data looks like:

GAME_IDEVENTNUMHOMEDESCRIPTIONVISITORDESCRIPTIONPLAYER1_IDPLAYER1_TEAM_ID
00296000021Jump Ball Bradley vs. West: Tip to Williams7621610612751.0
00296000022Mills Violation:Defensive Goaltending1681610612739.0
00296000025Sura S.FOUL (P1.T1)6821610612739.0
00296000027Gill Free Throw 1 of 2 (1 PTS)3831610612751.0
00296000028Gill Free Throw 2 of 2 (2 PTS)3831610612751.0

1. Filtering to 3 Point Shots

Using the following SQL query, we join the play-by-play data with the game log data so we can know the date of each game. We also filter to three point shots only.

SELECT 
  pbp2.GAME_ID AS game_id, 
  lgl.GAME_DATE AS game_date, 
  EVENTNUM as event_num,  
  PLAYER1_ID AS player_id,  
  PLAYER1_TEAM_ID  AS team_id,
  CASE
    WHEN COALESCE(HOMEDESCRIPTION, '') <> '' THEN HOMEDESCRIPTION
    ELSE VISITORDESCRIPTION
  END AS description
FROM  PLAY_BY_PLAYS_STATS_2 pbp2
LEFT JOIN LEAGUE_GAME_LOGS lgl ON pbp2.GAME_ID = lgl.GAME_ID
WHERE lgl.WL = "W" 
  AND DESCRIPTION LIKE "%3PT%"
ORDER BY pbp2.PLAYER1_TEAM_ID, GAME_DATE, pbp2.GAME_ID, EVENTNUM

2. Marking makes and misses

Next, we use LIKE to determine whether each shot is a make (1) or a miss (0).

SELECT
  game_id,
  game_date,
  event_num,
  player_id,
  team_id,
  CASE WHEN description LIKE "MISS%" THEN False 
    ELSE True
  END AS is_make
FROM
  team_shots_raw

3. Finding the previous shot result using LAG

Here we use the LAG function to find the shot result of the previous row so we can compare the current and previous shot later.

SELECT 
  *, 
  LAG(is_make) 
    OVER(
      ORDER BY team_id, game_date, event_num) 
  AS prev 
FROM team_shots

4. Assigning a streak_id to each streak

Now, we can use the SUM function to help define a streak_id for each string of consecutive made or missed shots. For each row, we now have a running total that adds 1 to the streak_id if the shot result (is_make) is a different result than the last, thus meaning it’s a new streak. If the shot result is the same, we add 0 to keep the streak_id the same.

SELECT 
  team_id, 
  game_id,
  is_make, 
  SUM(
    CASE 
      WHEN is_make = prev THEN 0 
      ELSE 1 
    END) 
    OVER(ORDER BY team_id, game_date, event_num) 
  AS streak_id
FROM 
  lags

5. Grouping by Streak ID and Game ID

Now we have a table where each shot is assigned a streak_id. The final step is to group this table so each combination of streak_id and game_id is only represented once, and the count of shots for each grouped row is calculated.

SELECT 
  team_id, 
  is_make,
  game_id,  
  COUNT(*) AS cnt,
  streak_id
FROM streaks
GROUP BY streak_id, game_id
ORDER BY streak_id;

Putting it All Together

The final query is complex and computationally expensive. It returned 698619 rows (from 14,486,793 total rows of play-by-play data) and took about 71 seconds to complete. Here’s the whole query combined:

WITH  team_shots_raw AS (
  SELECT 
    pbp2.GAME_ID AS game_id, 
    lgl.GAME_DATE AS game_date, 
    EVENTNUM as event_num,  
    PLAYER1_ID AS player_id,  
    PLAYER1_TEAM_ID  AS team_id,
    CASE
      WHEN COALESCE(HOMEDESCRIPTION, '') <> '' THEN HOMEDESCRIPTION
      ELSE VISITORDESCRIPTION
    END AS description
  FROM  PLAY_BY_PLAYS_STATS_2 pbp2
  LEFT JOIN LEAGUE_GAME_LOGS lgl ON pbp2.GAME_ID = lgl.GAME_ID
  WHERE lgl.WL = "W" 
    AND DESCRIPTION LIKE "%3PT%"
  ORDER BY pbp2.PLAYER1_TEAM_ID, GAME_DATE, pbp2.GAME_ID, EVENTNUM
		)
,team_shots AS (
  SELECT 
    game_id,
    game_date,
    event_num,
    player_id,
    team_id,
    CASE 
      WHEN description LIKE "MISS%" THEN False 
      ELSE True 
    END AS is_make
  FROM
    team_shots_raw
)

,lags AS (
  SELECT 
    *, 
    LAG(is_make) 
      OVER(
        ORDER BY team_id, game_date, event_num) 
    AS prev 
  FROM team_shots
)

, streaks AS (
SELECT 
  team_id, 
  game_id,
  is_make, 
  SUM(
    CASE 
      WHEN is_make = prev THEN 0 
      ELSE 1 
    END) 
    OVER(ORDER BY team_id, game_date, event_num) 
  AS streak_id
FROM 
  lags

SELECT 
  team_id,
  is_make,
  game_id,  
  COUNT(*) AS cnt,
  streak_id
FROM streaks
GROUP BY streak_id, game_id
ORDER BY streak_id;