I’ve previously posted about looking into NBA statistical data to investigate three-point shooting streaks: https://richchrisparks.com/analyzing-nba-3-point-shot-streaks-with-python-pandas-and-tableau/ .
In the post, I explain in general terms how I take play-by-play data sorted either by team or by player from a SQL query and transform it into a table of “streaks” (consecutively made or missed three-point shots) using Python and Pandas. You can see the original logic in GitHub here: https://github.com/richparks92/missed-threes/blob/main/get_streaks.py.
Let’s recreate this logic using SQL, step-by-step.
Goal
I want to take data from the play-by-play data table in my SQLite database and format it into a table like below to discover team shot streaks:
team_id | is_make | game_id | cnt | streak_id |
---|---|---|---|---|
1610612737.0 | 0 | 0029600005 | 4 | 1 |
1610612737.0 | 1 | 0029600005 | 2 | 2 |
1610612737.0 | 0 | 0029600005 | 4 | 3 |
1610612737.0 | 1 | 0029600005 | 1 | 4 |
Creating the SQL Query
0. Original Data
This is what the play-by-play data looks like:
GAME_ID | EVENTNUM | HOMEDESCRIPTION | VISITORDESCRIPTION | PLAYER1_ID | PLAYER1_TEAM_ID |
---|---|---|---|---|---|
0029600002 | 1 | Jump Ball Bradley vs. West: Tip to Williams | 762 | 1610612751.0 | |
0029600002 | 2 | Mills Violation:Defensive Goaltending | 168 | 1610612739.0 | |
0029600002 | 5 | Sura S.FOUL (P1.T1) | 682 | 1610612739.0 | |
0029600002 | 7 | Gill Free Throw 1 of 2 (1 PTS) | 383 | 1610612751.0 | |
0029600002 | 8 | Gill Free Throw 2 of 2 (2 PTS) | 383 | 1610612751.0 |
1. Filtering to 3 Point Shots
Using the following SQL query, we join the play-by-play data with the game log data so we can know the date of each game. We also filter to three point shots only.
SELECT
pbp2.GAME_ID AS game_id,
lgl.GAME_DATE AS game_date,
EVENTNUM as event_num,
PLAYER1_ID AS player_id,
PLAYER1_TEAM_ID AS team_id,
CASE
WHEN COALESCE(HOMEDESCRIPTION, '') <> '' THEN HOMEDESCRIPTION
ELSE VISITORDESCRIPTION
END AS description
FROM PLAY_BY_PLAYS_STATS_2 pbp2
LEFT JOIN LEAGUE_GAME_LOGS lgl ON pbp2.GAME_ID = lgl.GAME_ID
WHERE lgl.WL = "W"
AND DESCRIPTION LIKE "%3PT%"
ORDER BY pbp2.PLAYER1_TEAM_ID, GAME_DATE, pbp2.GAME_ID, EVENTNUM
2. Marking makes and misses
Next, we use LIKE
to determine whether each shot is a make (1) or a miss (0).
SELECT
game_id,
game_date,
event_num,
player_id,
team_id,
CASE WHEN description LIKE "MISS%" THEN False
ELSE True
END AS is_make
FROM
team_shots_raw
3. Finding the previous shot result using LAG
Here we use the LAG function to find the shot result of the previous row so we can compare the current and previous shot later.
SELECT
*,
LAG(is_make)
OVER(
ORDER BY team_id, game_date, event_num)
AS prev
FROM team_shots
4. Assigning a streak_id to each streak
Now, we can use the SUM
function to help define a streak_id for each string of consecutive made or missed shots. For each row, we now have a running total that adds 1 to the streak_id if the shot result (is_make) is a different result than the last, thus meaning it’s a new streak. If the shot result is the same, we add 0 to keep the streak_id the same.
SELECT
team_id,
game_id,
is_make,
SUM(
CASE
WHEN is_make = prev THEN 0
ELSE 1
END)
OVER(ORDER BY team_id, game_date, event_num)
AS streak_id
FROM
lags
5. Grouping by Streak ID and Game ID
Now we have a table where each shot is assigned a streak_id. The final step is to group this table so each combination of streak_id and game_id is only represented once, and the count of shots for each grouped row is calculated.
SELECT
team_id,
is_make,
game_id,
COUNT(*) AS cnt,
streak_id
FROM streaks
GROUP BY streak_id, game_id
ORDER BY streak_id;
Putting it All Together
The final query is complex and computationally expensive. It returned 698619 rows (from 14,486,793 total rows of play-by-play data) and took about 71 seconds to complete. Here’s the whole query combined:
WITH team_shots_raw AS (
SELECT
pbp2.GAME_ID AS game_id,
lgl.GAME_DATE AS game_date,
EVENTNUM as event_num,
PLAYER1_ID AS player_id,
PLAYER1_TEAM_ID AS team_id,
CASE
WHEN COALESCE(HOMEDESCRIPTION, '') <> '' THEN HOMEDESCRIPTION
ELSE VISITORDESCRIPTION
END AS description
FROM PLAY_BY_PLAYS_STATS_2 pbp2
LEFT JOIN LEAGUE_GAME_LOGS lgl ON pbp2.GAME_ID = lgl.GAME_ID
WHERE lgl.WL = "W"
AND DESCRIPTION LIKE "%3PT%"
ORDER BY pbp2.PLAYER1_TEAM_ID, GAME_DATE, pbp2.GAME_ID, EVENTNUM
)
,team_shots AS (
SELECT
game_id,
game_date,
event_num,
player_id,
team_id,
CASE
WHEN description LIKE "MISS%" THEN False
ELSE True
END AS is_make
FROM
team_shots_raw
)
,lags AS (
SELECT
*,
LAG(is_make)
OVER(
ORDER BY team_id, game_date, event_num)
AS prev
FROM team_shots
)
, streaks AS (
SELECT
team_id,
game_id,
is_make,
SUM(
CASE
WHEN is_make = prev THEN 0
ELSE 1
END)
OVER(ORDER BY team_id, game_date, event_num)
AS streak_id
FROM
lags
SELECT
team_id,
is_make,
game_id,
COUNT(*) AS cnt,
streak_id
FROM streaks
GROUP BY streak_id, game_id
ORDER BY streak_id;