Introduction

Traffic volume forecasts are a core input to day-to-day transportation operations: they help agencies anticipate demand surges, detect abnormal conditions, allocate staffing, and communicate expected congestion patterns to the public. But “traffic volume” is not a single uniform signal—networks are heterogeneous. Some stations represent high-volume corridors with stable commuter seasonality; others are lower-volume segments where week-to-week patterns are noisier, disruptions are more visible, and naïve heuristics can fail in ways that matter operationally.

This project builds and evaluates a station-level daily traffic forecasting model designed for that reality: one model that can generalize across a portfolio of stations with very different scales, while still delivering reliable gains during volatile windows when decision-makers most need forecasts to hold up. The practical benchmark throughout is intentionally strict: the seasonal naïve lag-7 rule (“same day last week”) is a strong baseline for traffic series because it directly encodes weekly commuting structure. A model is only worth deploying if it consistently improves on that baseline—not just on average, but across stations, across time, and across different traffic regimes.

Data source and why it fits the forecasting problem

The dataset comes from the Minnesota Department of Transportation (MnDOT) Traffic Data & Analysis hourly volume reports for continuous count sites (ATR/WIM). MnDOT publishes these hourly volume datasets by year, reflecting counts collected by permanent sensors. The underlying monitoring program is explicitly designed around continuous counts—permanent road sensors that report hourly counts throughout the year—which makes it a natural foundation for time-series forecasting and backtesting workflows.

Operationally, continuous-count data has two strengths that shape the modeling design in this project. First, it supports realistic evaluation: you can simulate deployment by training on history up to a cutoff date and forecasting the next horizon (rolling-origin backtests), because the data arrives as a consistent time stream. Second, it preserves the structure traffic forecasters actually exploit—strong weekly seasonality, gradual drift, and episodic disruptions—while still requiring robustness to real-world issues like missing days, sensor irregularities, and coverage changes.

Core challenge: regime shifts and network heterogeneity

A major complication for traffic forecasting is that the data-generating process is not stationary over long horizons. The COVID period introduced a clear regime change in both level and variability, and many stations did not immediately return to their pre-2020 dynamics. Rather than treating 2020–early 2021 as “just another season,” this project explicitly tests for stabilization and anchors the modeling window in the more regular post-recovery regime. The early visuals (normalization to a 2019 baseline, systemwide trend context, and weekday/weekend structure) are not just EDA—they justify why the downstream modeling choices (features, baselines, evaluation windows) are framed the way they are.

At the same time, any systemwide summary can hide important station-level behavior. Traffic volumes are heavy-tailed: a few high-volume corridors can dominate aggregate losses, while smaller stations may be operationally important but statistically “easy to ignore.” This project therefore treats “generalization across station scale” as a first-class requirement. That shows up in (i) modeling decisions like variance-stabilizing transforms and shared global learning, and (ii) evaluation decisions like station-weighted metrics and station-level error slicing so performance is not reduced to a single network-average number.

What the project does

This pipeline converts hourly sensor stream into a daily station forecasting task and evaluates multiple models under a controlled, fair comparison:

Problem setup: forecast daily station volume using only information that would be available at prediction time (lagged history + rolling summaries + calendar effects).
Baseline: seasonal naïve lag-7 (“same day last week”), which is a strong traffic benchmark and a meaningful operational fallback.
Model selection: benchmark several regression families on the same features and chronological split, selecting a final model based on held-out performance rather than in-sample fit.
Evaluation at multiple levels: report results (a) in aggregate using station-weighted metrics, (b) by station to test breadth of improvement, (c) by station scale to confirm generalization from low-volume to high-volume corridors, and (d) across time via rolling-origin backtests to ensure uplift is repeatable and not split-dependent.
Interpretability checks: supplement metrics with concrete examples (small-multiple station plots) and system-total trend comparisons to show what “better forecasts” looks like in practice.

Contribution and takeaway

The outcome is not just “a model with a good score,” but an end-to-end forecasting story that matches how agencies evaluate usefulness: Does it beat a strong weekly baseline? Is the lift broad across the network? Does it persist across seasons and volatile windows? And does it produce plausible system-level signals? In the results that follow, ExtraTreesRegressor (ET) is selected because it delivers the strongest consistent improvement under these constraints, and the visual analyses are used to demonstrate that the gains are (1) widespread across stations, (2) present across station scales, and (3) repeatable across time in a deployment-like backtest—exactly the properties you want before trusting a forecasting model operationally.

Methods

This project builds a deployment-style forecasting pipeline for daily traffic volumes using only information available up to each prediction date. The core objective is practical: improve on a strong seasonal naïve baseline (“same day last week”) while remaining stable across heterogeneous stations (from low-volume sensors to major corridors) and across time.

Data and panel construction

The modeling dataset is a station-day panel covering June 2021 through December 2024, where each row represents a single station’s observed traffic volume for a specific date. The panel is derived from MnDOT ATR/WIM hourly volume files, which report traffic counts at the lane × direction × day level with 24 hourly columns ("1" … "24"). Each raw row is first converted into a daily lane total by summing the hourly columns (coercing non-numeric values to missing), and lane totals are then aggregated upward to a direction-day volume (vol_dir) by summing across lanes for the same station_id, dir_of_travel, and date. Finally, direction-day volumes are summed across directions to produce the station-day target, station_volume.

Because the forecasting task assumes a consistent physical interpretation of “daily volume” over time, the pipeline also constructs directional split diagnostics to identify station-days that are structurally comparable. For each station, the two most consistently observed directions (by number of reporting days) are selected as the canonical pair (dir_a, dir_b). The processing step then attaches vol_a, vol_b, and directional shares (share_a, share_b) for each station-day, and defines a valid_day flag requiring: (i) positive total volume, (ii) exactly two observed directions for the station, and (iii) non-missing volumes for both canonical directions. This flag is used to separate clean station-days from partial or structurally ambiguous reporting (e.g., missing direction data, inconsistent direction coding, or days where only one direction reports).

Coverage and station heterogeneity are addressed explicitly during construction. After station-day aggregation, the pipeline computes per-station coverage statistics over a fixed window, including the number of years present, minimum and average coverage fraction (valid days divided by expected days in that year), and average daily volume. These summaries are used to identify candidate stations with stable reporting behavior and to avoid training on stations dominated by missingness or intermittent operation.

The final modeling panel is created by filtering to the project’s analysis window (June 2021–December 2024) and retaining station-days that meet the validity criteria above, yielding a dataset suitable for backtesting models that rely only on information available up to each prediction date.

Regime selection: normalizing to a 2019 baseline and identifying post-COVID stabilization

To compare traffic dynamics across stations with very different absolute volumes, daily traffic is expressed as a ratio to each station’s 2019 median. This normalization makes the unit of analysis interpretable (“percent of typical 2019 activity”) and prevents high-volume stations from dominating systemwide summaries. We aggregate across stations using the daily median (robust to outliers) and visualize cross-station heterogeneity using the interquartile range (IQR, 25th–75th percentile).

Figure 1 shows a clear COVID-era regime shift: systemwide traffic drops sharply in early 2020 and recovers noisily through 2020–mid-2021. After June 2021 (vertical dashed line), the series exhibits a more regular annual pattern—seasonal peaks above baseline and winter troughs below baseline—while remaining broadly centered near the 2019 reference level (horizontal dashed line at 1.0). Consistent with this stabilization, the median’s short-horizon volatility declines (the 13-week rolling standard deviation is ~7% lower after June 2021). This supports treating 2020 through early 2021 as a distinct regime and using post-June 2021 data as the primary window for learning “normal” seasonal structure in forecasting models.

Figure 1. Daily traffic is shown as a ratio to each station’s 2019 median (1.0 = typical 2019), with the systemwide median (line) and cross-station IQR (band). The dashed June 2021 marker highlights the post-COVID period where seasonality becomes more regular and the modeling window begins.

Systemwide context and station coverage

To ground the station-level forecasting task in a systemwide view, we aggregate daily station volumes into a weekly system total and track the number of active stations reporting data each week (coverage), since both can affect observed trends. Figure 2 separates these concepts into aligned panels: the top panel shows overall demand (weekly total volume), while the bottom panel shows data availability (active stations). We plot the raw weekly totals as a light line to preserve short-term variation and overlay an 8-week rolling mean to emphasize the underlying trend.

Figure 2. Weekly system total volume across active stations (top) is shown with an 8-week rolling mean to emphasize the underlying trend and seasonality. Active-station coverage (bottom) provides context to distinguish true demand shifts from reporting changes.

Figure 2 motivates two modeling choices. First, post-June 2021 traffic exhibits stable repeating seasonality at the system level, supporting calendar encodings and seasonality-aware lags. Second, the active-station count stays within a relatively narrow band, indicating major swings in weekly total volume are primarily real demand dynamics, not stations dropping in/out. Where troughs coincide with modest coverage dips, this reinforces two practical safeguards: (i) using rolling features that reduce sensitivity to transient anomalies, and (ii) incorporating data-quality checks (e.g., “near-complete week” filtering or missingness indicators) so the model is not asked to learn from deflated totals caused by incomplete reporting.

Feature design and baseline choice

The forecasting goal is to predict daily volume at each station using only information that would be available at prediction time. The feature set is intentionally compact and operationally interpretable:

Autoregressive lags: prior-day and prior-week values (e.g., 1, 7, 14, 28 days)
Rolling statistics: trailing 7- and 28-day mean and standard deviation
Calendar features: day-of-week, month, and smooth seasonal encodings (sin/cos day-of-year)
Station identity: station_id as a global-model signal to learn station-specific baselines and variability without training a separate model per station

Figure 3. Across stations, normalized volumes rise through weekdays and drop sharply on weekends (left), and the weekday vs weekend distributions show a clear level shift (right). This motivates explicit calendar features and lag-7 as a strong baseline/feature.

Figure 3 explains why these engineered features—and the lag-7 baseline—are appropriate. After normalizing stations by their own median volume, the median pattern rises from Monday through Friday and drops sharply on weekends, indicating broad, consistent weekday/weekend structure across stations. This motivates calendar features (day-of-week effects) and also motivates the seasonal naïve baseline: traffic on a given Monday is typically much closer to the previous Monday than to the previous Sunday. The spread around the median curves shows stations differ in seasonal strength and noise, motivating station identity and rolling statistics to stabilize predictions and adapt to drift.

Modeling target and inverse transform

Station volumes are heavy-tailed, so models are trained to predict:

y = \log\!\left(1 + \text{station\_volume}\right)

This stabilizes variance and improves cross-station generalization. For operational interpretability, metrics are emphasized after inverse-transforming predictions back to vehicles/day:

\widehat{\text{station\_volume}} = \operatorname{expm1}(\hat{y}) = e^{\hat{y}} - 1

Train/validation/test setup (time-aware and fair)

All candidate models are benchmarked using the same engineered feature set and the same chronological train/validation/test split. The split is time-based (not random), aligning evaluation with deployment: train on past observations and predict future days without leakage.

Baselines and model candidates

To ensure improvements reflect signal beyond weekly repetition, the benchmark includes simple baselines:

Seasonal naïve (lag-7): $\hat{y}(t) = y(t-7)$
Naïve lag-1: $\hat{y}(t) = y(t-1)$
Rolling mean (7-day): $\hat{y}(t) = \frac{1}{7}\sum_{k=1}^{7} y(t-k)$

Learned models span linear methods, robust regression, neural/distance-based models, and tree ensembles. All learned models are trained on the same X and y to keep comparisons controlled.

Metrics and station-weighted aggregation (STW)

Performance is reported using complementary metrics:

RMSE (vehicles/day): typical miss size, penalizes large errors more
MAE (vehicles/day): average absolute miss, easier to interpret day-to-day
sMAPE (%): percent-style error for fair comparison across station scales

To avoid a few high-volume corridors dominating the headline results, the primary summaries use station-weighted (STW) aggregation: compute metrics within each station over its evaluation days, then average those station-level metrics across stations (roughly equal weight per station). This allows us to answer the important question: “how well does this work across the network?”

Rolling-origin backtest (deployment-like validation)

To test stability across time, the final model is evaluated using a rolling-origin backtest with an expanding training window:

Train on all history available up to a cutoff
Forecast the next 28 days
Move the origin forward and repeat across 16 folds

This mirrors production behavior (periodic retraining, forward forecasting) and tests whether gains persist across seasons and volatility regimes rather than depending on one favorable split.

Results

This Results section answers two practical questions: (1) which model should a traffic agency actually trust for day-ahead to month-ahead forecasting, and (2) does it stay better than a strong “same day last week” baseline across different station types and time periods? We start with a controlled benchmark where every model sees the same features and the same time-based split, and we report performance using station-weighted (STW) metrics so the biggest corridors don’t drown out everyone else. From there, we validate the model choice in three ways that map to real operations: coverage across stations (does it help most stations or just a few?), coverage across station scale (does it still help on both low- and high-volume sites?), and coverage across time (does it keep winning in rolling, forward-looking backtests that mimic deployment?).

Training setup and “fair” model comparison

To select a final forecasting model without giving any approach an unfair advantage, we benchmarked multiple regression models using the same engineered feature set and the same chronological train/validation/test split. The split is time-based (not random) so evaluation matches the real forecasting task: train on past observations and predict future days without leaking information across time.

The key update in this benchmark is that station_id is one-hot encoded and included as part of the shared feature set. This turns the problem into a single “global” model that can still learn station-specific baselines (fixed effects) while using the same lag/rolling/calendar signals everywhere. Practically, this is a fairer comparison for pooled forecasting because it prevents models from being forced into a one-size-fits-all intercept across stations with very different typical volumes.

Because station volumes are heavy-tailed—a small number of very high-volume stations can dominate loss functions—models were trained to predict log1p(station_volume). This stabilizes variance and helps the pooled model generalize across stations with very different scales. For operational interpretation, the benchmark emphasizes metrics computed after inverse-transforming predictions back to the original volume units. To avoid letting the largest corridors dominate summary error, the headline comparisons use station-weighted (STW) aggregation, which treats stations more evenly when reporting overall performance.

Performance is summarized using a few metrics that each answer a slightly different “how wrong was the forecast?” question. RMSE (volume/day) reflects typical miss in real units and penalizes big mistakes more heavily. MAE (volume/day) is the average absolute miss and is usually easier to interpret day-to-day because it’s less influenced by rare extreme spikes. To compare performance fairly across stations with very different traffic levels, sMAPE (%) reports error as a percentage rather than raw units, so low-volume stations don’t get ignored and high-volume stations don’t dominate. (MAPE is also reported for completeness, but sMAPE is emphasized because it behaves better when volumes are small.) For all metrics, lower is better. We also include simple baselines—especially the seasonal naïve lag-7 rule (“same day last week”)—so improvements reflect signal beyond weekly seasonality, and I report fit time as model training time only on the same machine under consistent settings.

What the benchmark shows

Table 1 compares models trained on the same feature set (lags/rolls/calendar effects plus one-hot station identity) under the same time-based split, reporting station-weighted, volume-scale test metrics for interpretability. The results show that the feature set contains real predictive signal beyond weekly seasonality: most learned models outperform the lag-7 baseline on the held-out test set (baseline STW test RMSE = 4,077.3; baseline STW test sMAPE = 9.21%, across 74 stations).

Model	Fit (s)	Test STW RMSE (vehicles/day)	Test STW MAE (vehicles/day)	Test STW sMAPE (%)	Δ RMSE vs lag-7 (%)	Δ sMAPE vs lag-7 (pp)
Baseline: seasonal naïve (t−7)	0.00	4,077.3	2,337.5	9.21	0.0	0.00
ExtraTreesRegressor (600, leaf=1)	5.95	2,815.6	1,828.5	6.91	30.9	2.30
RandomForestRegressor (400, leaf=2)	13.67	2,944.2	1,870.0	7.13	27.8	2.08
HuberRegressor	6.90	3,157.3	1,979.1	7.58	22.6	1.63
ElasticNet (a=0.0003, l1=0.5)	14.19	3,170.0	2,099.5	8.02	22.3	1.18
Ridge (α=1)	0.06	3,192.3	2,132.9	7.98	21.7	1.23
LinearRegression	0.12	3,208.6	2,158.6	8.00	21.3	1.20
XGBoost (1200, lr=0.03, d=8)	1.80	3,279.4	1,993.0	7.65	19.6	1.55
HistGradientBoosting (it=400, d=6)	1.17	3,533.0	2,152.8	7.99	13.3	1.22
MLPRegressor (128×64)	9.20	4,014.9	2,981.3	9.29	1.5	-0.09
Baseline: rolling mean (7d)	0.00	5,010.7	4,137.0	12.80	-22.9	-3.59
Baseline: naïve (t−1)	0.00	5,603.4	4,153.8	13.98	-37.4	-4.77

Table 1. Model comparison on the station-weighted test set (STW). Lower RMSE/MAE/sMAPE indicate better accuracy; Δ columns report improvement relative to the seasonal naïve lag-7 baseline.

Across the model family, tree ensembles are the clear top performers. ExtraTrees provides the best overall out-of-sample accuracy on the operational, station-weighted metrics (test STW RMSE = 2,815.6; test STW sMAPE = 6.91%), corresponding to a 30.9% RMSE reduction and a 2.30 sMAPE-point improvement relative to lag-7. RandomForest is close (2,944 RMSE; 7.13% sMAPE) but trains substantially slower. Linear/robust models (Ridge/LinearRegression/Huber/ElasticNet) remain competitive—an encouraging sign that the engineered lags/rolls/calendar features are strong—but they do not match the best ensemble accuracy. Boosting methods improve on lag-7, but under this feature set and split they fall short of ExtraTrees on the held-out test window.

Final model selection: ExtraTreesRegressor (ET)

Based on the controlled benchmark, we select ExtraTreesRegressor as the final forecasting model. With station identity included as one-hot features, ET captures station-specific baselines while still exploiting shared lag/rolling/calendar signals, yielding the lowest station-weighted test error among evaluated approaches while remaining fast enough for practical retraining and iteration. Relative to a credible seasonal naïve baseline, ET reduces both percent error (STW sMAPE 9.21% → 6.91%) and real-unit error (STW RMSE 4,077 → 2,816 vehicles/day), indicating that it is learning signal beyond simple weekly seasonality in a way that generalizes across the station portfolio.

In practical terms, ET is a strong workflow-ready choice because it combines:

Best held-out accuracy: lowest station-weighted test RMSE and sMAPE among benchmarked models.
Meaningful lift over a strong baseline: clear improvement over lag-7, demonstrating value beyond weekly seasonality.
Nonlinear modeling with minimal fragility: naturally captures interactions among lags, rolling trends, calendar effects, and station-level baselines without station-by-station tuning.

Station level improvement

Those are the reasons ExtraTreesRegressor (ET) is a strong, workflow-ready choice, but a question remains: does this improvement hold broadly across the network, or is it driven by a small subset of stations while others see little benefit (or even regress)? To answer that, Figure 4 shifts from “overall benchmark averages” to a station-level distribution, showing how often ET beats lag-7 and how large those gains are across the full set of sensors.

Figure 4. Histogram of per-station sMAPE improvement relative to the lag-7 baseline, where values to the right of 0 indicate ET is more accurate. ET improves 98.6% of stations, showing the gain is broad rather than driven by a few outliers.

Each bar counts how many stations fall into a given range of Δ sMAPE (percentage points), where Δ = baseline (lag-7) − ET. A dashed vertical reference line at 0 marks the break-even point: values to the right of 0 mean ET has lower percent error than the “same day last week” baseline, while values to the left of 0 mean lag-7 still wins for that station. The headline result—Improved: 98.6%—means ET reduced test error for nearly every station (roughly 73 of 74), indicating the gains are network-wide rather than confined to a handful of outliers.

The distribution is concentrated on the positive side, with most stations clustering around roughly ~1 to ~3 sMAPE points of improvement, and the highest density in the ~1.7–2.2 range. Operationally, that’s exactly what you want: not a fragile “home run” driven by a few stations, but many independent time series each getting a consistent reduction in relative error.

The tails are informative, too. There is a very small left tail (essentially one station with negative Δ, near −1), which is expected in a real traffic sensing environment. A station can underperform a general-purpose model for reasons unrelated to the algorithm’s average strength—intermittent sensor issues, atypical coverage gaps, localized operational changes, or locations where the lag-7 heuristic is already exceptionally strong and hard to beat. On the other end, there is a right tail reaching to ~6+ sMAPE points, indicating a small subset of stations where ET delivers large wins. These are typically the locations where week-to-week patterns aren’t stable—corridors affected by construction detours, event-driven surges, seasonal commuting shifts, or other disruptions that cause the “same day last week” baseline to break down. Taken together, Figure 4 supports the claim that ET’s improvement is broad and scalable across the station portfolio, while still providing especially strong gains where the baseline is least reliable.

Added value of ET

The station-level distribution shows that ExtraTreesRegressor (ET) delivers broad, consistent gains, not improvements driven by only a small handful of stations. The next step is to understand how those gains vary across the network’s natural scale differences—from low-volume local segments to high-volume corridors—because the lag-7 baseline behaves differently in each regime. Figure 5 slices the same improvement metric by typical station volume, showing where ET provides the biggest lift and whether it generalizes reliably across both small and large stations.

Figure 5. Left: per-station improvement (Δ sMAPE) versus mean station volume (log scale) with a binned median trend and IQR band, showing ET’s lift remains positive across station scales. Right: the same improvements summarized by low/mid/high volume tiers, highlighting larger gains at low-volume stations and consistent positive lift on high-volume corridors.

Figure 5 drills into where ExtraTreesRegressor (ET) adds the most value relative to a seasonal naïve baseline (predicting “same day last week”). The key quantity on the y-axis is Δ sMAPE (percentage points) = baseline − ET, so values above zero mean ET is more accurate and the dashed horizontal line at 0 marks “no difference.”

Panel A (left) plots this improvement against each station’s mean daily volume (x-axis on a log scale, spanning roughly the sub-1k range up to ~100k+). Each dot is one station. The green line is the binned median improvement across stations of similar scale, and the shaded band shows the interquartile range (25th–75th percentile) within each bin. This view matters because traffic networks are inherently multi-scale: a model that only works on high-volume corridors (or only on low-volume sensors) isn’t operationally dependable. Here, the median trend stays consistently positive across the full range, meaning ET improves on lag-7 for small stations and large corridors.

The shape of the trend is informative. Improvements are largest at the low-volume end (median roughly ~3+ sMAPE points for the smallest stations), then decline as volume increases (median around ~2 pp through the mid-volume range), and settle into steady ~1.4–1.8 pp gains on the highest-volume corridors, with a mild rebound at the far right. The shaded IQR band is widest for low-volume stations (more variability) and generally tightens as volume grows, which fits expectations: low-volume sites tend to be noisier and more idiosyncratic, while high-volume corridors are more stable.

Interpretation: ET adds the most value where the lag-7 baseline is least trustworthy—often smaller, noisier stations—while still delivering steady, positive gains on major corridors. The low-volume regime is where week-to-week “same day last week” assumptions can break: localized detours, construction impacts, weather sensitivity, school/event schedules, or sensor intermittency can introduce changes that a simple seasonal rule can’t anticipate.

Notably, there is only a minimal below-zero presence (roughly one station falling under Δ = 0), which is better treated as a diagnostic than a red flag. It typically indicates a location where lag-7 is already extremely strong (highly repeatable weekly patterns) or where station-specific data quirks (coverage gaps, anomalies) reduce the benefit of a global model. Operationally, it’s a candidate for QA review or a simple fallback rule if needed.

Panel B (right) summarizes the same story by grouping stations into volume tiers (“low / mid / high”) and showing boxplots of Δ sMAPE for each tier (with individual station points overlaid). The low-volume tier shows the largest typical improvement and the widest spread, meaning ET often helps a lot but results vary depending on volatility. The mid tier remains clearly positive with moderate variability. The high tier stays positive and comparatively tight, with a single negative outlier, consistent with the idea that a small minority of large corridors are already very well-served by lag-7 (or exhibit unique behavior a pooled model doesn’t perfectly fit).

Overall, Figure 5 demonstrates two practical strengths of the ET approach: broad coverage and targeted benefit. ET generalizes across heterogeneous station scales without per-station tuning, and it delivers its largest lift where agencies often struggle most—irregular demand and disrupted conditions where “last week” is a weak proxy—while still producing reliable error reductions on the corridors that matter most.

Station level forecasts

Figure 5 establishes the model’s value in aggregate—showing that ExtraTreesRegressor (ET) improves accuracy across station scales and delivers the largest lift where the baseline is weakest. To make that improvement tangible, the next visualization shifts from summary distributions to concrete time-series examples, illustrating what “better forecasts” actually looks like day-to-day. Figure 6 presents a small set of representative stations and overlays the actual volumes, the lag-7 baseline, and ET predictions, so the reader can visually confirm how ET behaves in practice—especially when the level drifts or when last week’s pattern is a poor proxy for this week.

Figure 6. Six representative stations across very different volume levels, overlaying actual volumes, the lag-7 baseline, and ET predictions, with per-station sMAPE and “ET wins % days” annotations. Most examples show ET tracking level shifts and irregular weeks better than repeating last week; two high-volume examples illustrate the regime where lag-7 can remain very hard to beat on short windows.

Figure 6 is a set of “small multiples” showing the last 90 days of the held-out test period (roughly Oct 6–Dec 29, 2024) for six stations. Each panel overlays three lines: Actual traffic volume (blue), the seasonal naïve baseline (orange dashed; “same day last week”), and ExtraTreesRegressor predictions (green). Because this figure focuses on a 90-day slice, the per-panel sMAPE values should be read as local performance over that window (not necessarily identical to whole-test-window station metrics).

Each panel includes station-specific performance annotations:

ET sMAPE vs. Baseline sMAPE, plus Δ sMAPE (percentage-point improvement) where Δ = baseline − ET
“ET wins X% of days”, the share of days where ET’s absolute percentage error beats lag-7 (a day-level consistency check)

That combination matters operationally: sMAPE summarizes overall accuracy, while “wins % days” tells you whether ET is consistently better or whether its gains come from a smaller subset of days.

What the six examples show

Station 214 (low volume, trending downward): Annotation: ET 12.99% vs Base 21.27% (Δ +8.28 pp) | ET wins 69.2% days. This is a textbook case where lag-7 breaks down: the series drifts downward and the baseline repeatedly echoes last week’s higher levels, overshooting as the level falls. ET tracks the declining level more smoothly and stays closer to the actual line for most of the window, consistent with the large Δ.

Station 4126 (low volume with sharp anomalies): Annotation: ET 13.94% vs Base 18.61% (Δ +4.67 pp) | ET wins 60.5% days. Here the baseline’s failure mode is “copy-forward anomalies”: a large spike can create a misplaced spike one week later under lag-7. ET is less likely to replay those one-off anomalies at exactly +7 days, improving sMAPE even if neither method perfectly hits extreme spike days.

Station 149 (mid volume, strong weekly structure): Annotation: ET 9.44% vs Base 11.56% (Δ +2.11 pp) | ET wins 65.3% days. Weekly seasonality is visible and lag-7 is already decent, but ET is consistently closer during weeks where the amplitude or level shifts slightly, producing a steady (if smaller) lift.

Station 6734 (mid volume, irregular swings): Annotation: ET 11.01% vs Base 14.73% (Δ +3.71 pp) | ET wins 65.6% days. The baseline tends to be too jagged and often misaligns peaks/troughs by repeating last week’s noise. ET dampens that echoing behavior and better matches the day-to-day level, yielding a meaningful improvement.

Station 11273 (very high volume, lag-7 is hard to beat): Annotation: ET 6.87% vs Base 5.95% (Δ −0.92 pp) | ET wins 38.5% days. This is an example of the “ceiling” regime: when weekly seasonality is extremely strong and stable, lag-7 can be near-optimal—especially over a short window like 90 days. ET still tracks the overall weekly shape, but on this slice the baseline is simply better.

Station 10800 (very high volume, competitive regime): Annotation: ET 6.68% vs Base 6.26% (Δ −0.42 pp) | ET wins 40.8% days. Same story: both methods are close and the baseline edges out ET on this window. Including cases like this is useful because it shows ET is not “magic everywhere”—its advantage is largest when week-to-week repetition is less reliable.

Takeaway from Figure 6

Across these examples, ET’s advantage is most visible in operationally important situations: trend shifts, irregular weeks, and stations where “same day last week” is a weak proxy (214/4126/149/6734). The two high-volume examples illustrate the opposite regime: very stable weekly seasonality, where lag-7 can remain extremely competitive and may win on a short time slice.

Those station-level examples provide an intuitive, “on-the-ground” view of how ET behaves, but a forecasting model still needs one more validation step before you would trust it operationally: showing that performance gains persist repeatedly across time, not just in a single held-out window or a handful of illustrative stations. The next section therefore evaluates ET under a rolling-origin backtest that mimics deployment—retraining on all history available at each point and forecasting the next horizon—so the reported improvements reflect stability across seasons and changing traffic conditions.

Rolling backtest results

To make sure ET’s performance wasn’t a “lucky” result from a single train/test split, we evaluated it with a rolling-origin (walk-forward) backtest using an expanding training window. This mirrors how the model would run in production: at each fold, train on all history available up to that date, then forecast the next 28 days. The backtest uses the same feature design as the main benchmark—including one-hot station identity—so each fold tests the same modeling selection (a pooled/global model with station fixed effects).

The rolling schedule produces 16 folds, with test windows starting 2022-06-29 through 2024-10-16 (each fold spaced 56 days apart), and the final fold ending 2024-11-12 due to the fixed 28-day horizon (i.e., the backtest uses the full history available through late 2024 but evaluates only folds that have a complete forward window). Each fold evaluates a consistent system slice—typically ~1.5k–1.9k station-days (min 1,517, max 1,902, mean 1,780) across ~70–74 stations (min 70, max 74, mean 73)—so changes in accuracy largely reflect model behavior rather than shifting station coverage.

Because fold-by-fold results can swing with seasonality and short-term volatility, the most useful summary reports both a mean (overall average across folds) and a median (a “typical fold” value that is less influenced by extreme windows). Table 2 compares ET vs. lag-7 on two complementary metrics: sMAPE (relative/percentage error) and RMSE (absolute error in vehicles/day), both computed in volume space for operational interpretability.

Metric	Mean	Median
Volume sMAPE (baseline lag-7)	10.734%	8.965%
Volume sMAPE (ET)	7.449%	6.296%
Volume sMAPE improvement (pp)	3.285	2.811
Volume RMSE (baseline lag-7)	6,448	5,751
Volume RMSE (ET)	4,038	3,401
Volume RMSE improvement (%)	34.495%	35.445%

Table 2. Rolling backtest summary across 16 folds (28-day horizons; expanding-window training). Values are shown as the mean and median across folds.

Interpreting table 2, ET improves both relative and absolute accuracy consistently across time. On a percentage basis, mean sMAPE drops from 10.734% → 7.449% (a +3.285 pp improvement), while the typical fold (median) drops from 8.965% → 6.296% (+2.811 pp). In practical units, mean RMSE falls from 6,448 → 4,038 vehicles/day, and the typical fold falls from 5,751 → 3,401 vehicles/day, corresponding to a ~34.5% mean RMSE reduction and a ~35.4% median RMSE reduction.

Two operationally important details sit behind these averages. First, the baseline shows occasional high-volatility windows (e.g., periods where sMAPE spikes well above typical levels), which inflate the mean—exactly the kind of conditions where agencies most need forecasts to remain stable. Second, the improvements are not confined to a subset of folds: in this backtest, ET beats lag-7 in every fold on both sMAPE and RMSE (the minimum fold improvement remains positive for both metrics). That’s the strongest signal of “workflow-ready” performance—the lift persists repeatedly across seasons and years, not just in one favorable split.

Figure 7 summarizes the rolling-origin backtest as per-fold improvement in volume sMAPE (Δ sMAPE = baseline − ET). Each bar corresponds to one 28-day forward test window (folds stepped 56 days apart), and every bar is above zero, meaning ExtraTreesRegressor beats the lag-7 baseline in 16/16 windows. The chart includes two reference lines to anchor interpretation: the mean improvement (dotted) at 3.29 pp and the median improvement (solid) at 2.81 pp. The annotation box in the top-right summarizes the full distribution: improvements range from 1.00 → 7.48 pp, with Median: 2.81 pp and Mean: 3.29 pp.

Figure 7. Per-fold sMAPE improvement over a 28-day horizon in a walk-forward backtest (step = 56 days), with mean and median reference lines. All bars are above zero, indicating ET beats lag-7 in 16/16 windows and provides the largest lift during volatile periods.

Two takeaways matter operationally:

The uplift is repeatable across time, not split-dependent. The fact that all 16 bars are positive means the model isn’t “winning” only in one favorable season or year—ET is consistently extracting signal beyond simple weekly seasonality under a deployment-like workflow (train on all history up to a date, forecast the next 28 days).
ET helps the most when the baseline is under stress. Bar heights vary because some 28-day windows are calm while others contain larger regime shifts and volatility. The largest improvement occurs in the window explicitly labeled “Most volatile baseline window” (winter 2022–2023): the baseline sMAPE spikes to 27.19%, while ET reduces it to 19.71%, a +7.48 pp gain. A second standout window appears in early 2024, where the baseline is 16.03% and ET drops it to 8.81% (+7.22 pp). By contrast, in the calmest periods (where weekly seasonality is already highly predictive), the improvement compresses toward the low end of the range (as small as ~1.00 pp). That pattern is exactly what you want: smaller gains when the baseline is already strong, and large gains when week-to-week “copy last week” rules are most likely to fail.

These results show the forecasting pipeline is learning consistent, actionable signal that generalizes across time. In a fair benchmark, ET improves on a credible lag-7 baseline while remaining straightforward to retrain, and the station-level analyses show those gains apply broadly (improving ~98.6% of stations) rather than being driven by a small subset. Most importantly for an operational setting, the rolling backtest demonstrates that the uplift is repeatable under a deployment-like workflow: ET wins in 16/16 forward-looking windows, with the largest benefits concentrated in volatile periods when agencies most need reliable forecasts.

System wide forecast

After Figure 7 establishes that ET’s uplift is repeatable across time (winning in 16/16 rolling windows), the next question is a different kind of “production realism” check: do those station-level gains translate into a coherent system-wide signal? Traffic agencies often care less about a single sensor’s daily error and more about network totals and trend direction—the quantities that drive staffing, messaging, and corridor-level monitoring. Figure 8 therefore aggregates forecasts across stations to compare system-total volume from ET vs. actual at a weekly cadence, while also showing how many stations are contributing to that total each week.

Figure 8. Top: weekly system total volume comparing actual vs ET (and a smoothed 8-week mean), with a lag-7 smoothed baseline for context; ET closely tracks the system trend and turning points. Bottom: weekly active-station counts to show coverage is relatively stable, supporting that the system-total comparison reflects demand dynamics rather than major reporting swings.

Top panel (system total volume, weekly): This plot overlays three perspectives on the same system-wide signal:

Actual (weekly) (thin line): observed weekly system totals (sum across stations)
ET (weekly) (thin line): ET forecasts summed across the same contributing stations
Smoothed trend lines (8-week mean) (thicker lines): de-noise short-run volatility to show whether ET captures the level and direction of system demand
Baseline lag-7 (8-week mean) (dotted): a seasonal naïve reference (“repeat last week”) at the aggregated level

The main takeaway is that the ET 8-week mean tracks the Actual 8-week mean closely throughout the period, capturing the broad system pattern: an early rise from the mid-15–16M range into the high-17M/low-18M range, a relatively stable plateau, and then a gradual decline toward the end of the window. Weekly totals are naturally jagged—there are occasional sharp spikes and dips—but ET does not “chase” every one-week anomaly. Instead, it stays aligned to the medium-term level, which is typically what matters for operational planning and capacity expectations.

The lag-7 smoothed reference is also informative here: it stays fairly close to the system trend for much of the horizon, which is expected when weekly seasonality is strong. The fact that ET’s smoothed line remains tightly aligned with actual—and is often at least as close during periods of drift—supports the interpretation that station-level improvements translate into reasonable aggregate calibration, not just isolated station wins.

Bottom panel (coverage: active stations each week): Because system totals depend on which stations report, the lower subplot shows the number of active stations contributing each week. Coverage varies modestly (roughly ~60–71 active stations) without large step-changes, making it unlikely that the major movements in the top panel are driven purely by stations dropping in or out. This panel doesn’t eliminate coverage effects entirely, but it provides important context: the system-total comparison above is primarily reflecting system demand dynamics rather than dramatic reporting swings.

Taken together, the results show a consistent and operationally meaningful pattern: ExtraTreesRegressor (ET) beats a credible lag-7 baseline on station-weighted metrics, and it does so broadly across the network and repeatedly across time. In the fair benchmark, ET is the top-performing model on held-out test performance under the shared feature set, and the station-level breakdown confirms the lift is not driven by a small subset of sensors—ET reduces test sMAPE for ~98.6% of stations. The rolling-origin backtest strengthens the production realism story by showing the improvement persists under an expanding-window workflow: ET wins in 16/16 folds, with per-fold volume-sMAPE gains ranging from 1.00 to 7.48 percentage points (median: 2.81 pp; mean: 3.29 pp), and the largest advantages occurring in volatile windows where “copy last week” is most likely to fail. Finally, the system-total view shows forecasts remain coherent at the aggregate level, with ET tracking the same medium-term trend direction agencies use for planning and monitoring while coverage remains relatively stable.

Discussion

This project set out to answer a practical operations question: can a single global model reliably outperform a strong “same day last week” rule for station-level traffic forecasting, without becoming fragile or overly complex? The results support a clear “yes.” Under a fair, time-based benchmark with the same feature set for every model, ExtraTreesRegressor (ET) delivered the strongest station-weighted performance on held-out data, improving both percent error (STW sMAPE) and real-unit error (STW RMSE) relative to lag-7. A key enabler of this global approach is that the model includes station identity via one-hot encoding, allowing ET to learn station-specific baselines while still sharing temporal structure (lags, rolling statistics, calendar effects) across the network. More importantly for a traffic authority use case, the lift is not limited to one favorable window: the rolling-origin backtest shows ET’s improvements persist across seasons and years, and the station-scale slicing shows ET generalizes from low-volume sensors to high-volume corridors.

Why ET works well for traffic volumes

Two properties of traffic data show up repeatedly in the diagnostics:

Strong weekly seasonality plus local drift. Lag-7 is a credible baseline because many corridors repeat weekly patterns. But the baseline fails when the level shifts (construction, weather, schedule changes, events) or when week-to-week patterns become irregular. ET’s advantage is that it can still “use” weekly structure through lagged/rolling features while also learning interactions that signal drift—e.g., recent rolling means changing, recent volatility increasing, calendar effects, and station-specific intercepts via one-hot station identity.
A multi-scale network. The station fleet is naturally heavy-tailed: a few corridors carry enormous volume, while many sensors are smaller and noisier. Using station-weighted (STW) reporting matters here because it prevents the evaluation from being “won” by just getting the top corridors right. The fact that ET still wins on STW metrics indicates it is improving performance broadly, not simply optimizing for the biggest stations.

Interpreting the station-level and scale-level patterns

The station-level distribution (Figure 4) and the volume-tier slicing (Figure 5) sharpen what “better” means operationally:

The improvement is widespread. In the station-level distribution (Figure 4), ET improves test sMAPE for ~98.6% of stations, indicating the lift is broad rather than driven by a small subset.
The biggest lift occurs where the baseline is weakest. ET tends to help most on low-volume or more volatile sites—exactly the places where “repeat last week” can be a poor proxy. That’s a useful operational feature, not just a modeling curiosity: volatile locations are often where agencies most need forecasts to avoid overreacting to noise.
High-volume corridors are harder to beat day-to-day. Where weekly structure is stable and demand is smooth, lag-7 is already strong. In that regime, the gains compress and can come from a smaller number of “drift” weeks rather than being better every single day. That pattern is expected and acceptable: if ET mainly adds value at turning points or unusual weeks, it still improves planning outcomes.

Robustness across time

The rolling-origin backtest (Figure 7) is the most deployment-relevant validation in the project. A single train/test split can be misleading in time series—especially when demand regimes shift—so the expanding-window workflow is a better proxy for how the model would behave if retrained regularly. Across the 16 folds, per-fold volume-sMAPE improvements range from 1.00 to 7.48 percentage points (median: 2.81 pp; mean: 3.29 pp), with the largest gains occurring in the most volatile baseline windows. ET winning in every fold suggests two things that matter for agencies:

The lift is repeatable, not split-dependent.
The model is most valuable in stress periods. The largest gains appear in volatile windows where baseline performance spikes, indicating ET reduces the risk of “baseline blow-up” when conditions deviate from typical weekly repetition.

System-total behavior and what it implies for planning use cases

Traffic authorities rarely consume forecasts only as per-sensor daily predictions. They also care about network totals and trend direction for staffing, messaging, and corridor-level monitoring. The system-total plot (Figure 8) is a useful realism check because it confirms ET’s station-level improvements don’t come at the cost of producing incoherent aggregate behavior. ET tracks the medium-term trend (rise, plateau, gradual decline) at the system-total level, and remains at least as well-aligned as the lag-7 reference while avoiding overreacting to noisy week-to-week swings. That’s the right bias for many planning applications: stable totals that move with true demand rather than echoing week-to-week noise.

Limitations and caveats

Several constraints are worth stating explicitly:

Feature scope: The model uses lags/rolling statistics and calendar effects only. That keeps it broadly deployable, but it cannot directly “know” about special events, construction schedules, major weather shocks, or school calendars. Some of the remaining error—especially on spike days—likely reflects missing exogenous drivers rather than modeling weakness.
Single global model tradeoffs: A global model is operationally attractive, but a small minority of stations will always have idiosyncratic behavior. Negative-Δ stations are not necessarily failures; they are candidates for data QA, station-specific features, or a hybrid approach (e.g., fall back to lag-7 when it’s demonstrably optimal for that station).
Coverage changes and measurement noise: Station participation varies over time. Any real deployment should treat station uptime/missingness as a first-class monitoring dimension.
Metrics as proxies: STW metrics reflect “fairness across stations,” but agencies may also care about volume-weighted accuracy (system impact) depending on the decision being supported. The right metric depends on the operational question (e.g., corridor management vs. systemwide totals vs. equity across sites).

Practical implications and recommended next steps

If this were moved closer to production, the next steps are straightforward and aligned with agency workflows:

Define the decision target: daily station forecasts, weekly system totals, or both. That choice should drive the headline metric (STW vs volume-weighted vs a paired reporting set).
Add lightweight exogenous signals (optional): weather, planned construction/event flags, school calendar indicators, and holidays tend to improve “irregular week” performance disproportionately.
Station-level monitoring: track which stations consistently underperform ET vs lag-7 and route them into a QA / feature review loop.
Retraining cadence: the backtest supports an expanding-window retrain schedule; operationally, a weekly or monthly refresh is usually reasonable depending on data latency and seasonality.
Fallback policy: for stations where lag-7 remains dominant, consider a rule-based fallback or an ensemble that chooses per station (or per week) between ET and lag-7 based on recent performance.

Bottom line

The main takeaway is that ET wins in the way planning needs it to: broadly across stations, across station scales, and repeatedly across time, with the largest benefits showing up in volatile windows when simple weekly repetition is least reliable. That combination makes ET a strong candidate for operational forecasting in a traffic authority setting, especially when paired with clear station-level monitoring and a retraining workflow that mirrors the rolling backtest setup.

Conclusion

In transportation planning and operations, the value of a traffic forecast is not the metric itself—it is the decision risk it reduces. Day-ahead to month-ahead volume expectations shape how agencies interpret corridor conditions, communicate anticipated congestion, prioritize attention during abnormal weeks, and separate real demand shifts from noise. A forecasting approach is only useful if it improves on the strongest practical fallback (lag-7 “same day last week”) consistently, because in many settings the baseline is already “good enough” until it suddenly is not.

The results show that a single global model can provide that reliability. Under a controlled, time-based benchmark, ExtraTreesRegressor (ET) improves materially on lag-7 in both percent and real-unit error (STW sMAPE 9.21% → 6.91%; STW RMSE 4,077 → 2,816 vehicles/day) while improving ~98.6% of stations. That breadth matters for planning because networks are multi-scale: if a method only improves major corridors, it risks under-supporting the smaller and more volatile locations where construction, detours, and local disruptions often first show up. Station-weighted reporting in this report aligns the evaluation with planning needs for network-wide usefulness rather than improvements dominated by a few high-volume sites.

The rolling-origin backtest is the most decision-relevant evidence. ET beats lag-7 in 16/16 forward windows, with improvements ranging from 1.00 to 7.48 sMAPE points (median 2.81 pp, mean 3.29 pp). The largest gains occur during volatile periods when the baseline “copy last week” rule degrades sharply. Practically, these are the windows where agencies are most likely to overreact to noise, miss a true level shift, or carry forward a stale expectation into corridor monitoring and planning conversations. By reducing baseline blow-ups in those stress periods, ET functions as a more stable forecasting reference that planners can use with greater confidence.

Finally, the system-total comparison shows that these station-level gains do not come at the expense of coherence at the scale where many planning decisions are made. Aggregated forecasts track medium-term trend direction without echoing week-to-week anomalies, supporting the use of forecasts as an input to regional and corridor-level planning signals rather than as a brittle per-sensor gadget. This is the core planning implication: a model that is consistently better than lag-7, especially when conditions are changing, is a stronger foundation for developing transportation forecasts at regional, corridor, and project levels and for synthesizing technical outputs into information that supports policy and planning decisions.

Daily Traffic Forecasting for MnDOT Sensors