Julia Margie — julialmargie@gmail.com — jmargie@uchicago.edu

Motivation

Over the summer, Roman Anthony debuted for the Red Sox, and was immediately impactful, which resulted in lots of buzz. Some of the responses to his play stuck with me for a while because I wasn't sure if I believed the conclusions these articles came to from just a few statistics.

"Who he reminds me of, it's hard," Alex Cora said "I don't want to say (Barry) Bonds, of course. Probably (Juan) Soto, without the flashiness, early on. It's a good at-bat. He's not going to chase. Even when he doesn't get hits, you're like, holy s---, that's a good at-bat"^{1Rosenthal, Ken. 2025. "Red Sox Rookie Roman Anthony Has Passed Every Major-League Test so Far. Now He Takes on Yankee Stadium." The New York Times, August 21.}

"Anthony is averaging 4.25 pitches per plate appearance," Rosenthal wrote. "His walk rate is 14.6 percent. In those categories ... he would rank among the league leaders if he had enough playing time."^{2Keane, Colin. 2025. "Alex Cora Compares Red Sox Star To Juan Soto, With A Disclaimer." NESN.Com, August 21.}

"'Players to rank in 95th percentile in hard-hit percentage and chase%The percentage of pitches that a batter swings at which are outside of the strike zone in 2025, there's two of them. It's Juan Soto and Roman Anthony,' Mark DeRosa shared on "MLB Central." ^{3Crisafulli, Owen. 2025. "Roman Anthony Finds Himself In Exclusive Company Alongside Juan Soto." NESN.Com, August 12.}

It is easy to mislead people with statistics, especially when cherry-picked. I don't think the above people are attempting to mislead anyone, but comparing anyone to Juan Soto is a big deal! That changes a team! (Despite the 2025 Mets, unfortunately). Anthony is about my age, and despite the data not supporting it, I instinctually am fascinated by high walk rates in younger players (according to statcast, the relationship between age and BB% number of walks / number of plate appearances. in 2025 has an r² of 0.01). Also, a recent ESPN^{4David Schoenfield. 2025. “The Number That Will Decide 2026 for All 15 American League Teams.” ESPN.Com, December 29. URL.} article listed a lot of these statistics—-highlighting the walk rate—-as reasons the Red Sox should be optimistic about next year. While I am nothing if not a pessimist about the Yankees' chances in the AL East, I'd still like to be a realistic fan. If Anthony is actually a game-changing player like Soto, I'd like to regulate my expectations about the upcoming season.

So, let's take a look! Does Roman Anthony react to pitches like early Juan Soto?

Please note that I am hoping for this analysis to be accessible with minimal baseball knowledge, though I do assume a level of knowledge that includes things like walks, strikes, at-bats, and contact. Words underlined will show a definition when hovered over, and I'll try to do that for more obscure statistics. If you think something else should be defined or that I did something wrong elsewhere, feel free to email me! I always want to improve both my analysis and communication. I love baseball, and I truly think the numbers make it more fun. I always want people to be able to share in my joy of baseball, so I hope that this is approachable enough to convince you!

Planning

Before starting any analysis, I like to make sure I know exactly what question I am asking. Here, that is:

Does Roman Anthony's plate approach at the MLB level mirror that of Juan Soto?

In order to answer this, I had to choose whether or not I wanted to compare 2025 Soto to 2025 Anthony (which would make direct numeric comparisons easier) or rookie Soto to rookie Anthony, i.e. 2018 Soto to 2025 Anthony. I decided to look at their rookie seasons, because I want the end goal of this analysis to be how Anthony might grow as a player, not just how they currently line up. Also, Cora specifically compares him to "early" Juan Soto. So, the question became:

Does Roman Anthony's plate approach in 2025, at the MLB level, mirror that of Juan Soto in 2018?

To answer this question, I need to specify what a player's approach is. This is an exceptionally complicated question, so I narrowed once again, and considered their decision to swing, given a certain pitch location and typePitch type classification (fastball, breaking ball, offspeed) based on velocity, movement, and spin characteristics., and the quality of contact or lack thereof. Unfortunately, we can't look at squared-up rateThe percentage of swings where the exit velocity of the contact is at least 80% of what is possible, given the speed of the bat as it is swung and the speed of the pitch. before 2023 because bat speed was not tracked, so I had to consider quality of contact differently. Recall, however, that we are interested in approach rather than results, specifically whether a given player has a "good at-bat" (Cora, quoted by Rosenthal). So, I only examined swing%The percentage of pitches that a batter swings at and contact%The percentage of pitches that a batter swings at and makes contact with (regardless of result), and since DeRosa cited HardHit%The percentage of batted balls hit with an exit velocity of 95 mph or greater, indicating well-struck contact., I also looked at exit velocityThe speed of the baseball as it comes off the bat after contact (mph) . I considered isolating chase%The percentage of pitches outside the strike zone that a batter swings at, indicating plate discipline., but decided just considering swing%The percentage of pitches that a batter swings at by zone would do the same thing. We now are trying to answer:

Does 2025 MLB Roman Anthony swing and make contact at the same rate as 2018 MLB Juan Soto after controlling for pitch location and typePitch type classification (fastball, breaking ball, offspeed) based on velocity, movement, and spin characteristics., and normalizing across seasons?

I am specifically interested in swing%The percentage of pitches that a batter swings at because DeRosa mentioned chase%The percentage of pitches that a batter swings at which are outside of the strike zone in his tweet and chase%The percentage of pitches that a batter swings at which are outside of the strike zone is a subset of swing%The percentage of pitches that a batter swings at. It is somewhat misleading to say that Anthony has a low chase%The percentage of pitches that a batter swings at which are outside of the strike zone because he has a good eye for the zone without noting that he also has a low swing%The percentage of pitches that a batter swings at in all parts of the zone.

Methods

First, instead of running a power analysis (to make sure I had enough data, because Anthony only played a little over half the season), I found a stabilization analysis through FanGraphs which found that both swing%The percentage of pitches that a batter swings at and whiffThe percentage of pitches that a batter swings at and does not make contact with/contact%The percentage of pitches that a batter swings at and makes contact with, regardless of outcome stabilized around 40PA. This meant I had enough data to begin. I downloaded all of the pitches from 2018 and 2025 from StatCast via pybaseball (baseballr was acting up).

I decided to do my analysis in R rather than Python (which I have more experience with) because I both wanted to get more practice with R and because it has a very straightforward library for Generalized Additive Models (GAMs) which essentially piece together a bunch of possibly non-linear relationships over a given space. This works better than linear and polynomial models, even when piecewise, because it is smooth and allows for different relationships over the surface, while ensuring the resulting model is still consistent. I also used the BAM model from the mgcv library in R because it is a modified GAM intended for larger datasets, and there are a lot of pitches thrown in the MLB over a season.

If you want more information about the models, click here!

If you have taken linear algebra, you can think of the model fitting as choosing all of our partial models from the same basis, i.e. a minimal generating set. This means combining them is not as complicated. For example, all of the formulas are some form of cubic equation, but they do not have to be the same one over the whole space. Then, we piece them together for a smooth result. The link above provides an example of a polynomial spline, which looks at each section of the data independently, then (intuitively, or, as my one of my math professors would say, morally) slides them up or down to 'match' the end points. The GAM is more complicated than this due to penalized regression (which I do not know how to do yet).

I ran this model in a few ways:

(1) recognizing the pitcher and game as an effect on each pitch, which could overfit the model
(2) not considering each pitcher/game on the resulting decision, which could prioritize certain pitchers/games for the resulting predictions, based on sample size
(3) considering the pitch typePitch type classification (fastball, breaking ball, offspeed) based on velocity, movement, and spin characteristics. of pitch (which again, could overfit the model and slowed down the processing significantly).

The results were all approximately the same, but certain numbers may not align exactly because of this. I recognize that this is not best practice but since this is just a project for fun, I didn't worry too much. I differentiated these results with different colors in the graphs. I also decided against making a model specifically for , because contact predicated on swings is just the negative, and the heatmaps were more visually coherent.

For example, one of the swing models looked like this:

m_swing <- bam(
    swing ~ batter_id +
    s(plate_x, plate_z, k = 20) +
    s(plate_x, plate_z, by = batter_id, k = 10, id = 1) +
    s(pitcher_id, bs = "re") +
    s(game_id, bs = "re"),
    data = df,
    family = binomial(),
    method = "fREML",
    discrete = TRUE,
    select = TRUE
)

and another:

m_swing <- bam(
    swing ~ batter_flag +
            pitch_category +
            s(plate_x, plate_z, k = 50),
    data = df,
    family = binomial(),
    discrete = TRUE
)

The next step was creating prediction tables to granularize the data, normalizing Roman Anthony to 2025 hitters and Juan Soto to 2018 hitters, and comparing the difference between these results (often referred to as differences-in-differences analysis, or DiD).

Results

Heatmaps

Swinging

The heatmaps show that Soto swung more than Anthony — this is not necessarily significant. This is particularly true in the shadowone baseball-length inside and outside of the strike zone. Strike/ball call more dependant on the catcher than other pitches. of the top of the strike zone. These graphs also act as a sanity check for my models, and pass the eye-test (though, like in baseball, the eye-test does not necessarily align with the advanced numbers in statistics! In this case, though, it seems to).

One possible limitation of this is that the strike zone will be slightly different for each batter, and the rectangle marked is just the average strike zone (vertically). Both Soto and Anthony do have similar strike zone boundaries, as set by the official scorer and StatCast. Both Soto and Anthony seem to swing less than average in the bottom of the zone, but the 2025 and 2018 league average swing locations are visually nearly identical.

Note: these two graphs were developed by slightly different models

Contact

Soto also made contact at a higher rate than Anthony in the bottom of the zone. This is interesting to me because Anthony swung more in this part of the zone. This seems to imply that Soto is reading pitches in the bottom of the zone better than Anthony. I did not run significance testing for top/bottom of zone, but that would be an interesting next step. It seems likely that Soto's nearly 20% higher chance at making contact around the lower edge of the zone would be significant. If I had to give Anthony advice on where to focus (if he wants to be more like Soto), I would tell him to practice making contact with pitches around his knees, even just to foul them off! Given that Soto swings less but makes more contact, it is likely that he is reading the pitches better than Anthony. Another place for further research would be the differences likelihood to hit a ball into play, given contact, and xwOBApredictive statistic looking at liklihood of a batted ball event to result in the batter being safely on base, weighted by how likely the batter is to make it to a specific base. compare with SLG, wOBA, and OBP, which attempt to measure similar things. of these contact events. Maybe Anthony's contact is more useful!

I think this graph is affected by the height differences of MLB players. Soto and Anthony are both on the taller side for batters, and the lower edge of the strike zone is obviously dependant on the height of a player's knee. Visually, Soto and Anthony seem to make less contact (given a swing) than average around the bottom edge of the zone, meaning they whiffa swing and a miss more than average near their knees. A good next step for this analysis would be looking at pitches by their distance from the bottom of the declared strike zone, rather than the ground, as StatCast provides. Note that these numbers are not significant, as per later graphs and tables.

Exit VelocityThe speed of the baseball as it comes off the bat after contact (mph)

I added in exit velocityThe speed of the baseball as it comes off the bat after contact (mph) at the end as an exploratory analysis, so I do not have the same analysis for it as everything else, so I am not sure of statistical significance. It is nonetheless interesting! Rookie Anthony is getting much more power in his contact than Rookie Soto. Maybe this is because of age? Maybe Anthony is not swinging at pitches he can't get good contact on? Unfortunately, bat tracking was not implemented in 2018, which means we can't look at their squared-up ratesThe percentage of swings where the exit velocity of the contact is at least 80% of what is possible, given the speed of the bat as it is swung and the speed of the pitch., but in 2025, Roman Anthony had a 26.7% squared-up rateThe percentage of swings where the exit velocity of the contact is at least 80% of what is possible, given the speed of the bat as it is swung and the speed of the pitch. (good for 60th percentile, if he was qualified) and Juan Soto clocked in at 32.5% (92nd percentile), despite Anthony having higher bat speed, solid contact%The percentage of pitches that a batter contacts at just under the requirements for a barrel. see here for a more in depth explanation. and barrel%percentage of contacted pitches hit particularly well. see here for a more in depth explanation. Note that the squared-up rate is looking at squared-up per swing, not per contact, which disadvantages Anthony due to his (statistically significantly!) higher whiff rate.

DiD Effects

These graphs begin to provide significance for the prior analysis. Note that if the line is entirely above or below the axis, the result is significant. This is because the 95% confidence interval (meaning, 95% of the time, the result will fall within the given range) is entirely positive or negative. If it is positive, Anthony is statistically significantly more likely to do the given action. If it is negative, Soto is. When the line crosses the axis, it means that there is a greater than 5% chance (generally agreed to be the barrier for significance) that both Anthony and Soto are more likely to swing or make contact.

Swinging

In the heartMore than one baseball length inside of the zone of the zone, Anthony is significantly less likely to swing.

For breaking ballspitch that moves a lot because of its spin. usually sideways or downwards, more than expected given just gravity., Anthony is more likely to swing, regardless of zone, and for fastballstype of pitch which tends to move in a straight line as fast as a pitcher can throw, Soto is more likely in the heartMore than one baseball length inside of the zone and chaseMore than one baseball length outside of the zone areas. There is an interesting pattern among pitch categories across zones (where fastballs have the highest positive difference in Soto's probability of swinging, then offspeedpitch which is thrown notably slower than a fastball but doesn't move as much as a breaking ball pitches, and breaking balls last). Future work could be to run an omnibus test to see if these patterns are significant.

Contact

We have no significance regarding contact. We again see a similar pattern from above, though, where the average likelihood of contact for fastballtype of pitch which tends to move in a straight line as fast as a pitcher can throwf and offspeedpitch which is thrown notably slower than a fastball but doesn't move as much as a breaking ball pitches is lower than breakingpitch that moves a lot because of its spin. usually sideways or downwards, more than expected given just gravity. pitches. Note too that the 95% CI increases in size with chaseMore than one baseball length outside of the zone and shadowone baseball-length inside and outside of the strike zone. Strike/ball call more dependant on the catcher than other pitches. zone pitches because there is less data for contact in these regions. This makes sense—if players swing at pitches outside of the zone less, there will be a smaller sample size than the pitches swung at inside of the zone (where you get a strike regardless of your swing).

Conclusions

Let's go back to my original question: is Roman Anthony the next Juan Soto? I clarified this to:

Does 2025 MLB Roman Anthony swing and make contact at the same rate as 2018 MLB Juan Soto after controlling for pitch location and type and normalizing across seasons?

Anthony does not swing at the same rate, particularly when seeing breaking balls and fastballs. He also hits harder than Soto did in 2018 (who has increased his average exit velocity a little, over his career). Contact predicated on a swing has no significance, which implies it is similar for the two players, but I did not specifically look at top vs bottom of the zone, which the heatmaps imply might be significant. In the next couple weeks, if I have time, I will add that! To conclude more generally, Anthony is not the same kind of player as Soto, compared to the average league behavior in their rookie years. However, if he wants to become more like Soto (which he may be well-suited for, given his patience regarding swinging), I would recommend he focus on breaking balls and pitches around his knees. More specifically, becoming more comfortable swinging at pitches in the shadow zone in order to foul them off, instead of just taking the strike looking. However, I do not know how this will be affected by the ABS system coming into the MLB in the 2026 season. Perhaps all of my predictions will be for naught. Either way, I am excited to recompare Soto and Anthony after next season, with more data!

Appendix

Summary Tables

Just another way of presenting the same data.

Swinging

Contact

Bar Graph Comparisons

I find it useful to look at the same data in different ways, which is why I included these graphs too. You can see where the differences shown in the prior graphs come from (i.e. Soto is slightly more likely than league average in 2018 to swing at a fastball in the heart of the zone, but Anthony is much less likely). These graphs should be understood in the context of the prior graphs and later tables since they do not mark significance. Below each graph, I note which parts are statistically significant.

Swinging

significance present in "heart" and "shadow"

significance present in "fastball:heart", "fastball:shadow", "breaking:heart", "breaking:shadow", "breaking:chase"

Contact

There is no statistical significance in these graphs. This makes sense: look at the y-axis scale, in contrast to that of the swinging graphs. For contact, percentage differences to league average per zone lies between -6% and 2%, whereas in swinging likelihood, it ranges from ~0% to over -20%.

More Summary Tables for Exact Numbers (Calculated from Different Models)

DiD mean: average difference between Soto and Anthony's differences from the league average

DiD SE: standard error. "standard deviation of sampling distribution" (per wikipedia). When looking at significance, we can take a "sampling population" which is like testing a lot of slightly different populations to see how uniform our whole population is. The standard error measures how much the mean of each of these sampling populations differs from the mean of the whole population. This lets us calculate the confidence intervals.

pitch categoryPitch category classification (fastball, breaking ball, offspeed) based on velocity, movement, and spin characteristics.	zone	DiD mean	DiD SE	Soto mean	Anthony mean	CI low	CI high	significance
fastball	Heart	-0.14367	0.03313	0.00453	-0.13913	-0.20863	-0.07872	TRUE
fastball	Shadow	-0.11350	0.04388	-0.02840	-0.14190	-0.19952	-0.02748	TRUE
fastball	Chase	-0.02902	0.03337	-0.04112	-0.07015	-0.09444	0.03638	FALSE
breaking	Heart	0.09020	0.04204	-0.15696	-0.06675	0.00779	0.17261	TRUE
breaking	Shadow	0.13439	0.05195	-0.20824	-0.07384	0.03255	0.23624	TRUE
breaking	Chase	0.08213	0.03862	-0.12769	-0.04555	0.00643	0.15784	TRUE
offspeed	Heart	-0.03354	0.04073	-0.01680	-0.05035	-0.11338	0.04628	FALSE
offspeed	Shadow	-0.00662	0.05995	-0.05528	-0.06190	-0.12412	0.11088	FALSE
offspeed	Chase	0.02582	0.05599	-0.07339	-0.04757	-0.08391	0.13556	FALSE

This table looks at significance slightly differently. The filled in circle indicates significance. The significant area indicates if the region of the surface which makes the estimate significant is large, or if the fewer significant points are just, exceptionally significant. It's another way of thinking about this type of analysis, and I'm not sure if I am as confident in the results.

Difference-in-Differences Results by Zone and Metric
Zone	Metric	DiD Estimate	95% CI	% Significant Area
Heart	Contact Probability	-0.019 ●	[-0.023, -0.016]	25.3%
Shadow	Contact Probability	-0.026 ●	[-0.034, -0.017]	53.2%
Waste	Contact Probability	0.033 ●	[0.029, 0.037]	23.7%
Heart	Exit Velocity	-0.709 ●	[-0.796, -0.622]	0.0%
Shadow	Exit Velocity	0.420 ●	[0.336, 0.504]	0.0%
Waste	Exit Velocity	4.124 ●	[4.029, 4.219]	0.0%
Heart	Swing Probability	-0.074 ●	[-0.075, -0.073]	93.8%
Shadow	Swing Probability	-0.045 ●	[-0.047, -0.043]	22.8%
Waste	Swing Probability	0.009 ●	[0.009, 0.010]	8.1%

jmargie

Is Roman Anthony the next Juan Soto?

Motivation

Planning

Methods

Results

Heatmaps

Swinging

Contact

Exit VelocityThe speed of the baseball as it comes off the bat after contact (mph)

DiD Effects

Swinging

Contact

Conclusions

Appendix

Summary Tables

Swinging

Contact

Bar Graph Comparisons

Swinging

Contact

More Summary Tables for Exact Numbers (Calculated from Different Models)