Initial Report

Analyzing Tennis Matches Based on Audio

Erik Sangeorzan, Jason Hu, Max Baer, Pavel Mazza, Po-Hsun Wu

November 10, 2020

Table of Contents

I. Summary of Project Idea

II. Tasks Assigned

III. Task Progress (as of Nov. 10)

IV. Next Steps

Summary of Project Idea

Our team’s goal is to create a software that takes in an audio clip of a tennis match and detects when each point begins and ends. Given how the crowd is silent while the ball is in play, we will be able to identify the sounds that pertain to how the game is scored, such as a player hitting the ball across the court or an umpire declaring a ball to be out-of-bounds, thereby ending the point.

Tasks Assigned

1. Collect audio from tennis matches (3 different samples were recommended) and load them into MATLAB/Python

2. Extract the exact audio from when a point begins to when it ends and plot its FFT and Spectrogram. Also extract sound from random points during the match and plot the FFT and Spectrogram. Compare the two results.

3. Apply a basic processing technique to the audio clip that could be useful for extracting a certain piece of data. Try and draw some conclusions from this to apply it to final design.

4. Read “Inferring the Structure of a Tennis Game using Audio Information” by Qiang Huang and Stephen Cox. Draw motivation from their methods if applicable.

Tasks Progress

1. We collected an audio clip from a match at Wimbledon that is about 2 minutes in length. In this time period, there are lots of different events that take place that are representative of the variety of action and sounds that takes place during a tennis match. There are rallies of multiple hits between the two players, service faults that are announced by the line umpires, score announcements from the chair umpire, interjections from the broadcast commentators, and cheers from the crowd, both large rounds of applause and solitary admirers cheering on their favorite player.

The clip was converted into .wav form and loaded into MATLAB.

2. Below are two sets of plots: one for a tennis point (a.k.a. rally), and one during a random time in the match, for which we isolated a time between points (subtle background noise from the audience). We found the sounds of the ball hitting the racket during a rally produce something similar to a non-ideal delta signal, or one that has a relatively high magnitude in a very short timespan. We compared that to cheers from the crowd, which immediately follows the point’s completion. The cheers have a shape in the time domain similar to a harmonic envelope, but the envelope decays over time until the noise eventually dies out.

Plotting the FFT of both of these events, we can see that the FFT of the audio clip of the rally has a similar peak at high frequencies to the FFT of the audio clip of the applause, but where they differ is in magnitude. Our initial thought was since the sound of the ball hitting the racket had such a small time-period, it would have a higher frequency than the applause of the crowd. But they ended up having peaks at similar frequencies, but the peaks of the rally FFT were almost three times greater than the peaks of the applause FFT. This discovery is something we should keep in mind when we want to design our final filter for ignoring audio events not related to the match.

To get a different view on what these clips are like in the frequency domain, we also made plots of the Spectrograms of the rally and applause audio clips. This time, the rally Spectrogram showed what we had expected for the sound profile of the ball hitting the racket, a high frequency ‘delta’ that had some components at higher frequencies. These components were difficult to identify, but compared to the Spectrogram of the applause clip, which had most of its peaks at lower frequencies, the ‘deltas’ were much easier to identify. And the Spectrogram of the applause clip did have a higher power/frequency ratio than the rally clip, but it didn’t operate at higher frequencies like the rally clip did, which differentiated the two plots.

3. We applied a high pass filter to the entire audio wave to try to obtain a result with only the sharp peaks in the waveform, which are the sounds of the ball being hit. This is because the peaks tend to have a higher frequency than the sounds of the audience and umpire. A cutoff frequency that was too low would result in much of the audience sounds still being present, and a cutoff frequency that was too high would result in even the ball noises being removed, leaving only the ambient noise. A relative cutoff frequency of 0.8 (19.2KHz at the sampling rate of 48KHz) was chosen which leaves in most of the ball noises.

4. This article uses a variety of techniques from probability and machine learning and breaks down the problem into a few simpler problems. More specifically, the article classifies certain “events”, from most broad to most specific being game points, match events, audio events, and acoustic observations. Game points are the instances for which it is the ultimate goal to detect, that is, the boundary between one point and another. Match events are major events that can occur throughout the course of a point, such as a rally, serve, and audience applause. Audio events are more minor events such as the ball being hit, and the umpire’s signal. Finally, acoustic observations are simply the raw sound data collected from the game.

The goal of the article is to determine, from the acoustic observations gathered, the points in time of the game points. This is done by maximizing the probability that a certain point in time is the game point, given the previous acoustic observations. This conditional probability can be broken down in terms of match events and audio events using the law of total probability and Bayes’ Law. The advantage of this approach is that, for example, estimating the probability that a match event is occurring, given the past audio events, is more tractable. Thus, the article builds up a hierarchical system, where starting with acoustic observations, the probability of audio events is estimated, which allows for estimation of probability of match events, which allows for determination of probability of game points.

Each of these individual probabilities is determined in slightly different ways, but the general ideas are similar. Some games are taken and used as training data for the machine learning algorithms. This training data allows for the determination of the constants associated with the models. Once a model is established, other games are used as a test to determine the model accuracy. The result was that the various steps achieved accuracy rates of anywhere from 60% to 90%.

Most of the specific techniques in the article will not be used in our project, as they are relatively complex and therefore difficult to adapt to our specific purpose. However, the idea of breaking down the problem into stages and connecting the stages with a flow chart seems a reasonable approach for our team.

Next Steps

Some steps that we are planning to take in the future include classifying different sounds into categories. This would include differentiating between the sound of a ball hit versus the sounds of the audience and umpire. One method that could be used to accomplish this would be to design and use an appropriate filter that highlights the key events. The sharp pop of a tennis ball as it is being hit is one of the sounds that we will be focusing on. The resulting signal of the sound greatly resembles a dirac delta function. Since the games can be more easily tracked when paying attention to the hits of the tennis ball, the filters will work to isolate these strikes and other key sounds.

We could also evaluate the effectiveness of our methods by applying them onto a brand new sound wave of a match and compare the results of the filter with the actual results. We could also apply this to specific rallies of unique styles (e.g. rallies consisting of quieter “slice” hits, of predominant grunts/loud hits, of “aces”, etc.).

An additional, more challenging step would be to create a machine learning algorithm which is able to identify different types of match events on its own, similar to the work presented in the IEEE paper. This step would be the most difficult to undertake since we are attempting to create an algorithm that would use analyze data in a very intricate way but any success in the realm of this problem is the goal.