The right call: embracing big data in the patent world

When it comes to leveraging statistical data in order to hone strategy, in-house IP departments can learn a lot from the world of sport

There are only a few seconds left in the basketball game and your team is losing by one point. You decide to give the ball to your star player and hope for the best – you have no idea if your decision is the right one. This is how the game of professional basketball was played just a few years ago. Teams reviewed player statistics and watched video footage, but still went with their gut to make decisions. Most IP departments behave the same way in how they make decisions concerning IP strategy and management – from the gut and without the benefit of critical data.

In the sporting world, this mindset began to shift in the early 2000s when Billy Beane applied a data-driven approach to assembling his team roster (popularised by the book and movie Moneyball). It changed the game of baseball forever. Since then, the use of data analysis has evolved and expanded to other sports.

In some ways, the SportVU project was the next generation of Moneyball as applied to the game of basketball. SportVU was started by two PhD students in Harvard’s statistics department. They installed six motion-capture cameras around the basketball court to record player performance during a game. After completing a successful trial run with several professional teams, SportVU was installed in every stadium in the National Basketball Association for the 2013-2014 season. Whether a player was playing at home or away, SportVU would track that player’s every move.

SportVU measures every possible performance statistic, with every game generating millions of data points. Teams can use this data to generate key performance metrics to drive their strategies and decisions. One such metric is expected possession value (EPV), which uses thousands of data points to estimate a player’s value by quantifying how many points the player adds compared to a hypothetical replacement player. Teams now make game-time decisions that optimise EPV for every possession. Big data has fundamentally changed the way that the game of basketball is played; and today, teams in every major sport leverage data analytics to select their players and design their playbooks. The power of big data in sports has become commonplace.

In the same way that data analytics has taken the sporting world by storm, it is poised to have a profound impact on corporate IP strategy and management. The data landscape for patents has significantly improved over the last several years. There is now a plethora of clean, accessible and useful data covering nearly all aspects of patents, including patent publications, prosecution histories, transactions and patent litigations. This – coupled with recent advances in analytical tools, computing and machine learning – enables any organisation to embrace big data for patents. Although the amount of patent data may not technically rise to the level of some definitions of ‘big data’, the concept of applying analytics to extract value certainly applies to patent data.

A variety of high-quality open source analytical tools and libraries facilitate cutting-edge modelling capabilities. They provide everything from linear regression to random forests, support vector machines and deep neural networks. Recent advances in cloud computing have also made it much simpler to store, access and process large datasets. The confluence of these two trends has led to an environment where even small organisations can apply sophisticated analytics to large amounts of data. Further, many cloud computing providers are integrating modelling functionality and scripting support into their standard offerings, making advanced analytics even more seamless.

IP departments can incorporate data analytics into their strategy and management decisions in many ways. This article addresses four different contexts in which practitioners can use data to achieve better results:

  • Portfolio development – using data to more strategically develop, evaluate and manage patent portfolios.
  • Patent prosecution – using data to more effectively and efficiently prosecute patent applications before the US Patent and Trademark Office (USPTO).
  • Litigation strategy – using data to make decisions that maximise probabilities of success.
  • Budget management – using data to more accurately manage spend and predict future costs.

Portfolio development

The requisite data for portfolio development includes patent publication data for applications and issued patents that can be obtained from a variety of free or paid subscription services (eg, USPTO Patent Database, Innography, Thomson Innovation, LexisNexis TotalPatent and Google Patents).

There is a wealth of easily accessible feature data on patents. This includes a patent’s title, abstract, claims, description, citations, class codes, inventors, assignment history, family and dates. Commonly used features include class codes and word or phrase frequencies from the title, abstract and claims. This data is available from a variety of vendors and can be obtained in bulk through user-friendly interfaces. In some more advanced approaches, it may be desirable to compress or encode the feature data into a low-dimensional representation. Such representations are especially useful for sparse or infrequent features (eg, citations) to generalise the data and reduce its size. Low-dimensional representations are also commonly used with text in what is often referred to as ‘latent semantic analysis’. There are many ways to use this data for portfolio development. Three useful techniques are clustering, classification and similarity searching.


Clustering is an automated way of organising patents into more manageable groups. Clustering algorithms take in data on objects and then try to make meaningful distinctions to segregate the objects. This is especially useful for breaking up large patent portfolios into components by shared subject matter in order to gain a more holistic view of the assets within a portfolio. The ability to automatically cluster lends itself particularly well to unfamiliar portfolios by dividing them up into smaller components and surfacing themes. Figure 1 shows a hierarchical grouping of patents based on significant phrases that appear in the title, abstract and claims of those patents. Many patent analytics vendors offer some type of clustering capability, but it is feasible for in-house departments to do this on their own. Clustering is typically unsupervised, so no pre-labelled training data is required. However, it requires work on the back end to interpret the generated clusters. Depending on the user’s objectives, it may be preferable to discard or merge selected clusters to improve results.

Figure 1. Text cluster

Source: Innography

Many different clustering algorithms are applicable to patents. These include:

  • partitioning approaches that split portfolios into distinct groups;
  • hierarchical approaches that create a connected tree of clusters (see Figure 2); and
  • graphical segmentation approaches that create clusters by segmenting patent citation graphs (see Figure 3).

When selecting an algorithm, it is important to consider the following questions in view of the project’s objectives. Should the clusters be mutually exclusive (hard versus soft clustering)? How many clusters will be generated (fixed or variable)? How many patents need to be sorted? Should the clusters be arranged hierarchically?

Figure 2. Hierarchical patent clusters

Visualisation of a set of speech patents segmented into hierarchical patent clusters (size is based on patent count, colour is based on median priority date).

Figure 3. Graph segmentation

Citation graph for a set of patents. Different colours represent clusters created using graph segmentation


Classification is a way of automatically labelling patents with a pre-determined set of relevant labels. Once a desired taxonomy or set of labels has been established for a department’s particular needs, classification can be used to automatically label large numbers of internal and external patents with minimal human effort. The ability to automatically classify patents lends itself well to the dynamic nature of portfolios, avoiding the recurring need to manually update the assigned labels with newly issued or acquired patents. Classification can be useful in evaluating patent acquisitions and analysing competitor portfolios.

Unlike clustering, classification is a supervised process that requires pre-defined labels. Classification algorithms also need training data to teach them how to segregate objects by the pre-defined labels. This is a key difference between clustering and classification. For example, if a portfolio relating to mobile phones is broken down using a clustering algorithm, the model might produce clusters of patents relating to screens, batteries and antennae. There is no way to know the clusters that a model might produce until it is run. If the goal is to identify patents relating to batteries, it would be better to train a classification model to identify batteries.

When creating training data, it is important to include a large number of representative positive examples and diverse negative examples, each with a high degree of accuracy. For example, in order to build a model to label patents relating to touchscreens, one should include the patents covering different aspects of touchscreens (software and hardware) in the set of positive examples. The negative examples should include close negatives, such as generic displays. An efficient way to generate the negative examples is to manually identify close negatives and then supplement with randomly selected patents.

As with clustering, there are many algorithmic options for building a model. When choosing an algorithm, it is important to consider the following questions in view of the project’s objectives. Should the labels be mutually exclusive? What is the requisite level of accuracy? Should the output be binary (true/false) or probabilistic?

Similarity searching

Similarity searching is a way of identifying similar patents. It is often desirable to find other patents that are similar to an initial patent. Assume that a company is interested in filing a patent lawsuit against a competitor. The IP department has identified a strong patent that is relevant to the competitor’s product, but it wants to find additional patents to assert against the product. Similarity searching can quickly unearth other patents that are similar to the initial patent. It can also be a good way to identify potential prior art by filtering results with relevant priority dates. Some vendors offer similarity searching, although it is typically focused on textual similarity.

Similarity searching requires a quantifiable similarity or distance metric. One approach for creating such a metric is to identify a set of similarity factors (eg, words/phrases, class codes and citations) and manually determine how much weight to assign to each factor. The similarity factors can be determined by either vector or set similarity/distance measures (eg, Euclidean, cosine or Jaccard). Another approach is to train and apply a model that determines the optimal weights for each factor.

Patent prosecution

The requisite data for patent prosecution analysis includes all of the information contained in the prosecution history of each patent. The prosecution history entails the correspondence between the applicant and the USPTO from filing of a patent to issuance. This information can be obtained through the USPTO’s Patent Application Information Retrieval system (PAIR) or other paid subscription services (eg, Reed Tech Patent Advisor or Juristat).

Increasingly, vendors are crawling PAIR to extract and organise useful prosecution history data. This data includes millions of prosecution histories and their underlying documents, as well as aggregate statistics on examiners and art units. Vendors also offer the ability to obtain and organise a patent owner’s unpublished or private PAIR data. PAIR data can be useful, both in managing individual cases and in developing entire portfolios.

Using PAIR data for individual cases can improve prosecution strategy and minimise unnecessary costs. An organised presentation of the file history provides valuable context, but the value is really in the aggregated examiner statistics. These include allowance rates, appeal behaviour/success, response to interviews and average timing for office actions and requests for continued examination (RCEs). Each of these can be valuable for prosecution strategy by reducing uncertainty. For example, a particular examiner may have a significantly higher allowance rate and fewer office actions for cases where the applicant conducts an examiner interview. The data would suggest that the applicant requests interviews for all cases with this examiner.

PAIR data can also be very useful for managing entire portfolios. It can facilitate the identification of languishing cases to prevent additional investments of time and money (eg, opting not to file an RCE on cases with two or more RCEs and a low allowance examiner). It can also be used to select outside counsel and evaluate performance by comparing allowance rates and other prosecution efficiency metrics. Further, PAIR data can be leveraged to obtain a better understanding of a prosecution pipeline by looking at allowance likelihood and timing. For example, one area of a portfolio may appear to be strong based on a large number of filings. However, a conclusion of strength may be erroneous if most of those filings have a low likelihood of allowance and/or longer timelines to issuance.

Litigation strategy

The requisite data for patent litigation analysis includes information about lawsuits (pending and decided) that have been filed in court. This docket information can be obtained through the Public Access to Court Electronic Records system (PACER) or other paid subscription services (eg, Docket Navigator, Lex Machina, Darts IP and Innography).

Court docket information captures characteristics of every lawsuit filed, including status, venue, judge, parties, outside counsel, duration, pleadings, damages and outcomes. IP departments can analyse these features in order to make data-driven decisions that will maximise the probability of success for any litigation. These decisions span an entire litigation lifecycle and can have a significant impact on the outcome of the case.

Imagine that OpCo is about to file its first patent lawsuit against a competitor. Because of the nature of the competitor’s business, OpCo has the option of filing the lawsuit in one of three venues. By looking at the data for the courts in each available jurisdiction, OpCo can learn important features about each of the available courts. How much experience do the courts have with patent cases and what are their track records? What is the likelihood that the court will stay the case if the defendant challenges the patent before the USPTO (eg, re-examination or covered business method review)? This information will allow OpCo to compare the different venues in view of the circumstances and select the most plaintiff-friendly court.

OpCo also needs to select outside counsel. It can look at detailed performance data for candidate law firms in order to answer important questions. How much experience does each law firm have before particular judges? What is each law firm’s win/loss record? What were the damage awards in those cases? OpCo can compare all of this data and any other performance factors to select the outside counsel with the highest probability of success.

Once OpCo has retained outside counsel and chosen the venue, it needs to formulate a litigation strategy. This will depend in part on opposing counsel’s behaviour and the court’s rulings on motion practice. Does opposing counsel frequently employ certain tactics in the course of litigation? If so, OpCo can prepare for those attacks by proactively addressing arguments and establishing a record to position a strong counter-attack. How often does the judge grant motions for summary judgment of invalidity and on what grounds? OpCo can prepare a strong position by reviewing the court’s historical orders and selecting the arguments that are most likely to succeed.

Budget management

The requisite data for budget and cost analysis includes a company’s internal billing and invoice information, along with patent prosecution and litigation information from the above-mentioned sources.

Patent prosecution costs are difficult to predict. Some applications are resolved quickly, with only one or two office actions. Others drag on for years through appeals and RCEs. The timing of costs is also unpredictable, as examiners largely dictate the cost-incurring events for each case. Outside counsel can provide valuable information by reporting cost data to clients on a real-time basis; but in practice, they often take a long time to submit invoices and provide inaccurate or inconsistent accruals. It is no different on the litigation front. Some cases reach a quick resolution, while others take many years to fully resolve. Courts dictate the timing of expenses and most litigation counsel miss their forecasts on a regular basis. These issues can make budget management extremely challenging for IP departments, as well as the finance departments that rely on their forecasts to meet quarterly numbers. The good news is that data-driven cost prediction and tracking can go a long way towards making it manageable.

Figure 4a. Budget model: monthly cost of original filing and continuation filing

Figure 4b. Budget model: monthly cost of new filing and pending application

Cost prediction

One way to predict costs is to build a model to forecast costs for individual patent applications. The model should forecast total costs as well as costs per month from filing. For greater precision, multiple models can be used for different types of application (eg, original filings, continuations and track one). Figure 4(a) shows a cost model for original and continuation applications. Once built, the model can be used for long-range forecasting for all or part of a portfolio. The accuracy of the model’s results will increase with the number of patent applications and will reflect only the data that is used to build it. If there are changes to specific costs, applicant behaviour or examiner behaviour, the accuracy of the model will be impaired.

One approach for building a model to forecast costs for individual applications relies on internal invoice data. The first step is to determine the average cost per month of a pending application using historical invoice data. The next step is to determine the likelihood that an application will still be pending at each month from filing, accounting for abandonment and issuance. The last step involves multiplying the average cost for each month from filing by its respective likelihood of pendency. The result is a model that will forecast the monthly cost for a new application from filing. The model can also be modified to forecast future costs for a pending application by adjusting the likelihood of pendency to account for the current status of the application (see Figure 4(b)).

Cost tracking

Real-time cost tracking is important to ensure budget compliance. One efficient way to track costs is to use PAIR data in conjunction with cost estimates. PAIR data can provide real-time updates of events that will incur costs, and cost estimates can be applied to those events to track expected costs. For example, an IP department can determine from PAIR data that an examiner recently sent out a rejection and then allocate the cost of an expected response in the coming months. The cost can also be retroactively allocated by tracking filing of the response, rather than the rejection. Either way, real-time cost tracking will improve cost allocation to facilitate budget compliance.

A deeper analysis of PAIR data can help to determine when cost-incurring events are most likely to occur. It is certain that outside counsel will incur costs by drafting and filing a response to a first office action, but the time that it takes to issue a first office action may differ wildly from examiner to examiner. When each examiner’s expected behaviours are incorporated into a budget, it provides an organisation with much greater precision and predictability. IP departments can take a similar approach with litigation budgets. Court docket information from PACER data can provide the average time that it takes a particular court to reach critical milestones in a case (see Figure 5). With these court-specific timelines, IP departments can predict costs and build budgets with a higher degree of precision.

Figure 5. Litigation timelines

Source: Docket Navigator

The keys to implementing this type of cost tracking are a regular PAIR or PACER data feed and accurate cost estimates. Fixed fee schedules or capped fee arrangements provide accurate cost estimates for each type of prosecution task or case phase. If fixed or capped fees are not available, prior invoice data can be used to determine average costs. Some prosecution events to consider tracking include office action rejections and responses, information disclosure statement filings, notice of missing parts, appeals and notice of allowances. Some litigation events to consider tracking include dispositive motions, claim construction briefs and hearings.

Game changers

There are only a few seconds left in the basketball game and your team is losing by one point. You use performance data to draw up an offensive play that optimises EPV for your final possession. Instead of hoping for the best, you rely on the data to make the right call. When will IP departments embrace data in the same manner?

The data landscape for patents is ripe, and the state of analytics and computing has made it more accessible than ever. IP departments do not need to build their own tools or hire experts to interpret the data. They simply need to change their mindsets and embrace data-driven decision making. It starts with asking the right questions:

  • What outcomes do we want?
  • What factors affect those outcomes?
  • What data would be useful for making decisions that optimise those relevant factors?

By taking advantage of relevant data sources and tools, any IP department can leverage data analytics in its IP strategy and management decisions to achieve better results. 

Jeremiah Chan is legal director and Aaron Abood is senior patent agent at Google, Mountain View, California, United States

The views expressed in this article are those of the authors alone

Unlock unlimited access to all IAM content