Monetising data, machine learning’s most valuable asset
The data associated with machine learning can be extremely valuable but, writes Kimberley Bayliss of Haseltine Lake Kempner in this co-published piece, before it can be monetised there are some important issues to work through
One of the things I hear time and time again from inventors is that data is machine learning’s (ML) most valuable asset. Afterall, an ML model is only as good as the quality and quantity of data that it is trained on.
In many fields, good quality data that can be used for training ML models is proprietary and can be plagued (depending on your viewpoint) by privacy issues. Furthermore, as soon as you give a third-party access to your data it is copyable, making it easy to quickly completely lose control of.
If data really is so valuable, the burning question is therefore whether it can be successfully protected and monetised.
Following are some points to consider when seeking to turn data into a potentially revenue-generating asset.
Be clear on what you have and each individual’s responsibilities
Conduct a data audit to determine what you have and how it is used. Work out what is in each database, who in your organisation has permission to access it and for which purposes.
Just as employees should be aware when they access a trade secret - and the responsibilities that flow from this - employees should also be made aware of their responsibilities when accessing and using company data. This minimises the prospects of your employees sharing valuable data accidentally.
Keep records on how the data was compiled and processed
Both copyright and database rights can be used to protect databases in the UK and the EU.
Copyright protects original (eg, creative) selections or arrangements of material in a database. However, the contents of a database are protected by database rights, if there has been a substantial investment in obtaining, verifying or presenting the data.
Thus, in both cases, it is good practice to document how your data was collected and processed in order to evidence that these rights exist.
Consider patent protection
Patent protection isn’t necessarily the first thing that comes to mind when we think of protecting a database, but Article 64 of the European Patent Convention explicitly provides protection for products directly obtained by patentable processes.
Furthermore, the EPO Examination Guidelines at Section 3.3.1 state that: “Where a classification method serves a technical purpose, the steps of generating the training set and training the classifier may also contribute to the technical character of the invention if they support achieving that technical purpose.”
Thus, it seems theoretically possible to obtain a patent with claims to a method of processing data (for example, to optimise the data for use in training a machine learning model), that also extends to a database produced by that process. If your data is modified to make an improvement on a technical process, it is therefore worth considering patent protection.
One size doesn’t necessarily fit all
Once you have audited your data collections, the next question is how valuable each dataset is to your business. In other words, which data gives you a competitive advantage?
While you may want to keep the data that contributes most to your business a trade secret, this may be overkill for other data assets, that provide less benefit to your organisation. It is this data that is ripe for monetisation.
Derived products or the real thing?
Once the data has been selected for monetisation, then a range of options are open to you. These include licensing and selling the data; for example, via an IP or data broker.
Both the data itself and products derived from it, such as trained models or other predictive tools, may be sold to third parties. Derived products may be hidden behind application programming interfaces (APIs) and made available to third parties; for example through the use of a subscriber model.
While it may feel safer to sell, or give access to, derived products, this doesn’t necessarily protect the underlying data, as datasets can be reconstituted from ML models. Extraction attacks, in which an attacker makes large numbers of requests to a model, can be used to build up a database of inputs and output pairs, or to probe model boundaries to determine the underlying logic. Thus, both training data and model structure can potentially be reconstructed, simply through querying a model.
No magic bullet
As is so often the case, there doesn’t seem to be a magic bullet that allows data to be protected and easily monetised. However, a deliberate approach with a clear paper trail is likely to offer the best opportunity to realise the value of your datasets while also maintaining your rights.
Previous articles by Haseltine Lake Kempner authors in this series can be accessed here: