New tools and advancements in biological sequence searching

Biological molecules play a significant role in both human and animals and it thus necessary to apply knowledge of biological molecules for healthcare, pharmaceutical, diagnostics and biotechnology purposes. These biological molecules are formed of DNA, RNA or protein, which are built from a continuous chain of nucleotide bases or amino acids – called a biological sequence.

For many years, bacteria or viruses have been of intense interest to researchers. Biological sequence research helps to provide greater understanding of their pathophysiology, which in turn guides researchers to develop therapeutics or diagnostics for the diseases caused by these bacteria or viruses.

There are two forms of biological sequences:

  • A nucleotide sequence is a series of alphabetical letters that indicate the order of nucleotides within a DNA or RNA molecule, wherein nucleotides in genetic sequence are adenine (A), cytosine (C), guanine (G), thymine (T) and uracil (U). The sequence can be made of thousands of base units of nucleotide sequence or a protein sequence. Moreover, a nucleotide sequence can form a primer, probe or a biomarker with therapeutic or diagnostic applications.
  • Amino acid sequences can form proteins, antibodies, enzymes and receptors of varying length (from 100 to 1,000 amino acids).

Exponential growth in research related to nucleotides (DNA, RNA) or proteins has occurred in the pharmaceutical, agriculture and biotech industries. Therefore, a surge has been observed in patent filings encompassing nucleotide or protein sequences. Currently there are more than 40,000 patents related to DNA molecules as the entire human genome has given way to monetisation by various companies. Nearly 20% of the human genome of the known 23,688 human genes have been patented, with over half owned by private companies. In order to protect a biological sequence through a patent, regional patent offices all have specific guidelines for listing nucleotide or peptide sequences separately and in a proper format while submitting the patent applications. The most widely used format for patent submission is FASTA – a text-based format for representing nucleotide sequences or peptide sequences. Some of the rules and regulations laid by patent offices for sequence listings are as follows:

  • The sequence listing shall be referred to by the sequence identifier that is a unique integer, which corresponds to the SEQ ID NO assigned to each sequence in the listing.
  • If provided on paper, it shall have independent page numbering; if furnished in electronic form, it shall be in an electronic document format and filed by a means of transmittal.
  • A nucleotide sequence shall be presented only by a single strand, in the 5’-end to 3’-end direction from left to right. The terms 3’ and 5’ shall not be represented in the sequence.

Due to the high volume of research, IP experts need the functionality of sequence searches in order to identify patents as well as scientific articles. There are a number of IP projects wherein sequence searching is required. For instance, a patentability search before getting a sequence patented, freedom-to-operate searches before launching a product in the market, infringement search or product clearance searches and invalidation searches to assess the validity of sequence-claimed patents. To perform searches for patents with sequences, a sequence alignment exercise is needed. Sequence alignment uses an algorithm (eg, Basic Local Alignment Search Tool or BLAST) to establish similarity between two sequences – so-called character-to-character matching. BLAST is an algorithm for comparing any biological sequence (eg, amino-acid sequences of proteins or nucleotides) against a list of other sequences.

Many patent offices offer publicly available software to standardise biological sequence submission formats. For example, BiSSAP was developed by the EPO in collaboration with national patent offices and the European Bioinformatics Institute, whereas PatentIn has been developed by the USPTO.

The IP community encounters difficulties when an invention or product features cover biological sequences because none of the conventional strategies (ie, keyword or class-based searches) provide the functionality to search patents or scientific articles containing sequences based on the mapping of biological sequences. In the past years, some tools and services were introduced by various platforms (eg, NCBI) to facilitate sequence searching, which can provide a starting line to quickly achieve out of the box results. Some freely available databases (eg, NCBI-BLAST and PatentLens) are also in line, particularly PatentLens, which allows over 80 million DNA and protein sequences disclosed in patents to be searched.

However, certain challenges are still associated with sequence searching, including non-editable sequences, lack of uniformity in submission of sequences and access to the full text of scientific articles. Also, it is difficult to access patented sequences especially for foreign jurisdictions. There is a lack of fast and accurate sequence alignment tools to identify sequences disclosed in patents.

In an effort to overcome these hurdles, some advancements and modifications have been taken by the industry to improve sequence search capability, which can help the IP community not only to map the sequence but also to provide a percentage of alignment of sequence bases. This limited number of available paid tools (eg, STN and GenomeQuest) have the capability to search sequences by integrating multiple databases in a single platform, so that searchers can perform sequence searches for patents of multiple jurisdictions along with other parameters (eg, chemically modified radionucleotide molecules).

Despite such improvements, there are still a few challenges that remain, including cost and lack of expertise. We can search scientific articles containing sequences but access to the full text is not always available. Therefore, a vast gap exists for advanced tools across the globe.  This is causing private institutions to agonise over building databases with sequence-search features in their dashboards, which can be an alternative for IP practitioners, science graduates and industry scientists.


In order to discuss the future prospects of sequence searching, new tools need to be developed to help make searches more accurate with a variety of alignment algorithms to optimise workflow, work accuracy and efficiency. Looking forward there is the promise of sequence searching in all jurisdictions as well as searches for scientific articles in a cost-effective, graphical user interface, with a genome library (of various plants, organisms and animals) being integrated into databases. However, it remains to be seen when and how these will become available.

This is an Insight article, written by a selected partner as part of IAM's co-published content. Read more on Insight

Unlock unlimited access to all IAM content