Tutorial for COBS Python Interface
Installation
Installation of COBS with Python interface is easy using pip. The package name
on PyPI is cobs_index
and you need cmake
and a recent C++11 compiler to build the C++ library source.
$ pip install --user cobs_index
Document Lists
COBS can read and create an index from the following document types:
FastA (
.fasta
,.fa
,.fasta.gz
,.fa.gz
)FastQ (
.fastq
,.fq
,.fastq.gz
,.fq.gz
)McCortex (
.ctx
,.cortex
)text files (
.txt
)MultiFastA (
.mfasta
)
The document types are identified by extension and compressed .gz
files are
handled transparently. The set of k-mers extracted from each file type is
handled slightly differently: for FastA files each continuous subsequence is
broken into k-mers individually, while McCortex files explicitly list all
k-mers, and for text files the entire continuous file is broken into
k-mers. Each document creates one entry in the index, except for MultiFastA were
each subsequence is considered an individual document.
COBS usually scans a directory and creates an index containing all documents it
finds. For more fine-grain control, document lists are represented using
DocumentList
objects. DocumentLists can be created empty or by scanning
a directory, files can be added, and they contain DocumentEntry
objects
which can be iterated over.
import cobs_index as cobs
doclist1 = cobs.DocumentList("/path/to/documents")
print("doclist1: ({} entries)".format(len(doclist1)))
for i, d in enumerate(doclist1):
print("doc[{}] name {} size {}".format(i, d.name, d.size))
doclist2 = cobs.DocumentList()
doclist2.add("/path/to/single/document.fa")
doclist2.add_recursive("/path/to/documents", cobs.FileType.Fasta)
print("doclist2: ({} entries)".format(len(doclist2)))
for i, d in enumerate(doclist2):
print("doc[{}] name {} size {}".format(i, d.name, d.size))
Index Construction
Compact indices are constructed using the functions compact_construct()
or
compact_construct_list()
. The first scans a directory for documents and
constructs an index from them, while the latter takes a explicit
DocumentList
. Note that the output index file must end with
.cobs_compact
.
cobs.compact_construct("/path/to/documents", "my_index.cobs_compact")
Parameters for index construction may be passed using a
CompactIndexParameters
object. See the class documentation for a
complete list of parameters. The default parameters are a reasonable choice for
most DNA k-mer applications.
import cobs_index as cobs
p = cobs.CompactIndexParameters()
p.term_size = 31 # k-mer size
p.clobber = True # overwrite output and temporary files
p.false_positive_rate = 0.4 # higher false positive rate -> smaller index
cobs.compact_construct("/path/to/documents", "my_index.cobs_compact", index_params=p)
Besides compact indices, COBS also constructs and supports “classic” indices. These are however usually not be used in practice and thus not discussed here further.
Querying an Index
To query an index, first load it using a Search
object. This method
detects the type of index, reads the metadata, and opens the entire file using
mmap
.
Querying is performed with the Search.search()
method. This method returns
a list containing pairs: (#occurrences, document name)
.
import cobs_index as cobs
s = cobs.Search("out.cobs_compact")
r = s.search("AGTCAACGCTAAGGCATTTCCCCCCTGCCTCCTGCCTGCTGCCAAGCCCT")
print(r)
# output: [(20, 'sample1'), (16, 'sample2'), ...]
With the default search parameters all document scores are returned. For
large corpora creating this Python list is a substantial overhead, such that the
result set should be limited using a) the threshold
parameter or b) the
num_results
parameter. Threshold determines the fraction of k-mers in the
query a document be reach to be included in the result, while num_results
simply limits the list size to a given number.