Counts with Functions

Collect word co-occurrence data, using a function oriented approach.

Function Approach: collect_counts

The function for collecting co-occurrence data is collect_counts().

Given a list of search terms, this function handles all the requests to collect the data.

# Import function to collect data, and helper functions to analyze co-occurrence data
from lisc.collect import collect_counts
from lisc.analysis.counts import compute_normalization, compute_association_index
# Set some terms to search for
terms_a = [['protein'], ['gene']]
terms_b = [['heart'], ['lung']]
# Collect co-occurrence data across a single list of terms
coocs, term_counts, meta_dat = collect_counts(terms_a, db='pubmed', verbose=True)
/Users/tom/opt/anaconda3/lib/python3.8/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(
Running counts for:  protein
Running counts for:  gene
# Check how many articles were found for each combination
print(coocs)
[[     0 655974]
 [655974      0]]
# Print out how many articles found for each term
for term, count in zip(terms_a, term_counts):
    print('{:12} : {}'.format(term[0], count))
protein      : 2955255
gene         : 2056207

When given a single set of terms, the function collects counts of each term against every other term in the list.

You can also specify different sets of terms to collect. In the example below, each term in list A is collected measuring co-occurrences with each term in list B.

# Collect co-occurrence data across two different lists of terms
coocs, term_counts, meta_dat = collect_counts(
    terms_a=terms_a, terms_b=terms_b, db='pubmed', verbose=True)
Running counts for:  protein
Running counts for:  gene

Calculating Co-occurrence Scores

Once co-occurrence data has been collected, we often want to compute a normalization or transform of the data.

The compute_normalization(), compute_association_index() and compute_similarity() functions take in co-occurrence data as returned by the collect_counts() function.

More details on the measures available in LISC are available in the Counts tutorial. When using the functions approach, all implemented scores and transforms are available in lisc.analysis.

# Calculate the normalized data measure, normalizing the co-occurrences by the term counts
normed_coocs = compute_normalization(coocs, term_counts[0], dim='A')

# Check the computed score measure for the co-occurrence collection
print(normed_coocs)
[[0.02482628 0.03193159]
 [0.02157176 0.0323839 ]]
# Compute the association index score, calculating the Jaccard index from the co-occurrences
score = compute_association_index(coocs, term_counts[0], term_counts[1])

# Check the computed score measure for the co-occurrence collection
print(score)
[[0.01896413 0.02611793]
 [0.01479154 0.02428621]]

From here, further analysis of collected co-occurrence data depends on the goal of the analysis.

There are also plot functions available, as demonstrated in the Counts tutorial.

Total running time of the script: ( 0 minutes 7.486 seconds)

Gallery generated by Sphinx-Gallery