Tutorial 01: Words Collection

Collecting literature data, including text and metadata for specified search terms.

Words Analysis

The ‘Words’ approach collects text and meta-data from articles found for requested search terms.

# Import the Words object, which is used for words collection
from lisc import Words

# Import the SCDB object, which organizes a database structure for saved data
from lisc.io import SCDB

# Import a utility function for saving out collected data
from lisc.io import save_object

Words Object

The Words object is used to collect and analyze text data and article metadata.

Search terms are specified, as previously introduced, to find articles of interest, from which text data and meta-data is collected.

# Set some search terms
terms = [['brain'], ['body']]
# Initialize Words object and set the terms to search for
words = Words()
words.add_terms(terms)

To get started, we will first run a collection of words data, collecting up to 5 articles for each search term, as specified by the retmax parameter.

# Collect words data
words.run_collection(retmax=5)
/Users/tom/opt/anaconda3/lib/python3.8/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(

LISC Data Objects

LISC uses custom objects to store collected words data.

The Articles object stores data for each collected article.

Collected data includes:

  • titles

  • journals

  • authors

  • publication years

  • abstract text

  • keywords

  • DOIs

# Check the collected words data
print(words.results)
[<lisc.data.articles.Articles object at 0x7fa0aa175430>, <lisc.data.articles.Articles object at 0x7fa0aa2b6310>]
# Check some specific fields of the collected data
print(words.results[0].n_articles)
print(words.results[0].titles)
5
['Bench-to-bedside investigations of H3 K27-altered diffuse midline glioma: drug targets and potential pharmacotherapies.', 'Epidemiological risk factors and phylogenetic affinities of Sarcocystis infecting village chickens and pigs in Peninsular Malaysia.', 'Association of aEEG and brain injury severity on MRI at term-equivalent age in preterm infants.', 'Gut microbiota-derived short chain fatty acids act as mediators of the gut-brain axis targeting age-related neurodegenerative disorders: a narrative review.', 'Polyomavirus Wakes Up and Chooses Neurovirulence.']

Word Collections

Collected words data from articles can become quite large. We will often want to use some of the available EUtils settings to help control what is collected, and how the data collection proceeds.

In the next example, we’ll revisit the same search terms we used in the previous co-occurence analysis, and explore some of these settings.

# Set search terms of interest
terms = [['frontal lobe'], ['temporal lobe'], ['parietal lobe'], ['occipital lobe']]
words.add_terms(terms)
Unloading terms.

EUtils Settings

The Pubmed EUtils has several settings that can help control searches, including:

  • field : which part of the record to search for search results

  • retmax : the maximum number of records to return for a given search

  • usehistory : whether to temporarily store results remotely and use them for interim requests

For some general guidelines:

  • the field setting defaults to TIAB for titles and abstracts

  • the retmax should be set to an upper bound for the number of articles you would like to collect, especially if your search terms are likely to return a large number of articles

  • the usehistory parameter should be set to True if you are running a large collection, as this is more efficient

Word Collection Settings

For larger collections, the collectio my take a while and return a large amount of data.

Because of this, the Words object allows for continuously saving collected data. If set to True, the save_and_clear parameter saves out collected data, and clears the object per term, so that collected data does not have to stay in RAM.

Now, let’s run our bigger collection, using some of these settings.

# Set up our database object, so we can save out data as we go
db = SCDB('lisc_db')

# Collect words data
words.run_collection(usehistory=True, retmax=15, save_and_clear=True, directory=db)

After this collection, the Words object does not actually include the collected data, as the data was saved and cleared throughout the collection.

The Words object does still have all the information about the search terms, which we can use to reload our data, so it is still worth saving as well.

We will analyze our words data in the next tutorial. For now lets save out the Words object.

# Save out the words data
save_object(words, 'tutorial_words', directory=db)

Total running time of the script: ( 0 minutes 8.181 seconds)

Gallery generated by Sphinx-Gallery