Big data from smart thermometer utilized to track and predict flu activity.

A study by researchers at the University of Iowa shows that anonymous data from a “smart thermometer” connected to a mobile phone app can track flu activity in real time at both population and individual levels. They also showed that this data can be used to improve flu forecasting. The study findings are published in the journal Clinical Infectious Diseases.

“We found the smart thermometer data are highly correlated with information obtained from traditional public health surveillance systems and can be used to improve forecasting of influenza-like illness activity, possibly giving warnings of changes in disease activity weeks in advance,” says lead study author Aaron Miller, PhD, a UI postdoctoral scholar in computer science. “Using simple forecasting models, we showed that thermometer data could be effectively used to predict influenza levels up to two to three weeks into the future. Given that traditional surveillance systems provide data with a lag time of one to two weeks, this means that estimates of future flu activity may actually be improved up to four or five weeks earlier.”

Miller and senior study author Philip Polgreen, MD, UI associate professor of internal medicine and epidemiology, analyzed de-identified data from the commercially available Kinsa Smart Ear Thermometers and accompanying app, which recorded users’ temperature measurement over a study period from Aug. 30, 2015 to Dec. 23, 2017. There were over 8 million temperature readings generated by almost 450,000 unique devices. The smart thermometers encrypt device identities to protect user privacy and also give users the option of providing anonymized information on age or sex. Readings were reported from all 50 states and were aggregated to provide region and age-group specific flu activity estimates.

The UI team compared the data from the smart thermometers to influenza-like illness (ILI) activity data gathered by the Centers for Disease Control and Prevention (CDC) from health care providers across the country. They found that the de-identified smart thermometer data was highly correlated with ILI activity at national and regional levels and for different age groups.

Current forecasts rely on this CDC data, but even at its fastest, the information is almost two weeks behind real-time flu activity. The UI study showed that adding thermometer data, which captures clinically relevant symptoms (temperature) likely even before a person goes to the doctor, to simple forecasting models, improved predictions of flu activity. This approach accurately predicted influenza activity at least three weeks in advance.

Miller notes that the smart thermometers also provide a way to estimate which age groups are being most affected during a flu season, using de-identified data. Monitoring the duration of fever from the smart thermometer readings also revealed that fevers occurring during flu season were more likely to last three to six days and much less likely to last only one day. Fevers lasting even or more days were not at all seasonal. The data also identified instances where users had fever that went away for a few days and then returned. The researchers believe this so-called “biphasic” fever pattern may reflect more serious illnesses. The second temperature spike can indicate a secondary bacterial infection like pneumonia that sets in after the flu and can lead to more severe health problems, especially in older individuals.

Citation:  Miller, Aaron C., Inder Singh, Erin Koehler, and Philip M. Polgreen. “A Smartphone-Driven Thermometer Application for Real-Time Population- and Individual-Level Influenza Surveillance.” Clinical Infectious Diseases, 2018. doi:10.1093/cid/ciy073.

Adapted from press release by  University of Iowa.

ResistoMap developed to track world wide microbial drug resistance

Scientists from the Federal Research and Clinical Center of Physical-Chemical Medicine, the Moscow Institute of Physics and Technology, and Data Laboratory have created the ResistoMap, an interactive visualization of gut resistome. Gut resistome is human gut microbiota potential to resist antibiotics and includes the set of all antibiotic resistance genes in the genomes of human gut microbes. Their ResistoMap will help identify national trends in antibiotic use and control antibiotic resistance on the global scale. This research is published in journal Bioinformatics.

Resistomap - interactive world map of human gut resistome
Resistomap –  an interactive world map of human gut microbiota potential to resist antibiotics.
Credit: Bioinformatics

Microbial drug resistance is caused by the extensive uncontrolled use of antibiotics in medicine and agriculture. It has been predicted that by 2050 around 10 million people will die annually due to reasons associated with drug resistance.

The ResistoMap has two main interactive work fields: a geographic map and a heat map. A user can choose the antibiotic group or country of interest to be displayed on the heat map and obtain a resistome cross section. The data can be filtered by the country of origin, gender, age, and diagnosis. The current version of the interactive map developed by the authors draws on a dataset that includes over 1600 individuals from 12 studies covering 15 countries. However, the dataset can be expanded by additional input from users reflecting the findings of new published studies in a unified format.

Using the ResistoMap, researchers fee that it is possible to estimate the global variation of the resistance to different groups of antibiotics and explore the associations between specific drugs and clinical factors or other metadata. For example, the Danish gut metagenomes tend to demonstrate the lowest resistome among the European groups, whereas the French samples have the highest levels, particularly of the fluoroquinolones, a group of broad-spectrum anti-bacterial drugs. This is in agreement with the fact that France has the highest total antibiotic use across Western Europe, while the use of antimicrobial drugs in Denmark and Germany is moderate, both in health care and agriculture. At the opposite end of the spectrum, Chinese and Russian populations appear to have increased levels of resistome, which is likely due to looser regulation policies, frequent prescription of broad-spectrum antibiotics, and their over-the-counter availability without prescription. The lowest levels of microbiota resistome are observed in the native population of Venezuela who have no documented contacts with populations of the developed countries. ResistoMap-informed analysis reveals certain novel trends that await further interpretation from the clinical standpoint.

Konstantin Yarygin, one of the creators of the visualization tool, says, “We anticipate that the exploratory analysis of global gut resistome enabled by the ResistoMap will provide new insights into how the use of antibiotics in medicine and agriculture could be optimized.”

Citation: Yarygin, Konstantin S., Boris A. Kovarsky, Tatyana S. Bibikova, Damir S. Melnikov, Alexander V. Tyakht, and Dmitry G. Alexeev. “ResistoMap—online visualization of human gut microbiota antibiotic resistome.” Bioinformatics, 2017.
doi:10.1093/bioinformatics/btx134.
Research funding: Russian Scientific Foundation.
Adapted from press release by the Moscow Institute of Physics and Technology.

Dutch universities collaborate on big data in health to understand disease process

Patients with the same illness often receive the same treatment, even if the cause of the illness is different for each person. This represents a new step towards ultimately being able to offer every patient more personalized treatment.

Six Dutch universities are combining forces to chart the different disease processes for a range of common conditions. This represents a new step towards ultimately being able to offer every patient more personalized treatment. The results of this study have been published in two articles in the authoritative scientific journal Nature Genetics.

The researchers were able to make their discoveries thanks to new techniques that make it possible to simultaneously measure the regulation and activity of all the genes of thousands of people, and to link these data to millions of genetic differences in their DNA. The combined analysis of these ‘big data’ made it possible to determine which molecular processes in the body become dysregulated for a range of disparate diseases, from prostate cancer to ulcerative bowel disease, before the individuals concerned actually become ill.

“The emergence of ‘big data’, ever faster computers and new mathematical techniques means it’s now possible to conduct extremely large-scale studies and gain an understanding of many diseases at the same time,” explains Lude Franke (UMCG), head of the research team in Groningen. The researchers show how thousands of disease-related DNA differences disrupt the internal working of a cell and how their effect can be influenced by environmental factors. And all this was possible without the need for a single lab experiment.

The success of this research is the result of the decision taken six years ago by biobanks throughout the Netherlands to share data and biomaterials within the BBMRI consortium. This decision meant it became possible to gather, store and analyze data from blood samples of a very large number of volunteers. The present study illustrates the tremendous value of large-scale collaboration in the field of medical research in the Netherlands.

Heijmans (LUMC), research leader in Leiden and initiator of the partnership: “The Netherlands is leading the field in sharing molecular data. This enables researchers to carry out the kind of large-scale studies that are needed to gain a better understanding of the causes of diseases. This result is only just the beginning: once they have undergone a screening, other researchers with a good scientific idea will be given access to this enormous bank of anonymized data. Our Dutch ‘polder mentality’ is also advancing science.”

Mapping the various molecular causes for a disease is the first step towards a form of medical treatment that better matches the disease process of individual patients. To reach that ideal, however, we still have a long way to go. The large-scale molecular data that have been collected for this research are the cornerstone of even bigger partnerships, such as the national Health-RI initiative. The third research leader, Peter-Bram ’t Hoen (LUMC), says: “Large quantities of data should eventually make it possible to give everyone personalized health advice, and to determine the best treatment for each individual patient.”

The research has been made possible thanks to the cooperation within the BBMRI biobank consortium of six long-running Dutch population studies carried out by the university medical centres in Groningen (LifeLines), Leiden (Leiden Longevity Study), Maastricht (CODAM Study), Rotterdam (Rotterdam Study), Utrecht (Netherlands Prospective ALS Study) and by the Vrije Universiteit (Netherlands Twin Register). The molecular data were generated in a standardized fashion at a central site (Human Genomics Facility HuGE-F, ErasmusMC) and subsequently securely stored and analyzed at a second central site (SURFSara). The study links in with the Personalised Medicine route of the National Research Agenda and the Health-RI and M3 proposals on the large-scale research infrastructure agenda of the Royal Netherlands Academy of Arts and Sciences (KNAW).

Citations:
1. Bonder, Marc Jan, René Luijk, Daria Zhernakova, Matthijs Moed, Patrick Deelen, Martijn Vermaat, Maarten van Iterson et al. “Disease variants alter transcription factor levels and methylation of their binding sites.” bioRxiv (2015): 033084. Nature Genetics 2016.
DOI: 10.1038/ng.3721

2.  Zhernakova, Daria V,  Patrick Deelen, Martijn Vermaat, Maarten van Iterson, Michiel van Galen, Wibowo Arindrarto et al. “Identification of context-dependent expression quantitative trait loci in whole blood”. Nature Genetics 2016.
DOI: doi:10.1038/ng.3737

Adapted from press release by Leiden University.

Researchers use multi-task deep neural networks to automatically extract data from cancer pathology reports

Despite steady progress in detection and treatment in recent decades, cancer remains the second leading cause of death in the United States, cutting short the lives of approximately 500,000 people each year. To better understand and combat this disease, medical researchers rely on cancer registry programs–a national network of organizations that systematically collect demographic and clinical information related to the diagnosis, treatment, and history of cancer incidence in the United States. The surveillance effort, coordinated by the National Cancer Institute (NCI) and the Centers for Disease Control and Prevention, enables researchers and clinicians to monitor cancer cases at the national, state, and local levels.

Much of this data is drawn from electronic, text-based clinical reports that must be manually curated–a time-intensive process–before it can be used in research.

A representation of a deep learning neural network designed to
intelligently extract text-based information from cancer
 pathology reports. Credit: Oak Ridge National Laboratory

Since 2014 Tourassi has led a team focused on creating software that can quickly identify valuable information in cancer reports, an ability that would not only save time and worker hours but also potentially reveal overlooked avenues in cancer research. After experimenting with conventional natural-language-processing software, the team’s most recent progress has emerged via deep learning, a machine-learning technique that employs algorithms, big data, and the computing power of GPUs to emulate human learning and intelligence.

Using the Titan supercomputer at the Oak Ridge Leadership Computing Facility, a DOE Office of Science User Facility located at ORNL, Tourassi’s team applied deep learning to extract useful information from cancer pathology reports, a foundational element of cancer surveillance. Working with modest datasets, the team obtained preliminary findings that demonstrate deep learning’s potential for cancer surveillance.

The continued development and maturation of automated data tools, among the objectives outlined in the White House’s Cancer Moonshot initiative, would give medical researchers and policymakers an unprecedented view of the US cancer population at a level of detail typically obtained only for clinical trial patients, historically less than 5 percent of the overall cancer population.

Creating software that can understand not only the meaning of words but also the contextual relationships between them is no simple task. Humans develop these skills through years of back-and-forth interaction and training. For specific tasks, deep learning compresses this process into a matter of hours.

Typically, this context-building is achieved through the training of a neural network, a web of weighted calculations designed to produce informed guesses on how to correctly carry out tasks, such as identifying an image or processing a verbal command. Data fed to a neural network, called inputs, and select feedback give the software a foundation to make decisions based on new data. This algorithmic decision-making process is largely opaque to the programmer, a dynamic akin to a teacher with little direct knowledge of her students’ perception of a lesson.

GPUs, such as those in Titan, can accelerate this training process by quickly executing many deep-learning calculations simultaneously. In two recent studies, Tourassi’s team used accelerators to tune multiple algorithms, comparing results to more traditional methods. Using a dataset composed of 1,976 pathology reports provided by NCI’s Surveillance, Epidemiology, and End Results (SEER) Program, Tourassi’s team trained a deep-learning algorithm to carry out two different but closely related information-extraction tasks. In the first task the algorithm scanned each report to identify the primary location of the cancer. In the second task the algorithm identified the cancer site’s laterality–or on which side of the body the cancer was located.

By setting up a neural network designed to exploit the related information shared by the two tasks, an arrangement known as multitask learning, the team found the algorithm performed substantially better than competing methods.

Another study carried out by Tourassi’s team used 946 SEER reports on breast and lung cancer to tackle an even more complex challenge: using deep learning to match the cancer’s origin to a corresponding topological code, a classification that’s even more specific than a cancer’s primary site or laterality, with 12 possible answers.

The team tackled this problem by building a convolutional neural network, a deep-learning approach traditionally used for image recognition, and feeding it language from a variety of sources. Text inputs ranged from general (e.g., Google search results) to domain-specific (e.g., medical literature) to highly specialized (e.g., cancer pathology reports). The algorithm then took these inputs and created a mathematical model that drew connections between words, including words shared between unrelated texts.

Comparing this approach to more traditional classifiers, such as a vector space model, the team observed incremental improvement in performance as the network absorbed more cancer-specific text. These preliminary results will help guide Tourassi’s team as they scale up deep-learning algorithms to tackle larger datasets and move toward less supervision, meaning the algorithms will make informed decisions with less human intervention.

In 2016 Tourassi’s team learned its cancer surveillance project will be developed as part of DOE’s Exascale Computing Project, an initiative to develop a computing ecosystem that can support an exascale supercomputer–a machine that can execute a billion billion calculations per second. Though the team has made considerable progress in leveraging deep learning for cancer research, the biggest gains are still to come.

Citation: Yoon, Hong-Jun, Arvind Ramanathan, and Georgia Tourassi. “Multi-task Deep Neural Networks for Automated Extraction of Primary Site and Laterality Information from Cancer Pathology Reports.” In INNS Conference on Big Data, pp. 195-204. Springer International Publishing, 2016.
DOI: http://dx.doi.org/10.1007/978-3-319-47898-2_21
Adapted from press release by US Department of Energy, Oak Ridge National Laboratory.

Researchers use big data analytics to discover harmful drug interations

WASHINGTON – Coupling data mining of adverse event reports and electronic health records with targeted laboratory experiments, researchers found a way to identify and confirm previously unknown drug interactions, according to a study published today in the Journal of the American College of Cardiology. 

Drug-drug interactions account for a significant proportion of side effects and hospitalizations, but they are often very difficult to predict. Researchers in the new study used the data mining approach to discover that together, two commonly used drugs – a popular over-the-counter medication for heartburn relief and an antibiotic used to prevent and treat infection – were associated with an increased risk of acquired long QT syndrome, which can lead to life-threatening arrhythmias or problems with the way the heart beats.

“Doctors must often rely on a wait-and-see approach to monitor safety when patients are taking multiple medicines. By using large datasets of clinical records available from the Food and Drug Administration and in electronic health records at our hospital, we were able to use data science to accurately identify a previously unexpected interaction from among millions of possibilities, which would not have been suspected using current surveillance methods,” said Nicholas Tatonetti, Ph.D., Herbert Irving Assistant Professor, Department of Biomedical Informatics at Columbia University and one of the study’s authors.

The researchers chose to investigate QT interval prolongation because of its importance in drug safety and drug development. The QT interval is the measure of the time between the start and the end of the cardiac electrical cycle as recorded by the electrocardiogram. With QT prolongation, it takes longer to transmit electrical signals through the heart muscle, which can lead to serious, even fata, heart rhythm disturbances. “Investigative drugs that have the potential to prolong the QT interval will be withdrawn before they are ever given to a patient; however, no such checks exist for drug-drug interactions and they often go undiscovered for years,” said Tatonetti.

Using an algorithm called Latent Signal Detection, researchers scanned data from two independent databases to investigate possible QT interval-prolonging drug-drug interactions: 1.8 million adverse event reports from the U.S. Food and Drug Administration’s Adverse Event Reporting System and 1.6 million electrocardiograms from 382,221 patients treated at New York-Presbyterian/CUMC between 1996 and 2014. A computer can evaluate millions of data points all at once and flag the most likely drug-drug interactions.

Even with large amounts of data, the association between the drug combination and the prolonged QT interval does not prove the drugs caused this problem. To get more information, researchers then applied more traditional analyses and laboratory experiments to validate the predictions.

In this study, people taking ceftriaxone (a cephalosporin antibiotic) and lansoprazole (a proton pump inhibitor) were 40 percent more likely to have a QT interval above 500 ms, which is the current FDA-stated threshold of clinical concern. Among men taking both of these drugs, QT intervals were 12ms longer than men who took either drug alone. This trend was then validated by cellular data from the electrophysiology experiment, which found that together these drugs block one of the cardiac ion channel responsible for controlling heart rhythm. White women and men appear to be more sensitive to this interaction.

Interestingly, the interaction identified in the data analysis was specific to lansoprazole and ceftriaxone, and not other cephalosporin antibiotics. Even though they share similar chemical structure and mode of action, Tatonetti said, “our algorithm was able to distinguish between one that would cause this interaction (ceftriaxone) and another that would not (cefuroxime). We tested both in the lab and the algorithm was correct in both cases.”

This research comes at a time when an increasing number of Americans are taking multiple prescriptions and over-the-counter medications. All told, nearly 70 percent take at least one prescription drug, and more than half take two, according to a 2013 study published in the Mayo Clinic Proceedings. Among the most commonly prescribed drugs are antibiotics. Twenty percent of patients are on five or more prescription medications. Finding ways to identify potentially harmful interactions is critical.

In an accompanying editorial, Dan M. Roden, M.D., and colleagues wrote that the findings of this study are not robust enough to advise clinicians to avoid this combination in all patients, but it shows that it is important to examine the effects of these drugs individually and in combination in patients.

The editorial said that with an aging population, it is becoming more common for patients to be on multiple medications, making it more important than ever to find a faster data-driven approach to identifying potential interactions among a vast number of possible drug pairs patients could be taking.

“Solving the methodological challenges of developing approaches to systematically leverage these data sources will be a next frontier in identifying and preventing adverse drug reactions,” the authors said.

While the new study was limited to common drug combinations prescribed in one hospital system, Tatonetti and his team believe that analyses made possible through wider data sharing can be used to look more extensively at potential drug interactions.

“The analyses are relatively rapid and inexpensive to perform and they focus on drug combinations that are actually used together in clinical practice,” Tatonetti said.

Publication: Coupling Data Mining and Laboratory Experiments to Discover Drug Interactions Causing QT Prolongation. DOI: http://dx.doi.org/10.1016/j.jacc.2016.07.761
Adapted from press release by American College of Cardiology

Data analysis using Integrated Microbial Next Generation Sequensing (IMNGS) enables worldwide bacterial analysis

Press release by Technical University of Munich

Sequencing data from biological samples such as the skin, intestinal tissues, or soil and water are usually archived in public databases. This allows researchers from all over the globe to access them. However, this has led to the creation of extremely large quantities of data. To be able to explore all these data, new evaluation methods are necessary. Scientists at the Technical University of Munich (TUM) have developed a bioinformatics tool which allows to search all bacterial sequences in databases in just a few mouse clicks and find similarities or check whether a particular sequence exists.

Microbial communities are essential components of ecosystems around the world. They play a key role in key biological functions, ranging from carbon to nitrogen cycles in the environment to the regulation of immune and metabolic processes in animals and humans. That is why many scientists are currently investigating microbial communities in great detail.
The Sanger sequencing method developed in 1975 used to be the gold standard to decipher the DNA code for 30 years. Recently, next generation sequencing technologies, or NGS as they are called, have led to a new revolution: With minimal personnel requirements, current devices can, within 24 hours, generate as much data as a hundred runs of the very first DNA sequencing method.

Today, the sequencing analysis of bacterial 16S rRNA genes is the most frequently used identification method for bacteria. The 16S rRNA genes are seen as ideal molecular markers for reconstructing the degree of relationship between organisms, as their sequence of nucleotides (the building blocks of DNA) has been relatively conserved throughout evolution and can be used to infer phylogenetic relationships between microorganisms. The acronym rRNA stands for ribosomal ribonucleic acid.

The Sequence Read Archive (SRA), a public database for deposition of sequences, currently stores over 100,000 such 16S rRNA gene sequence datasets. This is because the new technical procedures for DNA sequencing have caused the volume and complexity of genome research data over the past few years to grow exponentially. The SRA is home to datasets which previously could not be evaluated in their whole.

“Over all these years, a tremendous amount of sequences from human environments such as the intestine or skin, but also from soils or the ocean has been accumulated”, explains Dr. Thomas Clavel from the Institute for Food and Health (ZIEL) at the TU Munich. “We have now created a tool which allows these databases to be searched in a relatively short amount of time in order to study the diversity and habitats of bacteria“, says Clavel — “with this tool, a scientists can conduct a query within a few hours in order to find out in which type of samples the bacterium he is interested in can be found — for example a pathogen from a hospital. This was not possible before.” The new platform is called Integrated Microbial Next Generation Sequencing (IMNGS) and can be accessed via the main website www.imngs.org.

A detailed description of how IMNGS functions using the intestinal bacterium Acetatifactor muris has been published in the current online issue of “Scientific Reports”. Registered users can carry out queries filtered by the origin of the bacterial data, or also download entire sequences.
Such bioinformatics approach may soon become indispensable in routine daily clinical diagnostics. However, one critical aspect is that many members of complex microbial communities remain to be described. “Improving the quality of sequence datasets by collecting new reference sequences is a great challenge ahead”, says Clavel — “moreover, the quality of datasets is not yet good enough: the description of individual samples in databases is incomplete, and hence the comparison possibilities using IMNGS are currently still limited.”

However, Clavel imagines that a collaboration with clinics could be a catalyst for progress, provided the database is filled more meticulously. “If we had very well-maintained databases, we could use innovative tools such as IMNGS to possibly help diagnosis of chronic illnesses more rapidly”, says Clavel.
Publication: IMNGS: A comprehensive open resource of processed 16S rRNA microbial profiles for ecology and diversity studies. doi:10.1038/srep33721