HST.950J | Fall 2010 | Graduate

Biomedical Computing

Assignments

There are four homework assignments for this course, in addition to the final project and presentation. See the calendar for posting and due dates.

ASSN # ASSIGNMENTS
1 Clinical databases
2 Extracting meaning from text
3 Gene expression and mathematical foundations
4

Project proposals:

Please submit proposals for projects for the class. Students may complete the project alone or pair up with one or two other students. It is generally quite helpful for each team to include students with different backgrounds. Topics for projects can include anything that is relevant to the class material. Your proposal should include a list of the people who will be working on the project, a title, and a one-page description, with enough detail that the instructor can provide feedback to help direct your work.

Final Project Presentations

Each single student will have a 15 minute slot, including time to set up and take questions and teams will have 25 minutes. Presentations should be no longer than 10-12 minutes for individuals, and about 20 minutes for teams.

Part 1

This assignment contains two parts. The first asks you to consider some of the important issues underlying the motivations for healthcare IT systems, and some of the policy issues that influence their adoption and use. Much of the material can be answered from the readings. Please give concise thoughtful answers.

  1. What is an EHR? Outline the principal advantages and disadvantages, compared to a paper record.
  2. From your own experience with health care and the information infrastructure that it uses, how do you think we compare in practice to the vision outlined in Shortliffe’s first chapter?
  3. Discuss the relative advantages and disadvantages of unstructured text entry into an EHR vs. fully coded information.
  4. Briefly describe the hypothetico-deductive method, and its relation to health care. What aspects of this do you think would be simplest and most difficult to automate using computer processing?
  5. Given that nearly half of healthcare in the US is paid by CMS (Centers for Medicare and Medicaid Services) of the federal government, could CMS simply mandate that all clinical data be standardized according to one standard, stored and reported in electronic form, and thus seamlessly shared among healthcare institutions? What would be the technical medical and political consequences of such a move?
  6. Some argue that it is essential to distinguish between an EHR that is meant to be shared among health care providers and a “Personal Health Record” (PHR) that is meant to inform patients and to allow them to keep track of their own diseases treatments, immunizations, medications, etc. Muster some brief arguments, for why these should be different, and some counter- argument for why they should be the same.

Part 2

The second part of the assignment asks you to explore the organization and content of an extract of 300 patients’ data from an operational database taken in the mid-1990’s. Although these were real data, we have worked very hard to de-identify the data, so all the names, addresses, medical record numbers, etc., that you see in the database are synthesized replacements for the actual patient identities. We have also gone through all the text fields in the data and removed or replaced all such identifying information. (This process will be discussed later in the term, as part of the issue of how to use clinical data for research purposes. If you find what you believe to be true identifying data that we missed, please let us know so we can correct it; however, no such data have been found in this database in the past decade.)

Refer to the following resources to help you review or learn the relevant aspects of how contemporary relational databases work, and how they have been adapted for use in EHR’s.

  1. A general primer on relational data bases and relational algebra: Date, C.J. An Introduction to Database Systems, some edition earlier than 6th ed. Addison Wesley.
  2. A paper describing normal forms in relational databases
  3. A paper on generic data models. This is in line with the currently-popular Semantic Web notion of using RDF as a general data model for any kind of data.
  4. You can find any number of helpful documents on-line. For example, MySQL, a free (for non-commercial use) database is available for Windows, Mac OS and Linux systems provides extensive and handy documentation.

Note: The following questions refer to a “scrubbed” clinical database, called cws (for Clinical Work Station). Due to residual concerns about confidentiality, the database has not been included in this publication, but the questions have been retained below for reference. As an alternative, the MIMIC II Database (http://www.physionet.org/mimic2/) is available free of charge to qualified researchers, and contains comprehensive clinical data from thousands of Intensive Care Unit patients that has been thoroughly de-identified (all personal health information has been removed and all dates have been changed).

You will need to have available to you a database system of some sort that accepts standard (or at least typical) SQL commands and into which you can load some version of the above data. We have found it easiest to use MySQL, for which implementations exist for Mac OS X, Windows, and various flavors of Unix/Linux. If you don’t already have MySQL server running on your system, it may be obtained and freely installed from the MySQL download site.

The following questions require you to examine and explore the database:

  1. The table pat_demograph has dozens of columns. Comment on the design of this table from the viewpoint of Kent’s article on relational forms.

  2. The table pat_fin_acct (key to billing operations), has a column called pat_num but clearly does not have only one row per patient. What is the primary key of this table?

  3. Give the SQL query to retrieve the names of all doctors in the database.

  4. Suppose you are doing medical research on Diabetes-Insipidus and need related patient documents. Give the query for retrieving documents of patients with Diabetes-Insipidus.

  5. Retrieve a table of the number of distinct patients who have each of the many problems listed in the problems table.

  6. Give three different queries, each of which will estimate the total number of patients being tracked in the database. If they result in different numbers, discuss why. (Ignore the possibility that the same patient is entered several times but with different identifiers.)

  7. Sometimes, de-normalized database structures are designed deliberately and defensibly. Consider the pat_test table, which stores with each (numerical) data value the low and high bounds of the normal range of that value. One might argu_e_ that these bounds are properties of a test, not a specific test result, and change at most infrequently, e.g., when the test equipment is re-calibrated or the chemistry of the test is altered. Nevertheless, can you defend the decision to do this in the way CWS does? If you chose an alternative design, in which these bounds data were kept in a separate table, what columns would such a table need, and what SQL expression would you use to retrieve a specific test value and its appropriate normal range?

  8. (SQL challenge): Formulate a query that retrieves patients who have had a series of two or more tests in which adjacent tests yield a value that is abnormally high immediately followed by a value that is abnormally low. Produce the list of patients, which lab value, and when the two occurred. (Note: the hard part of this is making sure the values are adjacent, with no intervening valu_e_s.)

  9. Consider a patient-oriented journal in which doctors will enter, at each visit, the following data:

    1. visit date
    2. chief complaint (unstructured text)
    3. results of exams, if performed, which include the following, but to which others may be later added
      1. physical exam
        1. pulse (beats/minute)
        2. respiration rate (breaths/minute)
        3. blood pressure (systolic and diastolic, in mmHg)
      2. total blood count
        1. red blood cell count
        2. white blood cell count
    4. Diagnosis ICD9 code)
    5. plan (unstructured text )
    6. provider name and signature

    Build a data model for such a patient journal, making reasonable assumptions as necessary, and write the SQL table definitions to implement it. Also mark the key fields. Make sure you satisfy the third normal form.

  10. Johnson’s “Generic Data Modeling” paper suggests that you could use an alternative design for the relational data base, in which the attributes of an entity such as a visit are represented not all as distinct columns in the data, but as different properties of the entity, stored in a table with fewer columns but many more rows. Give a description of how you might transform your design above to such a representation.

This problem set asks you to run a relatively simple machine learning algorithm that is designed to work well with text extraction problems on some of the textual data that you encountered in homework assignment 1. The algorithm, called BoosTexter, is described in the following paper:

Schapire, Robert E., and Yoram Singer. “BoosTexter: A Boosting-based System for Text Categorization.” Machine Learning 39 (2000). (PDF)

Installing Boostexter

The program that implements these algorithms is available for download and should run on all common operating system from the following URL: http://www.research.att.com/~gsf/cgi-bin/download.cgi?action=list&name=boostexter.

This gives you the option of selecting implementations compiled for various flavors of Unix, Windows (32-bit) and Mac OS X. There is an odd method for agreeing to the license agreement, which involves clicking on “Cancel” when you get a dialog box asking for a username and password, so you can see the license agreement and retrieve the username and password you need. If you then refresh your browser window, it will again ask for the username/password that you just got. For me, this failed when using Safari on a Mac, but worked in Firefox. Your mileage may vary.

Installing on Windows

Running Boostexter on Windows poses an additional challenge because it really is a Unix program. According to its documentation, it should be possible to run the program under Cygwin, which is a common Unix-like shell environment that is often installed on Windows machines. Unfortunately, our experience actually trying to make this work has been dismal. Jacinda Shelly, a former student in the class, has helped greatly and figured out that the program will run under AT&T’s uwin-base (aka ksh) environment. Here are her instructions for making this work on a Windows XP installation:

  1. Download uwin at this Web site: http://www.research.att.com/~gsf/cgi-bin/download.cgi?action=list&name=uwin-base
  2. Double-click the .exe file and install it. It might say installation failed (it did for mine, but the program still works). Ignore unless the following steps don’t work.
  3. Download the win32 version of Boostexter from the link given above (choosing the win32 version)
  4. Open uwin (which appeared as ksh on my desktop after installation) and navigate to the directory where you downloaded the boostexter executable. Use the following command to unzip: gunzip -c boostexter.2001-04-30.win32.tar.gz | tar xvf -
  5. Go to the new Boostexter 2_1 directory.
  6. The command is boostexter.exe [Parameters] (example output below).

I hope this helps! I’m glad it’s working now. The only annoying part about uwin is that up and down arrow keys will only let you reuse a command, not edit it (at least on my machine).

$ boostexter.exe -n 10 -W 1 -N ngram -S sample -V

Copyright 2001 AT&T. All rights reserved.

Weak Learner parameters:

-———————–

Window = 1

Classes = all

Expert = NGRAM

goal-in-life:be

C0: -1.199 0.168 0.168

C1: 0.549 -0.549 -0.549

rnd 1: wh-err= 0.724633 th-err= 0.724633 test= 1.0000000 train= 0.3333333

If at all feasible, I would encourage you to run under some Unix-like OS, such as Linux or Mac OS X.

Running Boostexter

Once you have downloaded the program, look at the README file to see how to run it and to interpret the outputs, and the “sample.*” files to see examples of the input formats needed for the program. Note that the input texts to boostexter must not contain commas, periods or line breaks, because they are part of the formatting of the input files. In addition, I have discovered by the sad experience of having the program go into infinite loops that other symbols listed in *text-replacements* (in the Lisp code) also cause problems: colons and percent signs. These have all been substituted by upper-case symbols in the texts above.

Binary Classification

Using the dmss data, try building models using BoosTexter to classify the cases based on text unigrams, bigrams, trigrams, etc., appearing in the data. Note that the parameters -W and -N control the types of features used in learning, and the -n option controls how many rounds of boosting are performed (roughly, how many features are selected). The -l and -z parameters select variations on the algorithm, as described in the paper. If you use the -V option, you can see the features being selected as the program runs. In any case, you can examine the file dmss.shyp after running boostexter to see the model that was generated. The distributed README file explains how to understand these. That document also shows how to run a trained model against the dmss.test data, to see how well it performs on data other than the set it was trained on.

  1. Based on your experience, what combination of parameters gives you the best performance on this problem?
  2. Look at the specific features used in your model(s) and comment on whether these seem to “make sense” independent of their contribution to program performance.
  3. What do you think are the fundamental trade-offs in using fairly generic features such as unigrams vs. much more specific ones such as, say, trigrams.
  4. When doing a “train once, test once” experiment, there is always some risk that you might have chosen a model that just happens to work very well or very poorly, by chance. Often people perform cross-validation studies, where they will split the total data set (e.g., dmss.data + dmss.test) into different (randomly selected) subsets on which to train and test with the same parameters. This method can be used to explore the robustness of the method selected or can be used to optimize the choice of training parameters. For the optimization task, it is typical to further subdivide the training set into development and validation subsets (in several different ways), then to train on each development set, test on its corresponding validation set, and choose the parameter setting that optimizes performance across these trials. Then you can train on the entire training set with those parameters and finally test against the test set. Perform a (limited) set of such cross-validation experiments to get a better understanding of what kinds of models perform best on this (relatively easy) task.

Note: If the goal of cross-validation is not to optimize parameters but to explore the robustness of a single chosen method, it’s possible to use the entire training + test set in cross-validation. However, it’s very important yet tricky to assure that one does not “corrupt” the final evaluation results when allowing the test data set to participate in any of the training tasks.

Multi-Class, Multi-Label Learning

As described above, the hw2.* files contain notes with (possibly) multiple labels, from a list of 21 relatively common problems in the CWS. Nevertheless, the most common of these occurs nearly 50 times, and the least common only 4 times in the data set. Thus, the resulting classification problem is harder.

  1. Try the same experiments with this data set that you did in the binary classification task, and compare the results on these data. Try to draw parallels between this data set and the original.
  2. What were your expectations, and were they fulfilled?
  3. (Extra credit.) Boostexter is able to use additional types of fields, not just text, in building its classification models. You could try to create new data sets that include not just a single text field as the basis for classification, but also other data from the CWS, such as age, gender, specific lab values, etc. How does the addition of such structured data affect the performance of boostexter?

Background

Much of what we know about the world is known with various degrees of certainty. Although many different numerical models have been proposed for representing this uncertainty, many researchers today believe that probability theory—the most extensively studied uncertainty calculus—is an appropriate basis for building computerized reasoning systems.

Mathematical Preliminaries

We take the world of interest to consist of a set of random variables, X = {X1, X2, … , Xn}. Each of the Xi can take on one of a discrete set of values, {xi1, xi2, … , xiki}. (Formulations using continuous variables are also possible, but we will not pursue them here. A continuous variable can, of course, be approximated by a large number of discrete values.) Each possible combination of assignments of values to each of the variables represents a possible state of this world. There are

Image of a mathematical equation.

such states. This is clearly exponential in the number of variables; for example, if each variable is binary, all of the ki = 2 and the number of states is 2n. Each state may be identified with the particular values that it assigns to each variable. We might have, for example, S12 = {X1 = a, X2 = e, … , Xn = b}.

A probability function, P’, assigns a probability to each of these possible states. The probability for each state, P(Si),is the joint probability of that particular assignment of values to the variables. All states are distinct, and they exhaustively enumerate the possibilities; therefore,

Image of a mathematical equation.

. We will often be interested in the probability not of individual states, but of certain combinations of particular variables taking on particular values. For instance, we may be interested in the probability that {X3 = a, X5 = b}. In such cases, we wish to treat other variables as “don’t cares.” We call such a partial description of the world a circumstance C, and the variables that have values assigned the instantiation-set of the circumstance, I(C). (“Circumstance” is not a commonly-used term for partial descriptions of the world. People use terms such as “partially-specified state” and other equally unsatisfying terms.) The probability of a circumstance C can be computed by summing the probabilities of all states that assign the same values to the variables in the instantiation set I(C) as C. If we re-order

variables X so that the first m are the instantiation set of C, then (in a shorthand notation):

Image of a mathematical equation.

. For example, if the world consists of four binary variables, W, X, Y and Z, then the circumstance {X = true, Z = false} is given by

Image of a mathematical equation.

. The computational difficulty of calculating the probability of a circumstance comes precisely from the need to sum over a possibly vast number of states. The number of such states to sum over is exponential in the number of “don’t cares” in a circumstance.

Defining P for each state of the world is a rather tedious and counter-intuitive way to describe probabilities in the world, although the ability to assign an exponentially large number of independent probabilities would allow any probability distribution to be described. It might happen that the world really is so complex that only such an exhaustive enumeration of the probability of each state is adequate, but fortunately many variables appear to be independent. For example, in medical diagnosis, the probability that you get a strep infection is essentially independent of the probability that you catch valley fever (a fungal infection of the lungs prevalent in California agricultural areas). Formally, this means that P(strep, vf) = P(strep)P(vf). When two variables both depend on the same set of variables but not directly on each other, they are conditionally independent. For example, if an infection causes both a fever and diarrhea, but there is no other correlation between these symptoms, then P(fever,diarrhea│infection) = P(fever│infection) P(diarrhea│infection).

The independencies among variables in a domain support a convenient graphical notation, called a Bayes network. In it, each variable is a node, and each probabilistic dependency is drawn as a directed arc to a dependent node from the node it depends on. This notion of probabilistic dependency does not necessarily correspond to what we think of as causal dependency, but it is often convenient to identify them. When there is no directed arc from one variable to another, then we say that the second does not depend on the first.

The probability of a state is a product over all variables of the probability that the variable takes on its particular value (in that state)

given that its parents take on their particular values: 

Image of a mathematical equation.

. The right hand side is an abbreviation for

Image of a mathematical equation.

where

Image of a mathematical equation.

. As described above, to find the probability of a circumstance, we must still sum over all of the states that are consistent with the circumstance — i.e., a number of states exponential in the number of “don’t cares.”

Problems:

1. Your expert supplies you with the following assessment: “I think that disease D1 causes symptoms S1 and, together with disease D2, causes symptom S2. 80% of patients affected by disease D1 will manifest symptom S1, which is present in the 10% of the patients without disease D1. D1 and D2 cause the occurrence of symptom S2 in 60% of cases, when only D1 is present S2 occur in the 70% of patients, when only D2 is present S2 occur in the 80% of patients, when neither D1 or D2 occur symptom S2 occurs in the 10%. Disease D1 occurs in the 20% of the population, while disease D2 occurs in the 10%.

  1. Draw the network capturing the description.
  2. What kind of graph is this and why?
  3. Assume that the variables can take two values, 1 for present, 0 for absent. Write the conditional probability tables associated to each dependency in the network.
  4. Write down the formula to calculate the joint probability distribution induced by the network.
  5. Using these table, calculate; the marginal probability p(S1 = 1).

2. Neurofibromatosis-1 is an inherited disorder characterized by formation of neurofibromas (tumors involving nerve tissue) in the skin, subcutaneous tissue, cranial nerves, and spinal root nerves. Neurofibromatosis-1 is caused by a mutation in the NF1 gene.

  1. The Affymetrix human genome microarray probe for the gene in part c. is as follows: AAGTGCCATGTTCCTCAGATTTATC. If one starts with a sequence of length n >= 25, derive a formula for the probability that it will match a random 25 base sequence at least once assuming 1) base types are uniformly distributed within each position and 2) each base position is independent of the other.
  2. Using the same assumptions as above, what is the expected number of sequences that will match within the entire human genome (3x109 base pairs) going in the 5’ to 3’ direction (i.e. only looking in one direction).
  3. Using the same assumptions as above, what is the theoretical ratio of the number of serine to tryptophan amino acids in the human genome?
  4. Are the numbered assumptions made in question 1.a. always correct? Explain why or why not.
  5. Diagnosis of the disease in this question is typically done via clinical evaluation rather than by microarray analysis. Describe a use for having this gene as a probe in the HG-U133+ 2.0 Genome Array. How might it be useful in other settings in conjunction with other technologies, or in medicine in the future?

3. The following is a protein circuit perturbation-based profile. It lists which proteins are present in a particular system under different trial circumstances. In each trial (except t=0) one protein is either added or removed.

TRIAL PROTEIN 1 PROTEIN 2 PROTEIN 3 PROTEIN 4
t=0 0 1 1 1
t=1 1 0 1 removed
t=2 0 0 removed 1
t=3 1 removed 1 1
t=4 Added 1 1 1

  1. Find a simple expression (using only primitives AND, NOT, OR) for each protein that is consistent with this protein circuit perturbation-based profile. A protein’s formula need not be consistent with the trial in which it is experimentally adjusted, but must be consistent with the other 4 trials. For another protein’s formula, you may treat other proteins that were added as “1” and removed as “0.”
  2. Your collaborators’ experiments suggest that all proteins except protein 3 are involved in the p53 pathway (involved in many cancers). Based on Biocarta pathways, what are possibilities for proteins 1, 2 and 4 given the protein circuit you derived in part 1?

4. A start-up company has just implemented a new computer decision support tool for screening the general public for a type of cancer. The computer support system has a 99% chance of detecting the cancer if it exists. If there is no cancer, the computer system will incorrectly declare cancer only 0.1% of the time. The prevalence of the cancer in the general population is 0.5%.

  1. If the computer predicts that a patient has cancer, what is the probability that the patient, in fact, does not have cancer?
  2. Explain why the result is so high/low.
  3. It turns out that the company will let you use the product for free. What other uses can you think of for this product (other than screening as described above)?

5. In 2007, the US Food and Drug Administration approved gene expression-based test for predicting breast cancer recurrence. Technologies such as this one are often developed by looking at microarrays of gene expression (typically across all human genes) at several times points- comparing controls with disease patient samples. One way described in class involves using different dyes. Describe (in detail) an experiment using this process which could be used to find the gene expression pattern that predicts breast cancer recurrence.

6. A study finds that an innovative technique allows clinicians to find significantly smaller tumors than the previous gold standard through detecting protein interactions with tumor proteins before the tumor has a chance to grow. The study goes on to find that survival times increased in both patients with metastases and those with no metastases. Could this be true? If so how do you explain this finding? If not, why not?

Learning Resource Types
Problem Sets
Lecture Notes
Presentation Assignments