|
Cornell Cognitive Studies Symposium
Statistical Learning across Cognition
|
|
Learning from Disparate Data Sources Reba Schuller
With the ever-decreasing cost of data storage, the world is accumulating an astronomical amount of data. Unfortunately, each organization has its own way of storing its data. This disparity between database schemas results in a major bottleneck in integration of data from multiple sources, which is costly to businesses when they obtain new databases from other businesses they have acquired, and to benevolent organizations wishing to cooperate by sharing data. As an example, consider a hospital hoping to improve its methods for determining which patients diagnosed with pneumonia are at high enough risk to warrant hospitalization. While standard machine learning tools allow them to use relevant data (patient histories for patients previously diagnosed with pneumonia) collected by the hospital to make these predictions, it is known that larger sets of data guarantee a greater likelihood of more accurate prediction. Similar data sets from many other hospitals are available, and it would be ideal to augment their data set with this other data. Typically, though, one would have to first determine the different schemas of the different databases and translate all the data into one unified database; however, we offer a new alternative: we describe a technique for using each of the raw databases (without modification) to guarantee improved prediction. Approaching this problem from a statistical learning theory perspective, we demonstrate ties to multitask learning and give some preliminary results supporting the thesis that, for the purpose of classification prediction, the problem of integrating database information can be solved satisfactorily without actually merging the databases. In particular, we analyze the information complexity of classification prediction from multiple data sources, each of which has had its data transformed via some unknown function (from some known class of functions), and demonstrate improvement over the information complexity of learning from a single data source.
|