Were the training and test set split apart randomly?
Yes, using a stratified method so that the class ratios would be the same in both sets. • Do the data files distributed with the training set describe the test set instances in addition to the training set instances? Yes. The abstracts, protein-protein interactions, localization values, function values and aliases represent knowledge about all of the genes in yeast. The test set to be provided will consist solely of a list of gene identifiers. All of the information required to instantiate features for the test set instances is in the data files that were included with the training instances. • Are the MEDLINE abstracts meant to be used as input data? Yes, in fact it is probably necessary to use them to get competitive accuracies. • Why do the abstracts often contain references to gene names followed by a “p”. For example, abstract 10022848 references “sec4p” and “sec15p”, but the file gene-abstracts.txt associates this abstract with the genes “sec4” and “sec15”. The “p” suffix is ofte