The following are data related to the steps outlined in our paper entitled An Information Retrieval Process to Aid in the Analysis of Code Clones for the clustering of clone sets. The Windows Research Kernel 1.0 source code is only available for non-commercial use in academia. For more information about how to get the Windows Research Kernel 1.0 source code, please see the Windows Research Kernel web site at http://www.microsoft.com/resources/sharedsource/Licensing/researchkernel.mspx. Because of this limitation, every effort is made to remove any source code related information from the data below. If you have received approval to access the source code, we can provide you with more detailed data. For this, please contact Robert Tairas at tairasr@cis.uab.edu.
The results of the clone detection tool CCFinder is given below. The textual format is the format used in the subsequent steps:
The following contains the clone set totals for the directories in the NT Kernel.
N/A
The following are the Term-Document matrices for each directory that will be clustered. The format of files representing the matrices is each line contains a row index, column index, and total occurences of the term.
The following are the lower dimensionality matrices for each directory that will be clustered. These matrices, ie. Xhihat, contains the format of original the Xhihat matrix transposed, as Cluto works on the rows as vectors instead of the columns. The first row contains the matrix dimensions.
The following are the clustering assignments from Cluto for each of the vectors (i.e., clone sets) in the matrices from the previous step. This mapping is a simple text file containing a cluster number in each line where each line represents a clone set. Also included are the Cluto informational output report for each clustering.
The following are the mapping of the clustering assignments above with the clone set ID for each directory.
The following is a sample cluster report in HTML. The references to the file names and source code have been removed. Various information about the cluster is available, such as the clone ranges and their membership to the clone sets. These listings of clones are ordered both by clone set and by their location in the source code. In the middle of these listings are the files that contain the clones, and the clone sets listed in "Similar Clone Sets" section, which takes into account the two types of clone sets described at the beginning of Section 4 (i.e., linked clone sets and sliding clones).
This project is supported by NSF grant CPA-0702764.