Uncategorized

Data sorting for noobs

Big data can be intimidating, but sorting and extracting information becomes easy when a sort of mental ‘algorithm’ is applied. The pseudocode is in ~Matlab notation, but the basic concepts apply to any programming language.

Step 1: Observe layout of raw data

Raw data will have some kind of format connecting ‘inputs’ and ‘outputs’. For example, gene transcription data will include some identifier of the gene, and its set of transcription levels. Perhaps there are multiple replicates or experiments. Identify the physical relationship between the ‘independent variables’ and ‘dependent variables’.

Step 2: Import data in subsets

For the simple case of a single input (gene name) and single type of output (3 measures of transcription level), import each into their own array or matrix. Since one gene is related to 3 transcription levels, the gene names would be imported into an array of size n, and the set of data into a matrix of size 3xn. Make sure the subsets are physically correlated! Don’t import the headers into the dataset, and not into the array of gene names. Make sure that gene i’s data is in the ith column (or row) of the matrix.

Step 3: Use a loop

There are two (very) basic goals in data sorting: specific or mass information gathering.

(i) Specific

To analyze only a specific subset of data out of the large data set, import an array of things you are looking for – perhaps a certain value of transcription level, or a certain gene. For example, you make want to make a list of genes of interest – create an empty array or matrix to store results in.

Then, use a loop to iteratively check the appropriate matrix or array. When a match is found, utilize the symmetry of your data subsets to save the match. For example, when looking for a specific transcription value, iterate through every element of the transcription value matrix. When a value is found, you know that the name of the gene is on the same row (horizontal array) or column (vertical array) as the matching item.

Note: Depending on your system and method of creating arrays, you may be more likely to have a vertical array even if your data is aligned horizontally. Simply imagine transposing your array to visualize the physical relationship.

(ii) Mass

To analyze an entire data set in some way, simply iterate through each subset corresponding to the individual gene (in keeping with the example), and perform operations on the transcription values linked with that gene. Again, since the physical relationship is preserved, the column of the array of genes will correspond to the row of relevant transcription values. (See Note above).

Example pseudocode:

my_names = import("my file of data", first row, n columns)
my_data = import ("my file of data", rows 2,3,4, n columns)
matches = import("what I'm looking for", first row, m columns)
results = zeros(nrow=m,ncol=3) % matrix of zeros
for i = 1:size(matches) % m
       for j = 1: length(array) % n
              if (my_names(i) == matches(j)) % returns t/f
                    for k = 1:3 % 3 rows of my_data
                         results(i,k)=my_data(k,j);
                     end
               end
        end
end
export(results)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s