Skip to contents

This function serves for step 3 of chapter 1 of the "Using VCFtoGWAS package" markdown series.
Expand the fixed info and the genotype (gt) matrix so that each row represents only one SNP or INDEL (up until now each row represented a position) The output is the expanded fixed and gt and also an array (called `indication` that links between the two. Relevant for the next step)

Usage

expand_files(gt_GTonly_filt,
             fix_filt,
             dir_results = getwd(),
             results_name = name_by_time(),
             do_save = TRUE)

Arguments

gt_GTonly_filt

The filtered Genotype matrix (step2 output)

fix_filt

The filtered fixed information dataframe (step2 output)

dir_results

The directory in which a folder will be created and results will be saved. Make sure it exists!!!

results_name

The name of the folder in which the results will be saved within dir_results (default is a time stamp, see create_directory)

do_save

Do you wish to save the results? (will be saved as RDS files) (Default is TRUE)

Details

For very large files it might take a while.

An explanation about the indication array that is returned:
If for example, row number 3 in fix_filt represents some position that has two ALTs (like: [C, TG]) for a REF (like: [A]), in fix_filt_expand it will be two rows (as 3.0, 3.1) with only one ALT in each.
The indication array will get a "1" for row 3.0 and "2" for row 3.1. This array is the link between the data in the fix_filt_expand and the data in gt_GTonly_filt_expand.
The indication[i] gives us the id number of the alternative sequence in a certain location.
It is between 1 and some number (specifically here it's 6 but can be more).
In simpler terms: after all the pre-processing, each row in fix_filt_expand represents only one alteration from the reference (only one ALT). Since the same position might have had several alterations (in fix_filt), the indication array indicates what alteration is represented in each row (in regards to the position in the genome).

Value

fix_filt_expand

A dataframe of the fixed information. Each row represents only one variant(one SNP or INDEL). It has an extra column: `ALT_options` which represents the number of alterations that exist per genomic position in the data

gt_GTonly_filt_expand

A matrix in which the number of times each row appears is affected by the `ALT_options` in fix_filt_expand

indication

This array is the link between the data in the fix_filt_expand and the data in gt_GTonly_filt_expand

Author

Tomer Antman

Note

This should be run on the results of the Filter_genotypes function from step 2.