Expand rows of files
Expand_files.Rd
This function serves for step 3 of chapter 1 of the "Using VCFtoGWAS package" markdown series.
Expand the fixed info and the genotype (gt) matrix so that each row represents only one SNP or INDEL (up until now each row represented a position)
The output is the expanded fixed and gt and also an array (called `indication` that links between the two. Relevant for the next step)
Usage
expand_files(gt_GTonly_filt,
fix_filt,
dir_results = getwd(),
results_name = name_by_time(),
do_save = TRUE)
Arguments
- gt_GTonly_filt
The filtered Genotype matrix (step2 output)
- fix_filt
The filtered fixed information dataframe (step2 output)
- dir_results
The directory in which a folder will be created and results will be saved. Make sure it exists!!!
- results_name
The name of the folder in which the results will be saved within dir_results (default is a time stamp, see
create_directory
)- do_save
Do you wish to save the results? (will be saved as RDS files) (Default is TRUE)
Details
For very large files it might take a while.
An explanation about the indication
array that is returned:
If for example, row number 3 in fix_filt
represents some position that has two ALTs (like: [C, TG]) for a REF (like: [A]), in fix_filt_expand
it will be two rows (as 3.0, 3.1) with only one ALT in each.
The indication
array will get a "1" for row 3.0 and "2" for row 3.1. This array is the link between the data in the fix_filt_expand
and the data in gt_GTonly_filt_expand
.
The indication[i]
gives us the id number of the alternative sequence in a certain location.
It is between 1 and some number (specifically here it's 6 but can be more).
In simpler terms: after all the pre-processing, each row in fix_filt_expand
represents only one alteration from the reference (only one ALT). Since the same position might have had several alterations (in fix_filt
), the indication
array indicates what alteration is represented in each row (in regards to the position in the genome).
Value
- fix_filt_expand
A dataframe of the fixed information. Each row represents only one variant(one SNP or INDEL). It has an extra column: `ALT_options` which represents the number of alterations that exist per genomic position in the data
- gt_GTonly_filt_expand
A matrix in which the number of times each row appears is affected by the `ALT_options` in
fix_filt_expand
- indication
This array is the link between the data in the
fix_filt_expand
and the data ingt_GTonly_filt_expand
Note
This should be run on the results of the Filter_genotypes
function from step 2.