Turn vcf files to usable R files — Upload_vcf_to

This function serves for step 1 of chapter 1 of the "Using VCFtoGWAS package" markdown series.
This script loads the vcf file that you want to work with (either processed by the "usegalaxy" server or not). The file can be uploaded in its zipped version (.gz file extension). The vcf processing is based on the [vcfR package](https://cran.r-project.org/web/packages/vcfR/vignettes/intro_to_vcfR.html).

Usage

Upload_vcf_to_R(vcf_file,
                dir_results = getwd(),
                results_name = name_by_time(),
                do_save = TRUE,
                do_return = TRUE,
                get_chr_info = TRUE,
                fix_columns = c("CHROM","POS","REF","ALT","QUAL"))

Arguments

vcf_file: The location where the vcf file is saved
dir_results: The directory in which a subfolder will be created and results will be saved. Make sure it exists!!!
results_name: The name of the folder in which the results will be saved within dir_results (default is a time stamp, see create_directory)
do_save: Do you wish to save the results? (will be saved as RDS files) (Default is TRUE)
do_return: Do you wish to return the results to your current workspace environment? (Default is TRUE)
get_chr_info: Information regarding the lengths of the chromosomes (Default is TRUE)
fix_columns: The column names that you wish to get from the fixed column of the vcf (the default is probably all you need so don't change it)

Details

These files are usually very large and it will take a while.
The files exported are saved as .RDS files. They are lighter and very easy to read in R by calling readRDS(file = filepath).

Extract genotypes from the vcf data:
GT: genotype, encoded as allele values separated by either of / or |.
The allele values are:

0 for the reference allele (what is in the REF field)
1 for the first allele listed in ALT
2 for the second allele listed in ALT.
3 for the third allele listed in ALT and so on.

For diploid calls examples could be 0/1, 1|0, or 1/2, etc. If a call cannot be made for a sample at a given locus, '.' is specified for each missing allele in the GT field (for example './.' for a diploid genotype and '.' for haploid genotype).
The meanings of the separators are as follows:

/ : genotype unphased
| : genotype phased

Value

If do_return = TRUE:

fix_and_gt: is a list of two matrices: filtered fixed information (without unnecessary columns) and corresponding genotype section of the VCF

If do_return = FALSE:

results_directory: a string with the directory where the results were saved.

References

See vcfR package for more information

And also see the "usegalaxy" VCFselectsamples tool to pre-filter the data

Author

Tomer Antman

Note

Make sure you enter proper file routes (vcf_route) such as:

1) "somefolder/1011Matrix.gvcf.gz"
2) "Galaxy4_VCFselectsamples.vcf"

And also proper results route that exist (dir_results) such as:

1) "somefolder"
2) "C:/Users/user/Documents"

Examples

files_directory <- Upload_vcf_to_R(
                                  vcf_file = "1011Matrix.gvcf.gz",
                                  dir_results = "C:/Users/user/Documents",
                                  do_return = FALSE
                                  )
#> Warning: cannot create dir 'C:\Users\user\Documents\100222_16.54_Step1.1-Upload_VCF', reason 'No such file or directory'
#> 
#> Results directory created:
#> C:/Users/user/Documents/100222_16.54_Step1.1-Upload_VCF
#> Loading required package: vcfR
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
#> Warning: the condition has length > 1 and only the first element will be used
#> [[1]]
#> $src
#> [1] "vcf <- read.vcfR(vcf_file, verbose = FALSE)"
#> 
#> attr(,"class")
#> [1] "source"
#> 
#> [[2]]
#> <simpleWarning in file(file, "r"): cannot open file '1011Matrix.gvcf.gz': No such file or directory>
#> 
#> [[3]]
#> <simpleError in file(file, "r"): cannot open the connection>
#> 
#> 2022-02-10 16:54:28 - getting chromosome info
#> Loading required package: stringr
#> Loading required package: readr
#> Error in startsWith(vcf@meta, "##contig=<ID=chromosome"): object 'vcf' not found
print(files_directory)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'print': object 'files_directory' not found