Turn vcf files to usable R files
Upload_vcf_to_R.Rd
This function serves for step 1 of chapter 1 of the "Using VCFtoGWAS package" markdown series.
This script loads the vcf file that you want to work with (either processed by the "usegalaxy" server or not). The file can be uploaded in its zipped version (.gz file extension). The vcf processing is based on the [vcfR package](https://cran.r-project.org/web/packages/vcfR/vignettes/intro_to_vcfR.html).
Usage
Upload_vcf_to_R(vcf_file,
dir_results = getwd(),
results_name = name_by_time(),
do_save = TRUE,
do_return = TRUE,
get_chr_info = TRUE,
fix_columns = c("CHROM","POS","REF","ALT","QUAL"))
Arguments
- vcf_file
The location where the vcf file is saved
- dir_results
The directory in which a subfolder will be created and results will be saved. Make sure it exists!!!
- results_name
The name of the folder in which the results will be saved within dir_results (default is a time stamp, see
create_directory
)- do_save
Do you wish to save the results? (will be saved as RDS files) (Default is TRUE)
- do_return
Do you wish to return the results to your current workspace environment? (Default is TRUE)
- get_chr_info
Information regarding the lengths of the chromosomes (Default is TRUE)
- fix_columns
The column names that you wish to get from the fixed column of the vcf (the default is probably all you need so don't change it)
Details
These files are usually very large and it will take a while.
The files exported are saved as .RDS files. They are lighter and very easy to read in R by calling readRDS(file = filepath).
Extract genotypes from the vcf data:
GT: genotype, encoded as allele values separated by either of / or |.
The allele values are:
0 for the reference allele (what is in the REF field)
1 for the first allele listed in ALT
2 for the second allele listed in ALT.
3 for the third allele listed in ALT and so on.
For diploid calls examples could be 0/1, 1|0, or 1/2, etc. If a call cannot be made for a sample at a given locus, '.' is specified for each missing allele in the GT field (for example './.' for a diploid genotype and '.' for haploid genotype).
The meanings of the separators are as follows:
/ : genotype unphased
| : genotype phased
Value
If do_return = TRUE:
- fix_and_gt
is a list of two matrices: filtered fixed information (without unnecessary columns) and corresponding genotype section of the VCF
If do_return = FALSE:
- results_directory
a string with the directory where the results were saved.
References
See vcfR package for more information
And also see the "usegalaxy" VCFselectsamples tool to pre-filter the data
Note
Make sure you enter proper file routes (vcf_route) such as:
1) "somefolder/1011Matrix.gvcf.gz"
2) "Galaxy4_VCFselectsamples.vcf"
And also proper results route that exist (dir_results) such as:
1) "somefolder"
2) "C:/Users/user/Documents"
Examples
files_directory <- Upload_vcf_to_R(
vcf_file = "1011Matrix.gvcf.gz",
dir_results = "C:/Users/user/Documents",
do_return = FALSE
)
#> Warning: cannot create dir 'C:\Users\user\Documents\100222_16.54_Step1.1-Upload_VCF', reason 'No such file or directory'
#>
#> Results directory created:
#> C:/Users/user/Documents/100222_16.54_Step1.1-Upload_VCF
#> Loading required package: vcfR
#>
#> ***** *** vcfR *** *****
#> This is vcfR 1.12.0
#> browseVignettes('vcfR') # Documentation
#> citation('vcfR') # Citation
#> ***** ***** ***** *****
#> Warning: the condition has length > 1 and only the first element will be used
#> [[1]]
#> $src
#> [1] "vcf <- read.vcfR(vcf_file, verbose = FALSE)"
#>
#> attr(,"class")
#> [1] "source"
#>
#> [[2]]
#> <simpleWarning in file(file, "r"): cannot open file '1011Matrix.gvcf.gz': No such file or directory>
#>
#> [[3]]
#> <simpleError in file(file, "r"): cannot open the connection>
#>
#> 2022-02-10 16:54:28 - getting chromosome info
#> Loading required package: stringr
#> Loading required package: readr
#> Error in startsWith(vcf@meta, "##contig=<ID=chromosome"): object 'vcf' not found
print(files_directory)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'print': object 'files_directory' not found