R/import_data.R
import_data.Rd
This functions transforms the column names from a data frame from another format to a data frame with column names used by the OpenSWATH output and required for these functions. During executing of the function the corresponding columns for each column in the data need to be selected. For columns that do not corresond to a certain column 'not applicable' needs to be selected and the column names are not changed.
import_data(data)
A data frame containing the SWATH-MS data (one line per peptide precursor quantified) but with different column names.
Returns the data frame in the appropriate format.
List of column names of the OpenSWATH data: ProteinName: Unique identifier for protein or proteingroup that the peptide maps to. Proteotypic peptides should be indicated by 1/ in order to be recognized as such by the function filter_proteotypic_peptides. FullPeptideName: Unique identifier for the peptide. Charge: Charge of the peptide precursor ion quantified. Sequence: Naked peptide sequence without modifications. aggr_Fragment_Annotation: aggregated annotation for the different Fragments quantified for this peptide. In the OpenSWATH results the different annotation in OpenSWATH are concatenated by a semicolon. aggr_Peak_Area: aggregated Intensity values for the different Fragments quantified for this peptide. In the OpenSWATH results the aggregated Peak Area intensities are concatenated by a semicolon. transition_group_id: A unique identifier for each transition group used. decoy: Indicating with 1 or 0 if this transition group is a decoy. m_score: Column containing the score that is used to estimate FDR or filter. M-score values of identified peak groups are equivalent to a q-value and thus typically are smaller than 0.01, depending on the confidence of identification (the lower the m-score, the higher the confidence). Column containing the score that is used to estimate FDR or filter. RT: Column containing the retention time of the quantified peak. filename: Column containing the filename or a unique identifier for each injection. Intensity: column containing the intensity value for each quantified peptide. Columns needed for FDR estimation and filtering functions: ProteinName, FullPeptideName, transition_group_id, decoy, m_score Columns needed for conversion to transition-level format (needed for MSStats and mapDIA input): aggr_Fragment_Annotation, aggr_Peak_Are
data('Spyogenes', package = 'SWATH2stats')
head(data)
#> ProteinName FullPeptideName Charge
#> 1 Spyo_Exp3652_DDB_SeqID_520043 TTLHQAILMGR 3
#> 2 Spyo_Exp3652_DDB_SeqID_515468 SVLEELK 2
#> 3 Spyo_Exp3652_DDB_SeqID_325989 VIGVGGGGGNAINR 3
#> 4 Spyo_Exp3652_DDB_SeqID_515305 FYDPGHVMLK 3
#> 5 Spyo_Exp3652_DDB_SeqID_325124 ATDDAIKEIDR 3
#> 6 Spyo_Exp3652_DDB_SeqID_520062 YHSGDYVFVK 3
#> aggr_Fragment_Annotation
#> 1 58421_TTLHQAILMGR/3_y4;58422_TTLHQAILMGR/3_y3;58423_TTLHQAILMGR/3_b6;58424_TTLHQAILMGR/3_y9_2;58425_TTLHQAILMGR/3_y7;58426_TTLHQAILMGR/3_b7
#> 2 58499_SVLEELK/2_y5;58500_SVLEELK/2_y3;58501_SVLEELK/2_y4;58502_SVLEELK/2_y3;58503_SVLEELK/2_y4;58504_SVLEELK/2_y3
#> 3 58595_VIGVGGGGGNAINR/3_b13+18_2;58596_VIGVGGGGGNAINR/3_y4;58597_VIGVGGGGGNAINR/3_y10;58598_VIGVGGGGGNAINR/3_y12_2;58599_VIGVGGGGGNAINR/3_b12;58600_VIGVGGGGGNAINR/3_b7
#> 4 58955_FYDPGHVMLK/3_y7_2;58956_FYDPGHVMLK/3_y8_2;58957_FYDPGHVMLK/3_y7;58958_FYDPGHVMLK/3_b3;58959_FYDPGHVMLK/3_y9_2;58960_FYDPGHVMLK/3_b2
#> 5 59453_ATDDAIKEIDR/3_y9_2;59454_ATDDAIKEIDR/3_y10_2;59455_ATDDAIKEIDR/3_y5;59456_ATDDAIKEIDR/3_y4;59457_ATDDAIKEIDR/3_y6;59458_ATDDAIKEIDR/3_y8_2
#> 6 59657_YHSGDYVFVK/3_y3;59658_YHSGDYVFVK/3_b6;59659_YHSGDYVFVK/3_b2;59660_YHSGDYVFVK/3_b5;59661_YHSGDYVFVK/3_y4;59662_YHSGDYVFVK/3_a6
#> aggr_Peak_Area
#> 1 3939.000000;3895.000000;1580.000000;770.000000;1101.000000;730.000000
#> 2 11139.000000;1968.000000;1632.000000;1975.000000;1001.000000;1913.000000
#> 3 3275.000000;8378.000000;3175.000000;3392.000000;1804.000000;1933.000000
#> 4 32275.000000;4911.000000;3415.000000;3550.000000;4480.000000;3413.000000
#> 5 65727.000000;45606.000000;45401.000000;34751.000000;13926.000000;13518.000000
#> 6 1979.000000;1204.000000;1502.000000;810.000000;670.000000;160.000000
#> transition_group_id decoy m_score RT
#> 1 10094_TTLHQAILMGR/3_run0 FALSE 1.388132e-03 2764.405
#> 2 10107_SVLEELK/2_run0 FALSE 1.365587e-05 2785.501
#> 3 10123_VIGVGGGGGNAINR/3_run0 FALSE 4.066284e-04 2150.922
#> 4 10185_FYDPGHVMLK/3_run0 FALSE 5.755066e-05 3056.339
#> 5 10269_ATDDAIKEIDR/3_run0 FALSE 1.073568e-07 2160.130
#> 6 10303_YHSGDYVFVK/3_run0 FALSE 1.607579e-06 2628.328
#> align_origfilename
#> 1 /media/data/tmp/strep_align/Strep0_Repl1_R02/split_hroest_K120808_all_peakgroups.xls.gz
#> 2 /media/data/tmp/strep_align/Strep0_Repl1_R02/split_hroest_K120808_all_peakgroups.xls.gz
#> 3 /media/data/tmp/strep_align/Strep0_Repl1_R02/split_hroest_K120808_all_peakgroups.xls.gz
#> 4 /media/data/tmp/strep_align/Strep0_Repl1_R02/split_hroest_K120808_all_peakgroups.xls.gz
#> 5 /media/data/tmp/strep_align/Strep0_Repl1_R02/split_hroest_K120808_all_peakgroups.xls.gz
#> 6 /media/data/tmp/strep_align/Strep0_Repl1_R02/split_hroest_K120808_all_peakgroups.xls.gz
#> Intensity Sequence delta_rt
#> 1 12015 TTLHQAILMGR 21.6801501
#> 2 19628 SVLEELK 4.8352308
#> 3 21957 VIGVGGGGGNAINR 18.6908854
#> 4 52044 FYDPGHVMLK -3.7049072
#> 5 218929 ATDDAIKEIDR 0.3056915
#> 6 6325 YHSGDYVFVK 6.3220861
str(data)
#> 'data.frame': 38272 obs. of 13 variables:
#> $ ProteinName : chr "Spyo_Exp3652_DDB_SeqID_520043" "Spyo_Exp3652_DDB_SeqID_515468" "Spyo_Exp3652_DDB_SeqID_325989" "Spyo_Exp3652_DDB_SeqID_515305" ...
#> $ FullPeptideName : chr "TTLHQAILMGR" "SVLEELK" "VIGVGGGGGNAINR" "FYDPGHVMLK" ...
#> $ Charge : int 3 2 3 3 3 3 3 3 2 3 ...
#> $ aggr_Fragment_Annotation: chr "58421_TTLHQAILMGR/3_y4;58422_TTLHQAILMGR/3_y3;58423_TTLHQAILMGR/3_b6;58424_TTLHQAILMGR/3_y9_2;58425_TTLHQAILMGR"| __truncated__ "58499_SVLEELK/2_y5;58500_SVLEELK/2_y3;58501_SVLEELK/2_y4;58502_SVLEELK/2_y3;58503_SVLEELK/2_y4;58504_SVLEELK/2_y3" "58595_VIGVGGGGGNAINR/3_b13+18_2;58596_VIGVGGGGGNAINR/3_y4;58597_VIGVGGGGGNAINR/3_y10;58598_VIGVGGGGGNAINR/3_y12"| __truncated__ "58955_FYDPGHVMLK/3_y7_2;58956_FYDPGHVMLK/3_y8_2;58957_FYDPGHVMLK/3_y7;58958_FYDPGHVMLK/3_b3;58959_FYDPGHVMLK/3_"| __truncated__ ...
#> $ aggr_Peak_Area : chr "3939.000000;3895.000000;1580.000000;770.000000;1101.000000;730.000000" "11139.000000;1968.000000;1632.000000;1975.000000;1001.000000;1913.000000" "3275.000000;8378.000000;3175.000000;3392.000000;1804.000000;1933.000000" "32275.000000;4911.000000;3415.000000;3550.000000;4480.000000;3413.000000" ...
#> $ transition_group_id : chr "10094_TTLHQAILMGR/3_run0" "10107_SVLEELK/2_run0" "10123_VIGVGGGGGNAINR/3_run0" "10185_FYDPGHVMLK/3_run0" ...
#> $ decoy : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ m_score : num 1.39e-03 1.37e-05 4.07e-04 5.76e-05 1.07e-07 ...
#> $ RT : num 2764 2786 2151 3056 2160 ...
#> $ align_origfilename : chr "/media/data/tmp/strep_align/Strep0_Repl1_R02/split_hroest_K120808_all_peakgroups.xls.gz" "/media/data/tmp/strep_align/Strep0_Repl1_R02/split_hroest_K120808_all_peakgroups.xls.gz" "/media/data/tmp/strep_align/Strep0_Repl1_R02/split_hroest_K120808_all_peakgroups.xls.gz" "/media/data/tmp/strep_align/Strep0_Repl1_R02/split_hroest_K120808_all_peakgroups.xls.gz" ...
#> $ Intensity : int 12015 19628 21957 52044 218929 6325 84020 75170 16479 138165 ...
#> $ Sequence : chr "TTLHQAILMGR" "SVLEELK" "VIGVGGGGGNAINR" "FYDPGHVMLK" ...
#> $ delta_rt : num 21.68 4.835 18.691 -3.705 0.306 ...