vignettes/ftmsRanalysis.Rmd
ftmsRanalysis.Rmd
Fourier-transform mass spectrometry (FT-MS) is a type of mass spectrometry for determining the mass-to-charge ratio (m/z) of ions based on the cyclotron frequency of the ions in a fixed magnetic field. Chemical composition can be determined for a portion of the observed peaks/mass-to-charge ratios. FT-MS instrument data can be interpreted as peak intensities for each observed peak. FT-MS analysis has been used to examine a wide range of complex mixtures, including soils, plants, aquatic samples, petroleum and various beverages.
The ftmsRanalysis
package was designed to help with
various steps of processing FT-MS data, including:
An example dataset has been included with the
ftmsRanalysis
package. This dataset is a subset of an
experiment to assess differences in soil organic matter between multiple
locations and crop types. The data were collected from two locations (M
and W) for two crop flora (S and C). The data were analyzed with a 12T
FTICR (Fourier-transform ion cyclotron resonance) mass spectrometer.
Data required for the ftmsRanalysis
package is comprised
of three data tables:
The edata object is a data frame with one row per peak and one column per sample. It must have one column that is a unique ID (e.g. Mass).
##
## Attaching package: 'ftmsRanalysis'
## The following object is masked from 'package:stats':
##
## heatmap
## 'data.frame': 24442 obs. of 21 variables:
## $ Mass : num 98.5 98.8 98.8 101.7 103.3 ...
## $ EM0011_sample: num 0 0 5524739 0 0 ...
## $ EM0013_sample: num 0 13070372 0 0 0 ...
## $ EM0015_sample: num 0.0 0.0 2.4e+07 0.0 0.0 ...
## $ EM0017_sample: num 0 16120890 0 0 0 ...
## $ EM0019_sample: num 0 21228496 0 0 0 ...
## $ EM0061_sample: num 1197974 0 30656158 0 0 ...
## $ EM0063_sample: num 0 12305626 0 0 0 ...
## $ EM0065_sample: num 0.0 1.1e+07 0.0 0.0 0.0 ...
## $ EM0067_sample: num 0 0 12664590 0 0 ...
## $ EM0069_sample: num 2535836 38329628 0 0 0 ...
## $ EW0111_sample: num 0 0 21416774 0 0 ...
## $ EW0113_sample: num 0 8070914 0 0 0 ...
## $ EW0115_sample: num 3636046 0 38608164 0 0 ...
## $ EW0117_sample: num 0 3965230 0 0 0 ...
## $ EW0119_sample: num 0 0 2439325 0 1153547 ...
## $ EW0161_sample: num 0 0 0 0 0 0 0 0 0 0 ...
## $ EW0163_sample: num 0 0 0 0 0 0 0 0 0 0 ...
## $ EW0165_sample: num 0 0 0 16443347 0 ...
## $ EW0167_sample: num 0 1598118 0 0 0 ...
## $ EW0169_sample: num 0 0 0 0 0 0 0 0 0 0 ...
The fdata object is a data frame with one row per sample with information about experimental conditions. It must have a column that matches the sample column names in edata.
## 'data.frame': 20 obs. of 4 variables:
## $ SampleID : chr "EM0011_sample" "EM0013_sample" "EM0015_sample" "EM0017_sample" ...
## $ Location : chr "M" "M" "M" "M" ...
## $ Block : int 1 2 3 4 5 1 2 3 4 5 ...
## $ Crop.Flora: chr "S" "S" "S" "S" ...
The emeta object is a data frame with one row per peak and columns containing other meta data. Either a column giving the molecular formula or elemental count columns are required. It must have an ID column corresponding to the ID column in edata. If information about isotopic peaks is available and specified, these peaks are currently filtered from the data upon peakData object creation.
## 'data.frame': 24442 obs. of 10 variables:
## $ Mass : num 98.5 98.8 98.8 101.7 103.3 ...
## $ C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ H : int 0 0 0 0 0 0 0 0 0 0 ...
## $ O : int 0 0 0 0 0 0 0 0 0 0 ...
## $ N : int 0 0 0 0 0 0 0 0 0 0 ...
## $ C13 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ S : int 0 0 0 0 0 0 0 0 0 0 ...
## $ P : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Error : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NeutralMass: num 99.5 99.8 99.8 102.7 104.4 ...
peakObj <- as.peakData(ftms12T_edata, ftms12T_fdata, ftms12T_emeta,
edata_cname="Mass", fdata_cname="SampleID",
mass_cname="Mass", element_col_names = list("C"="C", "H"="H", "O"="O", "N"="N", "S"="S", "P"="P"),
isotopic_cname = "C13",
isotopic_notation = "1")
peakObj
## peakData object
## # Peaks: 23060
## # Samples: 20
## Meta data columns: [Mass, C, H, O, N, C13, S, P, Error, NeutralMass, MolForm]
The as.peakData
function also allows for the following
(optional) parameters:
The resulting peakData
object contains three elements,
named e_data, f_data, and e_meta:
names(peakObj)
## [1] "e_data" "f_data" "e_meta"
During object construction, the molecular formula is calculated from the elemental columns (and elemental columns would be created in the case that molecular formulae were provided):
tail(peakObj$e_meta)
## Mass C H O N C13 S P Error NeutralMass MolForm
## 24437 897.1796269 0 0 0 0 0 0 0 0.0000000 898.1869 <NA>
## 24438 897.2209292 0 0 0 0 0 0 0 0.0000000 898.2282 <NA>
## 24439 897.3973977 36 69 22 1 0 0 1 0.2345417 898.4047 C36H69O22NP
## 24440 898.812526 0 0 0 0 0 0 0 0.0000000 899.8198 <NA>
## 24441 899.0458907 0 0 0 0 0 0 0 0.0000000 900.0532 <NA>
## 24442 899.3370941 0 0 0 0 0 0 0 0.0000000 900.3444 <NA>
There is a summary method:
summary(peakObj)
## Samples: 20
## Molecules: 23060
## Percent Missing: 81.739%
… and a default plot method:
plot(peakObj)
When dealing with ’omics data quantitatively, we often log-transform
to stabilize variances and reduce skew for downstream data processing.
Alternatively, it’s common to treat FT-MS data as presence/absence data.
We can use the edata_transform
function to transform the
data scale for either of these options.
peakObj <- edata_transform(peakObj, data_scale="log2")
# for presence/absence transformation:
edata_transform(peakObj, data_scale="pres")
## Warning in edata_transform(peakObj, data_scale = "pres"): Data has already been
## transformed. Data is on the log2 scale. Changing to pres scale.
## peakData object
## # Peaks: 23060
## # Samples: 20
## Meta data columns: [Mass, C, H, O, N, C13, S, P, Error, NeutralMass, MolForm]
When we plot the transformed data, the difference in scale is apparent.
plot(peakObj)
It is frequently useful for biological analysis and interpretation to
calculate values related to chemical properties of each peak, such as
the nominal oxidation state of Carbon (NOSC), aromaticity, and elemental
ratios. This can be done via the compound_calcs
function.
By default, this function calculates all available meta-data fields,
specific fields can be chosen with the calc_fns
parameter.
peakObj <- compound_calcs(peakObj)
peakObj
## peakData object
## # Peaks: 23060
## # Samples: 20
## Meta data columns: [Mass, C, H, O, N, C13, S, P, Error, NeutralMass, MolForm, AI, AI_Mod, DBE_1, DBE_O, DBE_AI, GFE, kmass.CH2, kdefect.CH2, NOSC, OtoC_ratio, HtoC_ratio, NtoC_ratio, PtoC_ratio, NtoP_ratio]
Classification of compounds based on their elemental composition is
often desirable. The assign_elemental_composition
function
accomplishes this task.
peakObj <- assign_elemental_composition(peakObj)
table(peakObj$e_meta[,getElCompColName(peakObj)])
##
## CHO CHON CHONP CHONS CHONSP CHOP CHOS CHOSP
## 3528 3728 1416 742 317 95 884 249
Further, each compound formula can also be assigned to biochemical
compound classes (e.g. lipids, lignins, etc.) based on their chemical
properities (e.g. O:C, H:C ratios), and the assign_class
function performs this assignment.
peakObj <- assign_class(peakObj, boundary_set = "bs1")
table(peakObj$e_meta[, getBS1ColName(peakObj)])
##
## Amino Sugar Carbohydrate
## 352 698
## Carbohydrate;Amino Sugar Cond Hydrocarbon
## 64 1745
## Lignin Lignin;Amino Sugar
## 2901 103
## Lignin;Tannin Lipid
## 20 709
## Lipid;Protein Other
## 21 1840
## Protein Protein;Amino Sugar
## 1206 35
## Protein;Lignin Protein;Lignin;Amino Sugar
## 7 2
## Tannin Tannin;Cond Hydrocarbon
## 984 57
## Unsat Hydrocarbon Unsat Hydrocarbon;Cond Hydrocarbon
## 203 12
There are three sets of class boundary definitions that may be used
(for the boundary_set
parameter) corresponding to the
following publications:
There are multiple types of filtering algorithms provided in
ftmsRanalysis
:
For example, to filter peaks to include only masses between 200 and 900:
filter_obj <- mass_filter(peakObj)
plot(filter_obj, min_mass=200, max_mass=900)
summary(peakObj)
## Samples: 20
## Molecules: 23060
## Percent Missing: 81.739%
## Samples: 20
## Molecules: 19327
## Percent Missing: 79.299%
Other filtering options include number of molecule observations, formula presence or absence, or emeta columns.
peakObj <- applyFilt(molecule_filter(peakObj), peakObj, min_num=2)
peakObj <- applyFilt(formula_filter(peakObj), peakObj)
peakObj <- applyFilt(emeta_filter(peakObj, "NOSC"), peakObj, min_val = 0.5)
summary(peakObj)
## Samples: 20
## Molecules: 1521
## Percent Missing: 50.352%
To construct plots of a single sample, first we must subset the data
to contain only one sample, via the subset
method.
## Samples: 1
## Molecules: 1521
## Percent Missing: 57.791%
head(one_sample$e_data)
## Mass EM0011_sample
## 3746 200.9433045 NA
## 3748 200.9863723 NA
## 3774 202.9413892 NA
## 3844 209.0091827 20.32056
## 3892 212.0200788 20.82018
## 3909 213.004142 NA
A Van Krevelen plot shows H:C ratio vs O:C ratio for each peak
observed in a sample that has a molecular formula (thus H:C and O:C are
known). By default, the points are colored according to molecular
composition class determined by bs1
.
vanKrevelenPlot(one_sample, title="EM0011_sample")
By default, this function colors by Van Krevelen class. However, we
can also color the points according to other meta data columns in the
e_meta
object.
vanKrevelenPlot(one_sample, colorCName="PtoC_ratio",
title="Color by P:C Ratio", legendTitle = "P:C Ratio")
A Kendrick plot shows Kendrick Defect vs Kendrick mass for each observed peak.
Ions of the same family have the same Kendrick mass defect and are positioned along a horizontal line on the plot. A Kendrick plot is often used in conjunction with a Van Krevelen plot for evaluating elemental composition.
kendrickPlot(one_sample, title="Kendrick Plot for EM0011_sample")
We can also plot the distributions of any (numeric) column of
meta-data (i.e. column of e_meta
).
densityPlot(one_sample, variable = "NOSC", plot_curve=TRUE, plot_hist=TRUE,
title="NOSC Distribution for EM0011_sample")
It’s also possible to plot just the histogram or just the density
curve with this function with the plot_hist
and
plot_curve
parameters.
densityPlot(one_sample, variable = "kmass.CH2",
title="Kendrick Mass for EM0011_sample", plot_hist=TRUE,
plot_curve = FALSE, yaxis="count")
The goal of this experiment was to identify differences in soil organic matter between sample locations and crop types. In order to do that we need to compare experimental treatments (groups).
The group_designation
method defines treatment groups
based on the variable(s) specified as main effects. Here we define
groups based on the crop/flora type.
peakObj <- group_designation(peakObj, main_effects=c("Crop.Flora"))
getGroupDF(peakObj)
## SampleID Group
## 1 EM0011_sample S
## 2 EM0013_sample S
## 3 EM0015_sample S
## 4 EM0017_sample S
## 5 EM0019_sample S
## 6 EM0061_sample C
## 7 EM0063_sample C
## 8 EM0065_sample C
## 9 EM0067_sample C
## 10 EM0069_sample C
## 11 EW0111_sample C
## 12 EW0113_sample C
## 13 EW0115_sample C
## 14 EW0117_sample C
## 15 EW0119_sample C
## 16 EW0161_sample S
## 17 EW0163_sample S
## 18 EW0165_sample S
## 19 EW0167_sample S
## 20 EW0169_sample S
The summarizeGroups
function calculates group-level
summaries per peak, such as the number or proportion of samples in which
peak is observed. The resulting object’s e_data
element
contains one column per group, per summary function.
group_summary <- summarizeGroups(peakObj, summary_functions =
c("n_present", "prop_present"))
head(group_summary$e_data)
## Mass S_n_present S_prop_present C_n_present C_prop_present
## 1 200.9433045 3 0.3 4 0.4
## 2 200.9863723 2 0.2 0 0.0
## 3 202.9413892 1 0.1 1 0.1
## 4 209.0091827 10 1.0 7 0.7
## 5 212.0200788 9 0.9 9 0.9
## 6 213.004142 2 0.2 3 0.3
We can use the densityPlot
function to compare
distributions of a molecular property (e.g. NOSC) between groups.
densityPlot(peakObj, samples=FALSE, groups=c("S","C"), variable="NOSC",
title="Comparison of NOSC Between Crop Types")
We might also want to look at which peaks occur only in one group or another, versus those that appear in both groups. The number or proportion of samples for which a peak must be observed can be specified to determine if a peak was present for a group. Similarly, a threshold based on the number or proportion of samples can be specified to determine when a peak is absent from a group. Alternatively, a statistical test called the G-Test can be used. This is a likelihood ratio test which tests the hypothesis that the presence/absence of a peak across samples is independent of group membership.
The first step is to create peakData objects that each contain two groups to facilitate group comparisons
byGroup <- divideByGroupComparisons(peakObj,
comparisons = "all")[[1]]$value
crop_unique <- summarizeGroupComparisons(byGroup,
summary_functions="uniqueness_gtest",
summary_function_params=list(
uniqueness_gtest=list(pres_fn="nsamps",
pres_thresh=2, pvalue_thresh=0.05)))
head(crop_unique$e_data)
## Mass uniqueness_gtest
## 1 200.9433045 Observed in Both
## 2 200.9863723 <NA>
## 3 202.9413892 <NA>
## 4 209.0091827 Unique to S
## 5 212.0200788 Observed in Both
## 6 213.004142 Observed in Both
Then we can construct a Van Krevelen plot colored by group membership.
vanKrevelenPlot(crop_unique, colorCName = "uniqueness_gtest")
The same could be done with a Kendrick plot.