Scottish Marine and Freshwater Science Volume 5 Number 5: Identfying a Panel of Single Nucleotide Genetic Markers for River Level Assignments of Salmon in Scotland

This report examines utility of assigning fish to river in Scotland using Single Nucleotide

Polymorphic (SNP) markers.


Methods

The basic outline of the techniques employed is that firstly samples from numerous rivers were screened at a large number of SNP markers. A regional hierarchy of 'assignment regions' was then identified in which adjacent rivers shared genetic characteristics, and then a set of SNPs identified that allowed assignment to these regions. Subsequently further sets of SNPs were identified that allowed finer resolution of regional units until, in areas where the river coverage was sufficient and which were of significant interest, SNPs were identified which allowed assignments to the rivers represented in the baseline. Further, techniques were developed which allowed individual fish to be excluded from the analysis if their assignment confidence was not high ( i.e. there was no strong evidence that they originated from regions/rivers represented within the baseline). In this way, firstly fish could be assigned to region, and then within some of the regions individuals could be either assigned to river, or excluded from the analysis.

SNP baseline coverage

Figure 1 SNP sites included in the analysis

Figure 1 SNP sites included in the analysis.

Figure 1 details the V2 SNP sites available for analysis. The coverage contains 66 sites from 24 rivers containing a total of 1896 fish. The fish were screened using the ''V2- SNP microarray' on an Illumina Beadstation 500G platform at the Centre for Integrative Genetics ( CIGENE) in Norway. This resulted in 5568 SNP genotypes being generated for each fish in the dataset.

Hierarchical analysis

The approach used to identify a useful set of SNP markers for river level analysis from the 5568 available was to follow a hierarchical analysis of the available baseline data, rank the SNPs according to their usefulness in discriminating between reporting groups (using a measure of genetic distance) at each hierarchical level and combine the highest ranked SNPs from each level's analysis into a final panel. This approach was designed to allow the identification of those SNPs that performed best at each level, and to combine these into a single cost-effective set. The hierarchy is outlined in Figure 2.

Figure 2 Hierarchical approach to identification of SNPs

Figure 2 Hierarchical approach to identification of SNPs. Points where SNPs were ranked are noted. For description of assignment units see Fig. 6 and 7 and results section.

At each of the 3 regional hierarchical split points a measure of genetic distance ( DA, Nei's standard genetic distance; Nei, 1972) was calculated between each site. This measure was then used to visualise the pairwise genetic distance relationships between sites and regions using Multi-Dimensional Scaling ( MDS) plots. Groups of similar sites ( i.e. regional groupings) were then identified using K-Means Clustering. Finally the SNPs were ranked according to how powerful each one was in its ability to separate the assignment region groups. This was achieved by calculating hierarchical F-statistics (a measure of population differentiation) for each SNP and ranking the SNP loci according to their discriminatory power using the R-package HIERFSTAT ( Goudet, 2005).

Within the Kyle, East and South reporting regions identified by this analysis, further SNP rankings were performed but this time focusing on ranking the SNPs according to their power in differentiating between rivers within these regions.

The analysis resulted in 6 sets of ranked SNPs. The final part of the procedure was to combine the top ranked SNPs of each of these ranked sets into a single set of SNPs for assignment of fish to region and/or river. The top 48 SNPs were taken from each ranked set and incorporated into a single panel resulting in a panel of 288 SNPs. Where the same SNP was present in more than one ranked set, the next available one on the ranked set was used.

Data preparation and production of 'Training' and 'Hold-out' test panels

Outlier sites were firstly identified by examination of the MDS plots and then removed from the analysis. Pairwise measures of DA were calculated between each site and summarised in MDS plots. This allowed sites whose genetic character was very different from those in nearby systems to be identified. Such differences could arise due to a number of reasons including those relating to both sampling artefacts and real genetic differences in the sites genetic composition. However, even if the genetic differences are reflecting of the real genetic composition of such groups, the aim of the initial screenings was to identify broad regional genetic signals and as such these 'outliers' were not representative of such signals. The SNP data was then examined for loci where the minor allele frequency was below 10%. These loci were removed as scoring errors when genotyping can result in such frequencies which in turn can give rise to false estimates of allele frequencies ( Anderson et al., 2008; Tabangin et al., 2009; Pongpanich et al., 2010) and so negatively influence SNP ranking procedures.

It is very important that an unbiased procedure is used when analysing assignment success rates. Bias can enter the assessment of success rates due to the fact that the same sets of samples are used both to choose a set of SNP loci and to then go on to test the power of such loci. It has been clearly shown that this can lead to what is termed 'ascertainment bias' due to the circularity of the approach ( Waples et al., 2008; Anderson, 2010). In order to prevent this, and before any analysis was performed, the dataset was randomly split into two equal parts. At each site in the baseline, an equal number of fish were put into a 'Training Set' ( TS) and a 'Hold-out Set' ( HS). Further, in some cases where there were multiple sites within a river, entire sites were randomly removed from the baseline and included only in the HS. The TS was then used to rank and choose the SNP markers. The success rates of the different marker sets were then determined by assigning fish from the HS back to the TS baseline. In this way, ascertainment bias should be greatly reduced. It must be noted, however, that this procedure requires the dataset to be split into 2 and so the baseline is significantly reduced in size and potentially power and so could actually underestimate the power of the assignment accuracy ( Waples, 2010).

To try to estimate the power of the assignments without this confounding issue of only using half the fish in the baseline, new baseline and mixture files were assembled. The baseline was the full dataset ( i.e. not half of each sites data). The test mixture file comprised an entire site from each river with two or more sites of data. These sites were also those removed at the start of the whole hierarchical analysis and so had never been part of any SNP ranking procedure ( i.e. the baseline had that half of the fish that were removed for the original HS put back and the TS was simply now the whole sites form the original TS. As such, they were by definition unaffected by issues of ascertainment bias. Further, as the data from the sites were the entire sites' data, they represented fish from a particular river from locations that were not represented in the baseline. In reality, only a limited number or sites will be present in any baseline so the strictest test of an assignment protocol is how well do fish not from these sites assign to the assignment unit under investigation. The test was therefore very rigorous, and represented the hardest possible scenario of assignment (where a river of origin is represented in the baseline), rather than those often carried out ( e.g. self-assignment, or even TS/ HS where in both cases the actual site of origin of the test fish is always in the baseline). Sites represented in this mixture were from the rivers: Carron, Conon, Dee, Nith, South Esk, Spey and Tweed ( i.e. those from rivers which contained more than two sites).

Exclusion of fish from rivers not in the baseline

The baseline available for assignment of fish to Scottish rivers will never cover every river in Scotland. Many small rivers will not be represented, yet fish from these rivers may be found in any mixed stock group of individuals. Attempts should therefore be made to identify these fish and remove them from the assignment procedure because by definition if they are assigned they will be wrongly assigned.

Exclusion of fish from rivers not in the baseline was examined using the Exclusion method of assignment analysis according to Vasemägi et al. ( 2001) ( see also Ikediashi et al., 2012). When performing assignments, 2 metrics are obtained: the assignment score and the assignment probability. The assignment score shows the proportion of times in the simulated assignment attempts (typically 10,000 attempts) an individual fish assigns to each site in the baseline (and this can be summed for each site within a river to give the value at the river level). The assignment probability will return the probability (typically using Bayesian methods and Monte-Carlo resampling) that an individual belongs to each reference population. This value is calculated for each reference population and so cannot be summed to return a river level result if sites are used in the baseline as reference populations. The exclusion approach as described in Vasemägi et al. ( 2001) uses the probability values to exclude fish from the assignment analysis if the maximum probability of assigning to any of the reporting groups is less than 0.05 ( i.e. no strong evidence of belonging to any of the reporting groups).

A new analysis was performed to examine the efficiency of this technique. New baseline and mixture files were created using the SNP panel identified in the analysis described above. The same sites were removed from the baseline and placed into a mixture file, as had been used above to examine accuracy of assignments. In addition, all the fish from entire rivers were removed from the baseline and added to the mixture file. Assignments were made and both scores and probabilities calculated. Different levels of cut-off for both scores and probabilities were used and accuracy of assignments determined.

The mixture file contained: fish from sites with rivers in the baseline - Dee, Spey, Kyle, Nith, and Tweed; fish where the entire river was not present in the baseline - Ayr, Conon, North Esk, South Esk, Helmsdale, Dionard

Exclusion analysis was performed with the aim of identifying and having the ability to remove from further analysis the fish from rivers not represented in the baseline while at the same time leaving in the analysis assignment of those fish from rivers that were in the baseline.

Regional assignment test

A further analysis was performed to examine the power of the regional level assignments in more detail and as a first limited test of the ability to assign fish to the highest level regional assignment units. Fish were examined from 15 randomly chosen new sites from around the Scottish coastline which had not been screened for the larger SNP panel and so had not been involved in any part of the initial analysis ( Figure 3). Six fish were screened at each site using the top 89 SNPs from those identified at the SNP Ranking 1 and SNP Ranking 2 stages as detailed in Figure 2. Assignments were then performed using the methods outlined above and accuracy of assignments determined.

Figure 3 Map showing the sites of the samples used in the regional assignment

Figure 3 Map showing the sites of the samples used in the regional assignment. 1 River Cree, 2 River Garnock, 3 River Euchar, 4 River Morar, 5 River Ewe, 6 Rhiconich River, 7 River Hope, 8 River Thurso, 9 River Brora, 10 River Lossie, 11 River Ugie, 12 River North Esk, 13 River Tay, 14 River Forth, 15 River Tweed.

Contact

Back to top