The Big Y conversion to hg38 is complete for my kit and those of my matches. In my previous post Evaluating new Big Y changes I examined my new hg38 results and the new chromosome browser. I could not examine the Big Y matching system because I had no matches. I do now, and I couldn't be more excited!
In this post we will see the entire process from examining STRs, ordering a SNP pack, and examining the Big Y. We will see how Snps are determined, how to use the matching system, and how to use the new chromosome browser. We will also see what comes next.
STRs vs SNPs
When we first do Y-DNA testing, we get an STR test. We have traditionally considered only STRs to be genealogically relevant and have done SNP testing for deep ancestry. For an explanation of the differences between STR and SNP testing see Y-DNA STRs, SNPs, and Haplogroups.
But are SNPs only for deep ancestry? Can SNP testing tell us anything about genealogical relationships that STRs can't?
Let's start with examining 111 STRs and then see what information the Big Y added.
STRs led us to the immigrant ancestor
STR testing provided the clues I needed to trace my ancestor Electious Thompson back three generations further to his immigrant ancestor in Colonial Maryland. To see how I did this, see Breaking through brick walls with Y-DNA. After finding the common ancestor of my group within the Thompson project, I wanted to see if the STRs could place men within the known Thompson tree.
When you know that people are related, but you don't know precisely how, STRs can help sort people into distinct groups. For example, in just the first 37 markers, there is a clear pattern emerging where descendants of Electious Thompson have a 20 at DYS458 and a 17 at DYS576. The other men have a 19 and 18 at these positions. However, we often can't tell the order of the mutations.
The current STRs can provide a broad, but not specific, grouping of descendants of our immigrant ancestor, Robert Thompson. More STRs could provide more detail. However, the STRs could not lead us to the origins of Robert Thompson because we had no matches from any more distant European ancestors.
How can we find the origins of our immigrant ancestor?
If we've gone as far as we can with our genealogy, we need to find men who are more distantly related for Y-DNA testing. But waiting for relatively-distant matches to show up in our surname projects is problematic not only because these people may not do Y-DNA testing or join a surname project, but also because these relationships could have enough STR differences to cause these men to not show up on our list of matches.
We may need to recruit people. We often need to look for people who do not share our surname because surnames were adopted in Europe at different times and for different reasons. Will we match people with different surnames who all came from the same small region?
To find European origins, we often have to turn to SNP testing and haplogroup projects which reveal more distant relationships than STR testing and surname projects. One Thompson took a SNP panel test and found that he belonged to haplogroup FGC11134. He joined the R-FGC11134 and Subclades haplogroup project.
Within the FGC11134 haplogroup project, we could see that the Thompsons could be, at least distantly, related to a man named Cairns. Cairns shared some distinct STRs with the Thompsons, but he did not show up on the STR matching list of any Thompson.
Cairns and Thompson STRs
Family Tree DNA has the following Expected Relationships chart in its Learning Center:
We can use this chart to help determine whether Cairns and Thompson are related. We will compare Cairns to my brother (Kit 38962).
The results for the first 37 markers for Cairns and Thompson are as follows:
There are STR mismatches at six locations, and two of these locations have a two-allele difference. The total genetic distance of Cairns and Thompson at 37 markers is eight. To be considered a match by Family Tree DNA, the genetic distance must be no greater than four at 37 markers.
A genetic distance of eight is well beyond FTDNA’s match criteria of four, so these two are not a match. The Expected Relationships table confirms that they are “Not Related.”
Cairns and Thompson have the following results for markers 38-67:
Family Tree DNA considers a genetic distance of seven or less to be a match at 67 markers. Thompson and Cairns have only one mismatch in the second panel, bringing their total genetic distance to nine. Because the genetic distance is greater than seven, neither of these men is on the match list of the other man. They are still “Not Related.”
Here are the results for markers 68-111:
Cairns and Thompson have five more mismatches in the 68-111 panel bringing their total genetic distance to 14. This is beyond Family Tree DNA’s cutoff of 10. Again, the Expected Relationships table states that they are “Not Related.”
But Cairns and Thompson share some distinctive STRs. They could show up as SNP matches. Cairns had already taken the Big Y and had been assigned to Haplogoup R-A9871, a subclade of FGC11134. The Big Y is a SNP discovery test that finds previously unknown SNPs. SNP pack testing only looks for already known SNPs. We know Thompson and Cairns share SNP FGC11134, but do they share more? Time to take the Big Y!
Cairns and Thompson SNPs
My brother's original Big Y results showed something far different from "Not Related." Cairns and Thompson shared 23 novel SNPs. When Big Y results are initially reported, the haplogroup assignment will be for a known SNP that is already on the haplotree. Family Tree DNA reported our haplogroup as R-A9871. The SNPs were not named; they were shown with their positions on the Y chromosome. These results can be most easily seen in Alex Williamson's Big Tree. Cairns and Thompson are in the far right column.
Naming the SNPs is not an automatic process, so I requested that they be reviewed. After Family Tree DNA reviewed them, Cairns and Thompson formed a new haplogroup. The 23 new SNPs were grouped together under the new haplogroup R-BY20951.
The haplogroup appeared in our FGC11134 haplogroup project.
Even though we now know for sure that Thompson and Cairns are related, they can't be related too closely because the STR results are indicating that the relationship is distant. There is no way to know the exact date of our most recent common ancestor, but the time can be estimated by SNP dating.
SNP dating is not an exact science, but it gets more precise as more tests are completed. James Kane, administrator of the FGC11134 haplogroup project, did an evaluation of our results. His evaluations are important because he uses BAM files, has already been mapping results to Build 38, and has identified many SNPs that were not previously known. Kane estimated the date of our most recent common ancestor at about 1450 AD [TMRCA: 1450AD].
Cairns has traced his ancestry to a specific parish in Scotland. Can we trace his line further? Will we have other testers in the future who will bring this date even closer?
Recruit another Thompson for Big Y testing
To help determine the accuracy of an estimated date for our common ancestor at 1450 AD, the next step was to test another Thompson for whom I had a known relationship. Thompson Kit 34484 (shown below) agreed to order the Big Y.
It is important for SNP dating to know the exact relationship between the two Thompson Big Y testers. Here is the Thompson line for kit 38962:
Here is the Thompson line for Kit 34484:
Both men descend from George Thompson who was born about 1690 in Colonial Maryland. Any SNPs that are shared between the two Thompsons, and not by Cairns, occurred sometime between the birth of George Thompson and the common Cairns-Thompson ancestor. We can determine how closely related the Thompsons are to Cairns by seeing how many SNPs are shared between the two Thompsons. Any SNPs that are not shared can be assigned to the lines of James or Robert Thompson (sons of George).
The new Big Y results are in!!
We knew from STRs and genealogy that the Thompsons were related. As expected, when the results of Kit 34484 first arrived, all three men (Cairns and two Thompsons) were assigned to haplogroup R-BY20951. But I also expected that the Thompsons would share some SNPs that Cairns would not have. I requested a review by Family Tree DNA to see if the Thompsons would form a new haplogroup. After the review, the new haplogroup is now shown in the FGC11134 project.
But the review did much more than name a new haplogroup. The list of unnamed variants and the known SNPs changed on all three kits.
Here was my brother's list of unnamed variants before the FTDNA review. He had six unnamed variants:
Here is his list of unnamed variants now. There are only two:
What happened to the four that are missing? They are now named and assigned to the haplotree. Some of these SNPs had been named when I submitted my original Big Y test to Full Genomes Corporation. Three of these were found in the FTDNA review to be shared with Cairns. Our haplogroup R-BY20951 now shows 26 SNPs:
Position 11649109 is now SNP A18880; position 11321844 is SNP FGC65819, and position 19139783 is SNP FGC65831.
Only one unnamed variant was shared by the two Thompsons and not by Cairns. This was position 11514480 which is now called SNP FGC65820 and forms our new haplogroup.
When those four SNPs were named and placed on the haplotree, it left only two unnamed variants for my brother. These SNPs occurred somewhere in the line of James Thompson, son of George.
New Big Y matching system
We can examine other variants in the Big Y Matches section. When you go to your Big Y matches, you will see five haplogroup levels. Your list of matches at the lowest level will be shown along with all of your non-matching variants.
It is important to understand what constitutes a match. Family Tree DNA lists only those people who match within an average of 30 SNPs [this has also been stated as up to 40 total SNPs, 20 for you and 20 for your match]. We know that Cairns and Thompson share 26 SNPs that nobody else has, and they probably also have "private" SNPs (i.e. SNPs not yet seen in anybody else). So we will probably not see anybody other than Cairns and Thompson in this match list. Please note the five haplogroup levels in the image below (FGC11134, ZZ44_1, A9871, BY20951, and FGC65820). The two matches at level FGC11134 does not mean that only two people are in this haplogroup--there are hundreds. It means that only two people in this haplogroup are within 30 SNPs of my brother.
This screen shows only one match (the other Thompson), and he and my brother have three non-matching variants. This tells me that the other Thompson has variant 8837670, and my brother does not have this. This variant occurred in the line of Robert Thompson, son of George. The other two variants 12144610 and 56831461 are my brother's variants; they occurred in the line of James, son of George.
Where is Cairns? He is not within Haplogroup R-FGC65820, so he is shown at all of the higher levels. If I click on BY20951, I will see matches for both Thompson and Cairns.
We see that Cairns and my brother mismatch on five named SNPs and four unnamed variants. So we must determine why these are not shared.
How are SNPs identified?
Now we get to the difficult part. How do we know if a SNP is valid? How do we know whether we have more or fewer SNPs than FTDNA identified?
We must first understand how SNPs are determined. During the testing process, your DNA is not read in one continuous stretch. Instead, your DNA is broken into random fragments. The test reads these fragments from each end. Some fragments are read many more times than others. After being read, the fragments are aligned to the reference sequence, and differences from the reference sequence are identified. The difference from the reference value is your "derived" value.
Unfortunately, not all of the reads may give the same result. One read may show the reference allele (for example a C) and another read may show a derived value (for example a G). When fragments were aligned to the old Build 37 reference sequence there were more varying calls reported, often due to poor alignment. There will be more valid SNPs identified now that Family Tree DNA is using the more recent and more accurate Build 38.
To be considered a high quality SNP by Family Tree DNA the position must be read at least ten times. The number of differing calls is then taken into account. A position that was read a few times with different results will be considered to be a much less reliable SNP than one that was read many times with a consistent result.
How can I tell if my SNPs are reliable?
The best, and easiest, way to analyze your results is to send your file to third-party analysts like YFull or Full Genomes Corporation, or both! There are also many haplogroup administrators and other volunteer analysts who will do this for free. I recommend using as many analysts as possible because all will contribute to your understanding. FTDNA has a file called the BAM file which contains the information needed to do a full evaluation. The BAM files for the new conversion to hg38 are not yet available, so I will have a future blog post showing results from BAM analysis.
But Family Tree DNA now provides a tool that will help you to do some analysis by yourself. When you are trying to find out if you have a particular SNP, or why a known SNP is on your list of non-matching variants, you can use the new chromosome browser. You can see in the browser how many times a position was read and if all calls were consistent. Below we will see examples of three different types of reads: one position that was read fewer than ten times, another position that was read more than ten times but with different calls, and a third position that seems to be incorrectly called.
Evaluating variants with the Chromosome Browser
When you go to your Big Y results page, you will see three tabs for Named Variants, Unnamed Variants, and Matching.
Under the Named Variants tab, you will see the Y-Chromosome Browsing Tool (I will hereafter call it the Chromosome Browser). Below that you will see columns for SNP Name, Derived Status (whether or not you have that SNP), whether the SNP appears on FTDNA's Y haplotree, the hg38 Reference value, your value (which is called Genotype) -- it will be different from the reference value if you are derived for that SNP, and Confidence (how confident FTDNA is that this is a valid SNP).
A SNP with low confidence
In the SNP name search, I first entered my haplogroup FGC11134. I saw "Currently no results."
That didn't make any sense. I know my brother should have that SNP. I noticed that the default for the "Derived?" column is set to "Yes." So I reset it to "Show All" and searched again for FGC11134. This time FGC11134 showed up with a question mark in the Derived column. The Reference is T and the Genotype is "?".
I can click on the name of the SNP and view it in the Chromosome Browser.
Clicking on the SNP name takes you to that position in the Chromosome Browser. The Chromosome Browser shows that SNP FGC11134 is position 19221580. We can see the Reference value T highlighted in red below the black arrow. The calls below it (in pink) are all A. This means that my brother had "A", and not the Reference value which is T.
But because this position was only read four times, it does not qualify as a reliable SNP call. To be considered a valid SNP by FTDNA, the position must have been read at least ten times. So instead of the reported value of A, the derived value is listed in my brother's report as "?"
Most third-party analysts would report this as a SNP. For example, YFull will report variants that have been read as few as two times. It will rate these variants based on their reliability, but will only name them as SNPs and place them on their tree when shared with others. YFull would have been reported FGC11134 as a SNP on my brother's report. Although FTDNA did not report this SNP because it did not show up with high enough confidence, it does appear in the Chromosome Browser, and my brother is still placed in the FTDNA haplotree within FGC11134.
A SNP with different calls
We can examine the non-matching SNPs between Cairns and Thompson using the same process. For any named variants (BY26955, FGC22107, etc.) we use the Named Variants tab and the SNP Name search. For unnamed variants (7403088, 12144610, etc.) we can only view our own variants using the Unnamed Variants tab and then clicking the Position name.
I want to see why BY26955 shows up as a Non-Matching Variant between Cairns and Thompson. In the Named Variants tab, enter the SNP name into the SNP Search box. Notice that the Derived column contains "Yes", so Thompson has this SNP, but Cairns must not. Now click the name of the SNP (the blue link).
The Chromosome Browser shows that SNP BY26955 is position 10658444. Here the reference value is C, and the derived values below it are mixed.
This position appears to have been read 14 times. Some of the lines are in vivid colors indicating High Quality. Some are faint indicating Low Quality. Eleven calls are A, two are T, and one is a no call (where the cursor is pointing). But the Genotype is reported as A with High Confidence. Did Cairns have even more mixed calls at his location causing the SNP to be considered low-quality in his results?
Why is FGC46559 not reported?
SNP FGC46559 appears on the Cairns and Thompson list of Non-Matching Variants. I searched for this SNP in my brother's account:
As you see above, Thompson is not derived for SNP FGC46559. The reference and genotype are both listed as A which means that Thompson has the ancestral value and does not have a SNP here. Since it is a non-matching variant, Cairns must have this SNP.
This one is puzzling. When we click the SNP name and go to the Chromosome Browser, we see the following.
There are many more calls for this SNP that are not shown. You have to scroll up and down to see them all. This position was read a total of 76 times. The Reference value is A and is highlighted in red directly below the black arrow. All of the Genotype calls beneath it (in pink) are G except for that one blank space on the seventh line from the top. Yet the the Genotype and Reference are both stated to be A.
This is a case where we would submit this SNP to FTDNA for further evaluation.
How many more SNPs need to be reviewed?
The best results will be obtained when we get the BAM file and submit it to multiple third party analysts. The BAM files should be released by Family Tree DNA early next year, and then we wait the long process for the third-party evaluations. This could take several months depending on how many people are submitting their new files. I have had my brother's previous Build 37 results analyzed by several third parties. My personal favorite is YFull because the results are online and can be compared to others, they find many SNPs not reported by FTDNA, report a few hundred STRs that can be extracted from the Big Y test results, place you on their haplotree where you can see close as well as distant matches, and more. You can see what their evaluations look like in What are the benefits of YFull? and Big Changes to YFull.