Al, you’re right the reference set will be the biggest limitation in any breed analysis we might do as described in our [FAQ](https://darwinsdogs.org/?pg=faq#will-you-tell-me-about-my-dogs-breed-ancestry).
To elaborate, let me start by reiterating that we are not a breed test. That is not our purpose, and it is not our goal. Ancestry information will be a likely by-product of our analyses and we will share any such information that we find on your dog – but we are not offering breed testing as a product.
But if you are looking for a breed test, there are a few metrics you should be interested in to determine how well breeds can be assigned:
### 1. how is the variation in your dog’s DNA captured
Note that virtually no method of analyzing a DNA sample tells us *everything* about that DNA. There is always a trade-off between how much data we can get and the cost and complexity of the analysis. Whole genome sequencing at a depth of 30x can get pretty close to getting all the information in a DNA sample. But whole genome sequencing is orders of magnitude more expensive than most other approaches. No currently available breed test is doing whole genome sequencing.
All comercially available breed tests that I am aware of use genotyping. This is a much cheaper, easier, and much more efficient analysis that samples a number of positions within the DNA that are known to vary among dogs. To put this in context, there are about 18 million positions in the dog genome that are known to be different between dogs. Until recently, genotyping tools for dogs only checked ~170 thousand of those positions. That’s *less than 1%* of the actual variation. To be fair, if those 170 thousand spots are well chosen, good inferences can be made about many more because of linkage as I recently described in the [IAABC journal](http://iaabcjournal.org/2016/10/01/spell-behavior-darwins-dogs-use-gs-cs-ts/). But even under the best of conditions, those genotyping tools are only getting a portion of the variation. Very recently a new genotyping tool for dogs was made available that samples ~650 thousand different positions. This is better than the 170 thousand tool, but still has limitations *especially* for mixed breed dogs.
Here at Darwin’s Dogs we are not sure that new genotyping tool will be good enough for our study of behavioral variation. So we are experimenting with a new approach called low coverage sequencing. This works a lot like the whole genome sequencing mentioned at the start, but rather than getting high confidence measures of nearly every bit of the DNA we get just *some* measure of almost every bit of the DNA, and by comparing across a large number of dogs we can make good inferences about which measures are reliable and which may be false positives or misses.
So here’s a summary of relevant analysis techniques. Full sequencing gets high quality measures of nearly every spot in the DNA but it is prohibitively expensive – no breed test uses this technique. Genotyping gets high quality measures of a small subset of the DNA – this is what most breed tests use. Low coverage sequencing gets moderate quality measures of nearly every spot in the DNA.
### 2. How extensive is the reference data set
This is the question you are asking. To do proper breed assignment, one does need an extensive database of dogs from as many breeds as possible. Here at Darwin’s Dogs we do have access to some published and publicly availble genomes of purebred dogs, but it’s really not enough. But we also have something else: we have 12000 dogs currently a fair portion of which are registered purebreds. I just did a quick count of how many distinct registered purebred dogs we have and it looks to be about 400 breeds.
### 3. What algorithm is used to assign breed ancestry to various parts of your dogs DNA
This may be the most “mysterious” part of the process for two reasons. First, most comercially available breed-tests consider this to be a proprietary secret recipe. They will not likely share their method. In contrast, we will absolutely share all the details to any ancestry mapping we do once it is up and running. But the second reason this is a bit “mysterious” is the algoritms used are far from easy reading. Anyone with an advanced math degree would likely find it digestable, but for most of us it’s pretty tricky. Most approaches to breed assignment work much like ancestry reconstruction in humans. A google scholar search for these topics will give lots of examples.
So, overall, how will our ancestry reconstruction compare to breed tests? First and most importantly, this is not our goal and we don’t promise any results – so if you really want a breed test, we don’t compare: send your dog’s sample into one of the comercial services. But for the depth of information we get out of your dog’s DNA, we will be getting much more. For the number of reference purebreds used, comercial tests that have been in business for a while probably have quite a bit more than we do – but we are growing quickly. For the algoritm used for the analysis … I can’t compare as I don’t know what the breed-tests use.