Genome, Evolution, and Domestication of the Cat

Even though most of my posts on MassGenomics concern human genetics and genomics, today I’d like to highlight a milestone in another species, one that many humans care fiercely about. This guy:

domesticated cat

Credit: renekyllingstad on Flickr

Cat lovers, rejoice! This month in the Proceedings of the National Academy of Sciencs, Mike Montague, Wes Warren, and colleagues published the first complete reference genome for the domestic cat. Their analyses offer insights into the genetics underlying feline biology, evolution, and most recently, domestication.

The cat recently surpassed the dog as the most popular pet in the world, with a global population size estimated at 600 million. Many people credit ancient Egypt with cat domestication, but there’s archaeological evidence showing that cats and humans lived together 5,000 years ago in China, and ~9,500 years ago in Cyprus. In both cases, the new relationship seemed to arise when people turned to agriculture to feed themselves. It seems obvious what happened: farming and storing grains drew rodents, and rodents drew cats.

The path to domestication for cats differs from that of most other domesticated animals that were selectively bred for food (livestock), herding, hunting, or security (dogs). Most of the 30-40 cat breeds recognized today originated within the last 150 years, and were selected mainly for aesthetic traits rather than functional ones. That’s a fancy way of saying we bred cats to be pretty, not to be useful.

The Cat Genome

The 19 chromosomes in the reference cat genome (18 autosomes and an X-chromosome) span 2.35 billion base pairs. It contains about 19,500 protein-coding genes and 1,850 non-coding RNAs, numbers that are very similar to the dog. The authors first looked at the ~10,000 genes that had orthologs (matches in another species) in the tiger, dog, cow, and human genomes.

Genes Under Positive Selection

They searched in particular for genes under positive natural selection, and put those findings in context with what we know about cats relative to other carnivores.

Cats have the broadest hearing range among carnivores. There are at least six genes that look to be under positive selection in cats that are associated with hearing capacity; we know this because mutations in these genes cause nonsyndromic recessive hearing loss or deafness. At least 20 genes under positive selection in cats are associated with vision-related pathways, which fits with the importance of visual acuity for these natural-born hunters.

Felines are “crepescular” hunters, meaning that they’re most active in the twilight periods before sunrise and after sunset. Thus it was fascinating to see positive selection on genes like CHM and CNGB3, in which mutations can cause retinal diseases featuring night blindness (i.e. choroideremia and retinitis pigmentosa) in humans.

Cats rely less on their sense of smell for huntings than dogs do, which is apparent from the smaller repertoire of olfactory receptor genes in the feline genome. However, the cat ancestor had more genes encoding vomeronasal sensation. The vomeronasal organ is a sort of auxiliary sense of smell, mainly used to detect pheremones. It’s been suggested that there’s a tradeoff between olfactory and vomeronasal capacity in evolution, and the cat’s genome supports that: sense of smell was traded for pheremone detection, on which cats rely for social communication.

Wildcats and Domestication Genes

What about genes that might be involved in the domestication process? To search for these, the authors combined sequencing data from 22 cats, including both domestic and wildcat breeds. Wildcats (Felis silvestris) are small cats found in Africa, Europe, and parts of Asia. They tend to be larger than domestic cats, with longer legs and more robust bodies. There are numerous subspecies of wildcat, but they generally fall into one of three specialties:

  1. Forest wildcats, like the European wildcat.
  2. Steppe wildcats, whose ancestors migrated to the Middle East and tend to have smaller bodies, longer tails, and lighter fur.
  3. Bay or bush wildcats, which have paler coats and more defined patterns (stripes and spots).

As you might have guessed, house cats are thought to have been domesticated from those fancy-looking bay wildcats, probably an African subspecies.

When the authors looked for evidence of selection, they found regions harboring genes like:

  • PCDHA1 and PCDHB4, which play a role in neural connection establishment/maintenance and fear conditioning.
  • GRIA1, a glutamate receptor gene involved in associating stimulus with reward.
  • DCC, encoding the netrin receptor, which is expressed in dopaminergic neurons. Knockouts of this gene in mice produced animals with defects in memory, behavior, and reward.

So it looks like cats chose to domesticate themselves because they noticed that, if they came in and helped out with the rodent control, we would reward them with food. And they stayed because they were afraid that we wouldn’t feed them if they remained in the wild. The last assertion might not be correct based on observations of my neighbor feeding strays, but please, no one tell the cats that.

More on Cats and Domestication

If you want more great stories about the cat genome and domestication, you’ll find good articles in Wired Magazine and the Washington Post. Senior author Wes Warren also appeared on NPR’s Science Friday last week.

Montague MJ, Li G, Gandolfi B, Khan R, Aken BL, Searle SM, Minx P, Hillier LW, Koboldt DC, Davis BW, Driscoll CA, Barr CS, Blackistone K, Quilez J, Lorente-Galdos B, Marques-Bonet T, Alkan C, Thomas GW, Hahn MW, Menotti-Raymond M, O’Brien SJ, Wilson RK, Lyons LA, Murphy WJ, & Warren WC (2014). Comparative analysis of the domestic cat genome reveals genetic signatures underlying feline biology and domestication. Proceedings of the National Academy of Sciences of the United States of America PMID: 25385592

Brace Yourself for Large-Scale Whole Genome Sequencing

The release of the Illumina HiSeq X Ten sequencing system, and its current use restriction (only human, only whole-genome sequencing) are going to cause a major paradigm shift in human genetics studies over the next few years. Until now, we’ve seen relatively few large-scale efforts to apply whole-genome sequencing (WGS) to large numbers of samples. But the capability of a single X Ten installation to sequence ~18,000 genomes per year at a relatively low cost means that, for the first time, it may become easier to apply WGS as the primary discovery tool.

large scale genome sequencing on the X Ten

The Illumina HiSeq X Ten

I’ve already written about the realities of the sequencing GWAS to discuss some of the considerations in going from genotyping (SNP arrays) to sequencing (next-gen) for genetic association studies. Unlike genotyping, sequencing enables both discovery and genotyping, with the caveat that you’ll end up with:

  • Many rare variants private to an individual or family
  • Increased missingness in the resulting genotypes
  • More false-positive variants
  • Additional QC challenges

These are simply the reality of going from clean, defined SNP array datasets (>99.1% call rate) to next-gen sequencing data, which depends on alignment and variant calling and depth/breadth of coverage.

Data Storage Demon

One of the major practical considerations for whole-genome sequencing data is on the computational requirements side: data processing, storage, and retention. A binary alignment/map (BAM) file — which contains the sequences, base qualities, and alignments to a reference sequence — for a 30x whole genome is about 80-90 gigabytes in size. The BAM files for a modest sample size (1,000) might consume 80 terabytes of disk space. And that disk space is not free. It costs actual dollars to purchase and maintain over time.

I’m resisting the urge to show you that cost of sequencing / Moore’s law comparison plot here.

Because disk space is both finite and costly, and these files are so huge, at some point researchers will have to choose between getting new data and actually deleting some old data. Kind of like a “one in, one out” policy at a crowded bar. No one likes throwing data away. We NGS analysts shudder at the idea of not being able to go back to the BAM file to run yet another variant caller, or review that interesting variant. At some point we may have to call the sample’s analysis DONE and leave it that way. Because, let’s be honest, 99% of the bases in a BAM file match the reference. It’s the variants that we’re truly interested in.

Data Transfer: Traffic Jams Ahead

Another consideration is the simple act of moving data around. With a $10 million price tag, few research groups will be able to afford an X Ten cluster, but those who can’t will be unable to stay competitive on the cost of WGS. On the other side of the table, the lucky X Ten installation sites will need to find samples. This means that most whole-genome sequencing will take place at a few locations, and the resulting data transferred to the investigators who sent in the samples.

Have you tried to download an 80 gigabyte file lately? The regular internet is just not going to work for this.

You There, with the Samples!

A couple of years ago, I wrote that in a world with widespread genome sequencing capacity, samples are the new commodity. That has never been more true than in the world of the X Ten. The institutions that have them will need to find several thousand samples per year in order to achieve the optimal per-genome cost.

I don’t know too much about the details of consenting samples, but I know that many, many research samples are not consented for whole genome sequencing. Because whole-genome has everything: your Y-haplogroup (for males), your APOE allele, your BRCA1/2 risk variants, etc. There’s no “we will only look at this gene or region” nonsense.

The Awkward Question

Who is going to pay for sequencing all of these samples? Don’t count on the X Ten centers to do it; remember, they had to shell out $10 million just to buy the thing. Even at a reagents/personnel cost of $1,000 per genome, an X Ten running at full capacity will cost $18 million per year. That’s a lot of cash, in an era when research budgets seem to be flat (if not shrinking). So now you need samples and the funds to sequence them.

It may actually be more difficult to persuade researchers to make the switch to sequencing, because it will still be five times more expensive than running a SNP array.

The Promise Ahead

I know that this post has had a bit of a negative tone, but I felt it necessary to get people thinking about the challenges ahead. Now, perhaps, we should talk about the promise of large-scale whole genome sequencing. At last, we’ll have sequencing studies that aren’t biased towards coding regions or certain genes. Every sequenced genome will harbor over 3 million sequence variants. We can go after non-SNP variation, too: indels and structural variants are far easier to detect by WGS, though SV calling is still a nascent area of bioinformatics.

The wonderful thing about WGS is that it both enables and forces us to look beyond the obvious (e.g. the nonsynonymous variants in known protein-coding genes). We’re headed into the unknown, the dark matter of the genome, whether we like it or not. And that is a good thing.

Next-gen Sequencing for DNA Forensics

NGS forensics

Image Credit: brilliantbias dot com

The criminal justice system in the United States is rarely called an early adopter of new technology. Despite the major impact of forensic DNA testing over the last quarter-century, the tools deployed by most forensics laboratories are rudimentary by modern standards. If you compare what the FBI does with an unknown DNA sample to what 23andMe can do (faster and with lower costs), the differences are quite striking. The FBI can determine whether the DNA matches a reference sample (or an entry in its CODIS database) and that’s about it. In contrast, 23andMe can tell you how much Neanderthal is in your DNA and help you find your third cousins.

There’s encouraging news for the field of applied DNA forensics, as earlier this month the National Institute of Justice awarded an $825,000 grant to the Battelle Memorial Institute to “conduct feasibility and validation tests on a suite of new investigative tools that use next-generation sequencing.” It came to my attention thanks to Julia Karow’s feature article over at In Sequence.

Now seems like an excellent time to discuss the potential for NGS applications as well as some of the significant challenges.

How DNA is Currently Used in Forensics

Codis DNA forensics loci

Codis Core Loci (Wikipedia)

Importantly, nearly all routine DNA testing for forensic purposes involves capillary electrophoresis of short tandem repeats, or STRs. Unlike SNPs, which tend to have two alleles, STRs have numerous alleles, defined by the number of repeats at each locus. Because they’re so polymorphic, STRs are well-suited for finding a “match” between two samples.

Theoretically speaking, about 10-12 carefully chosen STR loci are sufficient to identify an individual. In the U.S., the FBI uses a panel of 13 (plus the AMELX/AMELY loci for sex determination) for its COmbined DNA Indexing System (CODIS) database.

The current applications for DNA in forensics laboratories are focused on matters of the justice system:

  • Matching a crime scene DNA sample to a reference (i.e. a cheek swab) to implicate a suspect, either directly or indirectly (i.e. using a close relative).
  • Searching for a match among the profiles of criminals in federal databases like the FBI’s CODIS.
  • Comparing samples from two crime scenes to learn if the same suspect was responsible
  • Identification of human remains or found individuals in missing persons cases

Notably, although the STR profiles in CODIS are well-suited for matching DNA samples, they’re poorly suited for distinguishing an individual’s ancestry.

How DNA Could be Used in Forensics

DNA sequencing/genotyping technologies and applied human genetics have advanced considerably since the establishment of CODIS in the 1990’s. One obvious application, hinted at in the proposed plans of the Battelle/NIJ grant, is to empower the identification of “unknown” DNA samples (from crime scenes or missing persons cases) by building a profile of likely physical characteristics:

  • Ancestry or continent of origin, based on ancestry-informative markers (AIMs). Back in 2006, when I worked in the lab of Raymond E. Miller, we developed a panel of about 25 SNPs that reliably distinguished between individuals of African, Asian, or European origins. With more SNPs one could trace ancestry with much higher resolution.
  • Physical appearance, especially eye or hair color. These are not simple Mendelian traits, but with sufficient markers one could determine the probability of brown versus blonde hair, or blue versus brown eyes. Of course, it’s now possible to change your apparent hair or eye color using dyes and colored contacts, respectively, but let’s not go there.
  • Deep familial matching is also possible, particularly with a sequencing-based assay that will detect rare variants unique to certain pedigrees.
  • Faster, less expensive results may ultimately be what drives the adoption of new technologies. The turnaround time for DNA testing in most situations is slow, and the backlog is substantial.

 Challenges of DNA Forensics

Given the obvious utility of new sequencing and genotyping technologies in forensic applications, you might be wondering: Why has it taken so long? Having done some work in this field, and discussed the matter with multiple law enforcement agencies, I can tell you some of the reasons.

Judicial Merit, Acceptance, and Precedence

First, adoption of new technologies is slow in the criminal justice system. To be useful in a criminal investigation, they must establish judicial precedence which takes a lot of vetting and a lot of time. No one wants to waste money on a technology that will be easily disputed by defense attorneys, right? The governmental agencies that operate most forensics laboratories exist as part of the justice system. They need to prove the merit of something before devoting substantial resources to it.

Established Infrastructure

It is difficult to emphasize how important the CODIS database is to forensic laboratories. Any technology that displaces capillary electrophoresis of STRs absolutely must provide CODIS profiles. Remember, CODIS has been around a long time (officially since the federal “DNA Identification Act of 1994″). As of last year, it contained 10.3 million offender profiles, 1.5 million arrestee profiles, and 493,500 forensic profiles. And it had assisted more than 200,000 investigations.

In short, CODIS is not going away. Next-gen sequencing is theoretically capable of generating STR profiles, but short read lengths make this more difficult.

Impure and Degraded Samples

In a research setting, we panic when we have sample contamination or low amounts of genomic DNA. We know it affects things like mapping rate and duplication rate of the resulting reads. Just a heads-up here: the DNA samples often encountered in forensic situations are a lab manager’s nightmare. They’re almost always mixtures of DNA from multiple people. They may be heavily degraded. They might not even be human in origin. Rarely will pure samples with good amounts of DNA come into a forensics laboratory.

This makes many of the desirable applications of forensic DNA sequencing a bit more difficult. Degraded DNA is typically fragmented and thus more difficult to sequence. Even when you can, the coverage might be incomplete. As I mentioned, the FBI uses 13 identification loci in its CODIS database. To add a profile, all 13 must be attempted, and at least 10 must be tested successfully.

Ethics and Privacy Concerns

You know how much I like to talk about the elephant in the room: ethical, legal, and social considerations for expanded DNA testing capabilities by government agencies. Groups such as the ACLU are already concerned with how DNA testing can provide “partial” (familial) matches. Imagine getting a knock at your door, and opening it to find two police detectives.

“Are you Mr. Smith?” one of them asks.

“Uh, yes,” you answer, already nervous from the sight of their badges and guns.

“We’re investigating a homicide,” says the other detective. “DNA evidence from the scene was a partial match to you. We’d like to ask you about all of your second and third cousins.”

This is a slight exaggeration, but with the power of next-gen sequencing and human genetics, it’s still plausible. We can’t live our live distrusting every single person and organization of authority, but we do need to have open conversations about implications of advanced DNA testing.

I tip my hat to the Battelle Institute for tackling both the hard science and the sticky issues. They have a bumpy road ahead.




Return of Results: Genetics Experts Weigh In

Genetics experts return of results

Image credit: 123 RF

In my last post, I wrote about the return of results from next-gen sequencing, specifically a recent paper in AJHG about secondary findings in ~6500 ESP exomes. Today we’ll delve into another paper in the same issue on the attitudes of genetics professionals on return of incidental findings from whole genome sequencing (WGS) and exome sequencing (ES).

Joon-Ho Yu and colleagues conducted a survey of around 850 genetics professionals to gauge their attitudes toward:

  1. The return of clinical ES/WGS results
  2. The process of returning results
  3. The ACMG recommendations for secondary findings

Responding Genetics Experts

To identify potential respondents, the authors first collected e-mail addresses, professional degrees, and states of residence from three societies: the American Society of Human Genetics (ASHG), the American College of Medical Genetics (ACMG), and the National Society of Genetic Counselors (NSGC). They sent out 9,857 invitation e-mails and had 847 respondents, for a completion rate of around 8%.

The majority of those respondents were:

  • Female (58%)
  • White (98%)
  • Non-Hispanic (96%)
  • Residents of the U.S. (81%)
  • In academia (73%)

Various professions were well-represented among respondents, including clinical geneticists (24%), genetic counselors (22%), and human geneticists (19%).

Return of Incidental Findings

Responses to the heady questions from the survey are depicted in Figure 1, which I’ve adapted here:

Return of results survey

Adapted from Figure 1 of Yu et al, AJHG 2014

Overall, genetics experts were very supportive of the idea that some secondary findings should be returned from clinical ES/WGS. The majority of respondents agreed that incidental results should be offered to:

  • Adult patients (85% agreed)
  • Healthy adults (75% agreed)
  • Parents of a child with a medical condition (74% agreed)

Where the Experts Agree

Nearly all experts (88%) supported offering results about childhood-onset conditions to the parents of child patients, and most (62%) would also offer information about the child’s results for adult-onset conditions. The last result in the figure above is an important one: the vast majority of experts (81%) agree that the preferences of a patient or family should guide which results are offered for return. And most (66%) agreed that a web-based tool would suffice to assess those preferences.

Where the Experts Don’t Agree

The experts were divided (~40% agreed/disagreed) on whether only actionable secondary results should be returned, and less than half (44%) thought that giving patients and families the option to choose which results to receive would improve care. Respondents also differed in their opinions on what kind of results to return.

Obligation to return results

Yu et al, Supp. Fig 1 (AJHG 2014)


When asked about the type(s) of conditions that merited return of positive incidental findings, the experts chose:

  • Mendelian disorders (67%)
  • Adverse drug reactions (61%)
  • Carrier status (49%)
  • Complex traits (20%)

And about 25% responded that healthcare providers had no obligation to return secondary results.

How to Return Results

Winning the survey’s least-surprising category was the part where respondents were asked to rank, by order of preference, the manner in which incidental findings should be communicated.

Method 1st Choice 2nd Choice 3rd Choice 4th Choice
A face-to-face meeting with a genetic counselor 78.5% 9.5% 7.2% 4.8%
A phone call with a genetic counselor 5.6% 63.0% 26.5% 4.9%
An interactive website with access to counseling 13.6% 23.4% 53.4% 9.5%
A report sent in the mail 2.3% 4.0% 13.0% 80.8%

Unsurprisingly, most respondents put a face-to-face meeting with a genetic counselor as their first choice. Most popular second choice, a phone call with a genetic counselor. An interactive website with access to genetic counseling by phone or online was a popular third choice (37% ranked it first or second). Everyone hated the idea of sending reports by mail.

The ACMG Gene List

Recently, the ACMG published their recommended list of 57 genes/conditions for which incidental findings should be returned. Their recommendations got a lot of press, and received a vigorous (and mixed) response from the medical and research community. In this survey, 68% of genetics professionals agreed that results from the ACMG list should be reported, regardless of the indication for sequencing.

However, only 29% felt it was the responsibility of the health care professional to decide which results on the minimal list should be returned. And the majority of respondents (70%) disagreed with the notion of returning secondary findings from the ACMG list regardless of the patient/family preferences for getting that information.

Challenges of Returning Results

Next, the genetics professionals were asked for their perspective on the greatest challenge of returning results from clinical ES/WGS.

Challenges of returning results

Yu et al, Supp. Fig 1 (AJHG 2014)

The greatest concern is one that I’ve heard before, particularly in the roundtable discussion on genetic testing last year: health care providers simply don’t have the time and may not have the expertise to return incidental findings. There are also concerns about the effect of returning secondary findings on the patients and families:

Concern results may cause

Yu et al, Supp. Fig 1 (AJHG 2014)

The foremost concern by a long margin was the anxiety and stress that this knowledge might cause the patient. That’s why asking and honoring the patient/family preferences beforehand (i.e. before the sequencing even happens) is so important. There are privacy concerns as well; about a third of respondents worried that recipients of secondary findings might experience discrimination.

Clearly, the decisions about whether to return results, which results to return, and how to do so will be difficult to address. It’s also unclear, at least to me, which group or organization or (dare I say) government body should call the shots. We’ve heard from the genetics professionals, but it’s also important to hear from two other groups: primary care physicians and the general public (i.e. patients and families). These people arguably have the most at stake, so their opinions should carry significant weight.

We should move quickly to collect the information necessary for well-guided decision making, because one thing is clear: Next-gen sequencing will soon be a routine part of clinical care, whether we like it or not.

Yu JH, Harrell TM, Jamal SM, Tabor HK, & Bamshad MJ (2014). Attitudes of genetics professionals toward the return of incidental results from exome and whole-genome sequencing. American journal of human genetics, 95 (1), 77-84 PMID: 24975944