Chinese Samples

Mithra asked:

Almost all the Chinese are now around 50% SE Asian, didn’t see this before is it right.

So I decided to look at the Chinese samples in Reference I dataset.

I ran Admixture on the whole Reference I dataset for K=10 ancestral populations. The green component is what I call Southeast Asian, blue is Northeast Asian (highest among the Japanese) and violet is Siberian (highest among the Yakut).

Here is the plot for the 106 HapMap Chinese samples from Denver (label: us chinese):

HapMap US Chinese

For the 137 HapMap samples from Beijing, China (label: han chinese):

HapMap Han Chinese

For the 34 HGDP Han samples (label: han):


For the 10 HGDP Han samples from North China (label: han-nchina):

HGDP Han North China

As you can see, the "Southeast Asian" component goes down from the top group to the bottom one, which is as expected.

I wasn't satisfied with these results, so I decided to run Admixture on the East Asian samples in Reference I separately.

East Asian Admixture K=3

At K=3, the results are about the same as at K=10 for the whole reference I population. The Han all have a significant amount of blue component which is highest among the Southeast Asians.

East Asian Admixture K=4

At K=4, we get a Chinese ("East Asian") component. So we have Japanese, Chinese, Yakut and Southeast Asian components. This is what most of you were probably expecting.

Why did the Japanese become the modal population for the Northeast Asian component? I ran a PCA on the East Asian data to see how the different populations looked on a PCA plot. Remember that eigenvector 1 explains 1.49 times the variance of eigenvector 2 and 1.9 times the variance of eigenvector 3. Thus, eigenvector 2 explains 1.28 times the variation explained by eigenvector 3.

East Asian PCA eig1 vs eig2

East Asian PCA eig1 vs eig3

East Asian PCA eig2 vs eig3

As you can see, the Yakut are the far away, but the Japanese are also fairly well-separated from the Chinese populations.

If I didn't have the 141 Japanese samples in my reference dataset, the Northeast Asian component would be centered on the Han most likely, which is the case for Dodecad.

I think this shows that it is not correct to think of the ancestral components inferred from admixture as some pure ancestral population.


  1. Paul Ó Duḃṫaiġ

    Great blog post, my wife is Filipina so it's great to see more post about genetic admixture in Asia. Interesting enough apart from the usual "West Asian" that most Irish have I also got 1.5% "South Asian" admixture according to Dodecad project.

    Anyways keep up the good work! 😀

  2. What is meant by "Northeast Asian" and "Siberian"? Why do Japanese have virtually no Siberian admixture, while N. Chinese have more? Even Malays and Cambodians have more Siberian admixture??? Or is that a different shade of purple I'm not seeing correctly?

    Also in the eigenvector 3 vs eigenvector 4 graph, the Japanese are close to the Malays. In fact they are the closest group towards the Malay/Cambodian cluster. What does this mean? I always thought that the Japanese looked more Southeast Asian than Northeast Asian.

    • Northeast Asian and Siberian here are just loose descriptions for the ancestral components. Basically Northeast Asian component is highest among the Japanese and Siberian highest among the Yakut.

      Northern Chinese should be expected to have some Mongolian and/or Siberian admixture due to geography.

      While the Japanese might be close to the Cambodian in one PCA plot, they are far in the others. Imagine 4-dimensional space with the four eigenvectors as the axes. Then in that space the Japanese are far from the Malay and Cambodians.

  3. japanese isolated and/but closer to hanchinese