Tag Archives: chinese

Chinese Samples

Mithra asked:

Almost all the Chinese are now around 50% SE Asian, didn’t see this before is it right.

So I decided to look at the Chinese samples in Reference I dataset.

I ran Admixture on the whole Reference I dataset for K=10 ancestral populations. The green component is what I call Southeast Asian, blue is Northeast Asian (highest among the Japanese) and violet is Siberian (highest among the Yakut).

Here is the plot for the 106 HapMap Chinese samples from Denver (label: us chinese):

HapMap US Chinese

For the 137 HapMap samples from Beijing, China (label: han chinese):

HapMap Han Chinese

For the 34 HGDP Han samples (label: han):


For the 10 HGDP Han samples from North China (label: han-nchina):

HGDP Han North China

As you can see, the "Southeast Asian" component goes down from the top group to the bottom one, which is as expected.

I wasn't satisfied with these results, so I decided to run Admixture on the East Asian samples in Reference I separately.

East Asian Admixture K=3

At K=3, the results are about the same as at K=10 for the whole reference I population. The Han all have a significant amount of blue component which is highest among the Southeast Asians.

East Asian Admixture K=4

At K=4, we get a Chinese ("East Asian") component. So we have Japanese, Chinese, Yakut and Southeast Asian components. This is what most of you were probably expecting.

Why did the Japanese become the modal population for the Northeast Asian component? I ran a PCA on the East Asian data to see how the different populations looked on a PCA plot. Remember that eigenvector 1 explains 1.49 times the variance of eigenvector 2 and 1.9 times the variance of eigenvector 3. Thus, eigenvector 2 explains 1.28 times the variation explained by eigenvector 3.

East Asian PCA eig1 vs eig2

East Asian PCA eig1 vs eig3

East Asian PCA eig2 vs eig3

As you can see, the Yakut are the far away, but the Japanese are also fairly well-separated from the Chinese populations.

If I didn't have the 141 Japanese samples in my reference dataset, the Northeast Asian component would be centered on the Han most likely, which is the case for Dodecad.

I think this shows that it is not correct to think of the ancestral components inferred from admixture as some pure ancestral population.


SGVP is the Singapore Genome Variation Project. It sampled the following groups:

Ethnicity Sample Count SNP Count
Singapore Chinese 96 1,405,417
Singapore Malay 89 1,402,256
Singapore Indian 83 1,404,699

Singapore Indians are generally likely to be South Indians, especially Tamils.

These 268 samples were easy to convert to Plink format