I’m doing a research project on evaluating Communist party support in the context of the application of Socialism with Chinese Characteristics, relating widespread support for policies with the relevant socialist theory. Anyway, while doing research I stumbled across this usage of K-means clustering to analyze the data and with this application of a data analysis tool, the support for the party, while still high, varies greatly from what is initially suggested from the surveys.
Looking at it I find some of the justifications they use for describing typologies a little fishy. The questions asked are whether or not you trust the CPC on a four point scale with 1 being not at all and 4 being high amounts of trust, with the second question being about support for the one party system using the same scale. In any case they use K Clustering to break these groups into the four possible typologies and cluster the two of the middle groups together under the justification that people can be “ambivalent”. However, this feels like unnecessary simplification of the clusters in order to present the “ambivalence” as being more varied than it is. Just because people might have incoherent views on the issue doesn’t mean they do and presenting the issue as that feels like it could be “gerrymandering” data. I’m completely open to my speculations and reservations being completely off base, this is very estranged from my major, but I thought I would ask her for some help in understanding it.
You guys are pretty smart sometimes
The part I’m discussing occurs on page 56 where they begin to explain their statistics and methods.
I’m not a statistics person, but wouldn’t using 4 degree scale to create 3 clusters make the middle cluster that contains 2 degree points overrepresented? I feel like the existence of 3 clusters implies that each cluster had an equal amount of sampling from the population, but I also may just not know how K-means clustering works.
so they’ve defined ambivalent typologies based on their framework in table 1, and use that to impose 4 clusters onto the data
Based on these considerations, the study sets the number of clusters at four, although all three heuristics, i.e. the elbow method, the silhouette value and the gap statistic, suggest that the respondents form three clusters
so there’s really only 3 clusters but they’ve decided to set k=4 anyway, and then k-means just minimizes the variance within each cluster relative to its mean value. each observation gets assigned to whichever mean is “closest” in a certain sense, but that doesn’t mean it’s really the best choice.
even after they “merge” the ambivalent classes and set k=3, assigning each observation to a cluster based on the closest mean value doesn’t mean it’s the best choice for defining each class, just that it’s the closest in terms of variance.
the natopedia article has a good illustration:
https://en.wikipedia.org/wiki/K-means_clustering#/media/File:K-means_convergence.gif
Doesn’t it seem strange to apply this kind of statistical analysis to a four point survey?
the second question being about support for the one party system using the same scale
Ah, the illusion of choice. Does anyone think western opinions of China would improve if they had two competing Communist parties rather than one big CCP?
It does feel like US/UK fetishization of choice leads to some awfully nasty negative externalities, without ever yielding the kind of popular candidates that these liberal democracies claim to venerate. If Donald Trump and Joe Biden were part of the same singular national party, I honestly think the US might be run better (or, at least, smoother) than in its current state. That’s primarily because you wouldn’t have enormous chunks of the popular base for each candidate driven the hysteria after every election cycle. How much time and labor and materials go into creating a boogeyman out of your rival and then spending the next six months to a year engaging in public flagellation at the impending success or failure of the given candidate?
Disclaimer, this is based on skimming. As far as survey analyses go it’s not horrible. If it were me I’d be asking why the two ambivalence categories are so different in size if they do represent a similar intensity, and the regression results suggest that people tend to support governments when they think the economy is doing well and corruption is low, an indication that the support question is more capturing a vibe than an indication of a person’s undergirding philosophy on the proper form of governance. I think the paper would be a lot stronger if it tried to tease out differences between weak supporters and weak dissenters. There’s some room for follow up here and the Asian Barometer data is publicly available for download so you can have a look at the questionnaire to see if he omitted any useful questions. Or do your own analyses, these things aren’t as hard as PoliSci profs try to make them look.
So let me get this right? they got a bunch of {1,2,3,4}^2 points and they applied k-means one (1) time with the standard l_2 metric, found 4 random ass centroids, published it, and now you’re calling into question their methodology?
edit: The problem with k means is that it converges to local minima, so depending on the starting point (in this case the 4 ‘typologies’) each run gives different results. So any typologies found are gonna be interpreted with the ideological slant of the authors. Also a heat map would have been better for representing the data: it’s on a 2d grid.