Bypassing Chinese Internet Censorship: How I Built a Censored Microblog Aggregator

View all articles

As is known worldwide, the Chinese government enforces strict censorship on the internet. The Chinese censorship system, commonly known as the Great Firewall of China, is operated by the Ministry of Public Security and is officially named the Golden Shield Project. The system has been in operation since 2003.

International news sites that usually contain politically sensitive content, such as the New York Times, or social media sites which are not complying with censorship rules, such as Facebook and Twitter, are usually blocked and unavailable for Chinese users. This is accomplished using a variety of sophisticated methods.

For Chinese news and social media sites, virtually everything is under the government’s surveillance. In order to be allowed to operate, ISPs and internet content providers in China usually have their own content filtering mechanism for blocking or removing the published content by its users, or even deleting users’ account directly if they are assumed to be illegal under the government policy. These companies have their own censorship software on their servers, as well as special teams or departments to manually handle the censorship tasks that automated censoring software can’t manage. These teams cooperate with the local divisions of the Ministry of Public Security, receiving new orders and policies, and usually working together with each other.

For our domestic web developers, the censorship of the Chinese internet not only filters out our freedom of speech, but also valuable professional resources from around the world. In my daily work, I have to bypass internet censorship to connect via VPN to use Gmail, Dropbox, and many other crucial sites. I still remember how awkward it was in 2010, when Google’s services became unstable or inaccessible in China after Google refused to continue complying with censorship rules. This would be unbelievable for developers in other countries.

Censorship on Sina Weibo

Sina Weibo is the biggest microblogging social network site in China. Since Twitter does not comply with China’s rules, Weibo does not have to compete with it for users. News spreads more quickly and directly on Weibo than any other media outlet in China. Members of the younger generations, such as myself, like to use it to share news and discuss public events. But of course, under Chinese internet censorship, many hot or interesting posts are deleted immediately after they are posted. Political and public event posts are most likely to be deleted, while entertainment news is least likely to be deleted. A 2013 study by computer scientists Jed Crandall and Dan Wallach found that about 12% of Chinese microblogs are being deleted every day.

On politically sensitive days like June 4th, it is expected to see a higher number of censored Microblog posts being deleted. On these days, users usually cannot even input certain sensitive words when they attempt to write a microblog.

What does it look like when a post gets censored? When you refresh a new microblog on the site, you will often see something like this:

This is a censored Chinese microblog where content was removed by the government regulatory offices or the ISP.

This is the equivalent of a retweet, where the original message typically appears in the gray box. The box now reads “Sorry. The microblog has been deleted. Please see…” The original post was a plea for justice by a mother, for the kidnap, rape, and forced prostitution of her 11-year-old daughter in 2013.

2013 is a year that a lot of political scandals were revealed through the microblog platform. The popularity of Sina Weibo soared during this time. In response, the government got nervous and started to strengthen its censorship on the social media platform.

Before the microblog, young people like me who were interested in politics usually had to use proxy servers or tunneling services to hunt down sensitive news from international websites. Suddenly, we had a relatively open Chinese social network platform. But the government stepped in quickly, and it turned out to be just a flash in the pan. This really infuriated me. I talked with friends, and we were all angry about the strengthening of censorship on the platform. My friends would ask, “Why can’t we do anything about this?” I decided I would try. So I built a website to begin bypassing internet censorship to see what exactly was being blocked or deleted from Sina Weibo.

Technical Discussion

Basically, I needed to set up a server which constantly scanned for blocked or deleted Chinese microblogs and showed them in a new website. I had planned to use a domestic cloud service like Aliyun, but it turns out that there are many constraints on the platform, such as domain redirecting, and their prices are no cheaper than other cloud services. Of course, my additional concern was that the server itself would be under surveillance if I deployed it domestically. So I ended up buying a server on Linode, and located the server in Japan. I also bought the domain freeweibo.me to begin bypassing the censorship of Sina Weibo.

The following graph shows the overall architecture of the system: MongoDB, a web server, and a crawler. I chose Node.js for the development environment, as it is more efficient and scalable for network applications and, personally, I have more experience with it. The web server was developed using the Express.js framework, and used the Weibo API to capture data. Initially, the crawler was designed to be a separate process, but later I found that bundling it as a module in the web server process was sufficient for the early stage.

This is the architecture of the system that would bypass censorship in China and retrieve microblogs that had been deleted.

The content of a microblog has two major parts of interest. One is the text data and its relevant attributes. The other is the images affiliated with the post. To save a post, we also want to download the images and save them as files on the disk. For blocked or deleted blogs, these images are very important. In China it’s very common and popular to use images for posting text content, as this content is much more difficult to catch with automated text-based filtering and censoring on the servers of internet companies.

The basic idea of detecting blocked or deleted posts is to constantly scan for new posts, from a known list of users, and then recheck the the posts’ availability at a later time. A microblog could be deleted or blocked within several minutes or several days. Thus, the crawler consists of two main tasks: the fetch task, to fetch newly posted content, and the check task, to check whether previously posted content has been censored.

At first, I configured the crawler to crawl microblogs from the top 100 well-known users on Weibo. But it turned out that there were almost no deleted blogs being detected each day. The reason is that most of the top users have no interest in political or publicly sensitive topics - they never post or forward these kinds of microblogs. For example, this blogger, who is an actress with more than 10 million followers, is one of the most popular users, but she never posts sensitive blogs.

After some experimentation and thinking, I came up with a technique to adaptively find users who consistently get censored. The social media network is topic-interconnected and users tend to gather in groups by interest. If a user has an interest in public or political topics, then he is more likely to post or forward other similar users’ blogs. These forwarded posts provide a good way to identify new users to scan.

For example, say user A is already in the database, and the crawler detects that one blog, which was reposted by user A, is deleted. If user B, the original author of the blog, is not in the database, then the crawler will save user B. Next time, when the crawler rescans new blogs, it will also scan new blogs from user B. Thus, the quantity of scannable users will automatically grow by harnessing this kind of social interest connection.

Chinese internet censorship can be bypassed by leveraging microblog behavior.

After tuning the crawler algorithm to take advantage of this methodology, I only needed to seed several key users who had strong interests in posting sensitive blogs and the crawler automatically discovered new users to scan. The daily total censored blogs that were detected rose steadily day by day. The following is a snapshot of archived deleted blogs in my mailbox.

This is an example of censored Chinese microblogs on the social network.

  • A historic dialogue by Mao Zedong rebuking a local official for not pulling down the ancient city wall of Chengdu.
  • A post about Xu Zhiyong, who is an active rights lawyer. He has helped many underprivileged people and started the New Citizen’s Movement in China. He was sentenced to jail in January, 2014.
  • Criticism of the government’s newspaper People’s Daily
  • Comment on the arrest and trial of Wang Gongquan, a billionaire in China and leader of the New Citizen’s Movement.
  • A reference to the arrest of activists who take part in social movements.

Results

After two weeks coding and debugging my Chinese microblog bypassing system, I deployed the site to freeweibo.me. However, after several weeks running, the server detected no more new blogs. With some investigation I found two issues. One was that the Weibo platform had changed their original API interface. The other was that the crawler’s API requests were exceeding the rate limit (1000 per minute) due to the increase of blogs and users in the database. So I tuned my code to adopt the new interface and also to decrease the API request count per minute. The crawler was stable from then on.

I faced a dilemma over whether or not to let many people know about the site. I knew that the more people who visited the site, the sooner it would be sniffed out by the government and be blocked. So I only shared the site with some of my friends. Initially, there was only about 10 to 20 visits per day. But a month later, the visits hit 80 or more on some days, and I had tens of email subscribers.

And then, as I had expected, the morning came when I found my site was blocked in China. It had lasted about three months. In order to reach the site after that, users had to use a VPN tunneling services to visit the site. This is impractical for most Chinese internet users.

However, that same day I was relieved and pleased to find that another site, freeweibo.com, is providing exactly the same service, and is more sophisticated than what I built. The freeweibo.com project is very resourceful. It is active on social media, and provides different means to access the content, like RSS feeds, email subscription, and mirror sites for domestic users. It even has a mobile app! I don’t know who built the site, but I’m glad we share a same vision.

Conclusion

Based on the circumstances, it was obvious that my site was not very useful anymore, and I closed it several months later.

Despite the outcome, I don’t feel like the project was in vain. On the contrary, it’s was a marvelous experience, even though it only survived for a few months. It helped me to deeply appreciate the reality in my country.

In China, to run an internet business, you have to be very cautious about censorship, or you will get into trouble sooner or later. There is barely any way for social media sites to be successful if they do not comply with the strict censorship, and compromise on users’ privacy.


Update

The freeweibo.me source code is now available on GitHub here. As stated above, this source code is not related to the similar website freeweibo.com.

About the author

Xiaolei Liu, China
member since January 2, 2014
Xiaolei is a JavaScript expert and full stack developer focusing on Node.js and AngularJS. He loves programming and enjoys working from home. He highly values the experience of working and building trust with colleagues and clients in a remote capacity. [click to continue...]
Hiring? Meet the Top 10 Freelance Web Developers for Hire in December 2016

Comments

Pablo Selín Carrasco
Wonderful article, thanks for the insight in developing something like this. I wonder if you did get in any trouble with the authorities for doing the crawler site or they just stop when they block it?
Guest
Thanks Pablo, I didn't get any trouble with the authorities. The site hadn't gained much popularity in China at then. For them, It's just not well worth hunting down me.
liuxiaolei
"Thanks Pablo, I didn't get any trouble with the authorities. The site hadn't gained much popularity in China at then. For them, It's just not well worth hunting down me."
佶辉 徐
it is a shame for a country to do that, I'm ok to live without facebook and tweeter, but when google was blocked, gist was blocked, heruku was block, code.google was blocked, dropbox was blocked, duckduckgo was blocked..... I feel hopeless.
uyghur2014
useful article thanks a lot ,can you explain step by step how technical peope will make hig advance crawler site?
arikira
This also seems like a potentially useful tool for the Chinese government to use to hunt down individuals who often create new "politically sensitive" blogs or posts. Has the government called you with a job offer yet, Xiaolei? :)
Johnny Arabia
I come from a country where activism has always been in the background of our society. In college, I too marched on the streets with tens of thousands of others to raise our voices in protest against tyranny and human rights violations. We don't have as much censorship as we used to, but activism runs healthily for us and that we can at least publicly voice out on social issues reasonably enough. So I totally get where you're coming from, Xiaolei. I hope that you and your countrymen can finally enjoy true freedom someday. If I may make a suggestion, the code that you wrote for your aggregator -- open source it. Release it on github with instructions on how to replicate what you did. Make it easy for others who can't code to have a way to help out in your country's struggle for freedom online. Hope this helps.
liuxiaolei
Thanks for you advice, Johnny, I will open source it soon.
liuxiaolei
Not yet:), arikira, the government actually has more sophisticated tools to do this.
liuxiaolei
uyghur, that's big topic. This project is actually my initial undertaking on crawler site. I think the most important thing for crawling is how to craft an efficient algorithm to fetch the information you are interested. You need to analyse the context and interconnections in the data, and find out the most productive way to crawl using limited resources.
asdf
中国的真相需要你搞这种东西才能看得见吗?闲的蛋疼
lsiden
Fantastic article! Great work! The authorities in China are incredibly petty, corrupt, paranoid, and short-sighted. The people of China deserve better. Keep chipping away at the wall!
comments powered by Disqus
Subscribe
The #1 Blog for Engineers
Get the latest content first.
No spam. Just great engineering and design posts.
The #1 Blog for Engineers
Get the latest content first.
Thank you for subscribing!
You can edit your subscription preferences here.
Trending articles
Relevant technologies
About the author
Xiaolei Liu
JavaScript Developer
Xiaolei is a JavaScript expert and full stack developer focusing on Node.js and AngularJS. He loves programming and enjoys working from home. He highly values the experience of working and building trust with colleagues and clients in a remote capacity.