Showing posts with label mongo db. Show all posts

Friday, 16 January 2015

Why MongoDB is a bad choice for storing our scraped data

MongoDB was used early on at Scrapinghub to store scraped data because it’s convenient. Scraped data is represented as (possibly nested) records which can be serialized to JSON. The schema is not known ahead of time and may change from one job to the next. We need to support browsing, querying and downloading the stored data. This was very easy to implement using MongoDB (easier than the alternatives available a few years ago) and it worked well for some time.

Usage has grown from a simple store for scraped data used on a few projects to the back end of our Scrapy Cloud platform. Now we are experiencing limitations with our current architecture and rather than continue to work with MongoDB, we have decided to move to a different technology (more in a later blog post). Many customers are surprised to hear that we are moving away from MongoDB, I hope this blog post helps explain why it didn’t work for us.

Locking

We have a large volume of short queries which are mostly writes from web crawls. These rarely cause problems as they are fast to execute and the volumes are quite predictable. However, we have a lower volume of longer running queries (e.g. exporting, filtering, bulk deleting, sorting, etc.) and when a few of these run at the same time we get lock contention.

Each MongoDB database (server prior to 2.2) has a Readers-Writer lock. Due to lock contention all the short queries need to wait longer and the longer running queries get much longer! Short queries take so long they time out and are retried. Requests from our website (e.g. users browsing data) take so long that all worker threads in our web server get blocked querying MongoDB. Eventually the website and all web crawls stop working!

To address this we:

Modified the MongoDB driver to timeout operations and retry certain queries with an exponential backoff
Sync data to our new backend storage and run some of the bulk queries there
Have many separate MongoDB databases with data partitioned between them
Scaled up our servers
Delayed implementing (or disabled) features that need to access a lot of fresh data

Poor space efficiency

MongoDB does not automatically reclaim disk space used by deleted objects and it is not feasible (due to locking) to manually reclaim space without substantial downtime. It will attempt to reuse space for newly inserted objects, but we often end up with very fragmented data. Due to locking, it’s not possible for us to defragment without downtime.

Scraped data often compresses well, but unfortunately there is no built in compression in MongoDB. It doesn’t make sense for us to compress data before inserting because the individual records are often small and we need to search the data.

Always storing object field names can be wasteful, particularly when they never change in some collections.

Too Many Databases

We run too many databases for MongoDB to comfortably handle. Each database has a minimum size allocation so we have wasted space if the size of the data in that DB is small. If no data is in the disk cache (e.g. after a server restart), then it can take a long time to start MongoDB as it needs to check each database.

Ordered data

Some data (e.g. crawl logs) needs to be returned in the order it was written. Retrieving data in order requires sorting which is impractical when the number of records gets large.

It is only possible to maintain order in MongoDB if you use capped collections, which are not suitable for crawl output.

Skip + Limit Queries are slow

There is no limit on the number of items written per crawl job and it’s not unusual to see jobs that have a few million items. When reading data from the middle of a crawl job, MongoDB needs to walk the index from the beginning to the offset specified. It gets slow browsing deep into a job with a lot of data.

Users may download job data via our API by paginating results. For large jobs (say, over a million items), it’s very slow and some users work around this by issuing multiple queries in parallel, which of course causes high server load and lock contention.

Restrictions

There are some odd restrictions, like the allowed characters in object field names. This is unfortunate, since we lack control over the field names we need to store.

Impossible to keep the working set in memory

We have many TB of data per node. The frequently accessed parts are small enough that it should be possible to keep them in memory. The infrequently accessed data is often sequentially scanned crawl data.

MongoDB does not give us much control over where data is placed, so the frequently accessed data (or data that is scanned together) may be spread over a large area. When scanning data only once, there is no way to prevent that data evicting the more frequently accessed data from memory. Once the frequently accessed data is no longer in memory, MongoDB becomes IO bound and lock contention becomes an issue.

Data that should be good, ends up bad!

After embracing MongoDB, its use spread to many areas, including as a back-end for our django UI. The data stored here should be clean and structured, but MongoDB makes this difficult. Some limitations that affected us are:

No transactions – We often need to update a few collections at a time and in the case of failure (server crash, bug, etc.) only some of this data is updated. Of course this leads to inconsistent state. In some cases we apply a mix of batch jobs to fix the data, or various work-arounds in code. Unfortunately, it has become common to just ignore the problem, thinking it might be rare and unimportant (a philosophy encouraged by MongoDB).
Silent failures hide errors - It’s better to detect errors early, and “let it crash”. Instead MongoDB hides problems (e.g. writing to non-existing collection) and encourages very defensive programming (does the collection exist? is there an index on the field I need? Is the data the type I expect? etc.)
Safe mode poorly understood – Often developers don’t understand that without safe=True, the data may never get written (e.g. in case of error), or may get written at some later time. We had many problems (such as intermittently failing tests) where developers expected to read back data they had written with safe=False.
Lack of a schema or data constraints – Bugs can lead to bad data being inserted in the database and going unnoticed.
No Joins – Joins are extremely useful, but with MongoDB you’re forced to either maintain denormalized data without triggers or transactions, or issue many queries loading reference data.

Summary

There is a niche where MongoDB can work well. Many customers tell us that they have positive experiences using MongoDB to store crawl data. So did Scrapinghub for a while, but it’s no longer a good fit for our requirements and we cannot easily work around the problems presented in this post.

I’ll describe the new storage system in future posts, so please follow @scrapinghub if you are interested!

Comment here or in HackerNews thread.

Open Source at ScrapinghubIn "Scrapinghub"

Looking back at 2014In "Scrapinghub"

Looking back at 2013In "Scrapinghub"

From → Scrapinghub, Scrapy Cloud

26 Comments

David Mytton permalink

I think these are all valid concerns but a few are things which are known about MongoDB and so should factor into the original decision to use it. In particular, no joins, no transactions and no schema are all features and if you need those then you shouldn’t have chosen MongoDB in the first place.

“Safe mode” changed in November last year and is now defaulted to on, or acknowledged writes. This helps with the problem you described and is a nice default because it gives reasonable performance + safety. You can dial back safety to get performance or vice versa.

Locking is probably the most often cited issue and it is better in 2.4, but could (and will) be improved. I think the biggest issue for us is disk space space reuse. The compact command helps but it does require a maintenance window and some time to complete, which is inconvenient.

Reply
ekyo777 permalink

I ‘ve been working with mongoDB for about a year now and encountered many of these problems and managed to implement workarounds for most, I’ll list a few of these workarounds.

Poor space efficiency:
Hard drives are cheap so this was a non-issue

No transactions:
we had to implement it ourselves with document based redis lock handled by our software. good thing they are not required everywhere in our use case, far from it. small atomic operations cover most of our cases.

safe mode:
I’d also prefer the default to be safe, but well.. this can be solved rather easily by coding some helpers.

lack of schema/data constraint:
depending on the language you use some solutions are already there, ex.: mongoose
that being said even when we use mongoose we have many helpers on top of it so that schema require less code, and respect specific traits.

but… I do see your point that it might not be worth your time if you need a workaround for many things and you should find a database more fitting to your use case. Good for us, the workarounds we required were negligible overhead.

Reply
- pro pro permalink
  
  Very nice inded .
  Who’s going to maintain these ‘workarounds’ in three years if you’ve left this project ?
  Specially the transcation “workaround”…
  
  Reply
Shane Evans permalink

I am glad to hear there are locking improvements in 2.4. Indeed, 2.2 was a decent improvement for us, but didn’t go far enough. The safe mode defaults are also a welcome change.

The lack of joins & transactions of course did factor into the original decision. My point (which perhaps could be clearer) was that MongoDB ended up being used outside of the area in which we originally intended to use it. There was some reluctance to add another technology when we could get by with what we had for what was (initially) only a small use. Additionally, some limitations were not always well understood by web developers (who were new to mongo and enthusiastic to try it). I see this as our mistake. With hindsight, it’s clear we should have introduced an RDBMS immediately and kept MongoDB for managing the crawl data.

Reply
Mark Vletter (@MarkV) permalink

Armin Ronacher recently shared his experiences with MongoDB at @PyGrunn

* typeout: http://maurits.vanrees.org/weblog/archive/2013/05/armin-ronacher-a-year-with-mongodb
* ‘slides': https://speakerdeck.com/mitsuhiko/a-year-of-mongodb

Reply
Engineer permalink

The ONLY reason to choose MongoDB is because you’re being lazy. Seriously, it has “NoSQL Cool” for people who want to write SQL. Any basic survey of the options out there- Cassandra, Riak, Couchbase, BigCouch, Voldemort, etc … gives you multiple actually distributed, scalable databases.

Global Write Lock? You’re recommending a database with a global write lock? You’re fired.

Seriously, when did engineering leave this profession? Where are people’s standards?

MongoDB is only appropriate if your data fits on a single machine…. and if that’s the case there are many other, more established choices such as MySQL and Postgres.

Reply
- Eric permalink
  
  Yeah, these companies and groups are all just lazy: SAP, Stripe, Sourceforge, Trello, Intuit, Bit.ly, Github, Ebay, LexisNexis, Shutterfly, ADP, Forbes, CERN, etc etc etc. And that’s just scratching the surface:http://www.mongodb.org/about/production-deployments/
  
  Reply
- Super Hans permalink
  
  MySQL? You’re recommending the shittiest RDBMS (lol) ever invented? You’re fired.
  
  Reply
Shane Evans permalink

When we chose MongoDB, we used it for “a simple store for scraped data used on a few projects” back in 2010. It was a useful tool we could use on some consulting projects. We deployed it on an AWS small instance (the 2GB limit was fine for a while, we are self-funded and didn’t want to spend much on hardware) along with other services. The early version of the platform was quick to develop and this is partly thanks to MongoDB.

Once we realized that we had a useful service and as our data size, traffic and requirements grew, we knew MongoDB wasn’t the best fit.

MySQL or Postgress would not have worked well for storing scraped data as we need to store arbitrary JSON objects and filter them (e.g. find products in the last crawl job with a price < 20, find blog posts crawled by this spider with more than 10 comments, etc.).

Reply
- Tim Williams permalink
  
  Your query examples suggests to me that elasticsearch might be a good fit. I have used at a smaller scale worked well for me.
  
  Reply
tt permalink

for the love of living things, why would a database be used to store large volumes of scraped data, especially for shared service like yours? It sounds like you need hdfs or some special-purpose datastore with good compression support. Relational db’s would be bad for this (ie. slow) in similar way’s mongo is.

I’m not really defending mongo, but it sounds like you picked the wrong tool for the job in any case.

Reply
- AspieDBA permalink
  
  Actually I can guarantee that RDBMS can handle this. Has been working very well for 14 years and in some cases on ancient 32bit technology and we aren’t talking about trivial traffic either.
  It’s attention to detail and common sense that is needed
  
  Reply
Kim Betti (@kimbetti) permalink

Well that’s the thing with these databases, you reeally have to make sure that you design your collections as transactional boundaries.

Reply
richarddunks permalink

Great post about the limitations of MongoDB. Are you planning to keep MongoDB for some applications or are you implementing HBase for everything?

Reply
Charles (@dreamingbinary) permalink

Great post indeed regarding your experience. We’re planning to adopt Mongo as well. However, I’ve studied it long enough for all of our use cases (one of which is similar to yours in terms of storing crawled data) so I wasn’t surprised to see the issues you ran ultimately ran into. As your follow up comment suggested, you definitely started using it outside the initial design, stuff that it wasn’t meant to do.

Reply
pavan permalink

Great post regarding limitations of MongoDB. Nice post. This is useful for all the techies..! Than x for this post..!

Reply
Vlad permalink

What database do you use now?

Reply
Pablo Hoffman permalink

@Vlad we are using HBase.

Reply
Praveen Addepally permalink

Very nice post… We were thinking of using MongoDB for our new project. But after reading this post now we are re-thinking of our decision. Thanks though for the post.

Reply
silviu dicu permalink

The fast you discovered that mongodb doesn’t support transactions after you implemented your system … says it all.
For all sake I think your dev/ops guys don’t understand quite well mongodb – see this – Impossible to keep the working set in memory – ok so what database will do that for you ? On a second thought actually you can keep all working set in memory … but you may need to buy half of aws instances :)

One think I can agree however is the fact that just using mongodb will NOT solve all your problems.
You need to know very well the data structure you want (documents) as well the access/update patterns.

Top 5 syntactic weirdnesses to be aware of in MongoDB

Rage posts about MongoDB are quite popular these days. Most of them are about poor performance on specific data sets, reliability and sharding issues. Some of those blog posts might be right, other are just saying that the most popular NoSQL solution didn't fit their needs.
This article is not one of those. While most of the posts focus on operations part, benchmarks and performance characteristics, I want to talk a little bit about MongoDB query interfaces. That's right - programming interfaces, specifically about node.js native driver but those are nearly identical across different platform drivers and Mongo-shell.
Disclaimer: I try hard not to hate on MongoDB. In fact I work with MongoDB every work day as part of my full-time job. I also take part in the development ofMinimongo, pure-JavaScript clone of MongoDB API to work with in-memory caches. There is no reason for me to mock Mongo other than warning everyone about its sharp edges. Most of these gotchas are found by David Glasser. This article assumes you are familiar with MongoDB's API.

1. Keys order in a hash object

Let's say you want to store a simple object literal:

> db.books.insert({ title: "Woe from Wit", meta: { author: "A. Griboyedov", year: 1823 } });

Great! Now we have a book record. Let's say later we would want to find all books published in 1823, written by this author ("A. Griboyedov"). It is unlikely to return more than one result but at least it should return the "Woe from Wit" book as we just inserted it, right?

> db.books.find({ meta: { year: 1823, author: "A. Griboyedov" } });
< No results returned

What happened? Didn't we just insert a book with such meta-data? Let's try flipping the order of keys in the meta object:

> db.books.find({ meta: { author: "A. Griboyedov", year: 1823 } });
< { _id: ..., title: "Woe from Wit", meta: { ... } }

Here it is!
The gotcha: the order of keys matters in MongoDB, i.e. { a: 1, b: 2 } does not match { b: 2, a: 1 }.
Why does it happen: MongoDB uses a binary data format called BSON. In BSON, the order of keys always matters. Notice, in JSON an object is an unordered set of key/value pairs.
What about JavaScript? ECMA-262 left it as 'undefined'. In some browsers (usually old ones) the order of pairs is not preserved meaning they can be anything. Thankfully most modern browsers' JavaScript engines preserve the order (sometimes even in arrays), so we can actually control it from node.js code.
Read more about it at John Resig's blog.
The answer to this is to either always specify pairs in the canonical form (keys are sorted lexicographically) or just to be consistent across your code base.
Another workaround would be to use a different selector, specifying certain key-paths rather than comparing to an object literal:

> db.books.find({ 'meta.year': 1823, 'meta.author': 'A. Griboyedov' });

It would work in this particular case but note that the meaning of this selector is different.
The gotcha: this behavior can be dangerous whenever you want to build a multi-key index.

> db.books.ensureIndex({ title: 1, 'meta.year': -1 });

In such command the priority of title would be higher than the priority ofmeta.year field. This is important to the way MongoDB will lay out your data: Read more in docs.

2. undefined, null and undefined

Anyone remembers those times when the behavior of undefined, null and the relation was confusing? In JavaScript world those are two different values and they are not the same in a strict comparison undefined !== null. However, they are equal in a non-strict comparison undefined == null. Some people are very careful with them, others use them interchangeably. But the point is: you have two different but similar values in JavaScript.
MongoDB brings it to the next level. The BSON spec defines undefined as "deprecated".
Node.js node-native-driver for MongoDB doesn't implement it at all.
In the current version (2.4.8) the behavior shows that null and undefined are treated as the same value.

> db.things.insert({ a: null, b: 1 });
> db.things.insert({ b: 2 }); // the 'a' is undefined implicitly
> db.things.find({ a: null });
< { a: null, b: 1 }
< { b: 2 }

I am not sure about the actual implementation, it looks like undefined is just converted to null by node driver but is restricted in mongo-shell.
In the following code we will get the same result printed twice: all 3 objects.

// from node.js code with mongo/node-native-driver
db.things.insert({ a: null, b: 1 });
db.things.insert({ b: 2 });
db.things.insert({ a: undefined, b: 3 });
console.log(db.things.find({ a: null }).fetch())
console.log(db.things.find({ a: undefined }).fetch())

However in mongo-shell you will be able to query only with null but we get all three objects as well.

// from mongo-shell
> db.things.find({a: undefined});
< error: { "$err" : "can't have undefined in a query expression", "code" : 13629 }
> db.things.find({a: null});
< { "a" : null, "b" : 1, "_id" : "wMWNPm7zrYXTNJpiA" }
< { "b" : 2, "_id" : "RjrYvmZF5EukhpuAY" }
< { "a" : null, "b" : 3, "_id" : "kethQ2khbyfFjJ7Sa" }

We can see that mongo/node-native-driver converted the explicit undefined tonull but left the implicit one as is (which is expected really).
The cool stuff happens when we insert an explicit undefined from mongo-shell:

// from mongo-shell
> db.things.insert({ a: undefined, b: 4 });
> db.things.find({ a: null })
< { "a" : null, "b" : 1, "_id" : "wMWNPm7zrYXTNJpiA" }
< { "b" : 2, "_id" : "RjrYvmZF5EukhpuAY" }
< { "a" : null, "b" : 3, "_id" : "kethQ2khbyfFjJ7Sa" }

We get the same three values and no new object with b=4. Shouldn't undefinedmatch null? Let's look at the new object:

> db.things.find({ b: 4 });
< { "_id" : ObjectId("52ca134f3e47d3d91146f2b5"), "a" : null, "b" : 4 }

It is still there, a field is holding something looking like null but doesn't match the null from our selector.
The gotcha: there are more than 2 values looking like null in MongoDB: null,undefined and undefined inserted from mongo-shell that looks like null in the shell but in reality matches the deprecated undefined in BSON (type number six). The last one doesn't match null from the selectors, first two match bothundefined and null. The absence of value also matches both.
Read the original GitHub issue.

3. Soft limits, hard limits and no limits

Let's say you have a feed of items and you allow user to specify the number items to return. You would return the result of a query looking like this:

db.items.find({ ... }).limit(N);

Where N is supplied by user. Of course we want to be careful and restrict user up to 50 items, otherwise anyone in the Internet would be able to load our application server and the database simply by supplying a very large N:

function getItems (N) {
  if (N > 50)
    N = 50;
  return db.items.find({}).sort({ year: 1 }).limit(N);
}

Looks like a reasonable code running in your node.js app (server-side).
The gotcha: if user supplies 0 (zero) as a number of items he wants to get the MongoDB would take it as "give me everything".
It is well documented but not obvious right away: zero means "no limit" to MongoDB. My guess is some code just treats all falsy values the same way:undefined, null, 0, absence of value - everything means "no limit".
That's OK, we can treat 0 as a special case:

function getItems (N) {
  if (N > 50 || !N) // check if N is falsy ("no limit")
    N = 50;
  return db.items.find({}).sort({ year: 1 }).limit(N);
}

Looks good? But what happens if user supplies a negative number? Is it even possible? What could it possibly mean?
In reality something like db.items.find().limit(-1000000000000) can return a bazillion of items. It is hard to find the documentation about it but several month ago I have seen the description of this behavior in node.js driver's docs, it talked about "soft" and "hard" limits. I have no idea what does it mean.
So the final version of our server-side method would look like this:

function getItems (N) {
  if (N < 0) N = -N;
  if (N > 50 || !N) // check if N is falsy ("no limit")
    N = 50;
  return db.items.find({}).sort({ year: 1 }).limit(N);
}

The gotcha: limit can be negative. It would mean the same as positive in the broader sense but the negative one is "soft".

4. Special treatment for arrays

A lot of people don't know this "feature" but arrays are treated specially.

> db.c.insert({ a: [{x: 2}, {x: 3}], _id: "aaa"})

> db.c.find({'a.x': { $gt: 1 }})
< { "_id" : "aaa", "a" : [  {  "x" : 2 },  {  "x" : 3 } ] }

> db.c.find({'a.x': { $gt: 2 }})
< { "_id" : "aaa", "a" : [  {  "x" : 2 },  {  "x" : 3 } ] }

> db.c.find({'a.x': { $gt: 3 }})
< Nothing found

So whenever there is an array in object, the selector would "branch" to every element and this acts like "if any of those match, then the whole document matches".
Notably, it doesn't work for nested arrays:

> db.x.insert({ _id: "bbb", b: [ [{x: 0}, {x: -1}], {x: 1} ] })

> db.x.find({ 'b.x': 1 })
< { "_id" : "bbb", "b" : [  [  {  "x" : 0 },  {  "x" : -1 } ],  {  "x" : 1 } ] }

> db.x.find({ 'b.x': 0 })
< Nothing found

> db.x.find({ 'b.x': -1 })
< Nothing found

Same feature applies to the fields projections:

> db.z.insert({a:[[{b:1,c:2},{b:2,c:4}],{b:3,c:5},[{b:4, c:9}]]})
> db.z.find({}, {'a.b': 1})
< { "_id" : ObjectId("52ca24073e47d3d91146f2b7"), "a" : [  [  {  "b" : 1 },  {  "b" : 2 } ],  {  "b" : 3 },  [  {  "b" : 4 } ] ] }

If we play a bit more combining this feature with numeric keys in selectors the behavior becomes harder and harder to predict:

> db.z.insert({a: [[{x: "00"}, {x: "01"}], [{x: "10"}, {x: "11"}]], _id: "zzz"})
> db.z.find({'a.x': '00'})
< Nothing found
> db.z.find({'a.x': '01'})
< Nothing found
> db.z.find({'a.x': '10'})
< Nothing found
> db.z.find({'a.x': '11'})
< Nothing found

> db.z.find({'a.0.0.x': '00'})
< { "_id" : "zzz", "a" : [     [   {   "x" : "00" },   {   "x" : "01" } ],     [   {   "x" : "10" },   {   "x" : "11" } ] ] }

> db.z.find({'a.0.0.x': '01'})
< Nothing found

> db.z.find({'a.0.x': '00'})
< { "_id" : "zzz", "a" : [     [   {   "x" : "00" },   {   "x" : "01" } ],     [   {   "x" : "10" },   {   "x" : "11" } ] ] }

> db.z.find({'a.0.x': '01'})
< { "_id" : "zzz", "a" : [     [   {   "x" : "00" },   {   "x" : "01" } ],     [   {   "x" : "10" },   {   "x" : "11" } ] ] }

> db.z.find({'a.0.x': '10'})
< Nothing found
> db.z.find({'a.0.x': '11'})
< Nothing found
> db.z.find({'a.1.x': '00'})
< Nothing found
> db.z.find({'a.1.x': '01'})
< Nothing found

> db.z.find({'a.1.x': '10'})
< { "_id" : "zzz", "a" : [     [   {   "x" : "00" },   {   "x" : "01" } ],     [   {   "x" : "10" },   {   "x" : "11" } ] ] }

> db.z.find({'a.1.x': '11'})
< { "_id" : "zzz", "a" : [ [ { "x" : "00" }, { "x" : "01" } ], [ { "x" : "10" }, { "x" : "11" } ] ] }

And later becomes just inconsistent. The difference between this and next examples is just the inner value: in the last example it is an object, in the following it is a number. It is enough for behavior to change:

> db.p.insert({a: [0], _id: "xxx"})

> db.p.find({'a': 0})
< { "_id" : "xxx", "a" : [  0 ] }

> db.q.insert({a: [[0]], _id: "yyy"})

> db.q.find({a: 0})
< Nothing found

> db.q.find({'a.0': 0})
< Nothing found

> db.q.find({'a.0.0': 0})
< { "_id" : "yyy", "a" : [  [  0 ] ] }

The gotcha: avoid arrays and nested arrays or other one-to-many pairs in your documents queried by selectors with a usual intend to query one-to-one pairs. The combination with numeric keys (like { 'a.0.x': Y } meaning the field x of the first element of field a must be Y) may become very confusing as it depends on your data.

5. $near geo-location operator

This one is simple. You have a collection of documents with a location field. Location field represents a geo-location. The trick is in two different types of locations MongoDB can index, each type has a slightly different API and a slightly different behavior.
The first one looks like this:

db.c.find({
  location: {
    $near: [12.3, 32.1],
    $maxDistance: 777
  }
});

The second one looks like this:

db.c.find({
  location: {
    $near: {
      $geometry: {
        type: "Point",
        coordinates: [ 12.3, 32.1 ]
      },
      $maxDistance: 777
    }
  }
});

The gotcha: the syntax of geo-query is slightly different depending on the index type. $maxDistance is the sibling element of $near in case of plain pairs and is a child in case of Geo-JSON.
But there is more! Sometimes you can get the same point twice in the result set! To understand this we need to recall the previous gotcha about nested arrays. Consider this code:

> db.c.insert({ location: [[1, 2], [1, 0]] }); // inserting an array of two points
> db.c.ensureIndex({ location: "2d" });
> db.c.find({ location: { $near: [0, 0], $maxDistance: 500 } });
< { "_id" : ObjectId("52ca30ec3e47d3d91146f2b8"), "location" : [  [  1,  2 ],  [  1,  0 ] ] }
< { "_id" : ObjectId("52ca30ec3e47d3d91146f2b8"), "location" : [  [  1,  2 ],  [  1,  0 ] ] }

Same point is returned twice as both points from the array match the selector.

All these gotchas remind me the days when I first started coding in JavaScript. There are several corner cases, some of them work inconsistently across browsers, some of the features you never want to use, somewhere you want to be extra careful. All of those are well known in JavaScript land, but not so well in MongoDB land.
Almost every weird behavior listed here was found in the process of simulating MongoDB in the project called Minimongo, mostly by David Glasser.
This article will be updated as new weirdnesses come to mind.

Update of 1 April 2014: I talked about some of these issues and some new gotchas on SF Meteor Devshop, the recording of the talk is below. "Don't get bitten by Mungos (or Mongos)":

Perfect MVC code