Perfect MVC code: January 2015

Friday, 16 January 2015

MongoDB Gotchas & How To Avoid Them

n.b. this post has been updated as of July 29th, 2014

A lot of people hate on MongoDB. In my opinion they’re misguided - the main reason so many people think like this is a lack of understanding. Everyone should be able to benefit from MongoDB’s power and simplicity, and so as a follow up to David’s article I have outlined some common and not-so-common things that hackers should know about MongoDB.

First though, why should you listen to me? I used to be a consultant specializing in Ops and helping companies (The Guardian, Experian) scale large web applications. As well as co-founding the offical MongoDB London User Group, I am a MongoDB Master and have worked on installations from single servers all the way to projects with 30k queries per second and over a terabyte of active data. I have learnt all of the following from experience.

32-bit vs 64-bit

Most modern servers are either running 32-bit or 64-bit operating systems. Most modern hardware supports 64-bit operating systems, which are better because they permit more addressable memory space, i.e. more RAM.

MongoDB ships with two versions - 32-bit and 64-bit. Due to the way MongoDB uses memory mapped files 32-bit builds can only store around 2G of data. For standard replicasets MongoDB only has a single process type - mongod. If you’re intending on storing more than 2G of data you should use a 64-bit build of MongoDB. For setups that are sharded, you can use 32-bit builds for the mongos.

tl;dr - Just use 64-bit, or understand the limitations of 32-bit

Document size limits

Unlike a Relational Database Management System (RDBMS) which stores data in columns and rows, MongoDB stores data in documents. These documents are BSON, which is a binary format similar to JSON.

Like most other databases, there are limits to what you can store in a document. In older versions of MongoDB, documents were limited to 4M each. All recent versions support documents up to 16M in size. This may sound like an annoyance, but 10gen’s opinion on this is that if you are hitting this limit then either your schema design is wrong or you should be using GridFS, which allows arbitrarily sized documents.

Generally, I would suggest avoiding storing large, irregularly updated objects in a database of any kind. Services such as Amazon S3 or Rackspace Cloudfiles are generally a much better option and don’t load your infrastructure unnecessarily.

tl;dr - Keep documents under 16M each and you’ll be fine!

Write failure

MongoDB allows very fast writes and updates by default. The tradeoff is that you are not explicitly notified of failures. By default most drivers do asynchronous, ‘unsafe’ writes - this means that the driver does not return an error directly, similar to INSERT DELAYED with MySQL. If you want to know if something succeeded, you have to manually check for errors usinggetLastError.

For cases where you want an error thrown if something goes wrong, it’s simple in most drivers to enable “safe” queries which are synchronous. This makes MongoDB act in a familiar way to those migrating from a more traditional database.

If you need more performance than a ‘fully safe’ synchronous write, but still want some level of safety, you can ask MongoDB to wait until a journal commit has happened using getLastError with ‘j’. The journal is flushed to disk every 100 milliseconds, rather than 60 seconds as with the main store.

tl;dr - use safe writes or use getLastError if you want to confirm writes

Schemaless does not mean you have no Schema

RDBMSs usually have a pre-defined schema: tables with columns, each with names and a data type. If you want to add an extra column, you have to add a column to the entire table.

MongoDB does away with this. There is no enforced schema per collection or document. This makes rapid development and changes easy.

However, this doesn’t mean you can ignore schema design. Having a properly designed schema will allow you to get the best performance from MongoDB. Read the MongoDB docs, or watch one of the many videos about schema design to get started: - Schema Design Basics - Schema Design at Scale -Schema Design Principles and Practice

tl;dr - design a schema and make it take advantage of MongoDB’s features

Updates only update one document by default

With a traditional RDBMS, updates affect everything they match unless you use a LIMIT clause. However MongoDB uses an equivalent of a ‘LIMIT 1’ on each update by default. Whilst there is no way to do a ‘LIMIT 5’, you can remove the limit completely by doing the following;

db.people.update({age: {$gt: 30}}, {$set: {past_it: true}}, false, true)

There are similar options in all the official drivers - the option is usually called ‘multi’.

tl;dr - specify multi true to affect multiple documents

Case sensitive queries

Querying using strings may not quite work as expected - this is because MongoDB is by default case-sensitive.

For example, db.people.find({name: 'Russell'}) is different todb.people.find({name: 'russell'}). The solution is to make sure your data is in a known case - which is the ideal solution. You can also use regex searches like db.people.find({name: /russell/i}) although these aren’t ideal as they are relatively slow.

tl;dr - queries are case sensitive

Type sensitive fields

When you try and insert data with an incorrect data type into a traditional database, it will generally either error or cast the data to a predefined value. However with MongoDB there is no enforced schema for documents, so MongoDB can’t know you are making a mistake. If you write a string, MongoDB stores it as a string. If you write an integer, it stores it as an integer.

tl;dr - make sure you use the correct type for your data

Locking

When resources are shared between different parts of code sometimes locks are needed to ensure only one thing is happening at once.

Older versions of MongoDB - pre 2.0 - had a global write lock. Meaning only one write could happen at once throughout the entire server. This could result in the database getting bogged down with locking under certain loads. This was improved significantly in 2.0, and again in the current stable 2.2. MongoDB 2.2 solves this with database level locking and is a big step forward. The next step, which I expect will be another large improvement is Collection level locking, which is planned for the next stable version.

Having said this, most applications I’ve seen were limited by the application itself (too few threads, badly designed) rather than MongoDB itself.

tl;dr - use a current stable to get the best performance

Packages

A lot of people have had issues with out of date versions of MongoDB being shipped in the standard repositories of common distributions. The solution is simple: use the offical 10gen repositories which are available for Ubuntu and Debian as well as Fedora and Centos.

tl;dr - use official packages for the most up to date versions

Using an even number of Replica Set members

Replica Sets are an easy way to add redundancy and read performance to your MongoDB cluster. Data is replicated between all the nodes and one is elected as the primary. If the primary fails, the other nodes will vote between themselves and one will be elected the new primary.

It can be tempting to run with two machines in a replica set; it’s cheaper than three and is pretty standard way of doing things with RDBMSs.

However due to the way voting works with MongoDB, you must use an odd number of replica set members. If you use an even number and one node fails the rest of the set will go read-only. This happens as the remaining machines will not have enough votes to get to a quorum.

If you want to save some money, but still support failover and increased redundancy, you can use arbiters. Arbiters are a special type of replica set member - they do not store any user data (which means they can be on very small servers) but otherwise vote as normal.

tl;dr - only use an odd number of replica set members and be aware that arbiters can reduce the costs of running a redundant setup

No joins

MongoDB does not support joins; If you need to retrieve data from more than one collection you must do more than one query.

If you find yourself doing too many queries, you can generally redesign your schema to reduce the overall number you are doing. Documents in MongoDB can take any format, so you can de-normalize your data easily. Keeping it consistent is down to your application however.

tl;dr - no joins, read how to design a schema in this post.

Journaling

MongoDB uses memory mapped files and flushes to disk are done every 60 seconds, which means you can lose a maximum of 60 seconds + the flush time worth of data.

To reduce the chance of losing data, MongoDB added journaling -since 2.0 it’s been enabled by default. Journaling records changes to the database to disk every 100ms. If the database is shutdown unexpectedly, this will be replayed before starting up to make sure the database is in a consistent state. This is the nearest thing in MongoDB to a commit in a more traditional database.

Journaling comes with a slight performance hit - around 5%. For most people the extra safety is well worth the overhead.

tl;dr - don’t disable journaling

No authentication by default

MongoDB doesn’t have authentication by default. It’s expected that mongod is running in a trusted network and behind a firewall. However authentication isfully supported, if you require it you can enable it really easily.

tl;dr - secure MongoDB by using a firewall and binding it to the correct interface, or enable authentication

Lost data with Replica Sets

Running a replica set is a great way to make your system more reliable and easier to maintain. An understanding of what happens during a node failure or failover is important.

Replica Sets work by transferring the oplog - a list of things that change in your database (updates, inserts, removes, etc) - and then replaying it on other members in the set. If your primary fails and later comes back online, it will rollback to the last common point in the oplog. At any point during this process, newer data that may have existed will be removed from the database and placed in a special folder in your data directory called ‘rollback’ for you to manually restore. If you don’t know about this feature, you may find that data goes missing. Each time you have a failover you should check this folder. Manually restoring the data is really easy with the standard tools that ship with MongoDB.

There is a far more detailed explanation in the official docs.

tl;dr - ‘missing data’ after a failover will be in the rollback directory

Sharding too late

Sharding is a way of splitting data across multiple machines. This is usually done to increase performance when you find a replica set is too slow. MongoDB supports automatic sharding.

MongoDB allows migrating to a sharded setup with very little effort. However if you leave it too late it can cause headaches. To shard, MongoDB splits the collections you choose into chunks based on a ‘shard key’ and distributes them amongst your shards automatically. Splitting them and migrating chunks takes time and resources, and if your servers are already near capacity could result in them slowing to a standstill right when you need them most.

The solution is simple; use a tool to keep an eye on MongoDB, make a best guess of your capacity (flush time, queue lengths, lock percentages and faults are good gauges) and shard before you get to 80% of your estimated capacity. Exampe tools include MMS, Munin (+ Mongo plugin), CloudWatch.

If you know you are going to have to shard from the outset, a nice alternative if you’re using AWS or similar is to start off sharded - but on smaller servers. Stopping and resizing machines is much quicker than migrating thousands of chunks.

tl;dr - shard early to avoid any issues

You cannot update a shard key in a document

For sharded setups, shard keys are what MongoDB uses to work out which shard a particular document should be on.

After you’ve inserted a document you cannot update the shard key. The suggested solution to this is to remove the document and reinsert it - which will allow it to be allocated to the correct shard.

tl;dr - you can’t update a shard key

You cannot shard a collection over 256G

Going back to leaving things too late - MongoDB won’t allow you to shard a collection after it has grown bigger than 256G. This used to be a lot lower. This will eventually be removed completely. There is no solution other than recompiling or avoiding trying to shard collections larger than this.

tl;dr - shard collections before you reach 256G

Unique indexes and sharing

Uniqueness is not enforced across shards as the enforcement is done per-shard and not globally, except for the shard key itself.

tl;dr - Read this

Choosing the wrong shard key

MongoDB requires you to choose a key to shard your data on. If you choose the wrong one, it’s not a fun process to correct. What is the wrong shard key depends on your application - but a common example would be using a timestamp for a news feed. This causes one shard to end up ‘hot’ by having data constantly inserted into it, migrated off and queried too.

The common process for altering the shard key is simple; dump and restore the collection.

MongoDB will support hashing a key for you in the next release (see SERVER-2001) which will make our lives easier if you need to shard on a certain key that happens to be sequential.

tl;dr - read this before you choose a shard key

Traffic to and from MongoDB is unencrypted

The connections to MongoDB by default aren’t encrypted, which means that your data could be logged and used by a third party. If you’re running MongoDB in your own non-public network, this is unlikely to happen.

However, if you’re accessing MongoDB over a public connection you may want to encrypt the traffic. The public downloads and distributions of MongoDB do not have SSL support enabled; luckily it’s quite easy to compile your own version. Subscribers to 10gen support get this enabled in their build by default. Also, luckily most of the official drivers support SSL out of the box so there should be little hassle there too. Checkout the docs here.

tl;dr - if you connect publicly, be aware stuff is unencrypted

Transactions

MongoDB only supports single document atomicity, rather than a traditional database such as MySQL which allows longer sequences of changes to either completely succeed or fail. This means it can be hard with out being creative to model things which have shared state across multiple collections. One way of getting around this is by implementing two phase commits in your application…however this is not for everyone; it may be better to use more than one datastore if this is required.

tl;dr - there is no built in support for transactions over multiple documents

Journal allocation times

MongoDB may report to be ready to work, but in fact be allocating the journal still. If you have machines provisioned automatically, as well as a slow filesystem or disks, this may be an annoyance. Normally this won’t be an issue - but if it is, you can use the undocumented flag –nopreallocj to disable pre-allocation.

tl;dr - if you have slow disks or certian file systems, journal allocation may be slow

NUMA + Linux + MongoDB

“Linux, NUMA and MongoDB tend not to work well together. If you are running MongoDB on numa hardware, we recommend turning it off (running with an interleave memory policy). Problems will manifest in strange ways, such as massive slow downs for periods of time or high system cpu time.”

tl;dr - Disable NUMA

Process Limits in Linux

If you experience segfaults under load with MongoDB, you may find it’s beacuse of low or default open files / process limits. 10gen recommend setting your limits to 4k+, however this may need to be varied depending on your setup. Read up on ulimit and its meaning here.

tl;dr - Permanently increase hard and soft limits for open files / user processes for Mongo on Linux

Why MongoDB is a bad choice for storing our scraped data

MongoDB was used early on at Scrapinghub to store scraped data because it’s convenient. Scraped data is represented as (possibly nested) records which can be serialized to JSON. The schema is not known ahead of time and may change from one job to the next. We need to support browsing, querying and downloading the stored data. This was very easy to implement using MongoDB (easier than the alternatives available a few years ago) and it worked well for some time.

Usage has grown from a simple store for scraped data used on a few projects to the back end of our Scrapy Cloud platform. Now we are experiencing limitations with our current architecture and rather than continue to work with MongoDB, we have decided to move to a different technology (more in a later blog post). Many customers are surprised to hear that we are moving away from MongoDB, I hope this blog post helps explain why it didn’t work for us.

Locking

We have a large volume of short queries which are mostly writes from web crawls. These rarely cause problems as they are fast to execute and the volumes are quite predictable. However, we have a lower volume of longer running queries (e.g. exporting, filtering, bulk deleting, sorting, etc.) and when a few of these run at the same time we get lock contention.

Each MongoDB database (server prior to 2.2) has a Readers-Writer lock. Due to lock contention all the short queries need to wait longer and the longer running queries get much longer! Short queries take so long they time out and are retried. Requests from our website (e.g. users browsing data) take so long that all worker threads in our web server get blocked querying MongoDB. Eventually the website and all web crawls stop working!

To address this we:

Modified the MongoDB driver to timeout operations and retry certain queries with an exponential backoff
Sync data to our new backend storage and run some of the bulk queries there
Have many separate MongoDB databases with data partitioned between them
Scaled up our servers
Delayed implementing (or disabled) features that need to access a lot of fresh data

Poor space efficiency

MongoDB does not automatically reclaim disk space used by deleted objects and it is not feasible (due to locking) to manually reclaim space without substantial downtime. It will attempt to reuse space for newly inserted objects, but we often end up with very fragmented data. Due to locking, it’s not possible for us to defragment without downtime.

Scraped data often compresses well, but unfortunately there is no built in compression in MongoDB. It doesn’t make sense for us to compress data before inserting because the individual records are often small and we need to search the data.

Always storing object field names can be wasteful, particularly when they never change in some collections.

Too Many Databases

We run too many databases for MongoDB to comfortably handle. Each database has a minimum size allocation so we have wasted space if the size of the data in that DB is small. If no data is in the disk cache (e.g. after a server restart), then it can take a long time to start MongoDB as it needs to check each database.

Ordered data

Some data (e.g. crawl logs) needs to be returned in the order it was written. Retrieving data in order requires sorting which is impractical when the number of records gets large.

It is only possible to maintain order in MongoDB if you use capped collections, which are not suitable for crawl output.

Skip + Limit Queries are slow

There is no limit on the number of items written per crawl job and it’s not unusual to see jobs that have a few million items. When reading data from the middle of a crawl job, MongoDB needs to walk the index from the beginning to the offset specified. It gets slow browsing deep into a job with a lot of data.

Users may download job data via our API by paginating results. For large jobs (say, over a million items), it’s very slow and some users work around this by issuing multiple queries in parallel, which of course causes high server load and lock contention.

Restrictions

There are some odd restrictions, like the allowed characters in object field names. This is unfortunate, since we lack control over the field names we need to store.

Impossible to keep the working set in memory

We have many TB of data per node. The frequently accessed parts are small enough that it should be possible to keep them in memory. The infrequently accessed data is often sequentially scanned crawl data.

MongoDB does not give us much control over where data is placed, so the frequently accessed data (or data that is scanned together) may be spread over a large area. When scanning data only once, there is no way to prevent that data evicting the more frequently accessed data from memory. Once the frequently accessed data is no longer in memory, MongoDB becomes IO bound and lock contention becomes an issue.

Data that should be good, ends up bad!

After embracing MongoDB, its use spread to many areas, including as a back-end for our django UI. The data stored here should be clean and structured, but MongoDB makes this difficult. Some limitations that affected us are:

No transactions – We often need to update a few collections at a time and in the case of failure (server crash, bug, etc.) only some of this data is updated. Of course this leads to inconsistent state. In some cases we apply a mix of batch jobs to fix the data, or various work-arounds in code. Unfortunately, it has become common to just ignore the problem, thinking it might be rare and unimportant (a philosophy encouraged by MongoDB).
Silent failures hide errors - It’s better to detect errors early, and “let it crash”. Instead MongoDB hides problems (e.g. writing to non-existing collection) and encourages very defensive programming (does the collection exist? is there an index on the field I need? Is the data the type I expect? etc.)
Safe mode poorly understood – Often developers don’t understand that without safe=True, the data may never get written (e.g. in case of error), or may get written at some later time. We had many problems (such as intermittently failing tests) where developers expected to read back data they had written with safe=False.
Lack of a schema or data constraints – Bugs can lead to bad data being inserted in the database and going unnoticed.
No Joins – Joins are extremely useful, but with MongoDB you’re forced to either maintain denormalized data without triggers or transactions, or issue many queries loading reference data.

Summary

There is a niche where MongoDB can work well. Many customers tell us that they have positive experiences using MongoDB to store crawl data. So did Scrapinghub for a while, but it’s no longer a good fit for our requirements and we cannot easily work around the problems presented in this post.

I’ll describe the new storage system in future posts, so please follow @scrapinghub if you are interested!

Comment here or in HackerNews thread.

Open Source at ScrapinghubIn "Scrapinghub"

Looking back at 2014In "Scrapinghub"

Looking back at 2013In "Scrapinghub"

From → Scrapinghub, Scrapy Cloud

26 Comments

David Mytton permalink

I think these are all valid concerns but a few are things which are known about MongoDB and so should factor into the original decision to use it. In particular, no joins, no transactions and no schema are all features and if you need those then you shouldn’t have chosen MongoDB in the first place.

“Safe mode” changed in November last year and is now defaulted to on, or acknowledged writes. This helps with the problem you described and is a nice default because it gives reasonable performance + safety. You can dial back safety to get performance or vice versa.

Locking is probably the most often cited issue and it is better in 2.4, but could (and will) be improved. I think the biggest issue for us is disk space space reuse. The compact command helps but it does require a maintenance window and some time to complete, which is inconvenient.

Reply
ekyo777 permalink

I ‘ve been working with mongoDB for about a year now and encountered many of these problems and managed to implement workarounds for most, I’ll list a few of these workarounds.

Poor space efficiency:
Hard drives are cheap so this was a non-issue

No transactions:
we had to implement it ourselves with document based redis lock handled by our software. good thing they are not required everywhere in our use case, far from it. small atomic operations cover most of our cases.

safe mode:
I’d also prefer the default to be safe, but well.. this can be solved rather easily by coding some helpers.

lack of schema/data constraint:
depending on the language you use some solutions are already there, ex.: mongoose
that being said even when we use mongoose we have many helpers on top of it so that schema require less code, and respect specific traits.

but… I do see your point that it might not be worth your time if you need a workaround for many things and you should find a database more fitting to your use case. Good for us, the workarounds we required were negligible overhead.

Reply
- pro pro permalink
  
  Very nice inded .
  Who’s going to maintain these ‘workarounds’ in three years if you’ve left this project ?
  Specially the transcation “workaround”…
  
  Reply
Shane Evans permalink

I am glad to hear there are locking improvements in 2.4. Indeed, 2.2 was a decent improvement for us, but didn’t go far enough. The safe mode defaults are also a welcome change.

The lack of joins & transactions of course did factor into the original decision. My point (which perhaps could be clearer) was that MongoDB ended up being used outside of the area in which we originally intended to use it. There was some reluctance to add another technology when we could get by with what we had for what was (initially) only a small use. Additionally, some limitations were not always well understood by web developers (who were new to mongo and enthusiastic to try it). I see this as our mistake. With hindsight, it’s clear we should have introduced an RDBMS immediately and kept MongoDB for managing the crawl data.

Reply
Mark Vletter (@MarkV) permalink

Armin Ronacher recently shared his experiences with MongoDB at @PyGrunn

* typeout: http://maurits.vanrees.org/weblog/archive/2013/05/armin-ronacher-a-year-with-mongodb
* ‘slides': https://speakerdeck.com/mitsuhiko/a-year-of-mongodb

Reply
Engineer permalink

The ONLY reason to choose MongoDB is because you’re being lazy. Seriously, it has “NoSQL Cool” for people who want to write SQL. Any basic survey of the options out there- Cassandra, Riak, Couchbase, BigCouch, Voldemort, etc … gives you multiple actually distributed, scalable databases.

Global Write Lock? You’re recommending a database with a global write lock? You’re fired.

Seriously, when did engineering leave this profession? Where are people’s standards?

MongoDB is only appropriate if your data fits on a single machine…. and if that’s the case there are many other, more established choices such as MySQL and Postgres.

Reply
- Eric permalink
  
  Yeah, these companies and groups are all just lazy: SAP, Stripe, Sourceforge, Trello, Intuit, Bit.ly, Github, Ebay, LexisNexis, Shutterfly, ADP, Forbes, CERN, etc etc etc. And that’s just scratching the surface:http://www.mongodb.org/about/production-deployments/
  
  Reply
- Super Hans permalink
  
  MySQL? You’re recommending the shittiest RDBMS (lol) ever invented? You’re fired.
  
  Reply
Shane Evans permalink

When we chose MongoDB, we used it for “a simple store for scraped data used on a few projects” back in 2010. It was a useful tool we could use on some consulting projects. We deployed it on an AWS small instance (the 2GB limit was fine for a while, we are self-funded and didn’t want to spend much on hardware) along with other services. The early version of the platform was quick to develop and this is partly thanks to MongoDB.

Once we realized that we had a useful service and as our data size, traffic and requirements grew, we knew MongoDB wasn’t the best fit.

MySQL or Postgress would not have worked well for storing scraped data as we need to store arbitrary JSON objects and filter them (e.g. find products in the last crawl job with a price < 20, find blog posts crawled by this spider with more than 10 comments, etc.).

Reply
- Tim Williams permalink
  
  Your query examples suggests to me that elasticsearch might be a good fit. I have used at a smaller scale worked well for me.
  
  Reply
tt permalink

for the love of living things, why would a database be used to store large volumes of scraped data, especially for shared service like yours? It sounds like you need hdfs or some special-purpose datastore with good compression support. Relational db’s would be bad for this (ie. slow) in similar way’s mongo is.

I’m not really defending mongo, but it sounds like you picked the wrong tool for the job in any case.

Reply
- AspieDBA permalink
  
  Actually I can guarantee that RDBMS can handle this. Has been working very well for 14 years and in some cases on ancient 32bit technology and we aren’t talking about trivial traffic either.
  It’s attention to detail and common sense that is needed
  
  Reply
Kim Betti (@kimbetti) permalink

Well that’s the thing with these databases, you reeally have to make sure that you design your collections as transactional boundaries.

Reply
richarddunks permalink

Great post about the limitations of MongoDB. Are you planning to keep MongoDB for some applications or are you implementing HBase for everything?

Reply
Charles (@dreamingbinary) permalink

Great post indeed regarding your experience. We’re planning to adopt Mongo as well. However, I’ve studied it long enough for all of our use cases (one of which is similar to yours in terms of storing crawled data) so I wasn’t surprised to see the issues you ran ultimately ran into. As your follow up comment suggested, you definitely started using it outside the initial design, stuff that it wasn’t meant to do.

Reply
pavan permalink

Great post regarding limitations of MongoDB. Nice post. This is useful for all the techies..! Than x for this post..!

Reply
Vlad permalink

What database do you use now?

Reply
Pablo Hoffman permalink

@Vlad we are using HBase.

Reply
Praveen Addepally permalink

Very nice post… We were thinking of using MongoDB for our new project. But after reading this post now we are re-thinking of our decision. Thanks though for the post.

Reply
silviu dicu permalink

The fast you discovered that mongodb doesn’t support transactions after you implemented your system … says it all.
For all sake I think your dev/ops guys don’t understand quite well mongodb – see this – Impossible to keep the working set in memory – ok so what database will do that for you ? On a second thought actually you can keep all working set in memory … but you may need to buy half of aws instances :)

One think I can agree however is the fact that just using mongodb will NOT solve all your problems.
You need to know very well the data structure you want (documents) as well the access/update patterns.