Mongodb course notes

Posted by swapan chakrabarty on August 7, 2023

Mongodb document fundamentals

These are my notes from: https://learn.mongodb.com/courses/m320-mongodb-data-modeling

  • documents displayed in json but stored as bson
  • a document is like a joined row
  • Documents are like a map or an associative array
  • by default, documents in a collection dont have to have the same schema
  • by default, fields in a document dont have to have the same data type

Mongodb anti patterns

  • massive arrays
  • massive number of collections
  • bloated documents
  • unnecessary indexes
  • queries without indexes
  • data that is accessed together but stored in different collections

Mongodb patterns

  • keep frequently documents in RAM
  • keep indexes in RAM
  • solid state drives or hard disk
  • infrequently used data can use hard disks
  • You either model for simplicity or performance

Mongodb datatypes

json datatypes

  • string
  • object
  • array
  • boolean
  • null

bson datatypes

  • dates
  • numbers
  • object id(primary key)

image

every doc requires _id field which is the primary key mongodb will auto add it, if its not included in the insert

_id is a required field in every MongoDB document, but it’s not a data type. If an inserted document does not include an_id field, MongoDB will automatically create the_id field and populate it with a value of type ObjectId.

Mongodb to RDBMS translation

image

Relational vs mongodb data modeling

example relational data model:

image

Modeling in Mongodb:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
{
    "_id": "5ad88534e3632e1a35a58d00",
    "customerID": 12345,
    "name": {
        "first": "John",
        "last": "Doe"
    },
    "address": [
        {
            "location": "work",
            "address": {
                "street": "16 Hatfields",
                "city": "London",
                "postal_code": "SE1 8DJ"
            },
            "country": "United Kingdom",
            "geo": {
                "type": "Point",
                "coord": [
                    51.5065752,
                    -0.109081
                ]
            }
        }
    ],
    "email": "john.doe@acme.com",`
    "phone": [
        {
            "location": "work",
            "number": "+44-1234567890"
        }
    ],
    "dob": "1977-04-01T05:00:00Z",
    "interests": [
        "devops",
        "data science"
    ],
    "annualSpend": 1292815.75
}

Mongo db data relationships

data that is accessed together should be stored together

  • one-to-one
  • one-to-many
  • Many-to-many

Ways to model relationships

  • Embedding
  • Referencing

Embedding and Referencing example: actor document (cast) is embedded within movie document image

NOT candidates for Embedding:

A document is frequently read, but contains data that is rarely accessed. Embedding this data only increases the in-memory requirements

One part of a document is frequently updated and constantly growing in size, while the remainder of the document is relatively static

The combined document size would exceed MongoDB’s 16MB document limit

Mongodb indexing

By default, MongoDB creates an index on the document’s _id primary key field All user-defined indexes are secondary indexes. Indexes can be created on any part of the JSON document – including inside sub-documents and array elements

Index types:

  • unique
  • compound
  • array
  • TTL Indexes

In some cases data should expire automatically. Time to Live (TTL) indexes allow the user to specify a period of time after which the database will automatically delete the data. A common use of TTL indexes is applications that maintain a rolling window of history (e.g., most recent 100 days) for user actions such as clickstreams

Wildcard Indexes For workloads with many ad-hoc query patterns or that handle highly polymorphic document structures, wildcard indexes give you a lot of extra flexibility. You can define a filter that automatically indexes all matching fields, subdocuments, and arrays in a collection. As with any index, they also need to be stored and maintained, so will add overhead to the database. If your application’s query patterns are known in advance, then you should use more selective indexes on the specific fields accessed by the queries

Partial Indexes Partial Indexes can be viewed as a more flexible evolution of Sparse Indexes, where the DBA can specify an expression that will be checked to determine whether a document should be included in a particular index. e.g. for an “orders” collection, an index on state and delivery company might only be needed for active orders and so the index could be made conditional on {orderState: “active”} – thereby reducing the impact to memory, storage, and write performance while still optimizing searches over the active orders.

Hash Indexes Hash indexes compute a hash of the value of a field and index the hashed value. The primary use of this index is to enable hash-based sharding, a simple and uniform distribution of documents across

Text Search Indexes MongoDB provides a specialized index for text search that uses advanced, language-specific linguistic rules for stemming, tokenization and stop words shards

Optimizing Performance

  • MongoDB’s explain() method
  • MongoDB Compass GUI
  • Mongodb query profiler
  • mongodb ops manager MongoDB Atlas and Ops Manager eliminates this effort with the Performance Advisor which monitors queries that took more than 100ms to execute, and automatically suggests new indexes to improve performance.

MongoDB Aggregation Pipeline

MongoDB provides the Aggregation Pipeline natively within the database, which delivers similar functionality to the GROUP BY, JOIN, Materialized View, and related SQL features.

MongoDB Charts

MongoDB Charts is the quickest and easiest way create and share visualizations of your MongoDB data in real time,

Mongodb transactions

Multi-Record ACID Transactional Model Because documents can bring together related data that would otherwise be modelled across separate parent-child tables in a tabular schema, MongoDB’s atomic single-document operations provide transaction semantics that meet the data integrity needs of the majority of applications.

MongoDB 4.0 added support for multi-document ACID transactions in the 4.0 release and extended them in 4.2 with Distributed Transactions that operate across scaled-out, sharded clusters.

Mongodb Consistency

image

MongoDB handles the complexity of maintaining multiple copies of data via replication. Read and write operations are directed to the primary replica by default for strong consistency, but users can choose to read from secondary replicas for reduced network latency, especially when users are geographically dispersed, or for isolating operational and analytical workloads

When reading data from any cluster member, users cantune MongoDB’s consistency model to match application requirements

Foreign Keys

add validation to embedded sub documents which would act like foreign keys in a relational model

mongodb schema validation

is very tunable, can be applied to sub documents only

Example json schema validation:

  • Each document must contain a field named lineItems
  • The document may optionally contain other fields
  • lineItems must be an array where each element:
  • Must contain a title (string), price (number no smallerthan 0)
  • May optionally contain a boolean named purchased
  • Must contain no further fields
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
db.createCollection("orders",
    {
        validator: {
            $jsonSchema: {
                properties: {
                    lineItems: {
                        type: "array",
                        items: {
                            properties: {
                                title: {
                                    type: "string"
                                },
                                price: {
                                    type: "number",
                                    minimum: 0.0
                                },
                                purchased: { type: "boolean" }
                            },
                            required: [
                                "_id",
                                "title",
                                "price"
                            ],
                            additionalProperties: false
                        }
                    }
                },
                required: ["lineItems"]
            }
        }
    }
)

On-Demand Materialized Views Using the $merge stage, outputs from aggregation pipeline queries can now be merged with existing stored result sets whenever you run the pipeline, enabling you to create materialized views that are refreshed on-demand. Rather than a full stop replacement of the existing collection’s content, you can increment and enrich views of your result sets as new data is processed by the aggregation pipeline. With MongoDB’s Materialized Views you have the flexibility to output results to sharded collections – enabling you to scale-out your views as data volumes grow. You can also write the output to collections in different databases further isolating operational and analytical workloads from one another. As the materialized views are stored in a regular MongoDB collection, you can apply indexes to each view, enabling you to optimize query access patterns, and run deeper analysis against them using MongoDB Charts, or the BI and Apache Spark connectors.

Data modeling for Mongodb

  1. identify workloads how your use your data image image
  2. document relationships
  3. Embedding vs Reference

How to identify workloads

  1. apply design patterns

Modeling for simplicity vs performance

simplicity

  • fewer collections
  • bigger embedded documents
  • objects map well to documents
  • fewer disc IO

Performance

  • Sharding
  • supports very fast reads or writes

Performance criteria latency operations per second more collections

Identify workloads

image

qualify and quantify operations

image

image

Examples of types of data and their durability

image

image

image

Relationships and Cardinality

a mother and her children embed as children should be small amount

a twitter user and his/er followers reference as a user may have millions of followers

Better entity diagram:

  • movie has between 1 and 1000 actors
  • has 2 sets of financials
  • has between 0 and 100 reviews
  • with 30 reviews being the norm image

One to Many

embed on the side of the most queried collection

One to many using references

  • array of references
  • cascade deletes must be handled by application
  • use referencing when the associated documents are not always needed with the most queried documents

image

one to many exmaple

here when i delete or update a store, no need to handle the zip codes since they’re kept in the zips collection image

image

Join operations like SQL joins with $ operator

image

image

image

image

image

image

One to Zillions

Document this relationship in the code as it requires special handling image

image

image

image

image

Mongodb Patterns

image

embed address with orders as the shipping address does not changed for shipped and delivered orders so duplication ok image

ok to embed actors with movies as its unlikely to change so duplication ok image

image

how to maintain this sum? maybe data can be stale and some application process updates the gross sum image

image

image foreign keys and cascading deletes not supported

image

image

image

Wildcard indexes in mongodb 4.2 may replace the need for indexes on the third section. image

Note that wildcard indexes may remove the need to use the pattern below image

image

image What if i wanted to know the releases between two dates?

image image

image

image

Extended reference pattern image

image

Here we query orders than we do orders per customer image

image

image

image

Subset pattern image

image

image

image

image

image

Lab data attribute pattern

The purpose of this lab is complete the Pattern Attribute file in the Pattern-Attribute editor tab and then to run thevalidate_m320 in the tab mongodb to ensure this is the correct solution for the lab. The lab covers theAttribute pattern and converting our existing schema to apply this pattern.

User Story

The museum we work at has grown from a local attraction to one that is seen as saving very popular items. For this reason, other museums in the World have started exchanging pieces of art with our museum. Our database was tracking if our pieces are on display and where they are in the museum. To track the pieces we started exchanging with other museum, we added an array called events, in which we created an entry for each date a piece was loaned and the museum it was loaned to.

This schema has a problem where each time we start a exchange with a new museum, we must then add a new index. For example, when we started working with The Prado in Madrid, we added this index: { “events.prado” : 1 } Please review the Problem Schema and the Problem Document editor tabs to help understand the old schema and how we need to change the schema to apply the pattern.

Tasks

To address this issue, you will need to change the schema to:

use a single index on all event dates. transform the field that tracks the date when a piece was acquired, date_acquisition, so that it is also indexed with the values above.

Problem schema

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  "_id": "<objectId>",
  "title": "<string>",
  "artist": "<string>",
  "date_acquisition": "<date>",
  "location": "<string>",
  "on_display": "<bool>",
  "in_house": "<bool>",
  "events": [{
    "moma": "<date>",
    "louvres": "<date>"
  }]
}

Problem document

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  "_id": ObjectId("5c5348f5be09bedd4f196f18"),
  "title": "Cookies in the sky",
  "artist": "Michelle Vinci",
  "date_acquisition": ISODate("2017-12-25T00:00:00.000Z"),
  "location": "Blue Room, 20A",
  "on_display": false,
  "in_house": false,
  "events": [{
    "moma": ISODate("2019-01-31T00:00:00.000Z"),
    "louvres": ISODate("2020-01-01T00:00:00.000Z")
  }]
}

Answer schema

1
2
3
4
5
6
7
8
9
10
11
12
{
  "_id": "<objectId>",
  "title": "<string>",
  "artist": "<string>",
  "location": "<string>",
  "on_display": "<bool>",
  "in_house": "<bool>",
  "events": [{
    "k": "<string>",
    "v": "<date>"
  }]
}

Subset pattern lab

User Story

You are the lead developer for an online organic recycled clothing store.

Due to the growing number of environmentally-conscious consumers, our store’s inventory has increased exponentially. We now also have an increasingly large pool of makers and suppliers.

We recently found that our shopping app is getting slower due to the fact that the frequently-used documents can no longer all fit in RAM. This is happening largely due to having all product reviews, questions, and specs stored in the same document, which grows in size as reviews and ratings keep coming in.

To resolve this issue, we want to reduce the amount of data immediately available to the user in the app and only load additional data when the user asks for it.

Problem schema

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
{
  "_id": "<objectId>",
  "item_code": "<string>",
  "name": "<string>",
  "maker_brand": "<string>",
  "price": "<decimal>",
  "description": "<string>",
  "materials": ["<string>"],
  "country": "<string>",
  "image": "<string>",
  "available_sizes": {
    "mens": ["<string>"],
    "womens": ["<string>"]
  },
  "package_weight_kg": "<decimal>",
  "average_rating": "<decimal>",
  "reviews": [{
    "author": "<string>",
    "text": "<string>",
    "rating": "<int>"
  }],
  "questions": [{
    "author": "<string>",
    "text": "<string>",
    "likes": "<int>"
  }],
  "stock_amount": "<int>",
  "maker_address": {
    "building_number": "<string>",
    "street_name": "<string>",
    "city": "<string>",
    "country": "<string>",
    "postal_code": "<string>"
  },
  "makers": ["<string>"]
}

Problem document

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
{
  "_id": ObjectId("5c9be463f752ec6b191c3c7e"),
  "item_code": "AS45OPD",
  "name": "Recycled Kicks",
  "maker_brand": "Shoes From The Gutter",
  "price": 100.00,
  "description": "These amazing Kicks are made from recycled plastics and
  fabrics.They come in a variety of sizes and are completely unisex in design.
  If your feet don't like them within the first 30 days, we'll return your
  money no questions asked.
  ",
  "materials": [
    "recycled cotton",
    "recycled plastic",
    "recycled food waste",
  ],
  "country": "Russia",
  "image": "https:///www.shoesfromthegutter.com/kicks/AS45OPD.img",
  "available_sizes": {
      "mens": ["5", "6", "8", "8W", "10", "10W", "11", "11W", "12", "12W"],
      "womens": ["5", "6", "7", "8", "9", "10", "11", "12"]
  },
  "package_weight_kg": 2.00,
  "average_rating": 4.8,
  "reviews": [{
      "author": "i_love_kicks",
      "text": "best shoes ever! comfortable, awesome colors and design!",
      "rating": 5
    },
    {
      "author": "i_know_everything",
      "text": "These shoes are no good because I ordered the wrong size.",
      "rating": 1
    },
    "..."
  ],
  "questions": [{
      "author": "i_love_kicks",
      "text": "Do you guys make baby shoes?",
      "likes": 1223
    },
    {
      "author": "i_know_everything",
      "text": "Why do you make shoes out of garbage?",
      "likes": 0
    },
    "..."
  ],
  "stock_amount": 10000,
  "maker_address": {
    "building_number": 7,
    "street_name": "Turku",
    "city": "Saint-Petersburg",
    "country": "RU",
    "postal_code": 172091
  },
  "makers": ["Ilya Muromets", "Alyosha Popovich", "Ivan Grozniy", "Chelovek Molekula"],
}

Answer document

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
{
  "_id": ObjectId("5c9be463f752ec6b191c3c7e"),
  "item_code": "AS45OPD",
  "name": "Recycled Kicks",
  "maker_brand": "Shoes From The Gutter",
  "price": 100.00,
  "description": "These amazing Kicks are made from recycled plastics and
  fabrics.They come in a variety of sizes and are completely unisex in design.
  If your feet don't like them within the first 30 days, we'll return your
  money no questions asked.
  ",
  "materials": [
    "recycled cotton",
    "recycled plastic",
    "recycled food waste",
  ],
  "country": "Russia",
  "image": "https:///www.shoesfromthegutter.com/kicks/AS45OPD.img",
  "available_sizes": {
      "mens": ["5", "6", "8", "8W", "10", "10W", "11", "11W", "12", "12W"],
      "womens": ["5", "6", "7", "8", "9", "10", "11", "12"]
  },
  "package_weight_kg": 2.00,
  "average_rating": 4.8,
  "top_five_reviews": [{
      "author": "i_love_kicks",
      "text": "best shoes ever! comfortable, awesome colors and design!",
      "rating": 5
    },
    {
      "author": "i_know_everything",
      "text": "These shoes are no good because I ordered the wrong size.",
      "rating": 1
    },
    "..."
  ],
  "top_five_questions": [{
      "author": "i_love_kicks",
      "text": "Do you guys make baby shoes?",
      "likes": 1223
    },
    {
      "author": "i_want_to_know_everything",
      "text": "How are these shoes made?",
      "likes": 1120
    },
    "..."
  ],
  "stock_amount": 10000,
}

Answer schema

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
  "_id": "<objectId>",
  "item_code": "<string>",
  "name": "<string>",
  "maker_brand": "<string>",
  "price": "<decimal>",
  "description": "<string>",
  "materials": ["<string>"],
  "country": "<string>",
  "image": "<string>",
  "available_sizes": {
    "mens": ["<string>"],
    "womens": ["<string>"]
  },
  "package_weight_kg": "<decimal>",
  "average_rating": "<decimal>",
  "top_five_reviews": [{
    "author": "<string>",
    "text": "<string>",
    "rating": "<int>"
  }],
  "top_five_questions": [{
    "author": "<string>",
    "text": "<string>",
    "likes":"<int>"
  }],
  "stock_amount": "<int>"
}