Fixing Notions 1-Hour Expiring Image Problem

Utilize Cloudflare Workers and R2 to create a semi-realtime cache storing your files to bypass Notions 1-hour file expiration. These techniques can be applied to any and all files, while not being restricted to Notion, but rather any CMS.

2024-04-14

-1

Understanding The Notion Problem

A growing number of developers choose Notion for hosting lightweight content due to its real-time syncing, platform compatibility, block-like note taking, and featureful free tier. Notion even exposes an API allowing developers to access their content and integrate it anywhere they like.

However, a significant hurdle for developers relying on Notion as a content management system (CMS) is the 1-hour file expiration time. When you fetch content from Notion, it arrives in a “block” format that contains the information from a page query and any embedded images/videos/etc are delivered in a URL:

{
  "object": "block",
  "id": "72aec0d6-59c2-48ca-9a55-f8d000fdc56e",
  "parent": { ... },
  "created_time": "2024-03-02T04:41:00.000Z",
  "last_edited_time": "2024-03-09T09:21:00.000Z",
  "created_by": { ... },
  "last_edited_by": { ... },
  "has_children": false,
  "archived": false,
  "type": "image",
  "image": {
    "caption": [ ... ],
    "type": "file",
    "file": {
      "url": "https://prod-files-secure.s3.us-west-2.amazonaws.com/3175668c-d64e-4b9b-87fc-5fdfe186dc33/b7acf74e-4442-47bf-82ec-6890f97e714b/mountains.avif?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45HZZMZUHI%2F20240323%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20240323T013017Z&X-Amz-Expires=3600&X-Amz-Signature=3f0eedc342fc8ad12fc46b1f4eb5829b7f2ba78ba19fc7eb748107af61e1b9ce&X-Amz-SignedHeaders=host&x-id=GetObject",
      "expiry_time": "2024-03-23T02:30:17.991Z"
    }
  }
}

Our focus centers on the following field:

"expiry_time": "2024-03-23T02:30:17.991Z"

Notion informs us that the content at that URL will expire at expiry_time (one hour after the query was made, UTC time). It may be important to note that every time a fetch request is made, Notion returns a new URL and expiration time.

What’s the issue if we get a new URL and expiration? A few points:

Things cache things. Often, browsers will identify parameters of fetch requests and return a cached response. And if your browser doesn’t do this, it’s likely your web server might. You cannot guarantee that making a new fetch request to the Notion API will execute if your browser/web server is unaware it shouldn’t cache response. These can be controlled with values like no-store within cache headers or caches rules within your web server, however it's necessary to initiate a new fetch request after an hour has elapsed. This practice can quickly exhaust your rate limits, increase bandwidth, and increase total content-delivery time.
Notion API rate limits. Even if the previous point is handled, Notion limits the number of calls you can make to their API. Notion claims “… an average of three requests per second…” while it’s more accurately 2700 API calls over a 15 minute period according to a random on reddit. These low limits won’t work especially if your site scales.

Notion themselves don’t go into too much detail, but they make it clear in their developer docs:

The developer is responsible for the asset and making it available via a secure URL.

And as developers, that’s exactly what we’ll do.

High-Level Plan

We’ll need to do a few things:

§Choose A Cloud Storage Provider
§Sync Data Between Notion And Cloud Storage
§Retrieve Realtime Data From The Cloud
§Cache Control

I’ll be using Cloudflare for this article, but the techniques and principles discussed are largely applicable regardless of what provider(s) you use.

Here’s what we’re going for.

The big idea involves frequent queries from a global serverless function to fetch the most up-to-date content from our CMS. This content is then synchronized with our cloud storage. The web server will retrieve the updated content directly from cloud storage, ensuring the link remains static. Finally we’ll make sure users fetch page data with a revalidation field in a cache-header, so they are served the latest information.

Choose A Cloud Storage Provider

As suggested, we’ll need to store our files (for this example we’ll work with images) in cloud storage that provides a static URL. Ideally this storage would have high rate limits, low latency, and a large storage capacity. Possible platforms may be Amazon S3, Google’s Cloud Storage for Firebase, Microsoft’s OneDrive, etc. This site is hosted on Cloudflare, making it most practical to utilize Cloudflare's R2 storage.

💡

It's worth noting that while there are specialized storage systems like Cloudflare Images that focus on specific file types, these are typically designed for particular use cases and may not be suitable as general-purpose storage solutions.

Cloudflare’s R2 has pretty generous limits and storage pricing, but more on that later. Start by setting up a bucket with your chosen provider and ensure that you have the necessary read/write permissions. In the following section, we'll start interacting with this bucket.

Sync Data Between Notion And Cloud Storage

With access to a bucket, we’ll need to periodically query Notion in order to get the most up-to-date information. Thankfully, Cloudflare provides Cloudflare Workers that allows us to run global, small, purpose-based functions. If you’re using Azure Functions, Google Cloud Functions, or AWS Lambda, the idea is the same; create a scheduled function to regularly fetch this information.

Functions are generally lightweight and don’t support much API’s. While the Notion API does exist, we’ll need to write fetch requests directly to Notion’s endpoints. I suggest becoming familiar with their query API, and if you need help structuring your fetch request, check out Postman.

Design & Test Sync Functionality

Ideally, we’ll aim to execute this Worker as frequently as possible to ensure we capture the most current information available. Later we’ll store this data, but it’s helpful to define our infrastructure now to help test our functions.

The API we’re working with allows 2700 calls per 15 minutes.

\frac{2700 \text{ API calls}}{15 \text{ minutes}} = 180 \text{/minute (3 requests per second)}

This info from a random is on-par with their publicly stated “…average of three requests per second. Some bursts beyond the average rate are allowed”.

Consider how many calls to your CMS it takes to complete a sync.

We’ll role with 1 sync action per minute. It’s possible to compute a lower sync-interval, however 180/min is fast enough and provide us plenty of headroom. Adjust your strategy based on your content's scale.

First, lets structure our global function (Worker). Cloudflare can handle both scheduled events and singular fetch requests in a single Worker. Emplacing our implementation into both means we can manually sync if needed.

export default {
  async scheduled(event, env, ctx) {
    // this is non-blocking
    ctx.waitUntil(sync(env));
  },

  async fetch(request, env, ctx) {
    // this is blocking
    return await sync(env);
  },
};

You can test this by returning a Response object, and reading logs in your provider’s console.

return new Response('Data fetched and stored successfully.', { status: 200 });

Don't forget to deploy your global function!

Grab Data From Notion

❗

Skip this section if you’re using a different CMS, as this is specific to Notion.

To store Notion file content, you first need to get the valid but volatile URLs. Notion is structured by blocks, and each block has an id and children. The following illustrates a manual fetch request to grab page content.

async function getPageContentFromID(env, id) {
  var requestOptions = {
    method: "GET",
    headers: {
      "Notion-Version": "2022-06-28",
      // use your secure environment variable
      Authorization: "Bearer " + env.NOTION_API_KEY,
    },
  };

  try {
	  // 100 blocks in a single fetch request is Notion's max. 100 is a lot but if
	  // you want to grab more, you'll need to implement some sort of pagination
	  // algorithm. experiment with moving Notion's cursor and their 'has_more'
	  // field
    const numBlocksToGrab = 100;
    const fetchResponseBlocks = await fetch(
      `https://api.notion.com/v1/blocks/${id}/children?page_size=${numBlocksToGrab}`,
      requestOptions
    );
    if (!fetchResponseBlocks.ok) {
      throw new Error(`Couldn't get page blocks: ${fetchResponseBlocks.status}`);
    }
    return await fetchResponseBlocks.json();
  } catch (error) {
    console.error('Error fetching page content:', error);
    // this function will be called in a try-catch to return a Response object
    // that contains this error information
    throw error;
  }
}

While we can grab information from a page directly, it’s probable that you’ll use a Notion database to store your content. This grabs all rows of a database, so you can eventually call getPageContentFromID from above to inquire on each post received from this function.

async function getDatabase(env) {
  var databaseRequestOptions = {
    method: "POST",
    headers: {
      Authorization: "Bearer " + env.NOTION_API_KEY,
      "Notion-Version": "2022-06-28",
      "Content-Type": "application/json",
    },
  };

  try {
    const fetchResponse = await fetch(
      `https://api.notion.com/v1/databases/${env.NOTION_DATABASE_ID}/query`,
      databaseRequestOptions
    );

    if (!fetchResponse.ok) {
      throw new Error(`Couldn't get database: ${fetchResponse.status}`);
    }

    const jsonResponse = await fetchResponse.json();
    return jsonResponse.results;
  } catch (error) {
    console.error('Error fetching published posts:', error);
    throw error;
  }
}

Sync Data To The Cloud

With the essential parts, lets define our sync function.

💡

Each post in my Notion database is represented by a row, with the first property being the page itself, followed by additional properties that contain general metadata about the post. The Published property (type: checkbox) marks posts that should appear on the site.

General procedure:

Grab ALL data. Published and unpublished.
Search for all image URLs. Published ones we store, unpublished ones we delete.
Update our R2 instance with push and delete .

async function sync(env) {
  try {
    // get both published and unpublished data to store AND erase objects. This
    // is considered 'syncing', and not just 'pushing new data`
    const db = await getDatabase(env); // or replace with your CMS get-data
    const published = db.filter(result => {
      return result.properties.Published.checkbox;
    })
    const unpublished = db.filter(result => {
      return !result.properties.Published.checkbox;
    })

    const imageIDs = [];
    // store promises to execute in parallel. efficient for large operations
    const storeImagePromises = [];
    const deleteImagePromises = [];

    for (let pub of published) {
	    // you can replace this with your CMS's way to grab images
      const pageContent = await getPageContentFromID(env, pub.id);
      // store images, and record which ones were stored
      for (let block of pageContent.results) {
        if (block.type !== "image")
          continue;
        const imageID = block.id;
        imageIDs.push(imageID);
        // !!! we use the imageID (block's id) for this image's key. this is
        // important when we want to grab it later. notion's block ID's are
        // always static, so this is safe
        storeImagePromises.push(storeImage(env, imageID, block.image.file.url));
      }
    }

    // remove unused images. 'list' only retrieves the first 1000 items. larger
    // buckets will need to implement pagination to handle more
    const r2Objects = await env.SNUGL_NOTION_IMAGES.list();
    for (const object of r2Objects.objects) {
      if (!imageIDs.includes(object.key)) {
        deleteImagePromises.push(env.SNUGL_NOTION_IMAGES.delete(object.key));
      }
    }

    await Promise.all(storeImagePromises);
    await Promise.all(deleteImagePromises);
    return new Response('Data fetched and stored successfully.', {
      status: 200,
    });
  } catch (error) {
	  // ..
  }
}

async function storeImage(env, id, url) {
  const response = await fetch(url);
  if (!response.ok) {
    throw new Error(`Failed to fetch image. Status: ${response.status}`);
  }
  const data = await response.arrayBuffer();

  await env.SNUGL_NOTION_IMAGES.put(id, data, {
    httpMetadata: {
      contentType: response.headers.get("Content-Type") || "application/octet-stream",
    },
  });
}

Congrats! 🎉 You now have a Worker that sync’s data in realtime to the cloud!

Retrieve Realtime Data From The Cloud

At some point the user will request data from our R2 instance. What will that look like and how should we serve it?

notion to client to cloudflare r2 diagram

Clients will access data directly from Notion* obtaining a list of blocks. During the parsing of these blocks some will be identified as files. We may look at their block IDs (URLs are not considered since the syncing and storage have already been completed). These block IDs can then be used as keys to access corresponding data in the R2 instance. However, to avoid security risks associated with exposing keys, it's advisable not to access your R2 instance directly from the client side. Instead, we will set up another Worker to safely fetch this data.

export default {
  async fetch(request, env, ctx) {
    const url = new URL(request.url);
    const pathParts = url.pathname.split("/").filter((p) => p);

    // expecting "/r2/<key_name>"
    if (pathParts.length !== 2) {
      return new Response("Invalid URL format", { status: 400 });
    }

    const [storageType, key] = pathParts;

    switch (storageType) {
      case "r2":
        return handleR2Request(key, env);
      default:
        return new Response("Invalid storage type.", { status: 400 });
    }
  },
};

async function handleR2Request(key, env) {
  const object = await env.SNUGL_NOTION_IMAGES.get(key);
  if (object) {
    return new Response(object.body, {
      headers: { "Content-Type": object.httpMetadata.contentType },
      status: 200,
    });
  } else {
    return new Response("Object not found in R2", { status: 404 });
  }
}

Once we deploy this Worker, we can call it like https://<your.website.api>/r2/<block id> to retrieve your data!

🤓

If you’re trying to grab images and load them directly into your site, you can do something like:

<Image className="..." src={`https://<your.website.api>/r2/${id}`}/>

That’s all the coding necessary 😊

So far we’ve addressed how to store volatile file data from Notion. Ideally you’d want to store ALL data including text, and can do similarly by using Cloudflare’s KV (or any lightweight key-value based storage) detailed later.

Cache Control

Perhaps the most tedious yet important section, revalidating cache per your chosen sync-interval.

If you’ve implemented the global functions correctly, and construct the proper fetch request from your client, you’ll most likely get the cached response. Why?

Many providers implement caching rules to conserve bandwidth and reduce the load on the origin server. You can set up Cloudflare to operate at the edge of the network, where edge nodes equipped with caches can handle requests without needing to contact the origin server. In addition, the web server itself can be configured to cache fetch requests.

This local caching optimizes response times by preempting the need for calls to the API, hence "short-circuiting" requests. This dual-layer caching strategy, both at the edge and on the local server, enhances performance by minimizing latency and server load.

won't detail the steps to modify cache rules in this post since it varies greatly. However, it's crucial to set up rules with your provider to include a revalidation header and to ensure that this header is properly respected.

Once properly configured, your entire pipeline should function smoothly. Congrats 🥳

Price Breakdown (Cloudflare-Specific)

There are costs (unsurprisingly…) associated with storing and accessing data.

R2 Pricing

The total number of Class A + Class B operations in the code illustrated above is:

Class A Operations:

env.SNUGL_NOTION_IMAGES.put(id, data, { ... }) - PUT operation
env.SNUGL_NOTION_IMAGES.delete(...)) - DELETE operation

Class B Operations:

env.SNUGL_NOTION_IMAGES.list() - LIST operation

The free tier limits are very generous for basic operations, and so we can ignore these for now.

Cloudflare also offers free storage for less than 10GB total, which is already plenty for a lightweight site. If you need store and access your data more often, you can use their pricing calculator.

Worker Pricing

Unless we exceed the high limits of the free tier, we’re good to ignore these.

Syncing Additional Content

We explored ways to sync/cache files and grab them from our Cloudflare Workers. While this gets us around Notion’s expiring links, this doesn’t solve the API limits Notion has for all other static content.

This may not be an issue if you’re running a small-scale site without needing realtime-updates. Below, I've provided a complete implementation using both R2 and KV, which I utilize to store text content from Notion and achieve similar benefits. I highly recommend you do the same, so feel free to copy and modify the following code.

Sync Worker

export default {
  async scheduled(event, env, ctx) {
    // 2700 calls per 15 mins is 180 calls per minute. For now I'll have it
    // sync to notion every 4 minutes. I can edit this in cron jobs. This
    // is non-blocking.
    ctx.waitUntil(sync(env));
  },

  async fetch(request, env, ctx) {
    // This is blocking
    return await sync(env);
  },
};

// Note that this code doesn't handle pagination.
async function sync(env) {
  try {
    // we get both publish and unpublished data to store/erase objects. this
    // is considered 'syncing'
    const db = await getDatabase(env);
    const published = db.filter(result => {
      return result.properties.Published.checkbox;
    })
    const unpublished = db.filter(result => {
      return !result.properties.Published.checkbox;
    })

    const posts = [];
    const searchData = [];
    const imageIDs = [];

    // store promises to execute in parallel. efficient for large operations
    const storeImagePromises = [];
    const deleteImagePromises = [];

    for (let pub of published) {
      const props = pub.properties;
      const post = {
        title: props.Name.title[0].plain_text,
        slug: props.slug.rich_text[0].plain_text,
        date: props.Date.date.start,
        status: props.Status.multi_select[0].name,
        summary: props.Summary.rich_text[0].plain_text,
      };
      posts.push(post);
      searchData.push({ slug: post.slug, title: post.title });

      const pageContent = await getPageContentFromID(env, pub.id);
      await env.SNUGL_NOTION_TEXT.put(post.slug, JSON.stringify({
        title: post.title,
        date: post.date,
        summary: post.summary,
        content: pageContent
      }));

      // store images, and record which ones were stored
      for (let block of pageContent.results) {
        if (block.type !== "image")
          continue;
        const imageID = block.id;
        imageIDs.push(imageID);
        storeImagePromises.push(storeImage(env, imageID, block.image.file.url));
      }
    }
    posts.sort((a, b) => Date.parse(a.date) - Date.parse(b.date));

    // store search data and metadata of posts
    const search_key = "search_data";
    await env.SNUGL_NOTION_TEXT.put(search_key, JSON.stringify(searchData));
    const posts_key = "all_posts_details";
    await env.SNUGL_NOTION_TEXT.put(posts_key, JSON.stringify(posts));

    // remove unpublish articles
    for (let unpub of unpublished) {
      const unpubSlug = unpub.properties.slug.rich_text[0].plain_text;
      await env.SNUGL_NOTION_TEXT.delete(unpubSlug);
    }

    // remove unused images
    const r2Objects = await env.SNUGL_NOTION_IMAGES.list(); // only reteives first 1000
    for (const object of r2Objects.objects) {
      if (!imageIDs.includes(object.key)) {
        deleteImagePromises.push(env.SNUGL_NOTION_IMAGES.delete(object.key));
      }
    }

    await Promise.all(storeImagePromises);
    await Promise.all(deleteImagePromises);
    return new Response('Data fetched and stored successfully.', {
      status: 200,
    });
  } catch (error) {
    try {
      await storeErrorInKV(env, error);
      return new Response(error.message, {
        status: 500,
      });
    } catch (storeError) {
      console.error('Failed to store error in KV:', storeError);
      return new Response('Internal Server Error', {
        status: 500,
      });
    }
  }
}

async function storeImage(env, id, url) {
  const response = await fetch(url);
  if (!response.ok) {
    throw new Error(`Failed to fetch image. Status: ${response.status}`);
  }
  const data = await response.arrayBuffer();

  await env.SNUGL_NOTION_IMAGES.put(id, data, {
    httpMetadata: {
      contentType: response.headers.get("Content-Type") || "application/octet-stream",
    },
  });
}

async function getDatabase(env) {
  var databaseRequestOptions = {
    method: "POST",
    headers: {
      Authorization: "Bearer " + env.NOTION_API_KEY,
      "Notion-Version": "2022-06-28",
      "Content-Type": "application/json",
    },
  };

  try {
    const fetchResponse = await fetch(
      `https://api.notion.com/v1/databases/${env.NOTION_DATABASE_ID}/query`,
      databaseRequestOptions
    );

    if (!fetchResponse.ok) {
      throw new Error(`Couldn't get database: ${fetchResponse.status}`);
    }

    const jsonResponse = await fetchResponse.json();
    return jsonResponse.results;
  } catch (error) {
    console.error('Error fetching published posts:', error);
    throw error;
  }
}

async function getPageContentFromID(env, id) {
  var requestOptions = {
    method: "GET",
    headers: {
      "Notion-Version": "2022-06-28",
      Authorization: "Bearer " + env.NOTION_API_KEY,
    },
  };

  try {
    // 100 is max. I can get more blocks by adjusting the cursor
    // TODO: if the number of blocks in the page is >100, move
    // the cursor and grab the next 100 blocks. repeat. I can use
    // the has_more boolean in the response to check.
    const numBlocksToGrab = 100;
    const fetchResponseBlocks = await fetch(
      `https://api.notion.com/v1/blocks/${id}/children?page_size=${numBlocksToGrab}`,
      requestOptions
    );
    if (!fetchResponseBlocks.ok) {
      throw new Error(`Couldn't get page blocks: ${fetchResponseBlocks.status}`);
    }
    return await fetchResponseBlocks.json();
  } catch (error) {
    console.error('Error fetching page content:', error);
    throw error;
  }
}

async function storeErrorInKV(env, error) {
  try {
    await env.SNUGL_NOTION_TEXT.put('ERROR', JSON.stringify({
      time: new Date().toString(),
      message: error.message,
      stack: error.stack
    }));
    console.error('Error stored: ', error);
  } catch (storeError) {
    console.error('Failed to store error in KV:', storeError);
  }
}

Fetch Worker

export default {
  async fetch(request, env, ctx) {
    const url = new URL(request.url);
    const pathParts = url.pathname.split('/').filter(p => p);

    // Expecting "/kv/keyname" or "/r2/keyname".'
    if (pathParts.length !== 2) {
      return new Response('Invalid URL format', { status: 400 });
    }

    const [storageType, key] = pathParts;
    
    switch (storageType) {
      case 'kv':
        return handleKVRequest(key, env);
      case 'r2':
        return handleR2Request(key, env);
      default:
        return new Response('Invalid storage type.', { status: 400 });
    }
  },
};

async function handleKVRequest(key, env) {
  const data = await env.SNUGL_NOTION_TEXT.get(key);
  if (data) {
    return new Response(data, {
      headers: { 'Content-Type': 'application/json' },
      status: 200
    });
  } else {
    return new Response('Key not found in KV', { status: 404 });
  }
}

async function handleR2Request(key, env) {
  const object = await env.SNUGL_NOTION_IMAGES.get(key);
  if (object) {
    return new Response(object.body, {
      headers: { 'Content-Type': object.httpMetadata.contentType },
      status: 200
    });
  } else {
    return new Response('Object not found in R2', { status: 404 });
  }
}

Happy coding! 🙏🏻

Discussion

No comments. Feel free to start the discussion!

Serverless Architecture

Content Management Systems

API Rate Limits Handling

Notion API

Real-time Data Sync