On one hand, you have Cloud SQL, Google’s fully managed relational database service that supports MySQL, PostgreSQL, and SQL Server. BigQuery, on the other hand, is Google’s highly scalable and cost-effective serverless, multi-cloud data warehouse.
So, if you need a database that can handle large data sets and maintain high speeds at the same time, consider switching to BigQuery from: Cloud SQL.
This article describes several ways to migrate data from Cloud SQL to BigQuery. Let’s begin:
How to migrate data – Cloud SQL to BigQuery
There are several ways to accomplish this process. Don’t worry, even if you don’t know how to code. Here are some ways to migrate data from Cloud SQL to BigQuery:
1. Google Cloud Dataflow
Google Cloud Dataflow is a fully managed service for stream and batch data processing. This can help with the migration process as it can handle large amounts of data.
Here are the steps to follow:
- Before you begin migration, ensure that Cloud SQL, BigQuery, and Dataflow APIs are enabled in your Google Cloud project. Grant the appropriate IAM roles to your Dataflow service account, including roles to access Cloud SQL and BigQuery.
- Ensure that your Cloud SQL instance is configured to allow connections from Dataflow. You may need to set up an IP whitelist or configure a private IP.
- Install the Apache Beat SDK for Python or Java.
- use ‘JdbcIO’ (Java) or custom ‘DoFn’ (Python) Read data from Cloud SQL. If you exported data to Cloud Storage, use the appropriate I/O connector to read the exported file, e.g. ‘TextIO’, ‘AbroIO’).
- Apply the necessary transformations to your data using Beam’s PTransform. (optional)
- use ‘Big Query IO’ A connector that writes data to BigQuery. Depending on your data size and pipeline requirements, you can choose different methods, such as writing directly or writing to a temporary table first.
- use cloud Deploy your Dataflow pipeline using command-line tools or the Google Cloud Console. Specify pipeline options, including project ID, region, and other relevant configuration.
- Monitor your pipeline execution in the Google Cloud Console for errors and ensure that data is moving correctly from Cloud SQL to BigQuery.
- Once your pipeline is complete, verify that the data in BigQuery matches your expectations and the source data in Cloud SQL.
Depending on your pipeline performance, you may need to adjust parallelism, choose a different machine type, or optimize sources/queries for better throughput.
2. Google Cloud Dataflow
If you don’t want to write a custom Dataflow pipeline, you can use the help in the query scheduler. The steps are as follows:
- Use Cloud SQL’s export feature to export your data to GCS. This can be automated using Google Cloud’s Cloud Scheduler and Cloud Functions to trigger exports at regular intervals (making sure you’re exporting data in a format that BigQuery can import, such as CSV, JSON, Avro, or Parquet).
- If you haven’t already done so, create a dataset to hold the data you retrieved from BigQuery.
- Schedule data loads from GCS.
- Go to the BigQuery UI, click ‘Scheduled Query’ in the side menu, then click ‘Create Scheduled Query’.
- Select the data source as ‘Drive’ or ‘Cloud Storage’ and specify the data path in GCS.
- write ‘Create table’ or ‘Insert’ Statement that loads data into a BigQuery dataset. For large data sets or frequent updates, we recommend using: ‘Create or replace table’ overwrite an existing table ‘put in’ Add to that.
- Configure a schedule for how often this query should run. This depends on how often you export Cloud SQL data to GCS and how fresh you need to keep the data in BigQuery.
- Use the BigQuery UI to monitor scheduled query execution and ensure that your data is updated as expected.
Because there is latency during export, you cannot access the data in real time while using this method.
3. Cloud Datastream for CDC
You can also replicate data from Cloud SQL to BigQuery using Google Cloud Datastream for Change Data Capture (CDC). The steps are as follows:
- setting:
- Make sure the Cloud Datastream API is enabled in your Google Cloud project.
- Configure the connection profile for your Cloud SQL instance. You can do this by specifying connection details such as instance ID, database type, and credentials.
- Configure a connection profile for BigQuery as the target.
- Create a new stream in Cloud Datastream by selecting the source and destination connection profiles you created previously. You must also specify the database tables in Cloud SQL that you want to replicate.
- Choose replication settings, such as starting point for data replication, such as from now, at the earliest available point, or at a specific point in time.
- Once configuration is complete, start the stream. Cloud Datastream begins capturing changes to the specified Cloud SQL table and replicates them to the destination in BigQuery.
- Make sure the dataset and table from which data will be replicated are set up correctly in BigQuery. You may need to define a schema to match your Cloud SQL tables.
- Datastream outputs changes to Google Cloud Storage in an Avro-like format. Ingest these changes into BigQuery using a Dataflow job or custom service. You may need to transform your data as needed to fit your BigQuery schema.
- Error handling and warning mechanisms are applied to quickly resolve any issues that may arise during the replication process.
Although this method is efficient and near real-time, the cost and complexity of the process are the two biggest considerations.
final notes
BigQuery considers all writes to be immutable, so migrating data may require a more specific strategy instead of using the solutions provided here. Additionally, BigQuery has a 4GB/file limit, so you may need to chunk your data extraction before the process.
Read more at TechBullion.