Data Lens¶
Available in FiftyOne Teams v2.2+
Data Lens is a feature built into the FiftyOne Teams App which allows you to use FiftyOne to explore and import samples from external data sources.
Whether your data resides in a database like PostgreSQL or a data lake like Databricks or BigQuery, Data Lens provides a way to search your data sources, visualize sample data, and import into FiftyOne for further analysis.
How it works¶
- Define your search experience
Tailor the interactions with your data source to exactly what you need. Data Lens provides a flexible framework which allows for extensive customization, allowing you to hone in on the questions that are most critical to your workflow. See Integrating with Data Lens to learn more.
- Connect your data source
Provide configuration to Data Lens to connect to your data source. Once connected, you can start searching for samples directly from FiftyOne. See Connecting a data source to learn more.
- Interact with your data
View your samples using Data Lens’ rich preview capabilities. Refine your search criteria to find samples of interest, then import your samples into a FiftyOne dataset for further analysis. See Exploring samples to learn more.
Connecting a data source¶
A data source represents anywhere that you store your data outside of FiftyOne. This could be a local SQL database, a hosted data lake, or an internal data platform. To connect to your data source, you’ll first implement a simple operator which FiftyOne can use to communicate with your data source.
Once your operator is defined, you can navigate to the “Data sources” tab by clicking on the tab header or by clicking on “Connect to a data source” from the “Home” tab.
Add a new data source by clicking on “Add data source”.
Enter a useful name for your data source and provide the URI for your operator.
The URI should have the format <your-plugin-name>/<your-operator-name>
.
Click “Connect” once you’re finished to save your configuration.
If you need to update your Data Lens configuration, simply click the action menu icon and select “Edit”. Similarly, you can delete a Data Lens configuration by clicking the action menu icon and selecting “Delete”.
That’s it. Now you’re ready to explore your data source. You can head over to the “Query data” tab and start interacting with your data.
Note
Don’t have a Data Lens operator yet? When you’re ready to get started, you can follow our detailed integration guide.
Exploring samples¶
Once you’ve connected to a data source, you can open the “Query data” tab to start working with your data.
In this tab, you can select from your connected data sources using the “Select a data source” dropdown.
Below this dropdown, you can parameterize your search using the available query parameters. These parameters are unique to each connected data source, allowing you to tailor your search experience to exactly what you need. Selecting a new data source will automatically update the query parameters to match those expected by your data source.
After you enter your query parameters, you can click the “Preview data” button at the bottom of the page to fetch samples which match your query parameters. These samples will be displayed in the preview panel, along with any features associated with the sample like labels or bounding boxes.
You can use the zoom slider to control the size of the samples, and you can modify the number of preview samples shown by changing the “Number of preview samples” input value and clicking “Preview data” again.
If you want to change your search, simply reopen the query parameters panel and modify your inputs. Clicking “Preview data” will fetch new samples matching the current query parameters.
If you want to import your samples into FiftyOne for further analysis, you can import your samples to a dataset.
Importing samples to FiftyOne¶
After generating a preview in Data Lens, you can click on the “Import data” button to open the import dialog.
Imports can be limited to a specific number of samples, or you can import all samples matching your query parameters.
The “Skip existing samples” checkbox allows you to configure the behavior for
merging samples into a dataset. If checked, samples with a filepath
which is
already present in the dataset will be skipped. If left unchecked, all samples
will be added to the dataset.
Note
If you elect to skip existing samples, this will also deduplicate samples within the data being imported.
After configuring the size/behavior of your import, select a destination dataset for the samples. This can be an existing dataset, or you can choose to create a new dataset.
You can optionally specify tags to append to the tags
field of each imported
sample.
When you click import, you will have the option to either execute immediately or to schedule this import for asynchronous execution.
If you are importing a small number of samples, then immediate execution may be appropriate. However, for most cases it is recommended to schedule the import, as this will result in more consistent and performant execution.
Note
Scheduled imports use the delegated operations framework to execute asynchronously on your connected compute cluster!
After selecting your execution preference, you will be able to monitor the status of your import through the information provided by the import panel.
In the case of immediate execution, you will be presented with an option to view your samples once the import is complete. Clicking on this button will open your destination dataset containing your imported samples.
In the case of scheduled execution, you will be presented with an option to visit the Runs page.
From the Runs page, you can track the status of your import.
Once your samples are imported, you will be able to leverage the full capabilities of FiftyOne to analyze and curate your data, and you can continue to use Data Lens to augment your datasets.
Integrating with Data Lens¶
Data Lens makes use of FiftyOne’s powerful plugins framework to allow you to tailor your experience to meet the needs of your data. As part of the plugin framework, you are able to create custom operators, which are self-contained Python classes that provide custom functionality to FiftyOne.
Data Lens defines an operator interface which makes it easy to connect to your data sources. We’ll walk through an example of creating your first Data Lens operator.
Setting up your operator¶
To assist with Data Lens integration, we can use the
DataLensOperator
base class provided with the Teams SDK. This base class handles the
implementation for the operator’s execute()
method, and defines a single
abstract method that we’ll implement.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | # my_plugin/__init__.py from typing import Generator import fiftyone.operators as foo from fiftyone.operators.data_lens import ( DataLensOperator, DataLensSearchRequest, DataLensSearchResponse ) class MyCustomDataLensOperator(DataLensOperator): """Custom operator which integrates with Data Lens.""" @property def config(self) -> foo.OperatorConfig: return foo.OperatorConfig( name="my_custom_data_lens_operator", label="My custom Data Lens operator", unlisted=True, execute_as_generator=True, ) def handle_lens_search_request( self, request: DataLensSearchRequest, ctx: foo.ExecutionContext ) -> Generator[DataLensSearchResponse, None, None] # We'll implement our logic here pass |
Let’s take a look at what we have so far.
1 | class MyCustomDataLensOperator(DataLensOperator): |
Our operator extends the
DataLensOperator
provided by the Teams SDK. This base class defines the abstract
handle_lens_search_request()
method, which we will need to implement.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | @property def config(self) -> foo.OperatorConfig: return foo.OperatorConfig( # This is the name of your operator. FiftyOne will canonically # refer to your operator as <your-plugin>/<your-operator>. name="my_custom_data_lens_operator", # This is a human-friendly label for your operator. label="My custom Data Lens operator", # Setting unlisted to True prevents your operator from appearing # in lists of general-purpose operators, as this operator is not # intended to be directly executed. unlisted=True, # For compatibility with the DataLensOperator base class, we # instruct FiftyOne to execute our operator as a generator. execute_as_generator=True, ) |
The config
property
is part of the standard operator interface and
provides configuration options for your operator.
1 2 3 4 5 6 | def handle_lens_search_request( self, request: DataLensSearchRequest, ctx: foo.ExecutionContext ) -> Generator[DataLensSearchResponse, None, None] pass |
The
handle_lens_search_request()
method provides us with two arguments: a
DataLensSearchRequest
instance, and the current operator execution context.
The
DataLensSearchRequest
is generated by the Data Lens framework and provides information about the
Data Lens user’s query. The request object has
the following properties:
request.search_params
: a dict containing the search parameters provided by the Data Lens user.request.batch_size
: a number indicating the maximum number of samples to return in a single batch.request.max_results
: a number indicating the maximum number of samples to return across all batches.
Note
The Data Lens framework will automatically truncate responses to adhere
to request.max_results
. Any sample data beyond this limit will be
discarded.
The ctx
argument provides access to a
range of useful capabilities which you can
leverage in your operator, including things like
providing secrets to your operator.
Using these inputs, we are expected to return a generator which yields
DataLensSearchResponse
objects. To start, we’ll create some synthetic data to better understand the
interaction between Data Lens and our operator. We’ll look at a
more realistic example later on.
Note
Why a generator? Generators provide a convenient approach for long-lived, lazy-fetching connections that are common in databases and data lakes. While Data Lens does support operators which do not execute as generators, we recommend using a generator for ease of integration.
Generating search responses¶
To adhere to the Data Lens interface, we need to yield
DataLensSearchResponse
objects from our operator. A
DataLensSearchResponse
is comprised of the following fields:
response.result_count
: a number indicating the number of samples being returned in this response.response.query_result
: a list of dicts containing serializedSample
data, e.g. obtained viato_dict()
.
Note
Data Lens expects sample data to adhere to the
Sample
format, which is easy to
achieve by using the FiftyOne SDK to create your sample data, as shown
below.
To see how Data Lens works, let’s yield a response with a single synthetic sample.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | def handle_lens_search_request( self, request: DataLensSearchRequest, ctx: foo.ExecutionContext ) -> Generator[DataLensSearchResponse, None, None] # We'll use a placeholder image for our synthetic data image_url = "https://placehold.co/150x150" # Create a sample using the SDK synthetic_sample = fo.Sample(filepath=image_url) # Convert our samples to dicts samples = [synthetic_sample.to_dict()] # We'll ignore any inputs for now and yield a single response yield DataLensSearchResponse( result_count=len(samples), query_result=samples ) |
Let’s see what this looks like in Data Lens.
After adding the operator as a data source, we can navigate to the “Query data” tab to interact with the operator. When we click the preview button, the Data Lens framework invokes our operator to retrieve sample data. Our operator yields a single sample, and we see that sample shown in the preview.
Let’s modify our operator to incorporate the request.batch_size
property.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | def handle_lens_search_request( self, request: DataLensSearchRequest, ctx: foo.ExecutionContext ) -> Generator[DataLensSearchResponse, None, None] samples = [] # Generate number of samples equal to request.batch_size for i in range(request.batch_size): samples.append( fo.Sample( # We'll modify our synthetic data to include the # sample's index as the image text. filepath=f"https://placehold.co/150x150?text={i + 1}" ).to_dict() ) # Still yielding a single response yield DataLensSearchResponse( result_count=len(samples), query_result=samples ) |
Now if we re-run our preview, we see that we get a number of samples equal to the “Number of preview samples” input.
If we modify that number and regenerate the preview, we can see that the number of samples remains in sync. For preview functionality, Data Lens fetches sample data in a single batch, so we can expect these values to be the same.
Working with user-provided data¶
Let’s now look at how Data Lens users are able to interact with our operator. Data Lens is designed to enable users to quickly explore samples of interest, and a key component is providing users a way to control the behavior of our operator.
To achieve this, we simply need to define the possible inputs to our operator
in the
resolve_input()
method.
1 2 3 4 5 6 7 8 9 | def resolve_input(self): # We define our inputs as an object. # We'll add specific fields to this object which represent a single input. inputs = types.Object() # Add a string field named "sample_text" inputs.str("sample_text", label="Sample text", description="Text to render in samples") return types.Property(inputs) |
Note
For more information on operator inputs, see the plugin documentation.
With this method implemented, Data Lens will construct a form allowing users to define any or all of these inputs.
We can then use this data to change the behavior of our operator. Let’s add
logic to integrate sample_text
into our operator.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | def handle_lens_search_request( self, request: DataLensSearchRequest, ctx: foo.ExecutionContext ) -> Generator[DataLensSearchResponse, None, None] # Retrieve our "sample_text" input from request.search_params. # These parameter names should match those used in resolve_input(). sample_text = request.search_params.get("sample_text", "") samples = [] # Create a sample for each character in our input text for char in sample_text: samples.append( fo.Sample( filepath=f"https://placehold.co/150x150?text={char}" ).to_dict() ) # Yield batches when we have enough samples if len(samples) == request.batch_size: yield DataLensSearchResponse( result_count=len(samples), query_result=samples ) # Reset our batch samples = [] # We've generated all our samples, but might be in the middle of a batch if len(samples) > 0: yield DataLensSearchResponse( result_count=len(samples), query_result=samples ) # Now we're done :) |
Now when we run our preview, we can see that the text we provide as input is reflected in the samples returned by our operator. Modifying the text and regenerating the preview yields the expected result.
There are a couple things to note about the changes we made here.
Inputs can be specified with
required=True
, in which case Data Lens will ensure that the user provides a value for that input. If an input is not explicitly required, then we should be sure to handle the case where it is not present.In most real scenarios, our operator will be processing more samples than fit in a single batch. (This is even true here, where there is no upper bound on our input length). As such, our operator should respect the
request.batch_size
parameter and yield batches of samples as they are available.
Note
This example is meant to illustrate how users can interact with our operator. For a more realistic view into how inputs can tailor our search experience, see our example integration with Databricks.
Differences in preview and import¶
While the examples here are focused on preview functionality, the Data Lens
framework invokes your operator in the same way to achieve both preview and
import functionality. The request.batch_size
and request.max_results
parameters can be used to optimize your data retrieval, but preview and import
should otherwise be treated as functionally equivalent.
Example data source connectors¶
This section provides example Data Lens connectors for various popular data sources.
Databricks¶
Below is an example of a Data Lens connector for Databricks. This example uses a schema consistent with the Berkeley DeepDrive dataset format.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 | import json import time from typing import Generator import fiftyone as fo from databricks.sdk import WorkspaceClient from databricks.sdk.service.sql import ( StatementResponse, StatementState, StatementParameterListItem ) from fiftyone import operators as foo from fiftyone.operators import types from fiftyone.operators.data_lens import ( DataLensOperator, DataLensSearchRequest, DataLensSearchResponse ) class DatabricksConnector(DataLensOperator): """Data Lens operator which retrieves samples from Databricks.""" @property def config(self) -> foo.OperatorConfig: return foo.OperatorConfig( name="databricks_connector", label="Databricks Connector", unlisted=True, execute_as_generator=True, ) def resolve_input(self, ctx: foo.ExecutionContext): inputs = types.Object() # Times of day inputs.bool( "daytime", label="Day", description="Show daytime samples", default=True, ) inputs.bool( "night", label="Night", description="Show night samples", default=True, ) inputs.bool( "dawn/dusk", label="Dawn / Dusk", description="Show dawn/dusk samples", default=True, ) # Weather inputs.bool( "clear", label="Clear weather", description="Show samples with clear weather", default=True, ) inputs.bool( "rainy", label="Rainy weather", description="Show samples with rainy weather", default=True, ) # Detection label inputs.str( "detection_label", label="Detection label", description="Filter samples by detection label", ) return types.Property(inputs) def handle_lens_search_request( self, request: DataLensSearchRequest, ctx: foo.ExecutionContext ) -> Generator[DataLensSearchResponse, None, None]: handler = DatabricksHandler() for response in handler.handle_request(request, ctx): yield response class DatabricksHandler: """Handler for interacting with Databricks tables.""" def __init__(self): self.client = None self.warehouse_id = None def handle_request( self, request: DataLensSearchRequest, ctx: foo.ExecutionContext ) -> Generator[DataLensSearchResponse, None, None]: # Initialize the client self._init_client(ctx) # Iterate over samples sample_buffer = [] for sample in self._iter_data(request): sample_buffer.append(self._transform_sample(sample)) # Yield batches of data as they are available if len(sample_buffer) == request.batch_size: yield DataLensSearchResponse( result_count=len(sample_buffer), query_result=sample_buffer, ) sample_buffer = [] # Yield final batch if it's non-empty if len(sample_buffer) > 0: yield DataLensSearchResponse( result_count=len(sample_buffer), query_result=sample_buffer, ) # No more samples. def _init_client(self, ctx: foo.ExecutionContext): """Prepare the Databricks client for query execution.""" # Initialize the Databricks client using credentials provided via `ctx.secret` self.client = WorkspaceClient( host=ctx.secret("DATABRICKS_HOST"), account_id=ctx.secret("DATABRICKS_ACCOUNT_ID"), client_id=ctx.secret("DATABRICKS_CLIENT_ID"), client_secret=ctx.secret("DATABRICKS_CLIENT_SECRET"), ) # Start a SQL warehouse instance to execute our query self.warehouse_id = self._start_warehouse() if self.warehouse_id is None: raise ValueError("No available warehouse") def _start_warehouse(self) -> str: """Start a SQL warehouse and return its ID.""" last_warehouse_id = None # If any warehouses are already running, use the first available for warehouse in self.client.warehouses.list(): last_warehouse_id = warehouse.id if warehouse.health.status is not None: return warehouse.id # Otherwise, manually start the last available warehouse if last_warehouse_id is not None: self.client.warehouses.start(last_warehouse_id) return last_warehouse_id def _iter_data(self, request: DataLensSearchRequest) -> Generator[dict, None, None]: """Iterate over sample data retrieved from Databricks.""" # Filter samples based on selected times of day enabled_times_of_day = tuple([ f'"{tod}"' for tod in ["daytime", "night", "dawn/dusk"] if request.search_params.get(tod, False) ]) # Filter samples based on selected weather enabled_weather = tuple([ f'"{weather}"' for weather in ["clear", "rainy"] if request.search_params.get(weather, False) ]) # Build Databricks query query = f""" SELECT * FROM datasets.bdd.det_train samples WHERE samples.attributes.timeofday IN ({", ".join(enabled_times_of_day)}) AND samples.attributes.weather IN ({", ".join(enabled_weather)}) """ query_parameters = [] # Filter samples based on detection label if provided if request.search_params.get("detection_label") not in (None, ""): query += f""" AND samples.name IN ( SELECT DISTINCT(labels.name) FROM datasets.bdd.det_train_labels labels WHERE labels.category = :detection_label ) """ query_parameters.append( StatementParameterListItem( "detection_label", value=request.search_params.get("detection_label") ) ) # Execute query asynchronously; # we'll get a statement_id that we can use to poll for results statement_response = self.client.statement_execution.execute_statement( query, self.warehouse_id, catalog="datasets", parameters=query_parameters, row_limit=request.max_results, wait_timeout="0s" ) # Poll on our statement until it's no longer in an active state while ( statement_response.status.state in (StatementState.PENDING, StatementState.RUNNING) ): statement_response = self.client.statement_execution.get_statement( statement_response.statement_id ) time.sleep(2.5) # Process the first batch of data json_result = self._response_to_dicts(statement_response) for element in json_result: yield element # Databricks paginates samples using "chunks"; iterate over chunks until next is None while statement_response.result.next_chunk_index is not None: statement_response = self.client.statement_execution.get_statement_result_chunk_n( statement_response.statement_id, statement_response.result.next_chunk_index ) # Process the next batch of data json_result = self._response_to_dicts(statement_response) for element in json_result: yield element def _transform_sample(self, sample: dict) -> dict: """Transform a dict of raw Databricks data into a FiftyOne Sample dict.""" return fo.Sample( filepath=f"cloud://bucket/{sample.get('name')}", detections=self._build_detections(sample), ).to_dict() def _build_detections(self, sample: dict) -> fo.Detections: # Images are a known, static size image_width = 1280 image_height = 720 # Extract detection labels and pre-process bounding boxes labels_list = json.loads(sample["labels"]) for label_data in labels_list: if "box2d" in label_data: label_data["box2d"] = { k: float(v) for k, v in label_data["box2d"].items() } return fo.Detections( detections=[ fo.Detection( label=label_data["category"], # FiftyOne expects bounding boxes to be of the form # [x, y, width, height] # where values are normalized to the image's dimensions. # # Our source data is of the form # {x1, y1, x2, y2} # where values are in absolute pixels. bounding_box=[ label_data["box2d"]["x1"] / image_width, label_data["box2d"]["y1"] / image_height, (label_data["box2d"]["x2"] - label_data["box2d"]["x1"]) / image_width, (label_data["box2d"]["y2"] - label_data["box2d"]["y1"]) / image_height ] ) for label_data in labels_list if "box2d" in label_data ] ) def _response_to_dicts(self, response: StatementResponse) -> list[dict]: # Check for response errors before processing self._check_for_error(response) # Extract column names from response columns = response.manifest.schema.columns column_names = [column.name for column in columns] # Extract data from response data = response.result.data_array or [] # Each element in data is a list of raw column values. # Remap ([col1, col2, ..., colN], [val1, val2, ..., valN]) tuples # to {col1: val1, col2: val2, ..., colN: valN} dicts return [ { key: value for key, value in zip(column_names, datum) } for datum in data ] def _check_for_error(self, response: StatementResponse): if response is None: raise ValueError("received null response from databricks") if response.status is not None: if response.status.error is not None: raise ValueError("databricks error: ({0}) {1}".format( response.status.error.error_code, response.status.error.message )) if response.status.state in ( StatementState.CLOSED, StatementState.FAILED, StatementState.CANCELED, ): raise ValueError( f"databricks error: response state = {response.status.state}" ) |
Google BigQuery¶
Below is an example of a Data Lens connector for BigQuery:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | import fiftyone.operators as foo import fiftyone.operators.types as types from fiftyone.operators.data_lens import ( DataLensOperator, DataLensSearchRequest, DataLensSearchResponse ) from google.cloud import bigquery class BigQueryConnector(DataLensOperator): @property def config(self): return foo.OperatorConfig( name="bq_connector", label="BigQuery Connector", unlisted=True, execute_as_generator=True, ) def resolve_input(self, ctx): inputs = types.Object() # We'll enable searching on detection labels inputs.str( "detection_label", label="Detection label", description="Enter a label to find samples with a matching detection", required=True, ) return types.Property(inputs) def handle_lens_search_request( self, request: DataLensSearchRequest, ctx: foo.ExecutionContext, ) -> Generator[DataLensSearchResponse, None, None]: handler = BigQueryHandler() for batch in handler.handle_request(request, ctx): yield batch class BigQueryHandler: def handle_request( self, request: DataLensSearchRequest, ctx: foo.ExecutionContext, ) -> Generator[DataLensSearchResponse, None, None]: # Create our client. # If needed, we can use secrets from `ctx.secrets` to provide credentials # or other secure configuration required to interact with our data source. client = bigquery.Client() try: # Retrieve our Data Lens search parameters detection_label = request.search_params.get("detection_label", "") # Construct our query query = """ SELECT media_path, tags, detections, keypoints FROM `my_dataset.samples_json`, UNNEST(JSON_QUERY_ARRAY(detections)) as detection WHERE JSON_VALUE(detection.label) = @detection_label """ # Submit our query to BigQuery job_config = bigquery.QueryJobConfig( query_parameters=[ bigquery.ScalarQueryParameter( "detection_label", "STRING", detection_label ) ] ) query_job = client.query(query, job_config=job_config) # Wait for results rows = query_job.result( # BigQuery will handle pagination automatically, but # we can optimize its behavior by synchronizing with # the parameters provided by Data Lens page_size=request.batch_size, max_results=request.max_results ) samples = [] # Iterate over data from BigQuery for row in rows: # Transform sample data from BigQuery format to FiftyOne samples.append(self.convert_to_sample(row)) # Yield next batch when we have enough samples if len(samples) == request.batch_size: yield DataLensSearchResponse( result_count=len(samples), query_result=samples ) # Reset our batch samples = [] # We've run out of rows, but might have a partial batch if len(samples) > 0: yield DataLensSearchResponse( result_count=len(samples), query_result=samples ) # Our generator is now exhausted finally: # Clean up our client on exit client.close() |
Let’s take a look at a few parts in detail.
1 2 | # Create our client client = bigquery.Client() |
In practice, you’ll likely need to use secrets to securely provide credentials to connect to your data source.
1 2 3 4 5 6 7 8 9 10 11 | # Retrieve our Data Lens search parameters detection_label = request.search_params.get("detection_label", "") # Construct our query query = """ SELECT media_path, tags, detections, keypoints FROM `my_dataset.samples_json`, UNNEST(JSON_QUERY_ARRAY(detections)) as detection WHERE JSON_VALUE(detection.label) = @detection_label """ |
Here we’re using our user-provided input parameters to tailor our query to only the samples of interest. This logic can be as simple or complex as needed to match our use case.
1 2 3 4 5 6 7 8 | # Wait for results rows = query_job.result( # BigQuery will handle pagination automatically, but # we can optimize its behavior by synchronizing with # the parameters provided by Data Lens page_size=request.batch_size, max_results=request.max_results ) |
Here we’re using request.batch_size
and request.max_results
to help
BigQuery align its performance with our use case. In cases where
request.max_results
is smaller than our universe of samples (such as during
preview or small imports), we can prevent fetching more data than we need,
improving both query performance and operational cost.
1 2 | # Transform sample data from BigQuery format to FiftyOne samples.append(self.convert_to_sample(row)) |
Here we are converting our sample data from its storage format to a FiftyOne
Sample
. This is where we’ll add features
to our samples, such as labels.
As we can see from this example, we can make our Data Lens search experience as powerful as it needs to be. We can leverage internal libraries and services, hosted solutions, and tooling that meets the specific needs of our data. We can expose flexible but precise controls to users to allow them to find exactly the data that’s needed.
Snippet: Dynamic user inputs¶
As the volume and complexity of your data grows, you may want to expose many options to Data Lens users, but doing so all at once can be overwhelming for the user. In this example, we’ll look at how we can use dynamic operators to conditionally expose configuration options to Data Lens users.
1 2 3 4 5 6 7 8 | class MyOperator(DataLensOperator): @property def config(self) -> foo.OperatorConfig: return OperatorConfig( name="my_operator", label="My operator", dynamic=True, ) |
By setting dynamic=True
in our operator config, our operator will be able to
customize the options shown to a Data Lens user based on the current state.
Let’s use this to optionally show an “advanced options” section in our query
parameters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | def resolve_input(self, ctx: foo.ExecutionContext): inputs = types.Object() inputs.str("some_param", label="Parameter value") inputs.str("other_param", label="Other value") inputs.bool("show_advanced", label="Show advanced options", default=False) # Since this is a dynamic operator, # we can use `ctx.params` to conditionally show options if ctx.params.get("show_advanced") is True: # In this example, we'll optionally show configuration which allows a user # to remap selected sample fields to another name. # This could be used to enable users to import samples into datasets with # varying schemas. remappable_fields = ("field_a", "field_b") for field_name in remappable_fields: inputs.str(f"{field_name}_remap", label=f"Remap {field_name} to another name") return types.Property(inputs) |
Our operator’s resolve_input
method will be called each time ctx.params
changes, which allows us to create an experience that is tailored to the Data
Lens user’s behavior. In this example, we’re optionally displaying advanced
configuration that allows a user to remap sample fields. Applying this
remapping might look something like this.
1 2 3 4 5 6 7 | def _remap_sample_fields(self, sample: dict, request: DataLensSearchRequest): remappable_fields = ("field_a", "field_b") for field_name in remappable_fields: remapped_field_name = request.search_params.get(f"{field_name}_remap") if remapped_field_name not in (None, ""): sample[remapped_field_name] = sample[field_name] del sample[field_name] |
Of course, dynamic operators can be used for much more than this. Search experiences can be broadened or narrowed to allow for both breadth and depth within your connected data sources.
As an example, suppose a user is searching for detections of “traffic light” in an autonomous driving dataset. A dynamic operator can be used to expose additional search options that are specific to traffic lights, such as being able to select samples with only red, yellow, or green lights. In this way, dynamic operators provide a simple mechanism for developing intuitive and context-sensitive search experiences for Data Lens users.