Data Quality Package

Inheritance

Submodules

Models

exception seed.models.data_quality.ComparisonError

Bases: exceptions.Exception

class seed.models.data_quality.DataQualityCheck(*args, **kwargs)

Bases: django.db.models.base.Model

Object that stores the high level configuration per organization of the DataQualityCheck

exception DoesNotExist

Bases: django.core.exceptions.ObjectDoesNotExist

exception MultipleObjectsReturned

Bases: django.core.exceptions.MultipleObjectsReturned

REQUIRED_FIELDS = {'PropertyState': ['address_line_1', 'custom_id_1', 'pm_property_id'], 'TaxLotState': ['address_line_1', 'custom_id_1', 'jurisdiction_tax_lot_id']}
add_result_comparison_error(row_id, rule, display_name, value, rule_check)
add_result_is_null(row_id, rule, display_name, value)
add_result_max_error(row_id, rule, display_name, value, rule_max)
add_result_min_error(row_id, rule, display_name, value, rule_min)
add_result_missing_and_none(row_id, rule, display_name, value)
add_result_missing_req(row_id, rule, display_name, value)
add_result_string_error(row_id, rule, display_name, value)
add_result_type_error(row_id, rule, display_name, value)
add_rule(rule)

Add a new rule to the Data Quality Checks

Parameters:rule – dict to be added as a new rule
Returns:None
static cache_key(identifier)

Static method to return the location of the data_quality results from redis.

Parameters:identifier – Import file primary key
Returns:
check_data(record_type, rows)

Send in data as a queryset from the Property/Taxlot ids.

Parameters:
  • record_type – one of PropertyState | TaxLotState
  • rows – rows of data to be checked for data quality
Returns:

None

get_fieldnames(record_type)

Get fieldnames to apply to results.

id

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

static initialize_cache(identifier=None)

Initialize the cache for storing the results. This is called before the celery tasks are chunked up.

The cache_key is different than the indentifier. The cache_key is where all the results are to be stored for the data quality checks, the identifier, is the random number (or specified value that is used to identifier both the progress and the data storage

Parameters:identifier – Identifier for cache, if None, then creates a random one
Returns:list, [cache_key and the identifier]
initialize_rules()

Initialize the default rules for a DataQualityCheck object

Returns:None
name

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

objects = <django.db.models.manager.Manager object>
organization

Accessor to the related object on the forward side of a many-to-one or one-to-one (via ForwardOneToOneDescriptor subclass) relation.

In the example:

class Child(Model):
    parent = ForeignKey(Parent, related_name='children')

child.parent is a ForwardManyToOneDescriptor instance.

organization_id

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

remove_all_rules()

Removes all the rules associated with this DataQualityCheck instance.

Returns:None
remove_status_label(label_class, rule, linked_id)

Remove label because it did not match any of the range exceptions

Parameters:
  • label_class – statuslabel object, either property label or taxlot label
  • rule – rule object
  • linked_id – id of propertystate or taxlotstate object
Returns:

boolean, if labeled was applied

reset_all_rules()

Delete all rules and reinitialize the default set of rules

Returns:None
reset_default_rules()

Reset only the default rules

Returns:
reset_results()
classmethod retrieve(organization_id)

DataQualityCheck was previously a simple object but has been migrated to a django model. This method ensures that the data quality model will be backwards compatible.

This is the preferred method to initialize a new object.

Parameters:organization – instance of Organization
Returns:obj, DataQualityCheck
retrieve_result_by_address(address)

Retrieve the results of the data quality checks for a specific address.

Parameters:address – string, address to find the result for
Returns:dict, results of data quality check for specific building
retrieve_result_by_tax_lot_id(tax_lot_id)

Retrieve the results of the data quality checks by the jurisdiction ID.

Parameters:tax_lot_id – string, jurisdiction tax lot id
Returns:dict, results of data quality check for specific building
rules

Accessor to the related objects manager on the reverse side of a many-to-one relation.

In the example:

class Child(Model):
    parent = ForeignKey(Parent, related_name='children')

parent.children is a ReverseManyToOneDescriptor instance.

Most of the implementation is delegated to a dynamically defined manager class built by create_forward_many_to_many_manager() defined below.

save_to_cache(identifier)

Save the results to the cache database. The data in the cache are stored as a list of dictionaries. The data in this class are stored as a dict of dict. This is important to remember because the data from the cache cannot be simply loaded into the above structure.

Parameters:identifier – Import file primary key
Returns:None
update_status_label(label_class, rule, linked_id)
Parameters:
  • label_class – statuslabel object, either property label or taxlot label
  • rule – rule object
  • linked_id – id of propertystate or taxlotstate object
Returns:

boolean, if labeled was applied

exception seed.models.data_quality.DataQualityTypeCastError

Bases: exceptions.Exception

class seed.models.data_quality.Rule(*args, **kwargs)

Bases: django.db.models.base.Model

Rules for DataQualityCheck

DATA_TYPES = [(0, 'number'), (1, 'string'), (2, 'date'), (3, 'year'), (4, 'area'), (5, 'eui')]
DEFAULT_RULES = [{'rule_type': 0, 'severity': 0, 'data_type': 1, 'not_null': True, 'field': 'address_line_1', 'table_name': 'PropertyState'}, {'rule_type': 0, 'severity': 0, 'data_type': 1, 'not_null': True, 'field': 'pm_property_id', 'table_name': 'PropertyState'}, {'rule_type': 0, 'field': 'custom_id_1', 'table_name': 'PropertyState', 'severity': 0, 'not_null': True}, {'rule_type': 0, 'field': 'jurisdiction_tax_lot_id', 'table_name': 'TaxLotState', 'severity': 0, 'not_null': True}, {'rule_type': 0, 'field': 'address_line_1', 'table_name': 'TaxLotState', 'severity': 0, 'not_null': True}, {'rule_type': 0, 'field': 'conditioned_floor_area', 'table_name': 'PropertyState', 'severity': 0, 'data_type': 4, 'min': 0, 'units': 'ft**2', 'max': 7000000}, {'rule_type': 0, 'severity': 1, 'data_type': 4, 'min': 100, 'field': 'conditioned_floor_area', 'table_name': 'PropertyState', 'units': 'ft**2'}, {'rule_type': 0, 'severity': 0, 'data_type': 0, 'min': 0, 'max': 100, 'field': 'energy_score', 'table_name': 'PropertyState'}, {'rule_type': 0, 'severity': 1, 'data_type': 0, 'min': 10, 'field': 'energy_score', 'table_name': 'PropertyState'}, {'rule_type': 0, 'severity': 0, 'data_type': 2, 'min': 18890101, 'max': 20201231, 'field': 'generation_date', 'table_name': 'PropertyState'}, {'rule_type': 0, 'field': 'gross_floor_area', 'table_name': 'PropertyState', 'severity': 0, 'data_type': 0, 'min': 100, 'units': 'ft**2', 'max': 7000000}, {'rule_type': 0, 'field': 'occupied_floor_area', 'table_name': 'PropertyState', 'severity': 0, 'data_type': 0, 'min': 100, 'units': 'ft**2', 'max': 7000000}, {'rule_type': 0, 'severity': 0, 'data_type': 2, 'min': 18890101, 'max': 20201231, 'field': 'recent_sale_date', 'table_name': 'PropertyState'}, {'rule_type': 0, 'severity': 0, 'data_type': 2, 'min': 18890101, 'max': 20201231, 'field': 'release_date', 'table_name': 'PropertyState'}, {'rule_type': 0, 'field': 'site_eui', 'table_name': 'PropertyState', 'severity': 0, 'data_type': 5, 'min': 0, 'units': 'kBtu/ft**2/year', 'max': 1000}, {'rule_type': 0, 'severity': 1, 'data_type': 5, 'min': 10, 'field': 'site_eui', 'table_name': 'PropertyState', 'units': 'kBtu/ft**2/year'}, {'rule_type': 0, 'field': 'site_eui_weather_normalized', 'table_name': 'PropertyState', 'severity': 0, 'data_type': 5, 'min': 0, 'units': 'kBtu/ft**2/year', 'max': 1000}, {'rule_type': 0, 'field': 'source_eui', 'table_name': 'PropertyState', 'severity': 0, 'data_type': 5, 'min': 0, 'units': 'kBtu/ft**2/year', 'max': 1000}, {'rule_type': 0, 'severity': 1, 'data_type': 5, 'min': 10, 'field': 'source_eui', 'table_name': 'PropertyState', 'units': 'kBtu/ft**2/year'}, {'rule_type': 0, 'field': 'source_eui_weather_normalized', 'table_name': 'PropertyState', 'severity': 0, 'data_type': 5, 'min': 10, 'units': 'kBtu/ft**2/year', 'max': 1000}, {'rule_type': 0, 'severity': 0, 'data_type': 3, 'min': 1700, 'max': 2019, 'field': 'year_built', 'table_name': 'PropertyState'}, {'rule_type': 0, 'severity': 0, 'data_type': 2, 'min': 18890101, 'max': 20201231, 'field': 'year_ending', 'table_name': 'PropertyState'}]
exception DoesNotExist

Bases: django.core.exceptions.ObjectDoesNotExist

exception MultipleObjectsReturned

Bases: django.core.exceptions.MultipleObjectsReturned

RULE_TYPE = [(0, 'default'), (1, 'custom')]
RULE_TYPE_CUSTOM = 1
RULE_TYPE_DEFAULT = 0
SEVERITY = [(0, 'error'), (1, 'warning')]
SEVERITY_ERROR = 0
SEVERITY_WARNING = 1
TYPE_AREA = 4
TYPE_DATE = 2
TYPE_EUI = 5
TYPE_NUMBER = 0
TYPE_STRING = 1
TYPE_YEAR = 3
data_quality_check

Accessor to the related object on the forward side of a many-to-one or one-to-one (via ForwardOneToOneDescriptor subclass) relation.

In the example:

class Child(Model):
    parent = ForeignKey(Parent, related_name='children')

child.parent is a ForwardManyToOneDescriptor instance.

data_quality_check_id

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

data_type

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

description

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

enabled

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

field

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

format_strings(value)
get_data_type_display(**morekwargs)
get_rule_type_display(**morekwargs)
get_severity_display(**morekwargs)
id

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

max

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

maximum_valid(value)

Validate that the value is not greater than the maximum specified by the rule.

Parameters:value – Value to validate rule against
Returns:bool, True is valid, False if the value is out of range
min

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

minimum_valid(value)

Validate that the value is not less than the minimum specified by the rule.

Parameters:value – Value to validate rule against
Returns:bool, True is valid, False if the value is out of range
name

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

not_null

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

objects = <django.db.models.manager.Manager object>
required

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

rule_type

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

severity

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

status_label

Accessor to the related object on the forward side of a many-to-one or one-to-one (via ForwardOneToOneDescriptor subclass) relation.

In the example:

class Child(Model):
    parent = ForeignKey(Parent, related_name='children')

child.parent is a ForwardManyToOneDescriptor instance.

status_label_id

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

str_to_data_type(value)

If the check is coming from a field in the database then it will be typed correctly; however, for extra_data, the values are typically strings or unicode. Therefore, the values are typed before they are checked using the rule’s data type definition.

Parameters:value – variant, value to type
Returns:typed value
table_name

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

text_match

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

units

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

valid_text(value)

Validate the rule matches the specified text. Text is matched by regex.

Parameters:value – Value to validate rule against
Returns:bool, True is valid, False if the value does not match
seed.models.data_quality.format_pint_violation(rule, source_value)

Format a pint min, max violation for human readability.

:param rule :param source_value : Quantity - value to format into range :return (formatted_value, formatted_min, formatted_max) : (String, String, String)

Tests

Views

class seed.views.data_quality.DataQualityViews(**kwargs)

Bases: rest_framework.viewsets.ViewSet

Handles Data Quality API operations within Inventory backend. (1) Post, wait, get… (2) Respond with what changed

create(request)

This API endpoint will create a new cleansing operation process in the background, on potentially a subset of properties/taxlots, and return back a query key — parameters:

  • name: organization_id description: Organization ID type: integer required: true paramType: query

  • name: data_quality_ids description: An object containing IDs of the records to perform data quality checks on.

    Should contain two keys- property_state_ids and taxlot_state_ids, each of which is an array of appropriate IDs.

    required: true paramType: body

type:
status:
type: string description: success or error required: true
csv(request, *args, **kwargs)

Download a csv of the data quality checks by the pk which is the cache_key — parameter_strategy: replace parameters:

  • name: pk description: Import file ID or cache key required: true paramType: path
data_quality_rules(request, *args, **kwargs)

Returns the data_quality rules for an org. — parameters:

  • name: organization_id description: Organization ID type: integer required: true paramType: query
type:
status:
type: string required: true description: success or error
rules:
type: object required: true description: An object containing ‘properties’ and ‘taxlots’ arrays of rules
reset_all_data_quality_rules(request, *args, **kwargs)

Resets an organization’s data data_quality rules — parameters:

  • name: organization_id description: Organization ID type: integer required: true paramType: query
type:
status:
type: string description: success or error required: true
in_range_checking:
type: array[string] required: true description: An array of in-range error rules
missing_matching_field:
type: array[string] required: true description: An array of fields to verify existence
missing_values:
type: array[string] required: true description: An array of fields to ignore missing values
reset_default_data_quality_rules(request, *args, **kwargs)

Resets an organization’s data data_quality rules — parameters:

  • name: organization_id description: Organization ID type: integer required: true paramType: query
type:
status:
type: string description: success or error required: true
in_range_checking:
type: array[string] required: true description: An array of in-range error rules
missing_matching_field:
type: array[string] required: true description: An array of fields to verify existence
missing_values:
type: array[string] required: true description: An array of fields to ignore missing values
results(request, *args, **kwargs)

Return the result of the data quality based on the ID that was given during the creation of the data quality task. Note that it is not related to the object in the database, since the results are stored in redis!

save_data_quality_rules(request, *args, **kwargs)

Saves an organization’s settings: name, query threshold, shared fields. The method passes in all the fields again, so it is okay to remove all the rules in the db, and just recreate them (albeit inefficient) — parameter_strategy: replace parameters:

  • name: organization_id description: Organization ID type: integer required: true paramType: query
  • name: body description: JSON body containing organization rules information paramType: body pytype: RulesSerializer required: true
type:
status:
type: string description: success or error required: true
message:
type: string description: error message, if any required: true
class seed.views.data_quality.RulesIntermediateSerializer(instance=None, data=<class rest_framework.fields.empty>, **kwargs)

Bases: rest_framework.serializers.Serializer

class seed.views.data_quality.RulesSerializer(instance=None, data=<class rest_framework.fields.empty>, **kwargs)

Bases: rest_framework.serializers.Serializer

class seed.views.data_quality.RulesSubSerializer(instance=None, data=<class rest_framework.fields.empty>, **kwargs)

Bases: rest_framework.serializers.Serializer

class seed.views.data_quality.RulesSubSerializerB(instance=None, data=<class rest_framework.fields.empty>, **kwargs)

Bases: rest_framework.serializers.Serializer