Salesforce is a very common Customer Relationship Management (CRM) platform.

It is an extremely popular CRM platform, and is nothing short of a giant for CRM software in the for-profit world and is rapidly expanding in the non-profit world as well.

If you spend enough time interacting with Salesforce, and the data that resides within the platform, then you have likely come across Salesforce’s own Salesforce Object Query Language (SOQL).

SOQL is a great way to quickly query necessary data from one, or several objects within your Salesforce instance. However, if you utilize Salesforce long enough in your organization, you may run into a common issue when querying specific fields.

Within Salesforce are objects and fields. Objects are similar to tables in a database, they hold data specific to an area of your organization. There are standard objects (Contacts, Accounts, Leads) and custom objects. Fields are similar to columns in a table. There are also standard fields (Id, Name, Email) and custom fields.

A great way to interact with Salesforce data, is to write SOQL against a specific object. This can be accomplished through the Open Source project Force Workbench. This is a free and Open Source project supported by the Open Source community, but please note that this is not an official Salesforce product.

There are also packages available in Python (Simple Salesforce) and R (salesforcer) that can connect to your organization’s Salesforce instance, and execute SOQL (among many other things!) through the Salesforce Platform APIs.

An issue that I have run into, is when a custom field is added to an existing object in Salesforce. Data parsers in popular data-analysis programming languages do a great job of guessing the column types of data based upon the first N number of rows in the data. When a custom field is added to a Salesforce object, and data is not back-filled for historical records that exist, the wrong data type can be inferred. For example, when using the salesforcer package in R, the first 2,000+ rows in an object have missing data for a custom field (let’s call it *Custom_Field__c* ). When guessing the data type of this column, R will likely store it as logical. The issue, is that if any rows are not of the type logical after the first 2,000+, they will not be stored correctly, or the data will be lost. A quick fix that I have used, is to first sort the data by the CreatedDate, with the most recently created data first (using the ORDER BY clause and DESC function). An example is provided below:

SELECT Id
     , Name
     , CreatedDate
     , Custom_Field__c

FROM Account

ORDER BY CreatedDate DESC

This will ensure that the most recently created data is read in first, and less likely to be missing. The data-parser will then guess the column type based upon this valid data. This work-around has saved me hours from specifying the column types by hand, or requiring R to read a very large number of rows before guessing the column types.

Happy data sleuthing!