In this posts I have selected and compiled some Best Practices and Tips for Ruby users that I have learned up till now. They are the result of a year and a half of experience working on a Business Intelligence and Big Data project.

I believe the development of this functionality would have been more effective (faster and higher quality) if I had followed these tips from the beginning.

There are 30 tips. For the post not getting too long, I separated the list into two parts.

This is the first part, where I am more focus for optimization of tests (specs) and calls to the bank (queries).

In the second part, I talk about data treatment and its consistency. Tips and tricks when dealing with import and export ( CSV ) files.

Testing (specs)

1. Use FactoryGirl

describe Car do
  subject(:car) { }

The FactoryGirl has an excellent syntax and let you make cleaner and readable tests. In addition, it optimizes the time that it takes to create the specs, has centralized modeling code, is flexible, and enables advanced customization. That is, it is a hand in the wheel in cases of optimization of build time and more complex models.

2. Do not create Prefer build
# [Fast] Constructs the object in memory
# [Slow] Saves the template in the database and runs all validations and callbacks (eg after_create)

It is good practice to use the same type of bench in production and test environments. This avoids surprises in production, even if the tests are passing.

One of the consequences of this, especially using ActiveRecord, is the time of the specs to be slower by having too many calls to the bench in the tests.

To reduce this impact, try to use it whenever you can.

3. Start with the exception

describe Car do
  subject(:car) {, color: color) }
  context 'when no color is given' do
    let(:color) { nil }
    it { is_expected.not_to be_valid }

Many times we test only the happy, error-free path. The problem is that we leave the exceptions to the end, as background, without considering all the test scenarios.

My tip is to start with the exceptions, thinking about the most unlikely ways.

4. Describe the behavior

describe Car do
  subject(:car) {, fuel: fuel) }
  describe '#drive' do
    subject(:drive) { car.drive_for(distance) }
    context 'when driving a positive distance' do
      let(:distance) { 100 }
      context 'and there is not enough fuel' do
        let(:fuel) { 10 }
        it 'drives less than the wanted distance' do
          expect(car.walked_distance).to < distance
        it 'consumes all fuel' do
          expect(car.fuel).to be 0

To understand how a model works, just read the description of the specs. But it’s not always so. It is quite common to see tests that do not describe the exact behavior of a model.

In the above example, everything is clearly described. I know that when I ask to drive a certain route and the car does not have enough fuel, it does not travel all the way you want. It stops halfway and the fuel runs out.

5. Test the functionality, not the implementation

def drive_for(distance)
  while fuel > 0 || distance > 0
    self.walked_distance += distance_per_liter
    distance -= distance_per_liter
expect(car.fuel).to eq 2
# [Good] tests functionality
expect(fuel).to receive(:subtract).with(1).exactly(5).times
# [Bad] tests implementation

If I change the logic of the method drive_forto this one below, the top test keeps going and the bottom one fails even with the correct logic.

def drive_for(distance)
  needed_fuel = distance.to_f / distance_per_liter
  spent_fuel = [self.fuel, needed_fuel].min
  self.walked_distance += spent_fuel * distance_per_liter

It is very rare for anyone to worry about the functionality in creating the tests and / or analyzing this in the code review. Having to rewrite the tests every time you refactor or change the implementation is a waste of time.

6. rspec –profile

Top 20 slowest examples (8.79 seconds, 48.1% of total time):
  Lead stubed notify_lead_update #as_indexed_json #mailing events has mailing_events
    1.32 seconds ./spec/models/lead_spec.rb:209
  Lead stubed notify_lead_update .tags #untag_me untags the leads
    0.80171 seconds ./spec/models/lead_spec.rb:545
  Lead stubed notify_lead_update .tags #tag_me tags the leads
    0.778 seconds ./spec/models/lead_spec.rb:526
  Lead stubed notify_lead_update .tags #tag_me tags the leads
    0.75545 seconds ./spec/models/lead_spec.rb:531

This option allows us to have instant feedback on spec time, making it easier to optimize the time for each test.

A good unit test should take less than 0.02 seconds. Functional ones usually take longer.

To not have to type in all the specs, you can insert the line –profilein the file .rspec, which is at the root of your project.

Bank calls (queries)

7. Use find_each, do noteach

# [Bad] Loads all elements in memory
# [Good] Loads only the elements of that batch (1000 by default

With each, the memory usage increases along with the size of the base as it loads the entire base in one go.

Already find_each has the fixed consumption. It is influenced only by the size of the batch, which can be easily configured using the option batch_size.

In terms of implementation, nothing changes. Then use it find_eachat ease.

8. Use pluck, do notmap

Car.where(color: :black).map { |car| car.year }
# [SQL] SELECT  "cars".* FROM "cars" WHERE "car"."color" = "black"
# [Bad] loads all attributes of cars and uses only one
Car.where(color: :black).map(&:year)
# [Bad] Similar to the previous example, only minor synax
Car.where(color: :black).select(:year).map(&:year)
# [SQL] SELECT  "cars"."year" FROM "cars" WHERE "car"."color" = "black"
# [Good] Only loads the cars year attribute
Car.where(color: :black).pluck(:year)
# [SQL] SELECT  "cars"."year" FROM "cars" WHERE "car"."color" = "black"
# [Good] Similar to the previous example and has smaller and clearer syntax.

9. Use select, not pluckwhen cascading

owner_ids = Car.where(color: :black).pluck(:owner_id)
# [SQL] SELECT  "cars"."owner_id" FROM "cars" WHERE "car"."color" = "black"
# owner_ids = [1, 2, 3...]
owners = Owner.where(id: owner_ids).to_a
# [SQL] SELECT  "owners".* FROM "owner" WHERE "owner"."id" IN [1, 2, 3...]
# [Over] Execute 2 queries
owner_ids = Car.where(color: :black).select(:owner_id)
# owner_ids = #<ActiveRecord::Relation [...]
owners = Owner.where(id: owner_ids).to_a
# [SQL] SELECT  "owners".* FROM "owner" WHERE "owner"."id" IN SELECT ("cars"."owner_id" FROM "cars" WHERE "car"."color" = "black")
# [Good] Performs only 1 query with subselect

When you do this pluck, it executes the query and loads the entire list of results into memory. Then, another query is performed with the result obtained previously.

With this select, ActiveRecord stores only one Relation and joins the two queries into one.

This saves the memory that would be required to store the results of the first query. In addition, it eliminates overhead to establish a connection to the bank.

The databases have been evolving for a long time and if there is any optimization that it can do in this query, it will do better than ActiveRecord and Ruby.

10. Use exists?, do not any?

# [Bad] Count on the whole table
# [SQL] SELECT 1 AS one FROM "cars" LIMIT 1
# [Good] Count with limit 1

The runtime of the any grows according to the size of the base. This is because it makes one counting the table and then compares if the result is zero. Already exists puts one limit 1at the end and always takes the same time regardless of the size of the base.

11. Load only what you will use

header = %i(id color owner_id)
CSV.generate do |csv|
  csv << header do |car|
    csv << car.values_at(*header)

We often iterate over the elements and use only a few fields. This ends up wasting time and memory because we have to load all the other fields that will not be used.

In the project, for example, I reduced more than 88% of memory usage. Reviewing and putting selecting all the queries, I was able to increase the competition with the same machine and the execution time was 12 times faster.

All this optimization gain in less than 1 hour of work. If this concern is already in your head at the time of creation, the additional cost is virtually zero.

Now we will focus on treatment and data consistency, tips and tricks when dealing with import and export ( CSV ) files.

Consistency of data

12. Set the time

# [Bad] The result may change depending on the time it is run
selected_day =
Car.where(created_at: selected_day.beginning_of_day..selected_day.end_of_day)
# [Acceptable] The result may still change but control of this change is relatively easy
selected_day = Time.parse("2018/02/01")
Car.where(created_at: selected_day.beginning_of_day..selected_day.end_of_day)
# [Good] The result does not change

Let’s say we have a routine that runs daily at midnight to generate a report from the previous day.

The first implementation may generate erroneous results if it is not run at that specified time.

In the second just ensure that it will run on the correct day and the result will be as expected.

In the third, it is possible to execute on any day, just inform the correct target date.

My tip for periodic routines is to leave the target date parameter with the default for the previous period. : Smirk:

13. Ensure ordering

# [Bad] Can return in different order if you have more than one record with the same created_at
Car.order (: created_at,: id)
# [Good] Even with repeated created_at the ID is unique then the returned order will always be the same
Car.order(:license_plate, :id)
# Not necessary because license_plate is already unique

In situations where order is important, it is always important to always use a single attribute as a tie-breaking criterion.

When we’re not careful about it, we’re usually only going to find the production error by using dummy data in the tests and rarely cover cases like this.

My tip for this item is to invest in prevention. : Innocent:

14. Beware of where by: updated_at and batches

Car.where(updated_at: 10)
# [SQL] SELECT * FROM "cars" WHERE (updated_at BETWEEN '2017-02-19 11:48:51.582646' AND '2017-02-20 11:48:51.582646') ORDER BY "cars"."id" ASC LIMIT 10
# [SQL] SELECT * FROM "cars" WHERE (updated_at BETWEEN '2017-02-19 11:48:51.582646' AND '2017-02-20 11:48:51.582646') AND ("cars"."id" >  3580987) ORDER BY "cars"."id" ASC LIMIT 10
# [SQL] SELECT * FROM "cars" WHERE (updated_at BETWEEN '2017-02-19 11:48:51.582646' AND '2017-02-20 11:48:51.582646') AND ("cars"."id" > 21971397) ORDER BY "cars"."id" ASC LIMIT 10
# [SQL] ...
# [Bad] Records may be missing
Car.where('updated_at > ?',
# [SQL] SELECT * FROM "cars" WHERE (updated_at BETWEEN '2017-02-19 11:48:51.582646' AND '2017-02-20 11:48:51.582646')
# [Less Bad] There are no batches so the records are correct, however it can consume a lot of memory
ids = Car.where('updated_at > ?',
Car.where(id: ids).find_each(batch_size: 10)
# [SQL] SELECT id FROM "cars" WHERE (updated_at BETWEEN '2017-02-19 11:48:51.582646' AND '2017-02-20 11:48:51.582646')
# [SQL] SELECT * FROM "cars" WHERE (id IN [31122, 918723, ...]) ORDER BY "cars"."id" ASC LIMIT 10
# [SQL] SELECT * FROM "cars" WHERE (id IN [31122, 918723, ...]) AND ("cars"."id" > 3580987) ORDER BY "cars"."id" ASC LIMIT 10
# [SQL] ...
# [Best] Only the IDs of the records are preloaded and all records will be processed correctly in batch, although it can consume a lot of memory if the table is GIANT

The :updated_atattribute changes very frequently and between one SQL from one batch to another it can be updated and end up processing repeated records or failing to process others.

Unfortunately I have not found an optimal solution for this case but pre-selecting the IDs and then iterating in batches on the ids (which is fixed) ensures that all records are processed correctly only 1 times.

Trying to understand all the working and possible exceptions of these queries is very difficult then, in doubt, avoid batches with updated_at. : Sweat_smile:

Date and time

15. TimeZone

Time.parse ("2018/02/01")
# [Bad] Do not consider timezone ("2017/02/01")
Time.current # same thing as above line
# [Good] Consider timezone

You thought you were not going to have anything to do with time?

The most common mistake is not to consider timezone in operations with date and time.

Even if your system does not need to handle timezones always use timezone so you do not waste time after fixing everything : Hourglass:

16. Very careful with Query Timezones

Car.where("created_at > '2018/02/01'")
# [Bad] Do not consider timezone

Car.where('created_at > ?',"2018/02/01"))

When making queries with ActiveRecord, it considers the timezone correctly so no problem.

But sometimes we need to make more “manual” queries, and in those cases that timezone care is all yours.

Use queries with parameters and o to ensure: +1:

17. DB always works with UTC

sql = <<-SQL
INSERT INTO 'cars' (id, created_at, updated_at)
VALUES (#{id},#{ created_at.utc },#{Sequel::CURRENT_TIMESTAMP})

The important thing here is that when you are going to generate a SQL in hand, you need to leave all dates in UTC. : Globe_with_meridians:

18. Spread the time

class CarController
def create
CarCreateJob.perform_async(car_params.merge(created_at: Time.current))
render :ok

Often we receive a call in the API and we queue a Backgroud Job to complete the action.

When it is an action that creates something, it is very important that you record the date / time that the API was called.

This ensures that the object is saved with the correct date even if it has a delay in execution because of queues or slowness on the server. : Memo:

Importation exportation

Transporting data between systems is always a challenge.

We invented APIs, services, and other ways but still fall into the good old data transport using simple tables with comma-separated (.csv) data.

So when dealing with transport via csv, remember:

19. Do not let a case interrupt the whole process.

# Ignore invalid rows in export
Car.find_each do |car|
row = to_csv_row(car)
if valid_row?(row)
csv << row
# Begin-rescue to ensure creation on import
CSV.parse(file) do |row|
rescue e
errors.add(row, e)
# Background jobs in import treats each row separately
# The code is very clean, uses little memory and performance is better
CSV.parse(file) do |row|

Has it ever happened to be running that migration script and locking it in the middle because it popped an exception or something and had to process it all over again?

Avoid this by ensuring that even if one entry gives some error the rest runs smoothly.

It’s better to have some unprocessed entries than to have everything else missing.

Also put mechanisms in place so that these exceptions are communicated to you and can be solved in some way.

Do not just add to the logs, something has to alert you, be it an email, an alert on your dashboard, anything.

If it is a very specific case, it may be easier for you to simply treat it in hand but if it is something that can affect more cases, hurry to correct it! : Runner:

20. Use “tab” as a separator in CSV

CSV.generate(col_sep: TAB) do |csv|
csv << { |v| v.gsub(TAB, SPACE) }

For one reason or another, one day we have to import or export data in CSV format.

In export, if we choose comma as separator and have some value inserted by the comma-containing user, we have to treat this so as not to break the CSV.

The most common is to surround the fields of the type open with quotation marks ( “) and if it has quotation marks in the value we have to treat it too and so on.

The code for all this is somewhat complex, increasing the chance of bugs, unforeseen cases and possibly having a degraded performance.

If we use the “tab” character as a separator the scenario changes.

Treating a “tab” inserted by the user by replacing it with “space” is practically imperceptible in most cases and we do not have to worry about any other character, the generated “CSV” is very clean and readable.

Of course there are some cases in which the user’s “tab” is important so we always have to think before choosing.

Trust me, the “tab” is your friend.

21. Treat the data

email.lower.delete(" ").gsub(ANYTHING_BETWEEN_PLUS_AND_AT_INCLUSIVELY, "@")

Who has never suffered with “equal” texts being considered different because of uppercase and lowercase, spaces at the beginning and end, etc.?

In the United States the date is in the month, day and year format, most of other countries it is day, month and year. When you simply read, you can easily end up changing the day and the month.

And that user who fills out a form with ” so- and-so + ” and 2 different users for the same person?

There are many problems caused by this kind of nonsense, so before you go out filling the CSV of an export or read from an import treat them to ensure their integrity and be happy. : Sunglasses:

22. Validate the data

time > MIN_DATE && time <= Time.current ?
object.present? && object.valid? ?
!option.blank? && VALID_OPTIONS.include?(option) ?

Even with everything being what it should be we sometimes receive dates from before Christ or in the future.

Null or invalid objects pop up several exceptions.

This is simple so you do not have to do it. :blush:

23. Encoding is Evil

I do not think I need to convert anyone whose encoding is evil.

This will help you in this battle.

require 'charlock_holmes'
contents ='test.xml')
detection = CharlockHolmes::EncodingDetector.detect(contents)
# {:encoding => 'UTF-8', :confidence => 100, :type => :text}
encoding = detection[:encoding]
CharlockHolmes::Converter.convert(content, encoding, 'UTF-8')

Even so, it’s not a silver bullet, so if it’s confidence not too loud, it’s worth telling the user that the file could not be read and asked to convert to UTF-8 or some accepted format: Mag:

24. CSV with header: true

CSV.parse(file) do |row|
# row => ['color', 'year', ...]
next if header?(row)
# row => ['black', '2017', ...] row[0], year: row[1])
# [Bad] Need to handle the first line (header)
# [Bad] It will be an error if the CSV column order changes
options = { headers: true, header_converters: :symbol }
CSV.parse(file, options) do |row|
# row => { color: 'black', year: '2017', ... } # => 'black'
# [Good] CSV will already handle the header
# [Good] Independent implementation of CSV column order
values = CSV.parse(file, options).map.to_a
# Insert all at once

When reading a CSV with header, you must ignore the first line and ensure that the user sorts the columns of the file the way we want.

We know this does not happen, there will always be a CSV with the column in the wrong order.

Because of this we end up having to read the header and interpret each line accordingly.

There is a parameter in the CSV that already does this for you.

With the option headers: trueI can iterate over a CSV already with an object similar to a hash.

It’s free, enjoy and use : Raised_hands:

25. CSV Lint

dialect = {
header: true,
delimiter: TAB,
skip_blanks: true,
}, dialect)

The csvlint gem statically validates the CSV file.

It is so cool that it already returns all the errors indicating the reason and the error line.

Super easy and fast.

This eliminates much of the import errors. Gun

26. Show user errors

Okay, you’ve protected your system well by validating encoding, checking for file syntax errors, and bypassing rows with errors.

Your system may be 100% ok but the user has not yet had 100% of the important data imported.

Whatever bug you detect and ignore, let your dear user know in a very friendly way and giving tips on how he can fix it.

If possible, even generate a new CSV with only the lines with errors.

Help him and earn his loyalty. : Heart:

We still need to worry about files

27. Parameterize

path = 'path/to/Filé Name%_!@#2017.tsv'
extension = File.extname(path)
File.basename(path, extension).parameterize

Files are saved on various operating systems for users and servers.

Especially for us, accents and strange characters can cause a lot of headaches.

Fortunately parameterize is a good remedy for this and has no contraindication: Pill:

28. Avoid large names in files

Depending on the FileSystem or communication protocol, large file names can simply be cut off.

The FTP protocol, for example, does this.

I usually like to use meaningful names and sometimes end up getting big.

I already had to cut short and cut everything because I had problems with big names. : Scream:

29. Compact the CSV

CSV files tend to have many repeated characters, so compacting the size is reduced dramatically.

I saw files of 32Mb be reduced to 6Mb so it’s worth it : Package:

30. zipRuby, not RubyZip

RubyZip is 2x slower than zipRuby and allocates 700x more objects in memory

I do not think I need to say anything else.

Got any questions?

Do you have killer tips on the subject?

Leave a message in the comments!