In undergrad I wrote a tutorial on #Julia programming language in which I analyzed the output of the Collatz function. The Collatz Conjecture was really fascinating to me due to its seemingly simple wording and almost impossible to solve mystery. Since then, #Julia changed significantly, and Terence Tao made a contribution that gets us closer to the proof of the Collatz Conjecture.

View on Julia Notebook nbviewer: https://lnkd.in/eGnHgy2

alt text

Update: In September 2019 Tao published a paper in which he proved that the Collatz Conjecture is “almost true” for “almost all numbers.”

Tao’s Paper: https://lnkd.in/eCus8YU

Tao’s Blog Post: https://lnkd.in/eb7ePAu

Table of contents

Download

Download Julia from http://julialang.org/

Download Julai IDEs:

  • Juno from http://junolab.org/

  • IJulia kernel for Jyputer notebook from https://github.com/JuliaLang/IJulia.jl

Juno is a good IDE for writing and evaluating julia code quickly. IJulia notebook is good for writing tutorials and reports with julia code results embeded in the document.

Once you’ve installed everything I recommend opening up the Juno IDE and going through the tutorial.

Quick start

I execute all julia code below in Juno. I suggest you create a folder on your desktop and make it your working directory where we will be able to write files. First, a couple of basic commands. To evaluate code in Juno you just need to press Ctrl-D (its in the Juno tutrial):

VERSION # print julia version number

pwd() # print working directory
homedir() # print the default home directory

# cd(pwd()) # set working directory to DirectoryPath
"/Users/kobakhitalishvili"
3+5 # => 8
5*7 # => 35
3^17 # => 129140163
3^(1+3im) # im stands for imaginary number => -2.964383781426573 - 0.46089998526262876im
log(7) # natural log of 7 => 1.9459101490553132
1.9459101490553132
1.9459101490553132

Interesting that julia has imaginary number built in. Now, variables and functions:

a = cos(pi) + im * sin(pi) # assigning to a variable
-1.0 + 1.2246467991473532e-16im
b = â„Ż^(im*pi)
-1.0 + 1.2246467991473532e-16im
a == b # boolean expression. It is an euler identity.
true

Lets see how to define functions. Here is a chapter on functions in julia docs for more info.

plus2(x) = x + 2 # a compact way

function plustwo(x) # traditional function definition
    return x+2
end
plustwo (generic function with 1 method)

plus2(11)
13
plustwo(11)
13

Here is a julia cheatsheet with above and additional information in a concise form. Next, lets write a function that will generate some data which we will write to a csv file, plot, and save the plot.

Data frames, plotting, and file Input/Output

I decided to write a function $f(x)$ that performs the process from the Collatz conjecture. Basically, for any positive integer $x$ if $x$ is even divide by $2$, if x is odd multiply by $3$ and add $1$. Repeat the process until you reach one. The Collatz conjecture proposes that regardless of what number you start with you will always reach one. Here it is in explicit form

\[f(x) = \begin{cases} x/2, & \mbox{if } x\mbox{ is even} \\ 3x+1, & \mbox{if } x\mbox{ is odd} \end{cases}\]

The function collatz(x) will count the number of iterations it took for the starting number to reach $1$.

function collatz(x)
    # Given a number x
    # - divide by 2 if x is even
    # - multiply by 3 and add 1 if x is odd
    # until x reaches 1
    # Args:
    # - param x: integer
    # - return: integer
    count = 0
    while x != 1
        if x % 2 == 0
            x = x/2
            count += 1
        else
            x = 3*x + 1
            count += 1
        end
    end
    return count
end

collatz(2)
1
collatz(3)
3

Data frames

Now, let’s create a dataframe with the number of steps needed to complete the Collatz process for each number from 1 to 1000. We will use the DataFrames package because the base julia library does not have dataframes.

# install DataFrames package
using Pkg
# Pkg.add("DataFrames")

using DataFrames

# Before populating the dataframe with collatz data lets see how to create one
df = DataFrame(Col1 = 1:10, Col2 = ["a","b","c","d","e","f","a","b","c","d"])
df

10 rows Ă— 2 columns

Col1Col2
Int64String
11a
22b
33c
44d
55e
66f
77a
88b
99c
1010d
# Neat. Now let's generate data using collatz function
df = DataFrame(Number = 1:1000, NumofSteps = map(collatz,1:1000))
first(df, 10)

10 rows Ă— 2 columns

NumberNumofSteps
Int64Int64
110
221
337
442
555
668
7716
883
9919
10106

map() applies collatz() function to every number in the 1:1000 array which is an array of numbers [1,2,3,...,1000]. In this instance map() returns an array of numbers that is the output of collatz() function.

# To get descriptive statistics 
describe(df)

2 rows Ă— 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64Int64Float64Int64NothingNothingDataType
1Number500.51500.51000Int64
2NumofSteps59.542043.0178Int64

Before we save it lets categorize the points based on whether the original number is even or odd.

df.evenodd = map(x -> if x % 2 == 0 "even" else "odd" end, 1:1000) # create new evenodd column
# rename!(df, :x1, :evenodd) #rename it to evenodd
first(df,5)

5 rows Ă— 3 columns

NumberNumofStepsevenodd
Int64Int64String
110odd
221even
337odd
442even
555odd

I use the map() function with an anonymous function x -> if x % 2 == 0 "even" else "odd" end which checks for divisibility by 2 to create a column with “even” and “odd” as entries. Finally, I rename the new column “evenodd”.

Additionally, let’s identify the prime numbers as well.

# Pkg.add("Primes")
using Primes

isprime(3)
true
df.isprime = map(x -> if isprime(x) "yes" else "no" end,df.Number)
first(df,5)

5 rows Ă— 4 columns

NumberNumofStepsevenoddisprime
Int64Int64StringString
110oddno
221evenyes
337oddyes
442evenno
555oddyes
# Pkg.add("CSV")
using CSV

# To save the data frame in the working directory 
CSV.write("collatz.csv", df)
"collatz.csv"

Plotting data

To plot the data we will use the Gadfly package. Gadly resembles ggplot in its functionality. There is also the Plots package which brings mutliple plotting libraries into a single API.

To save plots in different image formats we will need the Cairo pakage.

# Pkg.add(["Cairo","Fontconfig","Plots", "Gadfly","PlotlyJS","ORCA"])
# Pkg.add("Gadfly")

# using Plots
# plotlyjs() use plotly backend
# pyplot() use pyplot backend
using Cairo
using Gadfly

Gadfly.plot(df,x=:Number, y=:NumofSteps, Geom.point, Guide.title("Collatz Conjecture"))

<?xml version=”1.0” encoding=”UTF-8”?>

Number -1500 -1000 -500 0 500 1000 1500 2000 2500 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 -1000 0 1000 2000 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 -200 0 200 400 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 NumofSteps Collatz Conjecture

Looks pretty. I will color the points based on whether the original number is even or odd.

Gadfly.plot(df,x=:Number, y=:NumofSteps, color = :evenodd, Geom.point) # assign plot to variable

<?xml version=”1.0” encoding=”UTF-8”?>

Number -1500 -1000 -500 0 500 1000 1500 2000 2500 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 -1000 0 1000 2000 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 odd even evenodd h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 -200 0 200 400 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 NumofSteps

It looks like odd and even numbers might overlay each other. Let’s plot even and odd numbers side by side.

Gadfly.plot(df, xgroup=:evenodd, color = :evenodd,
    x=:Number, y=:NumofSteps,
    Geom.subplot_grid(Geom.point))

<?xml version=”1.0” encoding=”UTF-8”?>

Number by evenodd odd even evenodd even odd -1500 -1000 -500 0 500 1000 1500 2000 2500 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 -1000 0 1000 2000 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 -1000 0 1000 2000 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 -200 0 200 400 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 NumofSteps

Even numbers up to 1000 require fewer steps to complete the Collatz procedure than odd numbers. Let’s do the same for prime numbers.

Gadfly.plot(df, xgroup=:isprime, color = :isprime,
    x=:Number, y=:NumofSteps,
    Geom.subplot_grid(Geom.point))

<?xml version=”1.0” encoding=”UTF-8”?>

Number by isprime no yes isprime yes no -1500 -1000 -500 0 500 1000 1500 2000 2500 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 -1000 0 1000 2000 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 -1000 0 1000 2000 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 -200 0 200 400 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 NumofSteps

Seems like prime numbers follow a similar distribution of steps to that of non primes. Finally, let’s plot this data by separating even, odd and prime numbers.

p = Gadfly.plot(df, xgroup=:evenodd, ygroup=:isprime, 
    color = :isprime,
    x=:Number, y=:NumofSteps,
    Geom.subplot_grid(Geom.point),
    Guide.title("Number of steps to complete the Collatz procedure vs. Input"))

<?xml version=”1.0” encoding=”UTF-8”?>

Number by evenodd no yes isprime even odd -1500 -1000 -500 0 500 1000 1500 2000 2500 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 -1000 0 1000 2000 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 -1000 0 1000 2000 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 -200 0 200 400 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 yes -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 -200 0 200 400 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 no NumofSteps by isprime Number of steps to complete the Collatz procedure vs. Input

Obvisouly, there are no even prime numbers except 2 and hence the plot in bottom right contains a lonely (2,1) point. Let’s calculate the mean number of steps required for $x$ to complete the Collatz process by whether $x$ is even, odd or prime.

using Statistics
mean_steps = by(df, [:isprime, :evenodd], :NumofSteps => mean)
sort!(mean_steps,:NumofSteps_mean, rev=true)

4 rows Ă— 3 columns

isprimeevenoddNumofSteps_mean
StringStringFloat64
1noodd66.8348
2yesodd63.7305
3noeven53.3908
4yeseven1.0

You might have noticed the exclamation mark in sort!. The exclamation mark indicates to perform the operation in place instead of doing df = sort(df). It looks like even numbers on average require the least number of steps to complete the Collatz procedure. Non prime odd numbers require the most number of steps. There is only one prime even number which is 2 and 2 requires only one step.

What does it tell us about the Collatz Conjecture? Nothing. This kind of analysis will not uncover any helpful insights, even though it may help visualize the behavior of numbers and their outputs generated by the Collatz function.

The Collatz Conjecture is one of the simplest unsolved problems in mathematics actually and Terence Tao is the last mathematician to have made any kind of progress on it. In September 2019 he proved that the Conjecture holds “almost true” for “almost all numbers”.

# Save the plot in the working directory
draw(PNG("collatz-plot.png", 6inch, 4inch), p)

Conclusion

Julia is a comfortable language to work with and many say it is the future of scientific computing. It may very well be true. One of the main reasons is Julia’s JIT compiler which makes Julia almost as fast and sometimes faster than C. At this point, I find Julia not as good as R simply because R is more mature and has a bigger commmunity. R aslo has better documentation and more questions on Stackoverflow. There are $728,009$ questions with an R tag in contrast to $16,583$ questions with julia tag as of 12/12/2019.

Julia has been developing at a steady pace with growing community and ecosystem. Unlikely that Julia is going to be a competitor in the industry against Python, SAS and R, but in academia it is a different story.

Resources used