Side project: JustJoin.It data [#1]

I've been thinking today about starting some side project and documenting it on this blog. It's been difficult for me to post anything on a regular basis - hope it'll help with that.

What is JustJoin.it portal?

JustJoin.it is a Polish job board for IT industry. They offer a very convenient way of searching IT jobs (filters for skills, location, company and so on) and majority of job posts include salary range. It's SPA (single-page application) which uses API underneath to fetch data presented to end users. They do not require any authentication to browse the jobs - job offers endpoint is public. I'm going to fetch this data and try to do something with it.

Extracting and taking a quick look at the data

When you open the JustJoinIT page with developer console opened (F12 on Chrome/Firefox) and go to Network tab, you'll see a GET request to https://justjoin.it/api/offers. It returns a JSON array containing information about jobs available. Let's try to do some basic exploration in IPython shell to get the idea of what kind of data is returned. I'm going to use an excellent requests library for making the HTTP request.

import requests

response = requests.get("https://justjoin.it/api/offers")

assert 200 == response.status_code

results = response.json()

Let's try to explore fetched data a bit and ask some basic questions.

How many jobs are there?

len(results)

>> 6226

How does single job payload look like?

results[0]

>> {'title': 'React/React Native Expert',
 'street': 'Prosta 36',
 'city': 'Wrocław',
 'country_code': 'PL',
 'address_text': 'Prosta 36, Wrocław',
 'marker_icon': 'javascript',
 'workplace_type': 'remote',
 'company_name': 'Callstack',
 'company_url': 'https://callstack.com',
 'company_size': '45-55',
 'experience_level': 'senior',
 'latitude': '51.1032',
 'longitude': '17.0208962',
 'published_at': '2021-10-17T07:00:19.376Z',
 'remote_interview': True,
 'id': 'callstack-react-react-native-expert',
 'employment_types': [{'type': 'b2b',
   'salary': {'from': 24000, 'to': 28000, 'currency': 'pln'}}],
 'company_logo_url': 'https://bucket.justjoin.it/offers/company_logos/thumb/a6d79c6d4a5c754f9cf30144cdac8d6fadaf55cb.png?1612781235',
 'skills': [{'name': 'React Native', 'level': 4},
  {'name': 'ReactJS', 'level': 4},
  {'name': 'JavaScript', 'level': 4}],
 'remote': True}

Sanity check for job id being unique

job_ids = [job['id'] for job in results]

assert len(job_ids) == len(set(job_ids))

As expected, job id uniquely identifies each job in the array.

How many unique companies do we have?

unique_companies = sorted(set([job['company_name'] for job in results]))

len(unique_companies)

>> 1602

This number is probably a bit overstated as there are cases of different names that are likely the same company. I'll try to catch them during data cleaning process.

unique_companies

>> [
      ...
      'Packhelp',
      'Packhelp S.A.',
      ...
  ]

What are the dates for earliest/latest published job posts?

from dateutil import parser

published_at_dates = [parser.parse(job['published_at']) for job in results]

min(published_at_dates)
>> datetime.datetime(2021, 9, 20, 7, 52, tzinfo=tzutc())

max(published_at_dates)
>> datetime.datetime(2021, 10, 17, 7, 0, 19, 376000, tzinfo=tzutc())

(max(published_at_dates) - min(published_at_dates)).days
>> 26

Looks like data exposed by offers endpoint is a snapshot of jobs published during recent 26 days.

Process for regular data extraction

Since API exposes ~1 month snapshot of data, I will try to create an ETL (Extract/Transform/Load) process that will fetch & save raw data on a regular basis.

Storing daily raw data will allow me to maintain historical data and add delta from most recent snapshot on top of it - so no data is lost and historic values are maintained.

Architecture

I have a private VPS which could serve as a host for daily CRON job (Python/Bash script) for fetching & saving the data. But, in order to learn something new, I'll try to host whole process in the cloud (AWS). I'd also like to make whole process serverless - so CPU resources will be utilized only when required.

This post will contain only an overview of the architecture, actual implementation will be described in upcoming posts.

Services I plan to use

  • storage - AWS S3
  • computing - AWS Lambda
  • scheduling - AWS EventBridge

Thanks for reading, upcoming posts will contain actual implementation details for architecture mentioned above.


Take care,
Kuba

Thanks for reading the article, I really appreciate it! Have you heard about Braintrust - the first decentralized talent network? Whether you're a freelancer looking for a job, an employer looking for hiring talents, or you just have a wide network of connections - there's something for you there!

Go check it out and register with below link (yeah - it's my referral link and it's free - no hidden costs):

Registration link