Kafka Architect / Administration

Department: Engineering
Location: New York
Updated on: June 24, 2022

Back to Open Positions

This job can be performed Remotely, anywhere in the US

 

MediaMath’s Platform Operations team handles all the challenges of a real-time advertising stack. We run a hybrid environment, with multiple globally placed on-prem datacenters and the AWS cloud, supporting a broad range of services - from low latency bidding processes handling millions of transactions per second, through to big data storage and analytics, and client-facing UI and reporting solutions. Each has its own unique operational challenges, and our team is a key partner in ensuring these workloads are managed in stable, scalable ways.

As part of the Platform Operations team, the Systems Engineer (data management) will be responsible for a range of globally distributed production pipelines and datastores, handling our most business critical datasets. This includes a network of Kafka clusters, and multiple large noSQL database clusters, as well as the support of other data solutions in partnership with existing SMEs. This engineer will be hands-on at all levels, from day-to-day care and feeding, to fortifying and modernizing existing offerings, building solutions for management and monitoring, and helping guide overall organizational strategy.

ESSENTIAL DUTIES AND RESPONSIBILITIES

  • Build, maintenance and administration of production kafka clusters and related services (AWS and on-prem)
  • Monitor and regularly assess t of capacity needs for global kafka footprint, with consideration to upcoming roadmap items.
  • Monitoring health of kafka cluster & associated services, and implement appropriate alerting.
  • Migration of existing tooling for kafka management into company standard toolsets (chef, ansible, circleci)
  • Support application teams with kafka development efforts.
  • Back-up support of production Hadoop clusters, as requested by SME.
  • Build, maintenance & administration of business critical database clusters (Scylla, Cassandra, Aerospike)
  • Work with development teams to ensure cluster architectures supports business need, including replication strategy, hardware/site redundancy, failover testing, etc
  • Monitor health and regularly assess capacity needs for production databases
  • Monitor database cluster health and implement appropriate alerting
  • Support hardware strategy for database services, including specing, testing, maintenance and tuning.
  • Work with development teams to ensure cluster architectures supports business need, including replication strategy, hardware/site redundancy, failover testing, etc.
  • Consult on the design of new database offerings, and provide recommendations on strategy.
  • Back-up support of production PostgrSQL clusters, as requested by SME
  • Conduct training sessions to share knowledge with peers and development groups
  • Act as in-team SME for data pipelines & storage – providing guidance and oversight to others in-team, and across the development community.
  • Communicate current status of all projects, problems, and issues to the department management team
  • Support audit and compliance efforts, and initiate corrective action when appropriate for remediation
  • Participate in on-call rota as part of Platform Operations team.

Qualifications

Ops basics:

  • Proficiency with Linux system administration (Debian, Ubuntu, CentOS)
  • Proficiency with basic AWS administration (IAM, EC2, Networking, cost analysis)
  • Proficiency with scripting & basic coding (python, ruby, golang)
  • Understanding of networking fundamentals, including application layer protocols (HTTP, SSH, SSL), load balancing solutions (lvs, nginx), and DNS

Role specific experience:

  • Extensive production experience with at least one distributed data processing software (Kafka strongly preferred, Hadoop valuable)
  • Extensive production experience with at least one noSQL database software (Scylla preferred)
  • Experience with relational database software (PostgreSQL preferred)
  • Experience supporting low-latency, globally distributed services at scale.
  • Experience working with private datacenter infrastructure (“on-prem” servers)

Other nice-to-have:

  • Experience leveraging config management toolsets (Chef, Salt, Ansible)
  • Experience leveraging common deployment toolsets (CircleCi, Jenkins, Artifactory)
  • Experience collecting and analyzing metrics for service level monitoring using prometheus and grafana
  • Working knowledge of Kubernetes-based deployments

SKILLS

  • Practical approach to real world problems, with “hands-on” approach to solutions.
  • Ability to think strategically, understand business context, and make collaborative decisions
  • Willing to gather information & recommend courses of action clearly and confidently.
  • Fosters open communication, speaks with impact, listens to others, and writes effectively
  • Comfortable partnering with internal clients in development teams to address issues, consult on solutions and plan for future needs.
  • Ability to communicate sometimes complex ideas to to non-technical stakeholders, including product and support teams.
  • Desire to mentor and provide guidance to junior engineers, both technically and professionally.
  • Willingness to adhere to, streamline and help improve team processes for work tracking, knowledge sharing, incident response, and cross-org communication.